Mapping tables that go (Form, Treebank POS) --> (Lemma, UD POS, Morphological features) #212

honnibal · 2015-09-24T06:45:06Z

Hi all,

I've been working on moving spaCy (an NLP pipeline, http://spacy.io ) to the UD scheme for some time now. I'm currently trying to produce some mapping files, which I think might also be of use to others. Some of what I'm doing might exist already --- if so, I'd appreciate pointers to the resources :). If not, I'm looking for a little help in making sure I'm applying the annotation schemes currently, particularly the morphological scheme.

First, as a point of terminology, is there a standard way to describe language/treebank-specific POS schemes, e.g. the VBZ, NNS etc scheme used in the PTB? For now, I'll call these tags "XPOS", for "extended POS tags". I'll reserve the term "POS tag" for one of the 17 UD POS tags.

As a second point of terminology, I'll call the text-field of an inflected token an "orthographic form" (as opposed to a lemma).

Often an XPOS tag maps to a single POS, and zero or more morphological features. I've found useful mapping tables like this in the Interset project. For my purposes, I'm formatting these as JSON files. I call this a "Tag Map": https://github.com/honnibal/spaCy/blob/master/lang_data/en/tag_map.json . Example entry:

"NNS": {"pos": "noun", "number": "plur"}

Sometimes the XPOS doesn't map cleanly, e.g. the TO tag in the Penn Treebank. So I want to have another mapping file, which is keyed by an XPOS and an orthographic form. This is also necessary for exceptional forms in the language, which encode additional morphological features in their orthographic form that are not captured by their XPOS. Personal pronouns seem a common example. An example entry:

 "me/PRP":  {"lemma": "-PRON-", "PronType": "Prs", "Person": "One",   "Number": "Sing",  "Case": "Acc"}

(Question: What's the favoured lemma for personal pronouns? I've gone with the special value "-PRON-" above, but it'd be nice to align this with what other people are doing.)

The token/XPOS will still fail to map cleanly in some cases. My idea is to simply transform the XPOS scheme given the Treebank parse, if it doesn't. The idea is to see the XPOS as an arbitrary detail of the statistical modelling: a place to jointly predict the part-of-speech and morphology, which rule-based post-processing then transforms into the UD format.

Okay! Finally, the questions. First, three general questions:

Any mapping tables of this form already created?
Sources of equivalent information? Anything like this in the HPSG or LFG grammars?
I've read the UD documentation, and looked over the Interset materials. Does it sound like I've missed some major documentation?

Now some specifics:

In English, is "you" number unmarked, or is it Number=Sing,Plur? Equivalent question for case and gender. Are these clear/central applications of the annotation scheme, or are they difficult edge cases?
Is "mine" Case=Acc?
Is "myself" Case=Acc?

Sorry this was so long!

Thanks,
Matthew Honnibal

The text was updated successfully, but these errors were encountered:

fginter · 2015-09-24T07:07:53Z

Hi Matthew,

Can't speak for other languages and your question might be specifically targeting English, but FWIW this is what holds for Finnish:

The mapping is LEMMA+XPOS+XFEAT -> POS+FEAT and in two cases even LEMMA+XPOS+XFEAT+DEPREL -> POS+FEAT (where X marks original data in the original treebank). Note that XFEAT is not present in the final UD file.
The mapping is implemented in the dev-ud branch of the Finnish-dep-parser GitHub project here: https://github.com/TurkuNLP/Finnish-dep-parser/tree/dev-ud/morpho-sd2ud It's not a simple lookup table, but a set of scripts which transform the output of the Finnish morphological analyzer (XPOS+XFEAT) into the UD versions POS+FEAT.

Best,

Filip

honnibal · 2015-09-24T07:28:12Z

Thanks. I'm targeting multi-lingual. I'm hoping the simpler mapping will work for most languages, although I know some languages might need to work differently.

How does your lemmatization work? Do you go ORTH+XPOS -> LEMMA+XPOS+XFEAT?

fginter · 2015-09-24T07:35:16Z

Finnish lemmatization uses the two-level morphological analyzer OMorFi. That is distributed with the parser. So we go ORTH -> [morpho analyzer] -> several LEMMA+XPOS+XFEAT alternatives -> [bunch of scripts] -> several LEMMA+POS+FEAT alternatives. And then a CRF (Marmot) to disambiguate the competing readings.

ftyers · 2015-09-24T08:00:45Z

We were working on this for Kazakh at the TurkLang conference last week. Here is the spreadsheet we've been writing to be able to have a consistent mapping between our two standards (KNC, Apertium) and UD. Note that we conceive the mapping as unidirectional.

https://docs.google.com/spreadsheets/d/1Q4J3axzxYFFZebhfaCOvgtIy26OgpZcym-mVA_yX7FI/edit#gid=130548646

As Filip mentioned in the worst case you do need (LEMMA)+XPOS+XFEAT+DEPREL (for example to make a pronoun/determiner distinction or noun/adjective).

dan-zeman · 2015-09-24T12:13:40Z

@honnibal : These tables of yours could be quite useful and they can obviously improve over Interset. But as others mentioned, even with lemma it cannot be perfect. Beyond that, it is probably easier to write scripts than tables, and to consider the tree structure (that's what I do).

As for your specific questions, I would say that myself is Case=Acc. I would not mark case for mine because it can be used both as subject and object, without changing form. I wouldn't mark gender, number and case for you (the rule of thumb is: if your list of values contains all values available for the language (such as Number=Plur,Sing for English), then drop the feature entirely).

On the other hand, we have not settled on the scale between form and function as criteria for morphological (and POS) distinctions. For instance the German STTS tagset is very context-sensitive: the nouns inflect for case quite rarely (and usually the case is determined by the article), yet every noun has one of the four case values assigned. So if you are willing and able to disambiguate, you could distinguish between you that is Case=Nom and you that is Case=Acc. (In pure theory, the same could be done with Number, but I'm afraid that it would be often undecidable even for human annotators.)

honnibal · 2015-09-24T12:55:51Z

Thanks all. It sounds like I'll need an additional process to add further morphological features after parsing for many languages, which I hadn't thought about.

As for your specific questions, I would say that myself is Case=Acc. I would not mark case for mine because it can be used both as subject and object, without changing form.

myself makes sense as Case=Acc, thanks. But I don't understand how mine can be a subject? I would've said it can only be used in a predicative context, or as the object of a preposition.

(the rule of thumb is: if your list of values contains all values available for the language (such as Number=Plur,Sing for English), then drop the feature entirely).

Okay, thanks. So, there's no distinction between "none of the above" and "any of the above", right?

Here's my current pronouns table for English:

https://github.com/honnibal/spaCy/blob/master/lang_data/en/morphs.json

I'm working on the auxiliaries now.

dan-zeman · 2015-09-24T14:07:53Z

mine ... English is not my native language and it is quite possible that I am using it wrongly. I thought I could say something like: Your car is green. Mine is red. Is that ungrammatical?

honnibal · 2015-09-24T14:23:22Z

Ah --- yeah, of course.

fginter · 2015-09-24T14:26:45Z

All cases in UD English are here

honnibal · 2015-10-03T02:21:26Z

@fginter --- Thanks!!! That search engine is fantastically helpful.

manning · 2015-10-16T21:59:06Z

For English, we have a translator from Penn Treebank trees to UPOS tags: https://github.com/stanfordnlp/CoreNLP/blob/master/data/edu/stanford/nlp/upos/ENUniversalPOS.tsurgeon . It's a Tsurgeon translation file. Unlike for Google's universal POS, there are a number of instances where a translation from XPOS just isn't possible without seeing syntactic context.

manning · 2015-11-15T00:07:29Z

Seems like basically all questions are answered, to the extent they will be ... so closing this.

This was referenced Oct 3, 2015

TODO: Finish English morphology table, in lang_data/en/morphs.json , to move forward on multi-lingual explosion/spaCy#124

Closed

Use universal dependencies for cross-lingual consistency explosion/spaCy#128

Closed

manning added question UPOS Universal part-of-speech tags: definitions and examples features labels Oct 27, 2015

manning mentioned this issue Nov 8, 2015

Change naming in CoNLL-U: CPOSTAG --> UPOSTAG #229

Closed

manning added this to the lg-specific v1.2 milestone Nov 15, 2015

manning closed this as completed Nov 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping tables that go (Form, Treebank POS) --> (Lemma, UD POS, Morphological features) #212

Mapping tables that go (Form, Treebank POS) --> (Lemma, UD POS, Morphological features) #212

honnibal commented Sep 24, 2015

fginter commented Sep 24, 2015

honnibal commented Sep 24, 2015

fginter commented Sep 24, 2015

ftyers commented Sep 24, 2015

dan-zeman commented Sep 24, 2015

honnibal commented Sep 24, 2015

dan-zeman commented Sep 24, 2015

honnibal commented Sep 24, 2015

fginter commented Sep 24, 2015

honnibal commented Oct 3, 2015

manning commented Oct 16, 2015

manning commented Nov 15, 2015

Mapping tables that go (Form, Treebank POS) --> (Lemma, UD POS, Morphological features) #212

Mapping tables that go (Form, Treebank POS) --> (Lemma, UD POS, Morphological features) #212

Comments

honnibal commented Sep 24, 2015

fginter commented Sep 24, 2015

honnibal commented Sep 24, 2015

fginter commented Sep 24, 2015

ftyers commented Sep 24, 2015

dan-zeman commented Sep 24, 2015

honnibal commented Sep 24, 2015

dan-zeman commented Sep 24, 2015

honnibal commented Sep 24, 2015

fginter commented Sep 24, 2015

honnibal commented Oct 3, 2015

manning commented Oct 16, 2015

manning commented Nov 15, 2015