-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping tables that go (Form, Treebank POS) --> (Lemma, UD POS, Morphological features) #212
Comments
Hi Matthew, Can't speak for other languages and your question might be specifically targeting English, but FWIW this is what holds for Finnish:
Best, Filip |
Thanks. I'm targeting multi-lingual. I'm hoping the simpler mapping will work for most languages, although I know some languages might need to work differently. How does your lemmatization work? Do you go |
Finnish lemmatization uses the two-level morphological analyzer |
We were working on this for Kazakh at the TurkLang conference last week. Here is the spreadsheet we've been writing to be able to have a consistent mapping between our two standards (KNC, Apertium) and UD. Note that we conceive the mapping as unidirectional. As Filip mentioned in the worst case you do need |
@honnibal : These tables of yours could be quite useful and they can obviously improve over Interset. But as others mentioned, even with lemma it cannot be perfect. Beyond that, it is probably easier to write scripts than tables, and to consider the tree structure (that's what I do). As for your specific questions, I would say that myself is On the other hand, we have not settled on the scale between form and function as criteria for morphological (and POS) distinctions. For instance the German STTS tagset is very context-sensitive: the nouns inflect for case quite rarely (and usually the case is determined by the article), yet every noun has one of the four case values assigned. So if you are willing and able to disambiguate, you could distinguish between you that is |
Thanks all. It sounds like I'll need an additional process to add further morphological features after parsing for many languages, which I hadn't thought about.
myself makes sense as
Okay, thanks. So, there's no distinction between "none of the above" and "any of the above", right? Here's my current pronouns table for English: https://github.com/honnibal/spaCy/blob/master/lang_data/en/morphs.json I'm working on the auxiliaries now. |
mine ... English is not my native language and it is quite possible that I am using it wrongly. I thought I could say something like: Your car is green. Mine is red. Is that ungrammatical? |
Ah --- yeah, of course. |
All cases in UD English are here |
@fginter --- Thanks!!! That search engine is fantastically helpful. |
For English, we have a translator from Penn Treebank trees to UPOS tags: https://github.com/stanfordnlp/CoreNLP/blob/master/data/edu/stanford/nlp/upos/ENUniversalPOS.tsurgeon . It's a Tsurgeon translation file. Unlike for Google's universal POS, there are a number of instances where a translation from XPOS just isn't possible without seeing syntactic context. |
Seems like basically all questions are answered, to the extent they will be ... so closing this. |
Hi all,
I've been working on moving spaCy (an NLP pipeline, http://spacy.io ) to the UD scheme for some time now. I'm currently trying to produce some mapping files, which I think might also be of use to others. Some of what I'm doing might exist already --- if so, I'd appreciate pointers to the resources :). If not, I'm looking for a little help in making sure I'm applying the annotation schemes currently, particularly the morphological scheme.
First, as a point of terminology, is there a standard way to describe language/treebank-specific POS schemes, e.g. the VBZ, NNS etc scheme used in the PTB? For now, I'll call these tags "XPOS", for "extended POS tags". I'll reserve the term "POS tag" for one of the 17 UD POS tags.
As a second point of terminology, I'll call the text-field of an inflected token an "orthographic form" (as opposed to a lemma).
Often an XPOS tag maps to a single POS, and zero or more morphological features. I've found useful mapping tables like this in the Interset project. For my purposes, I'm formatting these as JSON files. I call this a "Tag Map": https://github.com/honnibal/spaCy/blob/master/lang_data/en/tag_map.json . Example entry:
Sometimes the XPOS doesn't map cleanly, e.g. the TO tag in the Penn Treebank. So I want to have another mapping file, which is keyed by an XPOS and an orthographic form. This is also necessary for exceptional forms in the language, which encode additional morphological features in their orthographic form that are not captured by their XPOS. Personal pronouns seem a common example. An example entry:
(Question: What's the favoured lemma for personal pronouns? I've gone with the special value "-PRON-" above, but it'd be nice to align this with what other people are doing.)
The token/XPOS will still fail to map cleanly in some cases. My idea is to simply transform the XPOS scheme given the Treebank parse, if it doesn't. The idea is to see the XPOS as an arbitrary detail of the statistical modelling: a place to jointly predict the part-of-speech and morphology, which rule-based post-processing then transforms into the UD format.
Okay! Finally, the questions. First, three general questions:
Now some specifics:
Sorry this was so long!
Thanks,
Matthew Honnibal
The text was updated successfully, but these errors were encountered: