TODO: Finish English morphology table, in lang_data/en/morphs.json , to move forward on multi-lingual #124

honnibal · 2015-10-03T03:19:23Z

The English morphology table needs to be finished, in the language-independent format. This is blocking multi-lingual development, because we need to have the English data in standard format to serve as an example reference.

See here: UniversalDependencies/docs#212
And here: http://spacy.io/tutorials/add-a-language/

Documentation on morphological scheme
http://universaldependencies.github.io/docs/

Useful search tool for reference: http://bionlp-www.utu.fi/dep_search/?db=English&search=Mine%7Cmine+%3Cnsubj+_

The morphological tables and lemmatization are the main pieces of language-specific work that need to be completed to support each new language. The code has now been updated to be language neutral.

Some morphologically rich languages may challenge the current architecture, and require exceptional processing. The code is set up to allow each component to be subclassed easily, to support special-case pipelines.

Linguistic Preliminaries

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not changes its part-of-speech.

We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form.

Examples:

Context: I was reading the paper
Lemma: read
Part of Speech: Verb
Morphological Features: VerbForm=Ger
Surface Form: reading

Context: I don't watch the news, I read the paper.
Lemma: read
Part of Speech: Verb
Morphological Features: VerbForm=Fin, Mood=Ind, Tense=Pres
Surface Form: read:

Context: I read the paper yesteday
Lemma: read
Part of Speech: Verb
Morphological Features: VerbForm=Fin, Mood=Ind, Tense=Past
Surface Form: read

Note that the same lemma and different morphological features can be expressed by the same surface form, especially when written (the past tense of "read" is distinct in speech from the base form and present, but is written identically).

Some languages are more regular than others in how surface forms are constructed from lemmas and morphemes. As far as I'm aware, all languages have at least a few irregular words, where the surface forms must be enumerated, and cannot be generated by the usual process.

Morphology in spaCy

spaCy's morphological processing proceeds as follows:

During tokenization, the tokenizer consults a mapping table specials.json, which allows sequences of characters to be mapped to multiple tokens. Each token may be assigned a part of speech and one or more morphological features.
The part-of-speech tagger then assigns each token an extended POS tag. In the API, these tags are known as .tag. For now we will call these XPOS. An XPOS tag expresses the part-of-speech (e.g. VERB) and some amount of morphological information, e.g. that the verb is in the ING form. The set of XPOS tags differs by language, and represents a compromise between desired detail and economy of representation.
A mapping table morphs.json is then consulted, which maps (surface form, XPOS) to (lemma, POS, morphological features). This table allows exceptional cases to be handled, where the specific surface form and XPOS return a result that can't be captured by generalized rules.
For words whose POS is not set by a prior process, a mapping table tag_map.json maps the XPOS to (POS, morphological features).
Finally, a rule-based deterministic lemmatizer maps (surface form, XPOS, POS, morphological features) to (surface form, XPOS, lemma, POS, morphological features), without consulting the context of the token. Currently the lemmatizer also accepts list-based exception files, acquired from WordNet, for words whose surface form-to-lemma mapping is unpredictable, but where the XPOS --> (POS, morphological features) mapping is predictable. This should be changed --- these exceptions should be moved into the morphs.json as well.

Previously in spaCy

At launch, spaCy used an ad hoc set of morphological features, optimized for English. A small set of union features were added. This was a placeholder for a more formal scheme.

Since I developed that in 2014, the Universal Dependencies and Interset projects have continued, with impressive results. spaCy is now adopting their morphological schema. (Eventually we should also adopt the UD syntax, at first as an alternate representation, and eventually as the standard form. We will probably want to learn an intermediate representation and map into UD as a post-process.)

What needs to be done

Finish the English morphs.json, found in lang_data/en/morphs.json The pronouns are reasonably complete. We need the auxiliaries next, and exceptional comparative adjectives (e.g. better, best etc). Finally, the noun.exc, verb.exc etc files from WordNet should be merged into morphs.json.
Create a German morphs.json, currently stubbed out in lang_data/de/morphs.json.
Update the lemmatizer to avoid using the old exception files, and to allow more flexible rule writing.

The text was updated successfully, but these errors were encountered:

dpk · 2015-12-01T19:51:14Z

Will there be German support shipping with spaCy soon, then? That would be very good for me!

lock · 2018-05-09T10:12:11Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

This was referenced Oct 3, 2015

Additional Language Support #15

Closed

Is there any interest in adding new languages? #134

Closed

honnibal added the enhancement Feature requests and improvements label Jan 18, 2016

honnibal closed this as completed Oct 20, 2016

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO: Finish English morphology table, in lang_data/en/morphs.json , to move forward on multi-lingual #124

TODO: Finish English morphology table, in lang_data/en/morphs.json , to move forward on multi-lingual #124

honnibal commented Oct 3, 2015

dpk commented Dec 1, 2015

lock bot commented May 9, 2018

TODO: Finish English morphology table, in lang_data/en/morphs.json , to move forward on multi-lingual #124

TODO: Finish English morphology table, in lang_data/en/morphs.json , to move forward on multi-lingual #124

Comments

honnibal commented Oct 3, 2015

Linguistic Preliminaries

Morphology in spaCy

Previously in spaCy

What needs to be done

dpk commented Dec 1, 2015

lock bot commented May 9, 2018