Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO: Finish English morphology table, in lang_data/en/morphs.json , to move forward on multi-lingual #124

Closed
honnibal opened this issue Oct 3, 2015 · 2 comments
Labels
enhancement Feature requests and improvements

Comments

@honnibal
Copy link
Member

honnibal commented Oct 3, 2015

The English morphology table needs to be finished, in the language-independent format. This is blocking multi-lingual development, because we need to have the English data in standard format to serve as an example reference.

See here: UniversalDependencies/docs#212
And here: http://spacy.io/tutorials/add-a-language/

Documentation on morphological scheme
http://universaldependencies.github.io/docs/

Useful search tool for reference: http://bionlp-www.utu.fi/dep_search/?db=English&search=Mine%7Cmine+%3Cnsubj+_

The morphological tables and lemmatization are the main pieces of language-specific work that need to be completed to support each new language. The code has now been updated to be language neutral.

Some morphologically rich languages may challenge the current architecture, and require exceptional processing. The code is set up to allow each component to be subclassed easily, to support special-case pipelines.

Linguistic Preliminaries

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not changes its part-of-speech.

We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form.

Examples:

Context: I was reading the paper
Lemma: read
Part of Speech: Verb
Morphological Features: VerbForm=Ger
Surface Form: reading

Context: I don't watch the news, I read the paper.
Lemma: read
Part of Speech: Verb
Morphological Features: VerbForm=Fin, Mood=Ind, Tense=Pres
Surface Form: read:

Context: I read the paper yesteday
Lemma: read
Part of Speech: Verb
Morphological Features: VerbForm=Fin, Mood=Ind, Tense=Past
Surface Form: read

Note that the same lemma and different morphological features can be expressed by the same surface form, especially when written (the past tense of "read" is distinct in speech from the base form and present, but is written identically).

Some languages are more regular than others in how surface forms are constructed from lemmas and morphemes. As far as I'm aware, all languages have at least a few irregular words, where the surface forms must be enumerated, and cannot be generated by the usual process.

Morphology in spaCy

spaCy's morphological processing proceeds as follows:

  • During tokenization, the tokenizer consults a mapping table specials.json, which allows sequences of characters to be mapped to multiple tokens. Each token may be assigned a part of speech and one or more morphological features.
  • The part-of-speech tagger then assigns each token an extended POS tag. In the API, these tags are known as .tag. For now we will call these XPOS. An XPOS tag expresses the part-of-speech (e.g. VERB) and some amount of morphological information, e.g. that the verb is in the ING form. The set of XPOS tags differs by language, and represents a compromise between desired detail and economy of representation.
  • A mapping table morphs.json is then consulted, which maps (surface form, XPOS) to (lemma, POS, morphological features). This table allows exceptional cases to be handled, where the specific surface form and XPOS return a result that can't be captured by generalized rules.
  • For words whose POS is not set by a prior process, a mapping table tag_map.json maps the XPOS to (POS, morphological features).
  • Finally, a rule-based deterministic lemmatizer maps (surface form, XPOS, POS, morphological features) to (surface form, XPOS, lemma, POS, morphological features), without consulting the context of the token. Currently the lemmatizer also accepts list-based exception files, acquired from WordNet, for words whose surface form-to-lemma mapping is unpredictable, but where the XPOS --> (POS, morphological features) mapping is predictable. This should be changed --- these exceptions should be moved into the morphs.json as well.

Previously in spaCy

At launch, spaCy used an ad hoc set of morphological features, optimized for English. A small set of union features were added. This was a placeholder for a more formal scheme.

Since I developed that in 2014, the Universal Dependencies and Interset projects have continued, with impressive results. spaCy is now adopting their morphological schema. (Eventually we should also adopt the UD syntax, at first as an alternate representation, and eventually as the standard form. We will probably want to learn an intermediate representation and map into UD as a post-process.)

What needs to be done

  • Finish the English morphs.json, found in lang_data/en/morphs.json The pronouns are reasonably complete. We need the auxiliaries next, and exceptional comparative adjectives (e.g. better, best etc). Finally, the noun.exc, verb.exc etc files from WordNet should be merged into morphs.json.
  • Create a German morphs.json, currently stubbed out in lang_data/de/morphs.json.
  • Update the lemmatizer to avoid using the old exception files, and to allow more flexible rule writing.
@dpk
Copy link

dpk commented Dec 1, 2015

Will there be German support shipping with spaCy soon, then? That would be very good for me!

@honnibal honnibal added the enhancement Feature requests and improvements label Jan 18, 2016
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements
Projects
None yet
Development

No branches or pull requests

2 participants