Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
The majority of the languages in spaCy are based on a lookup for lemmatization. That means, that most language models contain a static, huge file (usually called lookup.py or maybe lemmatizer.py) that contains a mapping of all of the words in the language to their lemma. The author believes that this approach lacks of scalability. The following section explains why.
Reasons NOT to use a lookup for lemmatization
It is almost impossible for languages like Greek to use a lookup approach for lemmatization. There are plenty of reasons to justify it:
Verb forms in Greek are dependent on the tense.
Let me guess your first reaction; So what? English verb forms are dependent on tense too. For example, we can spot the word "play" in two (tense) forms; "play, played". However, the same verb in Greek "παίζω" can be spotted in the following (tense) forms: "παίζω, έπαιζα, έπαιξα, παίξω". That's double right? Making the silent assumption that the amount of verbs in Greek language is the same as in English, that leads us to a 2x increase in the size of the verb section of the lookup. And we just started.
Verb forms in Greek are dependent on grammatical persons.
In English it goes like this:
I/You/We/They play, He/She/It plays.
The situation in Greek language is much more complicated:
Εγώ παίζω, Εσύ παίζεις, Αυτός/Αυτή/Αυτό παίζει, Εμείς παίζουμε, Εσείς παίζετε, Αυτοί παίζουν.
You spot the difference, right? For just a single verb, we would have a 3x increase in the verbs section of the lookup making the assumption that the amount of verbs in Greek language is equal to the amount of verbs in English language. Bearing in mind the first bullet, that actually leads us to 6x increase coming from verb peculiarities in Greek language.
Noun/Adjective forms in Greek are dependent on the cases (nominative/accusative/genitive/dative).
In English there is no case dependency to nouns; the noun change forms only from singular to plural version. However, in Greek nouns change depending on case. For example, the noun "γήπεδο" can be found in forms like "γήπεδο, γηπέδου, γηπέδων", the noun "ναυμαχία" can be found in forms like "ναυμαχία, ναυμαχίας, ναυμαχίες, ναυμαχιών". That complicates things a lot and can lead to a more than 2x increase of the nouns in the lookup table.
Adjective forms in Greek are dependent on the gender.
An English speaker may find difficulty to understand that. In English all the following sentences are valid:
he/she/it is beautiful. However, in Greek we have three separate sentences:
Αυτός είναι όμορφος.
Αυτή είναι όμορφη.
Αυτό είναι όμορφο.
This causes a 3x increase to the adjective section of the lookup compared to the English lookup table.
Adjective forms in Greek differ in singular and plural form.
You are probably get used to saying
You (singular)/You (plural) are beautifulbut in Greek that would be 2 different sentences in terms of the adjective form:
In a way, adjectives in Greek behave like nouns which means that their plural form is different than their singular form.
Adjective forms in Greek are dependent on the cases.
It's the same as bullet 3 but for adjectives.
How to create a lookup table
If you are not convinced that the lookup is not the best approach for lemmatization or if you are just curious how to create one, this section is for you.
Firstly, you need to collect a list of all the lemmas in the language. Such a list can be collected through parsing wiktionary. If you are too lazy to do this or if you don't know how to do this, that's not a problem because we have already collected a list of lemmas based on their part of speech tag. You can navigate here and find the related files.
After that, collect a list of a lot of Greek words. You can manually create this list (from a wikipedia dump maybe) or you can use an existing like the one that can be found here. You can then use a stemmer and group words with common stem. For each group, find which of the words in the group is a lemma and create an entry with the rest of them in the lookup table. Then, you will need to clean the lookup (review and remove wrong entries, add some exceptions, etc.) and voila, your lookup is ready.
A 22MB lookup produced for Greek language using this approach can be found here.
Rule based approach
A rule based approach for lemmatization was introduced in the English language model and the logic is transfered to the Greek language model. The steps that someone needs to do to achieve rule based lemmatization with spaCy are the following:
- Collect (normal) lemmas for each POS tag. For the Greek language, have a look here.
- Collect exception lemmas for each POS tag. For the Greek language, have a look here.
- Write rules for each POS tag that transform the suffixes of a token in a way that the token becomes lemma. If you didn't understand what I just said don't worry, I will explain it further. Lemmatization using rules in spaCy does the following; given a token, it finds its POS tag and it tries to replace a suffix of the word with another suffix in order to convert it to a lemma of the same POS tag. So the idea here is to notice the patterns (for example lemma adjectives in Greek word often end to "-δης", e.g. "πνευματώδης" while adjectives that are not lemmas often end to "-δεις", e.g "πνευματώδεις") and write appropriate mappings between the suffixes (for example we could create for adjectives a rule that maps "-δεις" to the lemma suffix "-δης"). Lemmatization rules for the Greek language can be found here.
Exceptions and rules define completely the rule based lemmatization procedure which is more scalable, more memory efficient and less painful than the traditional lookup approach for lemmatization.
We have overwritten spaCy's default Lemmatizer. Actually, we kept the same logic with the default lemmatizer but we write our own in order to optimize it for Greek language.
The most important change is that when it takes a word as input, it first checks whether it is already a lemma and only if not it begins the transformation rules.