Google Summer of Code 2018 Project - spaCy now speaks Greek
Welcome to the home repository of Greek language integration for spaCy.
This project is developed for Google Summer of Code 2018, under the auspices of GFOSS - Open Technologies Alliance.
- Project links
- Problem Statement
- Sentence splitter
- Stop words
- Norm exceptions
- Named Entities dataset
- Lexical attributes
- Part of Speech Tagger
- NER Tagger
- Noun chunks
- Sentiment Analyzer
- Topic classifier
- Future work
For this project, there is a daily basis timeline that keeps track of all the progress done for Google Summer of Code 2018. You can view the timeline here.
There is also a report page for the final evaluation for Google Summer of Code 2018. You can view the report page here.
What is really important, is the project Wiki, which holds information about every aspect of the addition of the Greek language to spaCy. You can view the project Wiki here.
Also, there is the NLPBuddy repository. NLPBuddy is a side result project of Google Summer of Code on top of spaCy which supports high quality NLP features such as syntax analysis, emotion analysis, topic classification and of course makes use of the Greek language support. You can find the repository here and the Wiki page of this demo here.
Problem statement and project goals
We live in the era of data. Every minute, 3.8 billion internet users, produce content; more than 120 million emails, 500,000 Facebook comments, 3 million Google searches. If we want to process that amount of data efficiently, we need to process natural language. Open source projects such as spaCy, textblob, or NLTK contribute significantly to that direction and thus they need to be reinforced.
This project is about improving the quality of Natural Language Processing of Greek Language.
The project goals can be categorized as following:
- Addition of Greek language to spaCy. Status: Complete
- Production of models for Part-Of-Speech (POS) tagging, Dependency Analysis (DEP) and Named Entities Recognition (NER), with and without word vectors. Status: Complete
- An open-source text analysis tool (demo) in which everyone can perform common NLP tasks in 7 languages. Status: Complete.
- Bonus goal: Usage of the addition of Greek language for sentiment analysis and other challenging NLP tasks.
Addition of Greek language to spaCy
Greek language has been successfully integrated to spaCy, which was actually the most important goal of the project.
There were two pull requests for this purpose; the first was the initial addition of the language and the second pull request contained important optimizations that made the support for the Greek language probably the most complete in terms of features after the English language.
Addition of the language: You can see the first pull request here.
Optimizations to the Greek language class: You can see the second pull request here.
Each part of the process of integrating Greek language to spaCy is discussed in detail in the Wiki page of the project.
Greek language models
Two models for Greek language have been produced. There is an ongoing process of uploading them to spaCy release.
After that, you will be able to install them with the following commands:
python3 -m spacy download el_core_web_sm python3 -m spacy download el_core_web_lg
Greek language models support most of the capabilities that you will find in the deliverables section. Sentence splitting, tokenization, Part Of Speech Tagging, Syntax Analysis using DEP tags, Named Entities Recognition, lexical attributes extraction, norm exceptions and stop-words lists, are all included the Greek language models. The big Greek model (el_core_web_lg) includes word vectors so it supports features such as similarity detection between texts. You can find more about the models production, usage and maintenance, in the models page of the wiki. Some visualizations from the models usage:
An open-source text analysis tool has been developed as a demonstration of the project results.
The demo leverages Spacy's capabilities to extract as much information as possible from a raw text.
Experiment yourself with the demo: https://nlpbuddy.io
Briefly, in this demo you can perform the following tasks with your text:
- Language identification (performed using langid library).
- Text tokenization.
- Sentence splitting.
- Part of Speech tags identification.
- Named Entity Recognition (Location, Person, Organization).
- Text summarization (uses Gensim's implementation of the TextRank algorithm).
- Keywords extraction.
- For the Greek language:
- Text classification among the following categories: Sports, Science, World News, Greek News, Environment, Politics, Art, Health, Science. The Greek classifier is built with FastText and is trained in 20.000 articles labeled in these categories. Accuracy reaches 90%,
- Text subjectivity analysis.
- Emotion analysis. It detects the main text emotion among the following emotions: Anger, Disgust, Fear, Happiness, Sadness, Surprise.
- Lexical attributes. Find numerals, urls and emails.
- Noun chunks. Get the noun phrases of the text such as "the red bicycle".
Currently, it supports the features mentioned above for text in one of the following languages: Greek, English, German, Spanish, Portuguese, French, Italian and Dutch.
Note: All the functionalities that demo supports (and some more) are implemented as modules so anybody can use them independently. Those modules are extensively discussed in the deliverables section. The central idea is that this Google Summer of Code project should produce results that are going to be used later on from people all around the world. For that reason, together with my mentor, Markos Gogoulos, we have implemented an API for the Demo so anybody can access the results that it provides (see more here).
Improvements in spaCy
A side goal of the project is to empower spaCy itself.
There is an open-dialogue with the creators of spaCy, who we would like to thank for their continuous support and enthusiasm.
A pull request for documentation improvements was successfully merged.
The pull request was about a small error found in the spaCy documentation in the pseudocode provided for overriding the spaCy tokenizer.
You can see the pull request here.
I am invited to write an article for Explosion AI Blog regarding the integration of Greek language to spaCy due to the innovative approaches followed during Google Summer of Code 2018. There is an ongoing process of writing and evaluation of this article till its' publication which may be after the end of Google Summer of Code. A link to the post will be published here when it's ready.
In the process of integrating Greek language to spaCy some new approaches are followed. Hopefully, these approaches will inspire other languages too.
- The Greek language is the second language that follows a rule based lemmatization procedure.
- There were no available data for training NER classifier, so there was a need for creating data. A fast procedure of annotating data using Prodigy annotation tool is proposed for future reference.
Deliverables are independent functionality submodules or/and useful resources that were produced either during the process of integrating Greek language to spaCy or during the process of experimenting with the functionalities of spaCy and the demo implementation.
A list of the deliverables and a short description of each of them follows. You can find the functionality submodules in the res/modules folder of the project repo (here), serving as examples for usage.
You can use this submodule having one of the produced greek models in order to split your sentence(s) to tokens, independently of the others spaCy modules. Sample input: Θέλω να μου σπάσεις αυτήν την πρόταση σε κομμάτια Sample output: [Θέλω, να, μου, σπάσεις, αυτήν, την, πρόταση, σε, κομμάτια] Submodule link.
This submodule is for sentences lemmatization.
Τα σύμβολα του αγώνα.
Original token: Τα , Lemma: τα Original token: σύμβολα , Lemma: σύμβολο Original token: του , Lemma: του Original token: αγώνα , Lemma: αγώνας Original token: . , Lemma: .
You can use this submodule using one of the produced greek models in order to split sentences in a greek text independently of the rest of the spaCy modules.
Αυτή είναι μια πρόταση. Αυτή είναι μια δεύτερη πρόταση. Και αυτή μια τρίτη πρόταση.
[ Αυτή είναι μια πρόταση., Αυτή είναι μια δεύτερη πρόταση., Και αυτή μια τρίτη πρόταση.]
In computing, stop words are words which are filtered out before or after processing of natural language data. Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.
spaCy usually tries to normalise words with different spellings to a single, common spelling. This has no effect on any other token attributes, or tokenization in general, but it ensures that equivalent tokens receive similar representations. This can improve the model's predictions on words that weren't common in the training data, but are equivalent to other words – for example, "realize" and "realise", or "thx" and "thanks".
Named Entities dataset.
For Greek language, there was no available dataset for Named Entities. So, we had to create our own annotated dataset using Prodigy. The annotated dataset is available here. You can learn more about NER and Prodigy in the following links: Link 1, Link 2.
Lexical attributes functions.
Each token of a spaCy doc is checked against some potential attributes. In this way, urls, nums and other types of special tokens can be seperated from the normal tokens.
Η ιστοσελίδα για το demo μας είναι: https://nlp.wordames.gr
- Part of Speech Tagger.
You can use this submodule having one of the produced greek models in order to get part of speech tags for your tokens, independently of the others spaCy modules.
Η δημοκρατία είναι το πιο ανθρώπινο πολίτευμα.
Token: Η Tag: DET Token: δημοκρατία Tag: NOUN Token: είναι Tag: AUX Token: το Tag: DET Token: πιο Tag: ADV Token: ανθρώπινο Tag: ADJ Token: πολίτευμα Tag: NOUN Token: . Tag: PUNCT
Visualized output using displaCy:
- DEP Tagger.
You can use this submodule having one of the produced greek models in order to analyze syntax of your text, independently of the others spaCy modules.
Get DEP tags.
Η δημοκρατία είναι το πιο ανθρώπινο πολίτευμα.
Token:η, DEP tag: det Token:δημοκρατία, DEP tag: nsubj Token:είναι, DEP tag: cop Token:το, DEP tag: det Token:πιο, DEP tag: advmod Token:ανθρώπινο, DEP tag: amod Token:πολίτευμα, DEP tag: ROOT Token:., DEP tag: punct
Navigate/Visualize the DEP tree.
Ο Κώστας αγόρασε πατάτες και τις άφησε πάνω στο ψυγείο.
αγόρασε __________________|______ | | | άφησε | | | ______|__________ | | Κώστας | | | ψυγείο | | | | | | | πατάτες . Ο και τις πάνω στο
Visualization code source.
- NER Tagger.
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
The greek language models support the following NER tags: ORG, PERSON, LOC, GPE, EVENT, PRODUCT. Having one of the greek models, you can use the NER tagger:
Visualization using displacy:
Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, "the lavish green grass" or "the world's largest tech fund".
In the latest pull request noun chunks for Greek language are supported.
You can view the submodule here.
This submodule gives you a subjectivity score for your text and an emotion analysis .
Έχω μείνει έκπληκτος! Πώς γίνεται αυτό; Η έκπληξη είναι τόσο μεγάλη! Α, τώρα εξηγούνται όλα.
Subjectivity: 16.666666666666664% Main emotion: surprise. Emotion score: 33.333333333333336%
Currently available only for the Greek language. Submodule link.
- Topic classifier.
This submodule is for text classification. It can categorize text in the following categories: Sports, Science, World News, Greek News, Environment, Politics, Art, Health, Science. Currently available only for the Greek language.
In this section, some suggestions for future work are listed. There are difficulty labels assigned to each task and some guidelines to start with. For more info on contribution, you can always have a look at the contribute page of the project wiki.
Add more rules to lemmatizer (Difficulty: easy) Greek language follows a rule based lemmatization technique. It is highly suggested to have a look in the lemmatizer wiki page to understand more about the approach followed. If you do, you will find out how scalable Greek language lemmatization is. Adding rules should be as easy as completing some lines in this file. For more info, check the contribute wiki page.
Overwrite the spaCy tokenizer (Difficulty: hard)
Each language modifies the spaCy tokenization procedure by adding tokenizer exceptions. The tokenizer exceptions approach is not scalable for languages such as Greek. The reasons are pretty much the same as with the lemmatizer. A new approach, rule-based tokenization is proposed. The suggested steps are the following:
- Rewrite the spaCy tokenizer in pure Python, following the pseudo-code provided here. This is already done, you can find the code here.
- Write regex expressions to catch the following phenomena of Greek language: "εκθλίψεις", "αφαιρέσεις", "αποκοπές".
- Transform the tokens that match one of the phenomena mentioned above, to other(s) tokens using transformation rules.
Improve models accuracy (Difficulty: medium)
Implement topic classifier for other languages as well (Difficulty: medium)
Implement sentiment analyzer for other languages as well (Difficulty: medium)
Implement attitude detector and integrate it to demo (Difficulty: hard)
- Google Summer of Code 2018 Student: Ioannis Daras
- Mentor: Markos Gogoulos
- Mentor: Panos Louridas