# NLTK with non-Latin scripts (Greek)

I downloaded this notebook [from here](https://github.com/hb20007/hands-on-nltk-tutorial/blob/master/7-1-NLTK-with-the-Greek-Script.ipynb) and I think it is a great example for you to get a first list of Greek Stop Words and do some basic Greek text processing.

It does require NLTK: so `pip install nltk` as well as download whatever you might need depending on your computer's configuration - although I think after `pip install` it should work all fine!

## 1. Cleaning text

Also known as basic **Text Normalisation**

In [1]:
sentence = "ΑΥΤΟΣ είναι ο χορός της βροχής της φυλής, ό,τι περίεργο."
sentence = sentence.lower()
sentence

'αυτος είναι ο χορός της βροχής της φυλής, ό,τι περίεργο.'

A package called [`unidecode`](https://pypi.org/project/Unidecode) can be used to transliterate any Unicode string into the “closest possible representation” in ASCII text:

In [2]:
!pip install unidecode

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting unidecode
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 25.5 MB/s 
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.6


In [3]:
from unidecode import unidecode

''' Unique to Greek and other non-latin languages to get the latin equivalent of it: 
        I think this step would be good whenever you need to implement software that pronounces 
        or reads Greek sentences out loud.
'''
sentence_latin = unidecode(sentence)
sentence_latin

'autos einai o khoros tes brokhes tes phules, o,ti periergo.'

Now some basic **word normalisation**:

In [4]:
import unicodedata

def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) # NFD = Normalization Form Canonical Decomposition, one of four Unicode normalization forms.
                   if unicodedata.category(c) != 'Mn') # The character category "Mn" stands for Nonspacing_Mark
sentence_no_accents = strip_accents(sentence)
sentence_no_accents

'αυτος ειναι ο χορος της βροχης της φυλης, ο,τι περιεργο.'

**Tokenisation:**

In [5]:
from nltk.tokenize import WhitespaceTokenizer

tokens = WhitespaceTokenizer().tokenize(sentence_no_accents)
tokens

['αυτος',
 'ειναι',
 'ο',
 'χορος',
 'της',
 'βροχης',
 'της',
 'φυλης,',
 'ο,τι',
 'περιεργο.']

**Removing punctuation:**

In [6]:
from string import punctuation

new_tokens = []

for token in tokens:
    if token == 'ο,τι':
        new_tokens.append('ο,τι')
    else:
        new_tokens.append(token.translate(str.maketrans({key: None for key in punctuation})))

new_tokens_with_stopwords = new_tokens
new_tokens

['αυτος',
 'ειναι',
 'ο',
 'χορος',
 'της',
 'βροχης',
 'της',
 'φυλης',
 'ο,τι',
 'περιεργο']

## 2. Removing stopwords

In [7]:
# Greek stopwords adapted from https://github.com/6/stopwords-json however better lists with more stopwords are available: https://www.translatum.gr/forum/index.php?topic=3550.0?topic=3550.0
greek_stopwords = ["αλλα","αν","αντι","απο","αυτα","αυτες","αυτη","αυτο","αυτοι","αυτος","αυτους","αυτων","για","δε","δεν","εαν","ειμαι","ειμαστε","ειναι","εισαι","ειστε","εκεινα","εκεινες","εκεινη","εκεινο","εκεινοι","εκεινος","εκεινους","εκεινων","ενω","επι","η","θα","ισως","κ","και","κατα","κι","μα","με","μετα","μη","μην","να","ο","οι","ομως","οπως","οσο","οτι","ο,τι","παρα","ποια","ποιες","ποιο","ποιοι","ποιος","ποιους","ποιων","που","προς","πως","σε","στη","στην","στο","στον","στης","στου","στους","στις","στα","τα","την","της","το","τον","τοτε","του","των","τις","τους","ως"]
len(greek_stopwords)

83

In [8]:
new_tokens_set = set(new_tokens)
greek_stopwords_set = set(greek_stopwords)
intersection_set = new_tokens_set.intersection(greek_stopwords_set)
intersection_set

for element in intersection_set:
    new_tokens = list(filter((element).__ne__, new_tokens)) # __ne__ is the != operator.
new_tokens

['χορος', 'βροχης', 'φυλης', 'περιεργο']

The next steps would be to either extract lemmas or to perform some stemming... But for that we need some external packages...

## 3. Other packages

There are more interesting packages like [`polyglot`](https://pypi.org/project/polyglot/) and [`greek-stemmer`](https://pypi.org/project/greek-stemmer/). However, these require [`PyICU`](https://pypi.org/project/PyICU/) in order to work and installing this on Windows is a pain.