In [1]:
import pandas as pd


import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


True

In [2]:
df_articles = pd.read_csv('https://raw.githubusercontent.com/tblock/10kGNAD/master/articles.csv',
                 sep=';',       # this file is actually a TSV, separated by ";"
                 on_bad_lines='skip',
                 header=None,   # There is no header line for this CSV...
                 # .. so we define the column names here:
                 names=['article_category', 'content'],
                 # And by specifiying the column as a Categorical type,
                 # we can save computer memory! Yay!
                 dtype={'article_category': 'category'})

In [3]:
df_articles

Unnamed: 0,article_category,content
0,Etat,"Die ARD-Tochter Degeto hat sich verpflichtet, ..."
1,Etat,App sei nicht so angenommen worden wie geplant...
2,Etat,Mitarbeiter überreichten Eigentümervertretern ...
3,Etat,Service: Jobwechsel in der Kommunikationsbranc...
4,Etat,Was Sie über diese Woche wissen sollten - und ...
...,...,...
9571,Wissenschaft,Die Fundstelle in Südengland ist Unesco-Weltku...
9572,Wissenschaft,Im Team arbeitet auch ein Inspektor der sudane...
9573,Wissenschaft,Die zentrale Frage des Projekts: Siedelten Ägy...
9574,Wissenschaft,Klimatische Verschlechterungen dürften zur Auf...


In [4]:
df_articles['article_category'].cat.categories

Index(['Etat', 'Inland', 'International', 'Kultur', 'Panorama', 'Sport', 'Web',
       'Wirtschaft', 'Wissenschaft'],
      dtype='object')

# Clustering with Latent Dirichlet Allocation (LDA)

Now, let's examine our German data set with LDA:

In previous exercises, you got to know NLTK.

### Stemming
Here we will also use NLTK's methods of **stemming** the words. By returning to the root of the word, its stem, we reduce the dimensionality: the number of words in the vocabulary decreases. For example, instead of having different words for the singular and plural form - 'word' <--> 'words' or 'Kanzler', 'Kanzlers', 'Kanzlei', etc., we trim those words into 'Kanzl'. Hence we can reduce the size of the vocabulary by at least half.

### Stop Words
We will also remove `stopwords` from our text. In English, words such as: `a`, `an`, and `the` will be removed, as they don't add much to the meaning of the sentence. For each language, there is a different curated list of such words, and NLTK is a great source for those.

### GenSim
In this exercise, you'll be introduced to another package, specialized in topic modeling, called `gensim`:
https://radimrehurek.com/gensim/

In [129]:
%pip install -U gensim --quiet

  and should_run_async(code)


In [130]:
from pprint import pprint # for printing objects nicely

from gensim import corpora, models
from gensim.utils import simple_preprocess

## Instead of the gensim English stopwords...
# from gensim.parsing.preprocessing import STOPWORDS
## ...we use nltk's German stopwords:
from nltk.corpus import stopwords

from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import *

import numpy as np

from random import choice

np.random.seed(1234)

  and should_run_async(code)


In [131]:
# Initialize the Stemmers
stemmer = SnowballStemmer('german')
german_stop_words = set(stopwords.words('german'))


def lemmatize_stemming(text):
  """lemmatize and stem a word"""
  return stemmer.stem(text)


def preprocess(text):
  """lemmatize and remove stopwords"""
  result = [lemmatize_stemming(token)
            for token in simple_preprocess(text)
            if token not in german_stop_words and len(token) > 3]
  return result

  and should_run_async(code)


In [132]:
all_articles = df_articles['content'].to_list()
all_articles[:5]

  and should_run_async(code)


['Die ARD-Tochter Degeto hat sich verpflichtet, ab August einer Quotenregelung zu folgen, die für die Gleichstellung von Regisseurinnen sorgen soll. In mindestens 20 Prozent der Filme, die die ARD-Tochter Degeto produziert oder mitfinanziert, sollen ab Mitte August Frauen Regie führen. Degeto-Chefin Christine Strobl folgt mit dieser Selbstverpflichtung der Forderung von Pro Quote Regie. Die Vereinigung von Regisseurinnen hatte im vergangenen Jahr eine Quotenregelung gefordert, um den weiblichen Filmschaffenden mehr Gehör und ökonomische Gleichstellung zu verschaffen. Pro Quote Regie kritisiert, dass, während rund 50 Prozent der Regie-Studierenden weiblich seien, der Anteil der Regisseurinnen bei Fernsehfilmen nur bei 13 bis 15 Prozent liege. In Österreich sieht die Situation ähnlich aus, auch hier wird von unterschiedlichen Seiten Handlungsbedarf angemahnt. Aber wie soll dieser aussehen? Ist die Einführung der Quotenregelung auch für die österreichische Film- und Fernsehlandschaft sinn

## Preprocessing

Let's see an example, what happens when we pre-process a document.

Look at the output of this cell, and compare the tokenized original document, to the lemmatized document:

In [133]:
print('original document: ')
article = choice(all_articles)
print(article, "\n")

# This time, we don't care about punctuations as tokens (Can you think why?):
print('original document, broken into words: ')
words = [word for word in article.split(' ')]
print(words, "\n")
print("Vocabulary size of the original article:", len(set(words)))

# now let's see what happens when we pass the article into our preprocessing
# method:
print('\n\n tokenized and lemmatized document: ')
preprocessed_article = preprocess(article)
print(preprocessed_article, '\n')
print("Vocabulary size after preprocessing:", len(set(preprocessed_article)))


original document: 
Ronaldo: "Es ist ein Spiel, das wir gewinnen müssen" – Barca vs. Sevilla. Madrid – In der spanischen Fußball-Liga steht die 26. Runde am Samstag (16.00 Uhr) ganz im Zeichen des Stadtderbys Real gegen Atletico Madrid. Die Königlichen, in der Tabelle neun Punkte hinter Leader FC Barcelona und einen hinter Atletico auf Rang drei, stehen vor eigener Kulisse vor einem Pflichtsieg. Das weiß auch Cristiano Ronaldo. Es ist ein Spiel, dass wir gewinnen müssen, stellte der Real-Superstar klar. Wir spielen gegen einen harten Gegner, der gut verteidigt. Atletico hat in der Liga in 25 Spielen erst elf Gegentore kassiert. Das Team von Diego Simeone will auch Reals Offensivkraft Einhalt gebieten. Während Simeone in seiner fünften Saison bei Atletico schon ein alter Derby-Hase ist, hat Reals Trainer Zinedine Zidane fast zwei Monate nach seinem Amtsantritt sein erstes Stadtduell als Trainer noch vor sich. Der Franzose verteidigt seit seiner Amtsübernahme im Jänner einen Lauf von ach

  and should_run_async(code)


Now let's pre-process all the documents.  
This is a heavy procedure, and may take a bit ;)

In [134]:
processed_docs = list(map(preprocess, all_articles))
processed_docs[:10]

  and should_run_async(code)


[['tocht',
  'degeto',
  'verpflichtet',
  'august',
  'quotenregel',
  'folg',
  'gleichstell',
  'regisseurinn',
  'sorg',
  'mindest',
  'prozent',
  'film',
  'tocht',
  'degeto',
  'produziert',
  'mitfinanziert',
  'soll',
  'mitt',
  'august',
  'frau',
  'regi',
  'fuhr',
  'degeto',
  'chefin',
  'christin',
  'strobl',
  'folgt',
  'forder',
  'quot',
  'regi',
  'verein',
  'regisseurinn',
  'vergang',
  'jahr',
  'quotenregel',
  'gefordert',
  'weiblich',
  'filmschaff',
  'mehr',
  'gehor',
  'okonom',
  'gleichstell',
  'verschaff',
  'quot',
  'regi',
  'kritisiert',
  'rund',
  'prozent',
  'regi',
  'studier',
  'weiblich',
  'seien',
  'anteil',
  'regisseurinn',
  'fernsehfilm',
  'prozent',
  'lieg',
  'osterreich',
  'sieht',
  'situation',
  'ahnlich',
  'seit',
  'handlungsbedarf',
  'angemahnt',
  'ausseh',
  'einfuhr',
  'quotenregel',
  'osterreich',
  'film',
  'sinnvoll',
  'diskuti',
  'forum'],
 ['angenomm',
  'word',
  'geplant',
  'weg',
  'gering',
  '

## Setting Up The Dictionary

Our preprocessing is complete.

We now need to calculate the occurance frequencies of each of our stemmed words. But first, we will create a vocabulary dictionary where every word appears once. Every article would be represented as a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model), an unordered set of words that the article contain.

---

Q: Why is it called bag-of-words?

Hint: Think about your probability lessons - where you had randomly picked out white or black balls out of a bag...

In [11]:
dictionary = corpora.Dictionary(processed_docs)


Let's take a look:

In [12]:
for idx, (k, v) in enumerate(dictionary.iteritems()):
    print(k, v)
    if idx >= 10:
        break


### BTW: `enumerate` is a great python function!
### It automatically creates an index, an auto-incremented counter variable,
### that represents the position of every object in the collection.

### Read more about it here: https://realpython.com/python-enumerate/

0 ahnlich
1 angemahnt
2 anteil
3 august
4 ausseh
5 chefin
6 christin
7 degeto
8 diskuti
9 einfuhr
10 fernsehfilm


Second, we filter the tokens that may appear to often.

We have full control on the process.

### Model Hyperparameter tuning

### Your Turn:
#### Exercise 1 - Hyperparameter effect on the model output:
**Q:** How would changing these parameters influence the result?  
After running this example, please return here to change them and try them out.

**A:** I lowered both the document frequency parameter and also the percentage threshold one, and reduced the number of topics to 5. I think the initial set was more specific, whereas mine was more general and diverse, which I believe it had to do something with the percentage parameter.

In [13]:
## Model hyper parameters:

## These are the dictionary preparation parameters:
filter_tokens_if_container_documents_are_less_than = 10 #changed from 15 to 10
filter_tokens_if_appeared_percentage_more_than = 0.3 #changed from 0.5 to 0.3
keep_the_first_n_tokens=100000

## and the LDA Parameters:
num_of_topics = 5 #changed from 10 to 5

In [14]:
dictionary.filter_extremes(
    no_below=filter_tokens_if_container_documents_are_less_than,
    no_above=filter_tokens_if_appeared_percentage_more_than,
    keep_n=keep_the_first_n_tokens)


We now create a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) (BOW) dictionary for each document, using [gensim's dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) tool.

It will be in the format of:

```{ 'word_id': count }```


In [15]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

Let's take a look at the result.

Our corpus contains now only word_ids, not the words themselves, so we have to peek into the dictionary to know which word that id represents:

In [16]:
# randomly choose an article from the corpus:
sample_bow_doc = choice(bow_corpus)

print('The processed bag-of-word document is just pairs of (word_id, # of occurnces) and looks like this:')
print(sample_bow_doc, '\n\n')

print ('We peek in the dictionary: for each word_id, we get its assigned word:')
for word_id, word_freq in sample_bow_doc:
  real_word = dictionary[word_id]
  print(f'Word #{word_id} ("{real_word}") appears {word_freq} time.')


The processed bag-of-word document is just pairs of (word_id, # of occurnces) and looks like this:
[(2, 1), (14, 2), (36, 1), (41, 1), (42, 1), (53, 1), (61, 1), (96, 1), (107, 3), (140, 1), (149, 1), (154, 1), (187, 1), (215, 1), (219, 1), (229, 1), (230, 1), (232, 2), (278, 1), (292, 1), (314, 1), (332, 1), (347, 1), (359, 1), (373, 3), (375, 1), (382, 1), (402, 2), (429, 1), (445, 1), (452, 1), (462, 1), (480, 1), (490, 1), (492, 1), (498, 1), (550, 1), (553, 1), (581, 1), (586, 1), (600, 1), (610, 2), (616, 1), (617, 1), (624, 1), (633, 1), (643, 1), (646, 1), (683, 1), (685, 5), (710, 1), (725, 3), (804, 2), (816, 1), (830, 1), (836, 1), (842, 1), (856, 1), (900, 3), (912, 1), (928, 2), (938, 1), (965, 2), (993, 1), (995, 1), (997, 1), (1000, 1), (1011, 1), (1058, 3), (1129, 1), (1149, 1), (1259, 1), (1285, 1), (1308, 1), (1355, 1), (1388, 2), (1410, 1), (1456, 1), (1475, 2), (1518, 1), (1546, 1), (1581, 1), (1632, 1), (1660, 1), (1713, 2), (1850, 1), (1865, 1), (1876, 1), (1880, 

## LDA model using Bag-of-words

Let's start by applying the LDA model using the bag-of-words (Warning: this could take a while):

In [17]:
lda_model = models.LdaMulticore(bow_corpus,
                                num_topics=num_of_topics,
                                id2word=dictionary,
                                passes=5,
                                workers=2)

It is done!

Now we can observe which topics the model had extracted from the documents.

- *Topics* are made of sets of words and their distribution for that topic, representing their weight in that topic.
- Every document may be composed of multiple topics, with different weights representing the relation to each topics.

We will loop over the extracted topics and examine the words that construct them.

In [18]:
for idx, topic in lda_model.print_topics(num_of_topics):
    print(f'Topic: {idx} \t Words: {topic}')


Topic: 0 	 Words: 0.019*"prozent" + 0.012*"euro" + 0.011*"osterreich" + 0.005*"million" + 0.004*"rund" + 0.004*"hoh" + 0.004*"wenig" + 0.004*"zahl" + 0.003*"unternehm" + 0.003*"land"
Topic: 1 	 Words: 0.007*"fluchtling" + 0.007*"land" + 0.005*"regier" + 0.005*"osterreich" + 0.005*"europa" + 0.004*"griechenland" + 0.004*"polit" + 0.004*"weit" + 0.004*"mensch" + 0.004*"staat"
Topic: 2 	 Words: 0.005*"unternehm" + 0.004*"appl" + 0.004*"prozent" + 0.004*"million" + 0.004*"euro" + 0.004*"nutz" + 0.004*"gross" + 0.003*"gerat" + 0.003*"googl" + 0.003*"allerding"
Topic: 3 	 Words: 0.005*"mensch" + 0.005*"gross" + 0.004*"forsch" + 0.004*"standard" + 0.004*"etwa" + 0.004*"gibt" + 0.004*"imm" + 0.004*"viel" + 0.004*"schon" + 0.003*"dabei"
Topic: 4 	 Words: 0.006*"word" + 0.006*"euro" + 0.005*"weg" + 0.004*"jahrig" + 0.004*"rund" + 0.003*"laut" + 0.003*"million" + 0.003*"weit" + 0.003*"deutsch" + 0.003*"drei"


## TF / IDF

Let's take it one step further. We will cluster our document by running the LDA using [TF/IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

We start with TF/IDF calculation on our bag-of-words.
TF/IDF accepts a dictionary of word frequencies as an input, and it calculates the term frequency and the inversed document frequency accordingly.

Its output is a re-weighted dictionary of the documents term frequencies:

In [19]:
# initialize a tfidf from our corpus
tfidf = models.TfidfModel(bow_corpus)

# apply it on our corpus
tfidf_corpus = tfidf[bow_corpus]

pprint(tfidf_corpus[0][:10])

[(0, 0.06672748351825836),
 (1, 0.08546380154246495),
 (2, 0.15583788486381192),
 (3, 0.10987442929344443),
 (4, 0.10602815249607306),
 (5, 0.13055353039827378),
 (6, 0.10457193427428682),
 (7, 0.09465538456786138),
 (8, 0.16822564381009375),
 (9, 0.05475315131893539)]


In [20]:
# the new tfidf corpus is just our corpus - but transformed. It has the same size of documents:
assert len(bow_corpus) == len(tfidf_corpus)

Now let's apply LDA on the tfidf corpus, with the same amount of topics.

You can play with the # of passes, if the model doesn't converge properly

In [29]:
lda_model_tfidf = models.LdaMulticore(tfidf_corpus,
                                      num_topics=num_of_topics,
                                      id2word=dictionary,
                                      passes=10,
                                      workers=4)

In [30]:
for idx, topic in lda_model_tfidf.print_topics(num_of_topics):
    print(f'Topic: {idx} \t Word: {topic}')

Topic: 0 	 Word: 0.002*"fluchtling" + 0.002*"griechenland" + 0.002*"griechisch" + 0.002*"regier" + 0.002*"land" + 0.002*"osterreich" + 0.002*"euro" + 0.002*"ath" + 0.002*"europa" + 0.002*"polit"
Topic: 1 	 Word: 0.002*"forsch" + 0.002*"wissenschaft" + 0.002*"mensch" + 0.001*"tier" + 0.001*"alt" + 0.001*"archaolog" + 0.001*"israel" + 0.001*"studi" + 0.001*"mann" + 0.001*"jahrig"
Topic: 2 	 Word: 0.006*"prozent" + 0.004*"euro" + 0.003*"unternehm" + 0.003*"million" + 0.002*"dollar" + 0.002*"bank" + 0.002*"osterreich" + 0.002*"milliard" + 0.002*"appl" + 0.002*"mitarbeit"
Topic: 3 	 Word: 0.004*"prozent" + 0.004*"forsch" + 0.002*"wissenschaft" + 0.002*"euro" + 0.002*"studi" + 0.002*"osterreich" + 0.002*"million" + 0.002*"tier" + 0.002*"energi" + 0.002*"milliard"
Topic: 4 	 Word: 0.011*"volltext" + 0.010*"basier" + 0.009*"artikel" + 0.008*"rechtlich" + 0.007*"verfug" + 0.006*"grund" + 0.006*"steht" + 0.002*"spiel" + 0.002*"train" + 0.002*"panama"


## Inference

Now that we have a topic-modeler, let's use it on one of the articles.

In [31]:
# randomly pick an article:
test_doc = choice(range(len(processed_docs)))
processed_docs[test_doc][:50]

['heino',
 'tatort',
 'verhohnt',
 'kolleg',
 'viel',
 'fan',
 'berlin',
 'neu',
 'sachs',
 'tatort',
 'mord',
 'volksmusiksz',
 'schlagersang',
 'verargert',
 'drehbuch',
 'mann',
 'geschrieb',
 'banal',
 'vorurteil',
 'gegenub',
 'volksmus',
 'gepragt',
 'wirklich',
 'beschaftigt',
 'sagt',
 'musik',
 'heino',
 'bild',
 'zeitung',
 'montag',
 'tatort',
 'uberfluss',
 'verhohnt',
 'kolleg',
 'viel',
 'fan',
 'erklart',
 'heino',
 'sonntag',
 'ausgestrahlt',
 'krimi',
 'schlag',
 'erst',
 'fall',
 'dresdn',
 'kommissarinn',
 'karin',
 'gorniak',
 'karin',
 'hanczewski']

Using the original BOW model:

In [32]:
for index, score in sorted(lda_model[bow_corpus[test_doc]], key=lambda tup: -1*tup[1]):
    print(f"Topic match score: {score} \nTopic: {lda_model.print_topic(index, num_of_topics)}")


Topic match score: 0.5789118409156799 
Topic: 0.005*"mensch" + 0.005*"gross" + 0.004*"forsch" + 0.004*"standard" + 0.004*"etwa"
Topic match score: 0.4121437072753906 
Topic: 0.006*"word" + 0.006*"euro" + 0.005*"weg" + 0.004*"jahrig" + 0.004*"rund"


And with the TF/IDF model:

In [33]:
for index, score in sorted(lda_model_tfidf[bow_corpus[test_doc]], key=lambda tup: -1*tup[1]):
    print("Topic match score: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, num_of_topics)))

Topic match score: 0.9880240559577942	 
Topic: 0.002*"forsch" + 0.002*"wissenschaft" + 0.002*"mensch" + 0.001*"tier" + 0.001*"alt"


Calculating the [perplexity score](https://towardsdatascience.com/perplexity-in-language-models-87a196019a94) (lower is better):

In [34]:
print('Perplexity: ', lda_model.log_perplexity(bow_corpus))
print('Perplexity TFIDF: ', lda_model_tfidf.log_perplexity(bow_corpus))

Perplexity:  -8.30461021963961
Perplexity TFIDF:  -8.606757283986147


### Exercise - inference

Now please try it on a new document!

Go to a news website, such as [orf.at](https://orf.at/) and copy an article of your choice here:

In [35]:
unseen_document = """Während die Wirtschaft vieler anderer EU-Mitgliedsländer schwächelt, hat Griechenland ein Luxusproblem: Für den Haushalt 2025 steht mehr als doppelt so viel Geld zur Verfügung als vorhergesehen.
Entsprechend musste die Planung der Ausgaben nach oben angepasst werden. Am Sonntagabend wurde der Haushalt vom Parlament verabschiedet. Wichtig sei nun, dass der wirtschaftliche Erfolg auch stärker bei den Menschen ankomme, sagte der konservative Ministerpräsident Kyriakos Mitsotakis vor den Abgeordneten.
Finanzminister Kostis Hatzidakis hatte bei seinem Entwurf für das Budget mit einem Haushaltsüberschuss von 6,1 Mrd. Euro gerechnet, nun sind es 13,5 Mrd. Euro. Das liegt durchaus daran, dass Hatzidakis sparsam gewirtschaftet habe, sagen griechische Finanzexperten. Aber es kommen weitere wichtige Gründe für den Geldsegen zum Tragen.
Zum einen macht sich die harte Bekämpfung der Steuerhinterziehung bezahlt. Mit der Digitalisierung der Finanzbehörden ist es unter anderem gelungen, den Betrug bei der Mehrwertsteuer etwa durch Schwarzarbeit zu verringern. Die Verluste, die dadurch entstehen, wurden in den vergangen fünf Jahren auf 3,2 Mrd. Euro halbiert. Hinzu kommt, dass die konservative Regierung weiter privatisiert.
2024 sollen so 5,8 Mrd. Euro eingenommen werden, allein 3,3 Mrd. Euro brachte die Konzession für die Stadtautobahn von Athen ein. Und dann ist da noch die Konjunktur, bei der Griechenland vielen anderen EU-Ländern den Rang abläuft. Während der Durchschnitt in der Europäischen Union bei 0,9 Prozent Wachstum liegt, rechnet die Kommission für Griechenland im Jahr 2025 mit 2,3 Prozent Wachstum nach 2,1 Prozent in diesem Jahr.
Das liegt nicht nur am boomenden Tourismusgeschäft. Vielmehr hat die Regierung es geschafft, das Vertrauen der Märkte zurückzugewinnen. Internationale Rating-Agentur stufen das Land wieder als investitionswürdig ein. Die US-Konzerne Microsoft, Google, Pfizer haben sich in den vergangenen Jahren angesiedelt, auch deutsche Unternehmen wie Fraport, RWE, Boehringer Ingelheim und Teamviewer sind in Griechenland aktiv.
Trotz der guten Entwicklung mahnt Mitsotakis dazu, den Ball flach zu halten. Grund dafür ist die anhaltende relative Armut der Griechen, deren Renten und Löhne während der Finanzkrise des Landes von 2010 bis 2018 stark zusammengestrichen wurden.
Der Aufschwung kommt bei den Menschen nur langsam an, obwohl die Regierung Renten und Mindestlohn immer wieder leicht erhöht hat. Für das kommende Jahr ist eine Anhebung der Renten um 2,4 Prozent geplant, der Mindestlohn von 830 Euro im Monat soll bis 2027 schrittweise auf 950 Euro steigen. Und Arbeitnehmer und -geber müssen künftig jeweils 0,5 Prozentpunkte weniger Sozialabgaben zahlen. Diese und weitere Maßnahmen sollen den Menschen finanziell auf die Beine helfen.
Die Arbeitslosigkeit soll im kommenden Jahr unter die Marke von 10 Prozent sinken, in den Hochzeiten der Krise erreichte sie mehr als 40 Prozent. Auch beim Schuldendienst verhält sich das Land musterschülerhaft: Die Kredite an internationale Gläubiger werden bedient, den Krisen-Kredit beim Internationalen Währungsfonds hat Athen sogar vorzeitig getilgt. Die Staatsschuldenquote soll 2025 auf 147 Prozent sinken - vor zwei Jahren waren es noch 164 Prozent."""


bow_vector = dictionary.doc2bow(preprocess(unseen_document))

print("Simply printing the lda_model output would look like this:")
pprint(lda_model[bow_vector])

print("\n\nSo let's make it nicer, by printing the topic contents:")
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))


Simply printing the lda_model output would look like this:
[(0, 0.6673369), (1, 0.2819162), (2, 0.048624884)]


So let's make it nicer, by printing the topic contents:
Score: 0.6673377752304077	 Topic: 0.019*"prozent" + 0.012*"euro" + 0.011*"osterreich" + 0.005*"million" + 0.004*"rund"
Score: 0.2819160223007202	 Topic: 0.007*"fluchtling" + 0.007*"land" + 0.005*"regier" + 0.005*"osterreich" + 0.005*"europa"
Score: 0.04862413555383682	 Topic: 0.005*"unternehm" + 0.004*"appl" + 0.004*"prozent" + 0.004*"million" + 0.004*"euro"


In [None]:
https://www.sn.at/wirtschaft/welt/griechenland-budgetueberschuss-170218414

  and should_run_async(code)


## Visualization

Finally, there are packages that can visulaize the results, such as [pyLDAvis](https://pypi.org/project/pyLDAvis/) and [tmplot](https://pypi.org/project/tmplot/).

Let's take a look at pyLDAvis visualization result.

**Please note:** this is an old and unmaintained package. It is easier to run it in Google-Colab than on your laptop. But, if you still try running it locally, please try **lowering your python version** (3.6 / 3.6 / 3.8) when you create the poetry environment for this exercise.

In [36]:
%pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-2.0 pyLDAvis-3.4.1


In [37]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

bow_lda_data = gensimvis.prepare(lda_model, bow_corpus, dictionary)

pyLDAvis.display(bow_lda_data)

# Your turn - Do it yourself:

Replace the given corpus with one in another language - maybe your own native language? You can find corpus online, for example:
- https://www.corpusdata.org/intro.asp
- or here: https://www.clarin.eu/resource-families/newspaper-corpora
- or even in nltk: https://www.nltk.org/nltk_data/
- In this github, there are many datasets that can be loaded through their `raw` url: https://github.com/selva86/datasets

Careful: You will need to change the [Stemming](https://snowballstem.org/algorithms/) and the [Stopwords](https://www.kaggle.com/rtatman/stopword-lists-for-19-languages) to support your language. Make a web-search after the appropriate ones (if they exist..)

Use the notebook to reproduce the result.  
Try changing the parameters to get a *satisfying level of clustering*.  
Which parameters worked best for the language you chose?




**Source**: CGL Modern Greek Texts Corpora: newspaper corpus "Ta Nea" (2015). Version 1.0.0 (automatically assigned). [Dataset (Text corpus)]. CLARIN:EL. http://hdl.handle.net/11500/KEG-0000-0000-24F9-F

Because the .txt file exceeded the maximum call stack size and I couldn't upload it on Collab, I reduced the size to the first 10000 lines using pyCharm and then uploaded it to my Google Drive.

In [97]:
from os import read
from google.colab import drive
drive.mount('/content/drive')

file_path = '/content/drive/MyDrive/ta_nea_gr.txt'

with open(file_path, 'r') as f:
  # Process the file content
  file_content_gr = f.read()

  and should_run_async(code)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [98]:
def split_articles(content):
    articles = content.split('**************') #the articles in my file weren't categorized, the asterisks meant the end of an article and the beginning of another

    articles = [article.strip() for article in articles if article.strip()]

    return articles

  and should_run_async(code)


In [99]:
articles_gr = split_articles(file_content_gr)
articles_gr[:5]
print(f"Number of articles: {len(articles_gr)}")
print("First article:", articles_gr[0])

Number of articles: 344
First article: Κοκτέιλ θανάτου για 3 κρατούμενες Κύκλωμα διακινούσε ανενόχλητα ναρκωτικά και ψυχοτρόπα στις Γυναικείες Φυλακές Κορυδαλλού 

ΛΙΑ ΝΕΣΦΥΓΕ  ΠΡΟΚΟΠΗΣ ΓΙΟΓΙΑΚΑΣ

Ανενόχλητα και κάτω από τη μύτη των σωφρονιστικών υπαλλήλων, κύκλωμα από κρατούμενες διακινούσε ηρεμιστικά και αντικαταθλιπτικά χάπια, έναντι αδρών ανταλλαγμάτων, στις Γυναικείες Φυλακές του Κορυδαλλού.

Ο θάνατος τριών και η μεταφορά άλλης μίας κρατούμενης, σε κώμα, στο νοσοκομείο, ύστερα από χρήση θανατηφόρου κοκτέιλ με ηρωίνη και χάπια, έφερε με τραγικό τρόπο ξανά στο προσκήνιο αυτό που είναι κοινό μυστικό, ότι δηλαδή στις φυλακές γίνεται διακίνηση ναρκωτικών και ψυχοτρόπων...

Επικεφαλής του κυκλώματος, σύμφωνα με την Ασφάλεια, είναι η Μαριάνθη Πατσέλη, γνωστή για την υπόθεση ναρκωτικών με τον τραγουδιστή των Νοστράδαμος Ιπποκράτη Εξαρχόπουλο (είχαν συλληφθεί το 1996 με 10 κιλά ηρωίνη). Στο κύκλωμα φέρεται ότι συμμετείχε και άλλη μία κρατούμενη, η οποία αποφυλακίστηκε πρόσφατα.

Η ίδια με

  and should_run_async(code)


In [100]:
print(type(articles_gr))

<class 'list'>


  and should_run_async(code)


In [48]:
!pip install spacy
!python -m spacy download el_core_news_sm

  and should_run_async(code)


Collecting el-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/el_core_news_sm-3.7.0/el_core_news_sm-3.7.0-py3-none-any.whl (12.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.6/12.6 MB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: el-core-news-sm
Successfully installed el-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('el_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [135]:
import spacy #in the end I decided to import spacy because the Stemmer I used didn't work on the text words
from spacy.lang.el.stop_words import STOP_WORDS as greek_stopwords

# Load Greek language model
nlp = spacy.load('el_core_news_sm')

def lemm_stemm_gr_spacy(text):
    # Process the word with spaCy
    doc = nlp(text)
    # Return the lemmatized word
    return doc[0].lemma_.lower()

def preprocess_gr_spacy(text):
    doc = nlp(text)
    result = [token.lemma_.lower() for token in doc if not token.is_stop and token.is_alpha and len(token) > 3]

    return result


  and should_run_async(code)


In [136]:
print('Original article: ')
article_gr = choice(articles_gr)
print(article_gr, "\n")

# This time, we don't care about punctuations as tokens (Can you think why?):
print('Original article, broken into words: ')
words = [word for word in article_gr.split(' ')]
print(words, "\n")
print("Vocabulary size of the original article:", len(set(words)))

# now let's see what happens when we pass the article into our preprocessing
# method:
print('\n\n tokenized and lemmatized document: ')
preprocessed_article_gr = preprocess_gr_spacy(article_gr)
print(preprocessed_article_gr, '\n')
print("Vocabulary size after preprocessing:", len(set(preprocessed_article_gr)))


Original article: 
ΑΙΧΜΕΣ Με κανόνες αγοράς...

ΒΑΣΩ ΑΡΤΙΝΟΠΟΥΛΟΥ

Γνωρίζουμε πολύ καλά ότι "φυλακή" σημαίνει ιδρυματισμός, απομόνωση, υποκουλτούρα, βία, στίγμα, αποκλεισμός...

Γνωρίζουμε, επίσης, ότι το "ιδεώδες" της επανακοινωνικοποίησης είναι αδύνατο να επιτευχθεί μέσα από τη στερητική της ελευθερίας ποινή. Αντίθετα, οι φυλακές είναι μάλλον χώροι περιστασιακής απομάκρυνσης (ή "αποθήκες ανθρώπων") των εγκληματιών από την υπόλοιπη κοινωνία.

Ο μικρόκοσμος της φυλακής έχει τις δικές του αξίες, τους δικούς του κανόνες, τη δική του γλώσσα, τη δική του νοοτροπία. Οι κανόνες της αγοράς λειτουργούν και μέσα στη φυλακή: το εμπόριο υπακούει στους κλασικούς νόμους της προσφοράς και της ζήτησης. Το απαγορευμένο και το σπάνιο προϊόν (σύμφωνα με τους κανόνες της φυλακής) είναι εμπορεύσιμο και ακριβό.

Στις γυναικείες φυλακές η χρήση και η χορήγηση των ψυχοφαρμάκων είναι ευρύτατα διαδεδομένη, σε σύγκριση με τις ανδρικές. Ευνοείται το εμπόριο και η συναλλαγή στον βαθμό που αυξάνεται η ζήτηση και π

  and should_run_async(code)


In [137]:
processed_docs_gr = list(map(preprocess_gr_spacy, articles_gr))
processed_docs_gr[:10]

  and should_run_async(code)


[['κοκτέιλ',
  'θανάτου',
  'κρατούμενος',
  'κύκλωμα',
  'διακινώ',
  'ανενόχλητα',
  'ναρκωτικός',
  'ψυχοτρόπα',
  'γυναικείες',
  'φυλακές',
  'κορυδαλλού',
  'νεσφυγε',
  'προκοπης',
  'γιογιακας',
  'ανενόχλητα',
  'μύτη',
  'σωφρονιστικός',
  'υπαλλήλων',
  'κύκλωμα',
  'κρατούμενος',
  'διακινώ',
  'ηρεμιστικά',
  'αντικαταθλιπτικός',
  'χάπια',
  'έναντι',
  'αδρών',
  'ανταλλαγμάτων',
  'γυναικείες',
  'φυλακός',
  'κορυδαλλού',
  'θάνατος',
  'τρία',
  'μεταφορά',
  'ένας',
  'κρατούμενη',
  'κώμα',
  'νοσοκομείο',
  'χρήση',
  'θανατηφόρος',
  'κοκτέιλ',
  'ηρωίνη',
  'χάπια',
  'έφερε',
  'τραγικός',
  'τρόπος',
  'προσκήνιος',
  'κοινός',
  'μυστικός',
  'φυλακός',
  'γίνομαι',
  'διακίνηση',
  'ναρκωτικός',
  'ψυχοτρόπων',
  'επικεφαλής',
  'κυκλώμα',
  'σύμφωνα',
  'ασφάλεια',
  'μαριάνθη',
  'πατσέλη',
  'γνωστός',
  'υπόθεση',
  'ναρκωτικός',
  'τραγουδιστής',
  'νοστράδαμος',
  'ιπποκράτη',
  'εξαρχόπουλο',
  'συλλαμβάνω',
  'κιλάς',
  'ηρωίνος',
  'κύκλωμα',
  'φέρω

In [138]:
dictionary_gr = corpora.Dictionary(processed_docs_gr)
print(len(dictionary_gr))

25570


  and should_run_async(code)


In [139]:
#Every article would be represented as a bag-of-words, an unordered set of words that the article contains
for idx, (k, v) in enumerate(dictionary_gr.iteritems()):
    print(k, v)
    if idx >= 10:
        break

0 έβαλε
1 έδιναν
2 έκαναν
3 έναντι
4 ένας
5 έπαθε
6 έρευνα
7 έσπασε
8 έτριβαν
9 έφερε
10 αδρών


  and should_run_async(code)


In [140]:
## These are the dictionary preparation parameters:
filter_tokens_if_container_documents_are_less_than = 15
filter_tokens_if_appeared_percentage_more_than = 0.3
keep_the_first_n_tokens=100000

## and the LDA Parameters:
num_of_topics = 5

  and should_run_async(code)


In [141]:
dictionary_gr.filter_extremes(
    no_below=filter_tokens_if_container_documents_are_less_than,
    no_above=filter_tokens_if_appeared_percentage_more_than,
    keep_n=keep_the_first_n_tokens)

  and should_run_async(code)


In [142]:
bow_corpus_gr = [dictionary_gr.doc2bow(doc) for doc in processed_docs_gr]

  and should_run_async(code)


In [143]:
# randomly choose an article from the corpus:
sample_bow_doc_gr = choice(bow_corpus_gr)

print('The processed bag-of-word document is just pairs of (word_id, # of occurences) and looks like this:')
print(sample_bow_doc_gr, '\n\n')

print ('We peek in the dictionary: for each word_id, we get its assigned word:')
for word_id, word_freq in sample_bow_doc_gr:
  real_word = dictionary_gr[word_id]
  print(f'Word #{word_id} ("{real_word}") appears {word_freq} time.')

The processed bag-of-word document is just pairs of (word_id, # of occurences) and looks like this:
[(4, 1), (19, 4), (39, 1), (50, 1), (60, 1), (61, 1), (99, 1), (101, 1), (120, 6), (126, 1), (130, 2), (134, 2), (151, 1), (152, 1), (160, 1), (163, 1), (165, 4), (166, 3), (178, 2), (186, 1), (187, 1), (196, 1), (205, 5), (209, 1), (218, 1), (229, 1), (242, 1), (249, 1), (253, 1), (271, 1), (279, 2), (283, 1), (304, 1), (365, 1), (366, 4), (373, 1), (392, 1), (395, 1), (399, 1), (413, 1), (419, 1), (422, 1), (427, 1), (429, 2), (442, 1), (443, 1), (446, 1), (461, 1), (463, 1), (465, 2), (475, 1), (488, 3), (503, 1), (510, 1), (511, 2), (513, 1), (520, 1), (539, 1), (542, 2), (545, 2), (549, 1), (559, 1), (567, 1), (586, 1), (593, 1), (625, 2), (633, 1), (662, 1), (688, 1), (710, 1)] 


We peek in the dictionary: for each word_id, we get its assigned word:
Word #4 ("αποτέλεσμα") appears 1 time.
Word #19 ("ελλάδα") appears 4 time.
Word #39 ("πέντε") appears 1 time.
Word #50 ("πρόβλημα") a

  and should_run_async(code)


In [144]:
lda_model_gr = models.LdaMulticore(bow_corpus_gr,
                                num_topics=num_of_topics,
                                id2word=dictionary_gr,
                                passes=10,
                                workers=2)

  and should_run_async(code)


In [145]:
for idx, topic in lda_model_gr.print_topics(num_of_topics):
    print(f'Topic: {idx} \t Words: {topic}')


Topic: 0 	 Words: 0.022*"ομάδα" + 0.015*"ολυμπιακός" + 0.014*"ντέρμπι" + 0.013*"τουρκία" + 0.013*"νίκη" + 0.012*"παναθηναϊκός" + 0.011*"σχέδιο" + 0.011*"θέλω" + 0.010*"καλός" + 0.010*"παιχνίδι"
Topic: 1 	 Words: 0.015*"κυβέρνηση" + 0.011*"υπουργός" + 0.011*"υπουργείο" + 0.009*"προεδρία" + 0.009*"έργο" + 0.009*"περιοχή" + 0.009*"ελληνικός" + 0.008*"χθες" + 0.008*"πολιτικός" + 0.008*"πόλεμος"
Topic: 2 	 Words: 0.020*"οργάνωση" + 0.012*"αρχή" + 0.011*"μήνας" + 0.011*"συμμετοχή" + 0.010*"σύμφωνα" + 0.010*"τμήμα" + 0.009*"νοσοκομείο" + 0.009*"μέλος" + 0.009*"ημέρα" + 0.008*"βάρος"
Topic: 3 	 Words: 0.016*"μουσικός" + 0.013*"ταινία" + 0.012*"άνθρωπος" + 0.012*"λέει" + 0.011*"ελληνικός" + 0.011*"παιδί" + 0.010*"έργο" + 0.010*"κόσμος" + 0.009*"ιστορία" + 0.008*"θέλω"
Topic: 4 	 Words: 0.032*"ευρώ" + 0.022*"τιμή" + 0.017*"οικονομικός" + 0.016*"αγορά" + 0.015*"χώρα" + 0.014*"ευρωπαϊκός" + 0.011*"ευρώπη" + 0.011*"ανάπτυξη" + 0.009*"κόστος" + 0.009*"μέσος"


  and should_run_async(code)


In [146]:
# initialize a tfidf from our corpus
tfidf_new = models.TfidfModel(bow_corpus_gr)

# apply it on our corpus
tfidf_corpus_gr = tfidf_new[bow_corpus_gr]

pprint(tfidf_corpus_gr[0][:10])

[(0, 0.06649159776328495),
 (1, 0.07567305129784985),
 (2, 0.12410482618894257),
 (3, 0.0531445575432159),
 (4, 0.1143336931497729),
 (5, 0.056222036121048036),
 (6, 0.0381112310499243),
 (7, 0.06554137054899746),
 (8, 0.058259044018074566),
 (9, 0.05971099357557421)]


  and should_run_async(code)


In [147]:
assert len(bow_corpus_gr) == len(tfidf_corpus_gr)

  and should_run_async(code)


In [148]:
lda_model_tfidf_gr = models.LdaMulticore(tfidf_corpus_gr,
                                      num_topics=num_of_topics,
                                      id2word=dictionary_gr,
                                      passes=10,
                                      workers=4)

  and should_run_async(code)


In [149]:
for idx, topic in lda_model_tfidf_gr.print_topics(num_of_topics):
    print(f'Topic: {idx} \t Word: {topic}')

Topic: 0 	 Word: 0.002*"φυλακή" + 0.002*"τρομοκρατία" + 0.002*"δίκη" + 0.002*"δικαιοσύνη" + 0.002*"ουσία" + 0.002*"τιμή" + 0.002*"σπίτι" + 0.002*"δικαστικός" + 0.002*"παιδί" + 0.002*"φυλακός"
Topic: 1 	 Word: 0.005*"ταινία" + 0.005*"βουλευτής" + 0.004*"κινηματογράφος" + 0.004*"εμπορικός" + 0.004*"λεωφόρο" + 0.004*"απουσία" + 0.004*"ενημέρωση" + 0.003*"ενίσχυση" + 0.003*"εξακολουθώ" + 0.003*"επιστημονικός"
Topic: 2 	 Word: 0.006*"μορφή" + 0.005*"υποστηρίζω" + 0.004*"μεταφορά" + 0.004*"απόσταση" + 0.004*"επτά" + 0.004*"τύπος" + 0.004*"τέσσερις" + 0.003*"αποφασίζω" + 0.003*"αφορώ" + 0.003*"μουσικός"
Topic: 3 	 Word: 0.005*"ταξίδι" + 0.004*"καλλιτεχνικός" + 0.004*"συνήθως" + 0.004*"ιανουάριος" + 0.003*"μέρος" + 0.003*"κατηγορούμενος" + 0.003*"οργάνωση" + 0.003*"έργο" + 0.003*"χθες" + 0.002*"δίκη"
Topic: 4 	 Word: 0.004*"έργο" + 0.004*"μουσικός" + 0.004*"ταινία" + 0.004*"περιοχή" + 0.004*"ελληνικός" + 0.004*"κυβέρνηση" + 0.004*"παιδί" + 0.004*"ομάδα" + 0.003*"άνθρωπος" + 0.003*"ευρώ"


  and should_run_async(code)


In [150]:
# randomly pick an article:
test_article = choice(range(len(processed_docs_gr)))
processed_docs_gr[test_article][:25]

  and should_run_async(code)


['γιαννης',
 'μπεζος',
 'πετρούκιο',
 'συναντά',
 'θείος',
 'βάνια',
 'ιλαροτραγωδία',
 'απάντηση',
 'γιάννης',
 'μπέζου',
 'καλός',
 'θεατρικός',
 'στιγμή',
 'επόμενος',
 'χειμώνας',
 'σκηνοθετώ',
 'θείος',
 'βάνια',
 'παίζω',
 'βάνια',
 'ελενα',
 'χατζηιωαννου',
 'τηλεοπτικός',
 'επιτυχία',
 'μεγάλος']

In [151]:
for index, score in sorted(lda_model_gr[bow_corpus_gr[test_article]], key=lambda tup: -1*tup[1]):
    print(f"Topic match score: {score} \nTopic: {lda_model_gr.print_topic(index, num_of_topics)}")

Topic match score: 0.9949427247047424 
Topic: 0.016*"μουσικός" + 0.013*"ταινία" + 0.012*"άνθρωπος" + 0.012*"λέει" + 0.011*"ελληνικός"


  and should_run_async(code)


In [152]:
for index, score in sorted(lda_model_tfidf_gr[bow_corpus_gr[test_article]], key=lambda tup: -1*tup[1]):
    print("Topic match score: {}\t \nTopic: {}".format(score, lda_model_tfidf_gr.print_topic(index, num_of_topics)))

Topic match score: 0.9950549006462097	 
Topic: 0.004*"έργο" + 0.004*"μουσικός" + 0.004*"ταινία" + 0.004*"περιοχή" + 0.004*"ελληνικός"


  and should_run_async(code)


In [153]:
print('Perplexity: ', lda_model_gr.log_perplexity(bow_corpus_gr))
print('Perplexity TFIDF: ', lda_model_tfidf_gr.log_perplexity(bow_corpus_gr))

  and should_run_async(code)


Perplexity:  -6.319465859070335
Perplexity TFIDF:  -6.646296344890441


In [154]:
#https://www.tanea.gr/2024/12/17/greece/symmoria-ekviaston-o-arxigos-pou-evlepe-apo-to-keli-tou-live-ksylodarmous-kai-oi-dyo-20xronoi-yparxigoi/
unseen_document_gr = """Συμμορία εκβιαστών: Ο αρχηγός που έβλεπε από το κελί του live ξυλοδαρμούς και οι δύο 20χρονοι υπαρχηγοί
Μια από τις πιο σκληρές συμμορίες διακίνησης ναρκωτικών και εκβιαστών εξάρθρωσε η ΕΛ.ΑΣ. – Είχαν απλώσει τα πλοκάμια τους και σε σχολεία
Χωρίς κανένα όριο δρούσε η συμμορία κακοποιών που εξάρθρωσε η αστυνομία συλλαμβάνοντας 25 μέλη ενώ έχουν ταυτοποιηθεί και περιλαμβάνονται στη δικογραφία άλλα 14.

Κεντρικό ρόλο στη δράση της συμμορίας ως αρχηγός φέρεται να είχε ένας γνωστός στις Αρχές Αλβανός κακοποιός που συντόνιζε και έδινε εντολές στη συμμορία μέσα από τις φυλακές Διαβατών,
όπου είναι έγκλειστος. Σύμφωνα με τη δικογραφία είχε το παρατσούκλι «Θείος» και «Μόντι». Επικοινωνούσε με κινητό τηλέφωνο με τα μέλη της σπείρας μέσα από το κελί του.
Μάλιστα, ο «Μόντι» σύμφωνα με όσα αποκάλυψε η αστυνομική έρευνα φέρεται να έβλεπε σε live streaming μέσα από το κινητό του αρπαγές και βασανισμούς.

Συγκεκριμένα όσα άτομα δεν συνεργάζονταν για την εξυπηρέτηση των σκοπών της συμμορίας ή δημιουργούσαν προβλήματα, με εντολή του αρχηγού, οδηγούνταν βίαια σε ερημικές περιοχές στον Υμηττό και τα βασάνιζαν.

Μάλιστα ο Μόντι έβλεπε σε ζωντανή μετάδοση στο κινητό του τηλέφωνο τα βασανιστήρια μέσα στη φυλακή όπου κρατείται και έδινε εντολές πότε θα σταματήσουν.
Κυρίως μέσα από τα διακίνηση ναρκωτικών και τις άλλες παράνομες δραστηριότητες η συμμορία αποκόμιζε μεγάλα κέρδη και τα μέλη της έκαναν πολυτελή ζωή.
Οι δύο υπαρχηγοί

Εκτός από τον «Μόντι» που θεωρείται αρχηγός της συμμορίας αν και έγκλειστος στη φυλακή, κεντρικό ρόλο διαδραμάτιζαν και δύο νεαροί 20 και 21 ετών. Μάλιστα ο 21χρονος με το προσωνύμιο «Χοντρός» είναι ανιψιός του αρχηγού.

Οι δύο νεαροί μετέφεραν στα υπόλοιπα μέλη της συμμορίας τις εντολές του αρχηγού και ουσιαστικά είχαν στον έλεγχό τους τη λειτουργία της συμμορίας.

Ναρκωτικά και στα σχολεία

Σχεδόν αποκλειστική τους ενασχόληση ήταν η προμήθεια και διακίνηση ναρκωτικών σε περιοχές των νοτίων προαστείων ενώ διαχειρίζονταν μεγάλα χρηματικά ποσά.

Μάλιστα δεν δίσταζαν να στρατολογούν και μαθητές προκειμένου να επεκτείνουν τη διακίνηση και μέσα σε σχολεία όπως προκύπτει από τις συνομιλίες που κατέγραψε η Αστυνομία."""

bow_vector_gr = dictionary_gr.doc2bow(preprocess_gr_spacy(unseen_document_gr))

print("Simply printing the lda_model output would look like this:")
pprint(lda_model_gr[bow_vector_gr])

print("\n\nSo let's make it nicer, by printing the topic contents:")
for index, score in sorted(lda_model_gr[bow_vector_gr], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model_gr.print_topic(index, 5)))

Simply printing the lda_model output would look like this:
[(1, 0.2207529), (2, 0.76615167)]


So let's make it nicer, by printing the topic contents:
Score: 0.7661354541778564	 Topic: 0.020*"οργάνωση" + 0.012*"αρχή" + 0.011*"μήνας" + 0.011*"συμμετοχή" + 0.010*"σύμφωνα"
Score: 0.22076913714408875	 Topic: 0.015*"κυβέρνηση" + 0.011*"υπουργός" + 0.011*"υπουργείο" + 0.009*"προεδρία" + 0.009*"έργο"


  and should_run_async(code)


In [155]:
bow_lda_data_gr = gensimvis.prepare(lda_model_gr, bow_corpus_gr, dictionary_gr)

pyLDAvis.display(bow_lda_data_gr)

  and should_run_async(code)


##### Help note

If your corpus is a csv, [pandas' read_csv method](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) would be the best way to tackle it.

If it is, however a text file or a zip of text files, you will need another way to load them. Luckily, python is working well with both - text and zip files, with a built-in support.

###### Example with text
For the sake of this example, let's download two files: one text file, and one gzip file, from this website:

https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-6260-A

In [None]:
!curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11858/00-097C-0000-0023-6260-A{/README.txt,/hindmonocorp05.plaintext.gz}


  and should_run_async(code)


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4472  100  4472    0     0   3258      0  0:00:01  0:00:01 --:--:--  3259
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2353M  100 2353M    0     0  19.3M      0  0:02:01  0:02:01 --:--:-- 20.0M


Files are accessed using the `Path` package of python's pathlib:

In [None]:
from pathlib import Path

  and should_run_async(code)


To access a file, we use its path in Path, like so:

`Path('Folder/filename.extension')`

Path has many methods for files and folder. Including looping over files in folder, checking if a file exists, etc.

Read more about it here: https://docs.python.org/3/library/pathlib.html

In [None]:
readme_file = Path('README.txt')

if readme_file.exists():
  # read the text content into a variable
  file_content = readme_file.read_text()
  print(file_content)
else:
  print("README.txt was not found...")

HindEnCorp 0.5 and HindMonoCorp 0.5 File Formats

This file describes the file formats of the Hindi-English and Hindi-only
corpora released in 2014 under the names HindEnCorp 0.5 and HindMonoCorp 0.5.

More details about the preparation of the corpora can be found in the paper:

  Ondřej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Straňák, Aleš Tamchyna
  and Dan Zeman. HindEnCorp - Hindi-English and Hindi-only Corpus for
  Machine Translation. In Proc. of LREC 2014. Reykjavik, Iceland. ISBN
  978-2-9517408-8-4. ELRA. 2014.

or on the corpora web page:
  http://ufal.mff.cuni.cz/hindencorp

Please cite this paper if you make any use of the corpora. BibTeX citation
format below.


Common Properties
-----------------

All the files are plain text:

- compressed with gzip
- encoded in UTF-8
- with unix line breaks (LF)
- with tab-delimited columns

The monolingual and parallel corpora have different columns.

The actual corpus text is stored in one (monolingual corpus) or two (parallel
corp

  and should_run_async(code)


If the file is a `.zip` file, you can open it and read its data without actually extracting all the files from it.

https://docs.python.org/3/library/gzip.html

Also, [requests](https://requests.readthedocs.io/en/latest/) is a great package of retrieving content from a URL.

###### Example with zipped files

In [None]:
import gzip


  and should_run_async(code)


In [None]:

# this would download the file (but it's 2 GB, so go easy on your internet provider...):
# file = requests.get('https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11858/00-097C-0000-0023-6260-A/hindmonocorp05.plaintext.gz')
# with gzip.open(file.content, 'rb') as gz:
#  ...

  and should_run_async(code)


In this example, we will open the zip and read the text lines.
The text is in hindi, so we have to decode it into `UTF-8` format:

In [None]:
with gzip.open('hindmonocorp05.plaintext.gz', 'rb') as gz:
  for i in range(10):
    line = gz.readline()
    words = line.decode('utf8').split()
    print(words)


['hwt2013', '<s>', 'लेकिन', 'गांव', 'के', 'जगदीश', 'मेघवाल,', 'मोहन...']
['spiderling', '<s>', 'विटामिन', 'सी', 'शरीर', 'में', 'रोग', 'पैदा', 'करने', 'वाले', 'विषाणुओं', 'से', 'लड़ने', 'की', 'ताकत', 'पैदा', 'करता', 'है', 'और', 'शरीर', 'में', 'इसकी', 'संतुलित', 'मात्रा', 'बने', 'रहने', 'से', 'रोग', 'प्रतिरोधक', 'क्षमता', 'मजबूत', 'रहती', 'है।']
['spiderling', '<s>', 'इन', 'बोतलों', 'के', 'बहुत', 'कम', 'पैसे', 'मिलते', 'हैं।']
['commoncrawl', '<a>', 'कार्टून', ':-', 'रे', 'लोकपाल', 'आ', 'गया', 'तू', '?', 'शाबाश....', '19', '0']
['spiderling', '<s>', 'प्रखर', 'बुद्धि', 'तेजस्वी', 'बालक', 'राजेन्द्र', 'बाल्यावस्था', 'में', 'ही', 'फारसी', 'में', 'शिक्षा', 'ग्रहण', 'करने', 'लगा', 'और', 'उसके', 'पश्चात', 'प्राथमिक', 'शिक्षा', 'के', 'लिए', 'छपरा', 'के', 'जिला', 'स्कूल', 'में', 'नामांकित', 'हो', 'गया।']
['commoncrawl', '<a>', 'निदेशक', 'स्तर', 'का', 'एक', 'वैज्ञानिक', 'संस्थान', 'या', 'सहोदर', 'संस्थान', 'से', '(']
['commoncrawl', '<a>', 'गज़ब', 'का', 'बतंगड़', 'है!', ':)', 'हिट', 'तो', 'वैसे',

  and should_run_async(code)


**IMPORTANT NOTE**: If your corpus, like that file example is very large (2.3GB zipped), then please don't load all the text: it will not fit in the memory and will only cause you trouble.  
Instead, you can use only the first 10 to 20k sentences, or so. For this exercise we just want you to get a hold of the steps and the proces involved in using LDA.

# Afterword

Gensim is not the only library that implements the LDA algorithm.
Another package that does LDA is [tomotopy](https://bab2min.github.io/tomotopy/v0.12.3/en/) - sometimes even faster than gensim. Additionaly, it is implemented as part of [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) - which we use often in the course.

Here's a jupyter example using Scikit-learn and spaCy: https://nbviewer.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb

Since we won't dive deeper into the LDA topic in this course, if you wish to know more about the statistics behind it, [this video](https://www.youtube.com/watch?v=0jQo8lVRHRY) of a lesson by the researcher [Nando de Freitas](https://linkedin.com/in/nandodefreitas) gives a good overview.

  and should_run_async(code)
