# Latent Dirichlet Allocation LDA 

#### Wikifetcher
Raw Text von Wikipedia mittels Suchbegriffen
#### LDAbuilder
Ausführen der LDA mit der gegebenen Dokumentliste (Rohtext-Liste von Wikifetcher)

## Ausführung
### Konfiguration 
- Wir benötigen Zugriff auf Wikipedia für den Rohtext
- Natural Language Toolkit NLTK für die Tokenisierung und Stemming
- Stop_words, um nichtssagende Wörter zu entfernen
- Gensim für die Implementierung der Latent Dirichlet Allocation LDA

In [1]:
import wikipedia
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
import re
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim import corpora, models

sentence_pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
tokenizer = RegexpTokenizer(r'\w+')

# Erzeuge englische stop words Liste
en_stop = get_stop_words('en')
# Erzeuge p_stemmer der Klasse PorterStemmer
p_stemmer = PorterStemmer()

doc_list = []
wikipedia.set_lang('en')

### Wikipedia Content
Mittels Suchbegriffen holen wir den Rohen Inhalt aus Wikipedia.
Danach wird der Inhalt in Sätze getrennt, welche zur Dokumentliste hinzugefügt werden.

In [2]:
def get_page(name):
    first_found = wikipedia.search(name)[0]
    try:
        return(wikipedia.page(first_found).content)
    except wikipedia.exceptions.DisambiguationError as e:
        return(wikipedia.page(e.options[0]).content)
    
search_terms = ['Stuttgart', 'Masters degree', 'University']
separator = '== References =='
for term in search_terms:
    full_content = get_page(term).split(separator, 1)[0]
    # sentence_list = sentence_pat.findall(full_content)
    #for sentence in sentence_list:
    doc_list.append(full_content)

    print(full_content[0:1500] + '...')

Stuttgart ( SHTUUT-gart; German: [ˈʃtʊtɡaʁt] ( listen); Swabian: Schduagert, pronounced [ˈʒ̊d̥ua̯ɡ̊ɛʕd̥]; names in other languages) is the capital and largest city of the German state of Baden-Württemberg.
Stuttgart is located on the Neckar river in a fertile valley locally known as the "Stuttgart Cauldron" an hour from the Swabian Jura and the Black Forest, and its urban area has a population of 609,219, making it the sixth largest city in Germany. 2.7 million people live in the city's administrative region and another 5.3 million people in its metropolitan area, making it the fourth largest metropolitan area in Germany.
The city and metropolitan area are consistently ranked among the top 20 European metropolitan areas by GDP; Mercer listed Stuttgart as 21st on its 2015 list of cities by quality of living, innovation agency 2thinknow ranked the city 24th globally out of 442 cities  and the Globalization and World Cities Research Network ranked the city as a Beta-status world city in t

### Vorverarbeitung
Der Text wird nun Tokenisiert, gestemt, nutzlose Wörter werden entfernt

In [3]:
num_topics = 5
num_words_per_topic = 20
texts = []

In [4]:
import pandas as pd

for doc in doc_list:
    raw = doc.lower()
    # Erzeuge tokens
    tokens = tokenizer.tokenize(raw)
    # Entferne unnütze Information
    stopped_tokens = [i for i in tokens if not i in en_stop]
    # Stemme tokens - Entfernung von Duplikaten und Transformation zu Grundform (Optional)
    # stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    texts.append(stopped_tokens)
output_preprocessed = pd.Series(texts)
print(output_preprocessed)

0    [stuttgart, shtuut, gart, german, ˈʃtʊtɡaʁt, l...
1    [master, s, degree, latin, magister, usually, ...
2    [university, latin, universitas, whole, instit...
dtype: object


### Dictionary und Vektoren
In diesem Abschnitt wird nun der Bag-of-words Korpus erstellt. Die Vektoren werden später für das LDA-Modell benötigt

In [5]:
# Erzeuge ein dictionary
dictionary = corpora.Dictionary(texts)
# Konvertiere dictionary in Bag-of-Words
# corpus ist eine Liste von Vektoren - Jeder Dokument-Vektor ist eine Serie von Tupeln
corpus = [dictionary.doc2bow(text) for text in texts]

output_vectors = pd.Series(corpus)
print(output_vectors)

0    [(0, 299), (1, 1), (2, 1), (3, 45), (4, 1), (5...
1    [(3, 1), (10, 11), (14, 1), (19, 1), (28, 1), ...
2    [(3, 7), (15, 1), (18, 3), (19, 15), (22, 1), ...
dtype: object


### LDA-Modell
Schließlich kann das LDA-Modell angewandt werden. Die Übergabeparameter dafür sind die Liste der Vektoren, die Anzahl der Themen, das Dictionary, sowie die Aktualisierungsrate.
In der Trainingsphase sollte eine höhere Aktualisierungsrate >= 20 gewählt werden.

In [6]:
# Wende LDA-Modell an
ldamodel = models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=20)
lda = ldamodel.print_topics(num_topics=num_topics, num_words=num_words_per_topic)
    
for topic in lda:
    for entry in topic:
        print(entry)

0
0.000*"s" + 0.000*"stuttgart" + 0.000*"city" + 0.000*"master" + 0.000*"universities" + 0.000*"university" + 0.000*"degrees" + 0.000*"degree" + 0.000*"württemberg" + 0.000*"also" + 0.000*"research" + 0.000*"m" + 0.000*"first" + 0.000*"germany" + 0.000*"world" + 0.000*"year" + 0.000*"higher" + 0.000*"century" + 0.000*"years" + 0.000*"since"
1
0.034*"stuttgart" + 0.015*"city" + 0.010*"s" + 0.008*"württemberg" + 0.006*"germany" + 0.005*"also" + 0.005*"german" + 0.004*"area" + 0.004*"war" + 0.004*"state" + 0.004*"world" + 0.003*"baden" + 0.003*"one" + 0.003*"castle" + 0.003*"since" + 0.003*"first" + 0.003*"center" + 0.003*"year" + 0.003*"many" + 0.003*"000"
2
0.026*"university" + 0.026*"universities" + 0.007*"education" + 0.005*"higher" + 0.005*"also" + 0.004*"european" + 0.004*"new" + 0.004*"europe" + 0.004*"students" + 0.004*"research" + 0.004*"century" + 0.004*"state" + 0.004*"scholars" + 0.004*"many" + 0.003*"knowledge" + 0.003*"institution" + 0.003*"countries" + 0.003*"texts" + 0.003

## Visualisierung
Mit pyLDAvis

In [7]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

vis_data = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
pyLDAvis.display(vis_data)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]
