# Latent Dirichlet Allocation LDA 

#### Wikifetcher
Raw text from Wikipedia using search terms
#### LDAbuilder
Run LDA with the given document list (raw text list from Wikifetcher)

## Execution
Additionally for each execution block the execution time is measured.
### Configuration 
- We need access to Wikipedia for the raw text
- Natural Language Toolkit NLTK for tokenization and stemming
- Stop_words to remove meaningless words
- Gensim for the Latent Dirichlet Allocation LDA implementation.

In [None]:
import wikipedia
import time
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
import re
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim import corpora, models

start = time.time()

sentence_pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
tokenizer = RegexpTokenizer(r'\w+')

# Create english stop words list
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()

doc_list = []
wikipedia.set_lang('en')

end = time.time()
print('Execution Time: %f' %(end-start) + ' s')

Ausführungszeit: 0.001001 s


### Wikipedia Content
Using search terms, we retrieve the raw content from Wikipedia.Then the content is separated into sentences, which are added to the document list.

In [2]:
def get_page(name):
    first_found = wikipedia.search(name)[0]
    try:
        return(wikipedia.page(first_found).content)
    except wikipedia.exceptions.DisambiguationError as e:
        return(wikipedia.page(e.options[0]).content)
    
start = time.time()

search_terms = ['Nature', 'Volcano', 'Ocean', 'Landscape', 'Earth', 'Animals']
separator = '== References =='
for term in search_terms:
    full_content = get_page(term).split(separator, 1)[0]
    # sentence_list = sentence_pat.findall(full_content)
    #for sentence in sentence_list:
    doc_list.append(full_content)

    print(full_content[0:1000] + '...')
    print('---')

end = time.time()
print('Execution. Time: %f' %(end-start) + ' s')

Nature, in the broadest sense, is the natural, physical, or material world or universe. "Nature" can refer to the phenomena of the physical world, and also to life in general. The study of nature is a large part of science. Although humans are part of nature, human activity is often understood as a separate category from other natural phenomena.
The word nature is derived from the Latin word natura, or "essential qualities, innate disposition", and in ancient times, literally meant "birth". Natura is a Latin translation of the Greek word physis (φύσις), which originally related to the intrinsic characteristics that plants, animals, and other features of the world develop of their own accord. The concept of nature as a whole, the physical universe, is one of several expansions of the original notion; it began with certain core applications of the word φύσις by pre-Socratic philosophers, and has steadily gained currency ever since. This usage continued during the advent of modern scienti

### Preprocessing
The text is now tokenized, stemmed, useless words are removed

In [3]:
num_topics = 5
num_words_per_topic = 20
texts = []

In [4]:
import pandas as pd

start = time.time()

for doc in doc_list:
    raw = doc.lower()
    # Create tokens
    tokens = tokenizer.tokenize(raw)
    # Remove useless info
    stopped_tokens = [i for i in tokens if not i in en_stop]
    # stemmed tokens - duplicate removal and transformation to base form (optional)
    # stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    texts.append(stopped_tokens)
output_preprocessed = pd.Series(texts)

print(output_preprocessed)

end = time.time()
print('Execution Time: %f' %(end-start) + ' s')

0    [nature, broadest, sense, natural, physical, m...
1    [volcano, rupture, crust, planetary, mass, obj...
2    [ocean, ancient, greek, ὠκεανός, transc, okean...
3    [landscape, visible, features, area, land, lan...
4    [earth, third, planet, sun, object, universe, ...
5    [animals, eukaryotic, multicellular, organisms...
dtype: object
Ausführungszeit: 0.062492 s


### Dictionary and vectors
In this section, we will now create the bag-of-words corpus. The vectors will be needed later for the LDA model.

In [5]:
start = time.time()

dictionary = corpora.Dictionary(texts)
# convert dictionary to bag-of-words
# corpus is a list of vectors - each document vector is a series of tuples
corpus = [dictionary.doc2bow(text) for text in texts]

output_vectors = pd.Series(corpus)

print(dictionary)
print('---')
print(output_vectors)

end = time.time()
print('Execution Time: %f' %(end-start) + ' s')

Dictionary(5354 unique tokens: ['nature', 'broadest', 'sense', 'natural', 'physical']...)
---
0    [(0, 51), (1, 2), (2, 1), (3, 32), (4, 9), (5,...
1    [(3, 2), (5, 6), (6, 1), (8, 28), (9, 2), (11,...
2    [(3, 4), (4, 2), (5, 1), (6, 15), (8, 12), (11...
3    [(0, 10), (2, 4), (3, 15), (4, 10), (5, 2), (6...
4    [(0, 2), (2, 1), (3, 7), (4, 3), (5, 6), (6, 1...
5    [(5, 2), (6, 2), (8, 5), (9, 1), (10, 1), (11,...
dtype: object
Ausführungszeit: 0.062440 s


### LDA model
Finally, the LDA model can be applied. The transfer parameters for this are the list of vectors, the number of topics, the dictionary, and the update rate.
In the training phase, a higher update rate `>= 20` should be selected.

In [6]:
start = time.time()

# Apply LDA model
ldamodel = models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=50)
lda = ldamodel.print_topics(num_topics=num_topics, num_words=num_words_per_topic)
    
for topic in lda:
    for entry in topic:
        print(entry)
        print('---')

end = time.time()
print('Ausführungszeit: %f' %(end-start) + ' s')

0
---
0.032*"earth" + 0.018*"s" + 0.008*"sun" + 0.008*"surface" + 0.005*"solar" + 0.005*"atmosphere" + 0.005*"moon" + 0.005*"1" + 0.005*"life" + 0.004*"water" + 0.004*"years" + 0.004*"land" + 0.004*"million" + 0.004*"5" + 0.003*"oceans" + 0.003*"year" + 0.003*"3" + 0.003*"energy" + 0.003*"field" + 0.003*"crust"
---
1
---
0.011*"water" + 0.010*"ocean" + 0.009*"animals" + 0.007*"earth" + 0.007*"surface" + 0.006*"life" + 0.005*"nature" + 0.005*"also" + 0.005*"zone" + 0.005*"oceans" + 0.005*"s" + 0.004*"species" + 0.004*"can" + 0.004*"natural" + 0.004*"human" + 0.004*"animal" + 0.004*"may" + 0.003*"world" + 0.003*"called" + 0.003*"within"
---
2
---
0.036*"landscape" + 0.009*"landscapes" + 0.007*"s" + 0.006*"painting" + 0.006*"poetry" + 0.006*"century" + 0.005*"human" + 0.004*"chinese" + 0.004*"cultural" + 0.004*"english" + 0.004*"land" + 0.004*"also" + 0.004*"natural" + 0.004*"garden" + 0.004*"art" + 0.003*"people" + 0.003*"can" + 0.003*"gardens" + 0.003*"term" + 0.003*"many"
---
3
---
0.0

## Visualization
with `pyLDAvis`

In [7]:
import pyLDAvis.gensim
# ignore dprecation warnings for pyLDAvis
warnings.simplefilter("ignore", DeprecationWarning)
    
start = time.time()
pyLDAvis.enable_notebook()

vis_data = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)

end = time.time()
print('Execution Time: %f' %(end-start) + ' s')

Ausführungszeit: 10.726074 s


In [8]:
pyLDAvis.display(vis_data)