# Latent Dirichlet Allocation LDA 

#### Wikifetcher
Raw text from Wikipedia using search terms
#### LDAbuilder
Run LDA with the given document list (raw text list from Wikifetcher)

## Execution
Additionally for each execution block the execution time is measured.
### Configuration 
- We need access to Wikipedia for the raw text
- Natural Language Toolkit NLTK for tokenization and stemming
- Stop_words to remove meaningless words
- Gensim for the Latent Dirichlet Allocation LDA implementation.

In [11]:
%pip install -r ../requirements.txt
%pip list

Collecting pyldavis==3.4.1 (from -r ../requirements.txt (line 5))
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting numexpr (from pyldavis==3.4.1->-r ../requirements.txt (line 5))
  Downloading numexpr-2.8.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (383 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m383.2/383.2 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting funcy (from pyldavis==3.4.1->-r ../requirements.txt (line 5))
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, numexpr, pyldavis
Successfully installed funcy-2.0 numexpr-2.8.5 pyldavis-3.4.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run

In [3]:
import wikipedia
import time
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
import re
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

start = time.time()

sentence_pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
tokenizer = RegexpTokenizer(r'\w+')

# Create english stop words list
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()

doc_list = []
wikipedia.set_lang('en')

end = time.time()
print('Execution Time: %f' %(end-start) + ' s')

Execution Time: 0.003447 s


### Wikipedia Content
Using search terms, we retrieve the raw content from Wikipedia.Then the content is separated into sentences, which are added to the document list.

In [4]:
def get_page(name):
    first_found = wikipedia.search(name)[0]
    try:
        return(wikipedia.page(first_found).content)
    except wikipedia.exceptions.DisambiguationError as e:
        return(wikipedia.page(e.options[0]).content)
    
start = time.time()

search_terms = ['Nature', 'Volcano', 'Ocean', 'Landscape', 'Earth', 'Animals']
separator = '== References =='
for term in search_terms:
    full_content = get_page(term).split(separator, 1)[0]
    # sentence_list = sentence_pat.findall(full_content)
    #for sentence in sentence_list:
    doc_list.append(full_content)

    print(full_content[0:1000] + '...')
    print('---')

end = time.time()
print('Execution. Time: %f' %(end-start) + ' s')

Nature, in the broadest sense, is the physical world or universe. "Nature" can refer to the phenomena of the physical world, and also to life in general. The study of nature is a large, if not the only, part of science. Although humans are part of nature, human activity is often understood as a separate category from other natural phenomena.The word nature is borrowed from the Old French nature and is derived from the Latin word natura, or "essential qualities, innate disposition", and in ancient times, literally meant "birth". In ancient philosophy, natura is mostly used as the Latin translation of the Greek word physis (φύσις), which originally related to the intrinsic characteristics of plants, animals, and other features of the world to develop of their own accord.
The concept of nature as a whole, the physical universe, is one of several expansions of the original notion; it began with certain core applications of the word φύσις by pre-Socratic philosophers (though this word had a

### Preprocessing
The text is now tokenized, stemmed, useless words are removed

In [5]:
num_topics = 5
num_words_per_topic = 20
texts = []

In [6]:
import pandas as pd

start = time.time()

for doc in doc_list:
    raw = doc.lower()
    # Create tokens
    tokens = tokenizer.tokenize(raw)
    # Remove useless info
    stopped_tokens = [i for i in tokens if not i in en_stop]
    # stemmed tokens - duplicate removal and transformation to base form (optional)
    # stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    texts.append(stopped_tokens)
output_preprocessed = pd.Series(texts)

print(output_preprocessed)

end = time.time()
print('Execution Time: %f' %(end-start) + ' s')

0    [nature, broadest, sense, physical, world, uni...
1    [volcano, rupture, crust, planetary, mass, obj...
2    [ocean, also, known, sea, world, ocean, body, ...
3    [landscape, visible, features, area, land, lan...
4    [death, irreversible, cessation, biological, f...
5    [animals, multicellular, eukaryotic, organisms...
dtype: object
Execution Time: 0.106811 s


### Dictionary and vectors
In this section, we will now create the bag-of-words corpus. The vectors will be needed later for the LDA model.

In [7]:
from gensim.corpora import Dictionary

start = time.time()

dictionary = Dictionary(texts)
# convert dictionary to bag-of-words
# corpus is a list of vectors - each document vector is a series of tuples
corpus = [dictionary.doc2bow(text) for text in texts]

output_vectors = pd.Series(corpus)

print(dictionary)
print('---')
print(output_vectors)

end = time.time()
print('Execution Time: %f' %(end-start) + ' s')

Dictionary(6141 unique tokens: ['0', '000', '001', '01186', '020']...)
---
0    [(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1...
1    [(0, 1), (1, 6), (7, 8), (8, 2), (9, 1), (22, ...
2    [(0, 2), (1, 29), (7, 13), (8, 7), (9, 6), (14...
3    [(24, 1), (35, 1), (66, 1), (69, 2), (70, 1), ...
4    [(1, 9), (7, 4), (9, 2), (14, 2), (16, 3), (22...
5    [(0, 2), (1, 7), (7, 6), (8, 3), (9, 1), (14, ...
dtype: object
Execution Time: 0.094048 s


### LDA model
Finally, the LDA model can be applied. The transfer parameters for this are the list of vectors, the number of topics, the dictionary, and the update rate.
In the training phase, a higher update rate `>= 20` should be selected.

In [10]:
from gensim.models.ldamodel import LdaModel

start = time.time()

# Apply LDA model
ldamodel = LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=50)
lda = ldamodel.print_topics(num_topics=num_topics, num_words=num_words_per_topic)
    
for topic in lda:
    for entry in topic:
        print(entry)
        print('---')

end = time.time()
print('Execution Time: %f' %(end-start) + ' s')

0
---
0.012*"nature" + 0.010*"life" + 0.010*"earth" + 0.008*"water" + 0.008*"human" + 0.007*"natural" + 0.005*"surface" + 0.005*"can" + 0.004*"also" + 0.004*"s" + 0.004*"species" + 0.004*"animals" + 0.004*"ocean" + 0.003*"years" + 0.003*"million" + 0.003*"within" + 0.003*"world" + 0.003*"several" + 0.003*"part" + 0.003*"generally"
---
1
---
0.022*"ocean" + 0.013*"water" + 0.008*"surface" + 0.008*"volcanoes" + 0.008*"volcanic" + 0.008*"s" + 0.008*"earth" + 0.006*"can" + 0.006*"eruptions" + 0.005*"oceans" + 0.004*"volcano" + 0.004*"also" + 0.004*"lava" + 0.004*"sea" + 0.004*"carbon" + 0.004*"may" + 0.004*"high" + 0.004*"temperature" + 0.003*"000" + 0.003*"zone"
---
2
---
0.023*"death" + 0.013*"animals" + 0.006*"life" + 0.006*"may" + 0.005*"can" + 0.005*"animal" + 0.005*"body" + 0.005*"people" + 0.005*"one" + 0.004*"brain" + 0.004*"s" + 0.004*"dead" + 0.004*"many" + 0.004*"also" + 0.003*"species" + 0.003*"person" + 0.003*"million" + 0.003*"aging" + 0.003*"medical" + 0.003*"definition"
---

## Visualization
with `pyLDAvis`

In [12]:
import pyLDAvis.gensim
# ignore dprecation warnings for pyLDAvis
warnings.simplefilter("ignore", DeprecationWarning)
    
start = time.time()
pyLDAvis.enable_notebook()

vis_data = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)

end = time.time()
print('Execution Time: %f' %(end-start) + ' s')

Execution Time: 3.139632 s


In [13]:
pyLDAvis.display(vis_data)