## Step 0: Latent Dirichlet Allocation ##

LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. 

* Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial. 
* It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. 

## Step 1: Load the dataset

The dataset we'll use is a list of over one million news headlines published over a period of 15 years. We'll start by loading it from the `abcnews-date-text.csv` file.

In [1]:
'''
Load the dataset from the CSV and save it to 'data_text'
'''
import pandas as pd
data = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False);
# We only need the Headlines text column from the data
data_text = data[:300000][['headline_text']];
data_text['index'] = data_text.index

documents = data_text

Let's glance at the dataset:

In [9]:
'''
Get the total number of documents
'''
print(len(documents))

300000


In [10]:
documents[:5]

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


## Step 2: Data Preprocessing ##

We will perform the following steps:

* **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
* Words that have fewer than 3 characters are removed.
* All **stopwords** are removed.
* Words are **lemmatized** - words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are **stemmed** - words are reduced to their root form.


In [2]:
'''
Loading Gensim and nltk libraries
'''
# pip install gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

unable to import 'smart_open.gcs', disabling that module


In [3]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Marco\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Lemmatizer Example
Before preprocessing our dataset, let's first look at an lemmatizing example. What would be the output if we lemmatized the word 'went':

In [13]:
print(WordNetLemmatizer().lemmatize('went', pos = 'v')) # past tense to present tense

go


### Stemmer Example
Let's also look at a stemming example. Let's throw a number of words at the stemmer and see how it deals with each one:

In [4]:
stemmer = SnowballStemmer("english")
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]

pd.DataFrame(data={'original word':original_words, 'stemmed':singles })

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [5]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            # TODO: Apply lemmatize_stemming on the token, then add to the results list
            result.append(lemmatize_stemming(token))
    return result



In [6]:
'''
Preview a document after preprocessing
'''
document_num = 4310
doc_sample = documents[documents['index'] == document_num].values[0][0]

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['ratepayers', 'group', 'wants', 'compulsory', 'local', 'govt', 'voting']


Tokenized and lemmatized document: 
['ratepay', 'group', 'want', 'compulsori', 'local', 'govt', 'vote']


In [17]:
documents

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4
...,...,...
299995,commuters evacuated from circular quay train,299995
299996,companies invest billions in png gas industry,299996
299997,company looks to expand basalt quarry,299997
299998,cooloola mayor flags support for super council,299998


Let's now preprocess all the news headlines we have. To do that, let's use the [map](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) function from pandas to apply `preprocess()` to the `headline_text` column

**Note**: This may take a few minutes

In [7]:
# TODO: preprocess all the headlines, saving the list of results as 'processed_docs'
processed_docs = documents['headline_text'].map(preprocess)

In [15]:
print(type(processed_docs))

<class 'pandas.core.series.Series'>


In [13]:
# Save processed documents to speed up making dictionary in webscraping scripts
proc_docs_df = pd.DataFrame(processed_docs)
#print(proc_docs_df)
proc_docs_df.to_csv(r'C:\Users\Marco\Desktop\Gits\hmm-tagger\lda-docs\proc_docs_df.csv')

In [19]:
'''
Preview 'processed_docs'
'''
processed_docs[:10]

0            [decid, communiti, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

## Step 3.1: Bag of words on the dataset

Now let's create a dictionary from 'processed_docs' containing the number of times a word appears in the training set. To do that, let's pass `processed_docs` to [`gensim.corpora.Dictionary()`](https://radimrehurek.com/gensim/corpora/dictionary.html) and call it '`dictionary`'.

In [14]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(processed_docs)

In [21]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


** Gensim filter_extremes **

[`filter_extremes(no_below=5, no_above=0.5, keep_n=100000)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes)

Filter out tokens that appear in

* less than no_below documents (absolute number) or
* more than no_above documents (fraction of total corpus size, not absolute number).
* after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [22]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
# TODO: apply dictionary.filter_extremes() with the parameters mentioned above
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)

** Gensim doc2bow **

[`doc2bow(document)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow)

* Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [23]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [24]:
'''
Checking Bag of Words corpus for our sample document --> (token_id, token_count)
'''
bow_corpus[document_num]

[(154, 1), (228, 1), (276, 1), (563, 1), (806, 1), (3175, 1), (3176, 1)]

In [25]:
'''
Preview BOW for our sample preprocessed document
'''
# Here document_num is document number 4310 which we have checked in Step 2
bow_doc_4310 = bow_corpus[document_num]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 154 ("govt") appears 1 time.
Word 228 ("group") appears 1 time.
Word 276 ("vote") appears 1 time.
Word 563 ("local") appears 1 time.
Word 806 ("want") appears 1 time.
Word 3175 ("compulsori") appears 1 time.
Word 3176 ("ratepay") appears 1 time.


## Step 3.2: TF-IDF on our document set ##

While performing TF-IDF on the corpus is not necessary for LDA implemention using the gensim model, it is recemmended. TF-IDF expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality.

*Please note: The author of Gensim dictates the standard procedure for LDA to be using the Bag of Words model.*

** TF-IDF stands for "Term Frequency, Inverse Document Frequency".**

* It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.
* If a word appears frequently in a document, it's important. Give the word a high score. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.
* Therefore, common words like "the" and "for", which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

In other words:

* TF(w) = `(Number of times term w appears in a document) / (Total number of terms in the document)`.
* IDF(w) = `log_e(Total number of documents / Number of documents with term w in it)`.

** For example **

* Consider a document containing `100` words wherein the word 'tiger' appears 3 times. 
* The term frequency (i.e., tf) for 'tiger' is then: 
    - `TF = (3 / 100) = 0.03`. 

* Now, assume we have `10 million` documents and the word 'tiger' appears in `1000` of these. Then, the inverse document frequency (i.e., idf) is calculated as:
    - `IDF = log(10,000,000 / 1,000) = 4`. 

* Thus, the Tf-idf weight is the product of these quantities: 
    - `TF-IDF = 0.03 * 4 = 0.12`.

In [26]:
'''
Create tf-idf model object using models.TfidfModel on 'bow_corpus' and save it to 'tfidf'
'''
from gensim import corpora, models
#tfidf = # TODO
tfidf = models.TfidfModel(bow_corpus)

In [27]:
'''
Apply transformation to the entire corpus and call it 'corpus_tfidf'
'''
#corpus_tfidf = # TODO
corpus_tfidf = tfidf[bow_corpus]

In [28]:
'''
Preview TF-IDF scores for our first document --> --> (token_id, tfidf score)
'''
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5959919082495837),
 (1, 0.3920069955308767),
 (2, 0.48532280284497653),
 (3, 0.5055550788930631)]


## Step 4.1: Running LDA using Bag of Words ##

We are going for 10 topics in the document corpus.

** We will be running LDA using all CPU cores to parallelize and speed up model training.**

Some of the parameters we will be tweaking are:

* **num_topics** is the number of requested latent topics to be extracted from the training corpus.
* **id2word** is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
* **workers** is the number of extra processes to use for parallelization. Uses all available cores by default.
* **alpha** and **eta** are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. We will let these be the default values for now(default value is `1/num_topics`)
    - Alpha is the per document topic distribution.
        * High alpha: Every document has a mixture of all topics(documents appear similar to each other).
        * Low alpha: Every document has a mixture of very few topics

    - Eta is the per topic word distribution.
        * High eta: Each topic has a mixture of most words(topics appear similar to each other).
        * Low eta: Each topic has a mixture of few words.

* ** passes ** is the number of training passes through the corpus. For  example, if the training corpus has 50,000 documents, chunksize is  10,000, passes is 2, then online training is done in 10 updates: 
    * `#1 documents 0-9,999 `
    * `#2 documents 10,000-19,999 `
    * `#3 documents 20,000-29,999 `
    * `#4 documents 30,000-39,999 `
    * `#5 documents 40,000-49,999 `
    * `#6 documents 0-9,999 `
    * `#7 documents 10,000-19,999 `
    * `#8 documents 20,000-29,999 `
    * `#9 documents 30,000-39,999 `
    * `#10 documents 40,000-49,999` 

In [29]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# LDA mono-core
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# LDA multicore 
'''
Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model'
'''
# TODO
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics=10, 
                                       id2word = dictionary, 
                                       passes = 2, 
                                       workers=2)

2020-04-13 13:48:05,294 : INFO : using symmetric alpha at 0.1
2020-04-13 13:48:05,296 : INFO : using symmetric eta at 0.1
2020-04-13 13:48:05,299 : INFO : using serial LDA version on this node
2020-04-13 13:48:05,309 : INFO : running online LDA training, 10 topics, 2 passes over the supplied corpus of 300000 documents, updating every 4000 documents, evaluating every ~40000 documents, iterating 50x with a convergence threshold of 0.001000
2020-04-13 13:48:05,316 : INFO : training LDA model using 2 processes
2020-04-13 13:48:05,341 : INFO : PROGRESS: pass 0, dispatched chunk #0 = documents up to #2000/300000, outstanding queue size 1
2020-04-13 13:48:05,359 : INFO : PROGRESS: pass 0, dispatched chunk #1 = documents up to #4000/300000, outstanding queue size 2
2020-04-13 13:48:05,360 : INFO : PROGRESS: pass 0, dispatched chunk #2 = documents up to #6000/300000, outstanding queue size 3
2020-04-13 13:48:05,363 : INFO : PROGRESS: pass 0, dispatched chunk #3 = documents up to #8000/300000, o

2020-04-13 13:48:11,017 : INFO : topic #0 (0.100): 0.018*"govt" + 0.009*"forc" + 0.007*"budget" + 0.006*"lead" + 0.006*"leav" + 0.005*"fish" + 0.005*"dead" + 0.005*"drought" + 0.005*"union" + 0.005*"head"
2020-04-13 13:48:11,019 : INFO : topic diff=0.168411, rho=0.333333
2020-04-13 13:48:11,019 : INFO : PROGRESS: pass 0, dispatched chunk #15 = documents up to #32000/300000, outstanding queue size 6
2020-04-13 13:48:11,558 : INFO : PROGRESS: pass 0, dispatched chunk #16 = documents up to #34000/300000, outstanding queue size 6
2020-04-13 13:48:11,642 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:48:11,648 : INFO : topic #0 (0.100): 0.018*"govt" + 0.007*"forc" + 0.007*"budget" + 0.006*"lead" + 0.006*"union" + 0.006*"leav" + 0.005*"region" + 0.005*"fish" + 0.005*"win" + 0.005*"dead"
2020-04-13 13:48:11,649 : INFO : topic #6 (0.100): 0.012*"council" + 0.009*"continu" + 0.007*"water" + 0.006*"south" + 0.006*"protest" + 0.006*"aceh" + 0.005*"worl

2020-04-13 13:48:15,605 : INFO : topic #6 (0.100): 0.013*"council" + 0.011*"south" + 0.011*"continu" + 0.009*"water" + 0.009*"world" + 0.008*"sydney" + 0.007*"rain" + 0.006*"give" + 0.006*"boost" + 0.006*"black"
2020-04-13 13:48:15,606 : INFO : topic #5 (0.100): 0.050*"plan" + 0.019*"council" + 0.012*"decis" + 0.008*"concern" + 0.007*"govt" + 0.007*"chang" + 0.007*"delay" + 0.006*"beckham" + 0.006*"vote" + 0.006*"australia"
2020-04-13 13:48:15,608 : INFO : topic #4 (0.100): 0.029*"polic" + 0.013*"probe" + 0.011*"investig" + 0.010*"year" + 0.009*"council" + 0.008*"group" + 0.008*"profit" + 0.007*"death" + 0.007*"arrest" + 0.007*"make"
2020-04-13 13:48:15,609 : INFO : topic #9 (0.100): 0.012*"say" + 0.011*"report" + 0.010*"busi" + 0.008*"highlight" + 0.008*"iraq" + 0.008*"govt" + 0.007*"claim" + 0.007*"inquiri" + 0.007*"boost" + 0.007*"health"
2020-04-13 13:48:15,611 : INFO : topic diff=0.117241, rho=0.218218
2020-04-13 13:48:15,616 : INFO : PROGRESS: pass 0, dispatched chunk #26 = docum

2020-04-13 13:48:18,902 : INFO : topic diff=0.098622, rho=0.179605
2020-04-13 13:48:18,903 : INFO : PROGRESS: pass 0, dispatched chunk #37 = documents up to #76000/300000, outstanding queue size 6
2020-04-13 13:48:19,314 : INFO : PROGRESS: pass 0, dispatched chunk #38 = documents up to #78000/300000, outstanding queue size 6
2020-04-13 13:48:19,559 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:48:19,565 : INFO : topic #6 (0.100): 0.017*"south" + 0.013*"sydney" + 0.012*"continu" + 0.011*"black" + 0.010*"council" + 0.010*"world" + 0.010*"water" + 0.010*"rain" + 0.008*"england" + 0.006*"darwin"
2020-04-13 13:48:19,566 : INFO : topic #7 (0.100): 0.025*"govt" + 0.018*"face" + 0.014*"fund" + 0.014*"reject" + 0.013*"claim" + 0.012*"rise" + 0.012*"court" + 0.010*"clear" + 0.010*"power" + 0.009*"battl"
2020-04-13 13:48:19,567 : INFO : topic #3 (0.100): 0.019*"kill" + 0.014*"urg" + 0.011*"jail" + 0.009*"strike" + 0.008*"polic" + 0.008*"communiti" + 0

2020-04-13 13:48:22,977 : INFO : topic #0 (0.100): 0.015*"govt" + 0.010*"region" + 0.009*"union" + 0.008*"leav" + 0.008*"back" + 0.008*"titl" + 0.008*"head" + 0.008*"australian" + 0.008*"fish" + 0.007*"lead"
2020-04-13 13:48:22,979 : INFO : topic #5 (0.100): 0.054*"plan" + 0.031*"council" + 0.011*"govt" + 0.010*"decis" + 0.010*"mayor" + 0.010*"chang" + 0.009*"delay" + 0.009*"vote" + 0.008*"elect" + 0.007*"seek"
2020-04-13 13:48:22,980 : INFO : topic #4 (0.100): 0.036*"polic" + 0.022*"probe" + 0.016*"investig" + 0.014*"death" + 0.010*"year" + 0.009*"poll" + 0.008*"crash" + 0.008*"make" + 0.008*"road" + 0.008*"award"
2020-04-13 13:48:22,980 : INFO : topic diff=0.085560, rho=0.152499
2020-04-13 13:48:22,982 : INFO : PROGRESS: pass 0, dispatched chunk #49 = documents up to #100000/300000, outstanding queue size 6
2020-04-13 13:48:23,465 : INFO : PROGRESS: pass 0, dispatched chunk #50 = documents up to #102000/300000, outstanding queue size 6
2020-04-13 13:48:23,734 : INFO : merging changes

2020-04-13 13:48:26,892 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:48:26,899 : INFO : topic #9 (0.100): 0.012*"busi" + 0.012*"market" + 0.011*"high" + 0.011*"say" + 0.010*"report" + 0.009*"hous" + 0.009*"aussi" + 0.009*"highlight" + 0.009*"health" + 0.009*"servic"
2020-04-13 13:48:26,901 : INFO : topic #8 (0.100): 0.024*"warn" + 0.020*"iraq" + 0.017*"olymp" + 0.016*"gold" + 0.015*"deal" + 0.013*"trade" + 0.012*"australia" + 0.011*"latham" + 0.010*"doubt" + 0.009*"sign"
2020-04-13 13:48:26,902 : INFO : topic #5 (0.100): 0.060*"plan" + 0.035*"council" + 0.013*"chang" + 0.011*"govt" + 0.011*"decis" + 0.009*"mayor" + 0.009*"delay" + 0.008*"seek" + 0.008*"water" + 0.007*"vote"
2020-04-13 13:48:26,904 : INFO : topic #6 (0.100): 0.017*"south" + 0.014*"continu" + 0.009*"rain" + 0.009*"sydney" + 0.009*"water" + 0.008*"black" + 0.007*"world" + 0.007*"quit" + 0.006*"dump" + 0.006*"darwin"
2020-04-13 13:48:26,906 : INFO : topic #7 (0.100): 0.028*"go

2020-04-13 13:48:30,311 : INFO : topic #2 (0.100): 0.019*"kill" + 0.018*"report" + 0.017*"test" + 0.015*"iraq" + 0.014*"releas" + 0.012*"iraqi" + 0.011*"hospit" + 0.011*"dead" + 0.010*"bomb" + 0.010*"hostag"
2020-04-13 13:48:30,312 : INFO : topic #5 (0.100): 0.063*"plan" + 0.036*"council" + 0.012*"chang" + 0.011*"govt" + 0.010*"delay" + 0.009*"water" + 0.009*"vote" + 0.009*"decis" + 0.009*"seek" + 0.009*"mayor"
2020-04-13 13:48:30,314 : INFO : topic diff=0.068761, rho=0.123091
2020-04-13 13:48:30,314 : INFO : PROGRESS: pass 0, dispatched chunk #72 = documents up to #146000/300000, outstanding queue size 6
2020-04-13 13:48:30,673 : INFO : PROGRESS: pass 0, dispatched chunk #73 = documents up to #148000/300000, outstanding queue size 6
2020-04-13 13:48:30,845 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:48:30,852 : INFO : topic #3 (0.100): 0.017*"jail" + 0.016*"urg" + 0.014*"child" + 0.013*"strike" + 0.012*"fight" + 0.012*"public" + 0.011*"c

2020-04-13 13:48:33,807 : INFO : topic #5 (0.100): 0.060*"plan" + 0.041*"council" + 0.014*"chang" + 0.013*"govt" + 0.010*"delay" + 0.009*"decis" + 0.009*"seek" + 0.009*"water" + 0.009*"develop" + 0.009*"park"
2020-04-13 13:48:33,809 : INFO : topic #8 (0.100): 0.029*"warn" + 0.016*"australia" + 0.014*"deal" + 0.013*"coast" + 0.013*"iraq" + 0.013*"gold" + 0.010*"trade" + 0.010*"price" + 0.010*"threat" + 0.010*"island"
2020-04-13 13:48:33,811 : INFO : topic #1 (0.100): 0.048*"polic" + 0.033*"charg" + 0.022*"court" + 0.020*"miss" + 0.019*"murder" + 0.013*"face" + 0.013*"woman" + 0.013*"search" + 0.012*"trial" + 0.012*"drug"
2020-04-13 13:48:33,812 : INFO : topic #4 (0.100): 0.036*"polic" + 0.025*"death" + 0.022*"investig" + 0.020*"probe" + 0.013*"crash" + 0.012*"road" + 0.012*"year" + 0.010*"toll" + 0.010*"award" + 0.009*"profit"
2020-04-13 13:48:33,813 : INFO : topic #3 (0.100): 0.017*"jail" + 0.016*"urg" + 0.014*"child" + 0.012*"communiti" + 0.012*"help" + 0.012*"strike" + 0.012*"public"

2020-04-13 13:48:36,724 : INFO : topic #8 (0.100): 0.028*"warn" + 0.016*"deal" + 0.015*"australia" + 0.015*"coast" + 0.012*"gold" + 0.011*"sign" + 0.011*"price" + 0.011*"trade" + 0.011*"iraq" + 0.010*"threat"
2020-04-13 13:48:36,725 : INFO : topic diff=0.061274, rho=0.106600
2020-04-13 13:48:36,726 : INFO : PROGRESS: pass 0, dispatched chunk #94 = documents up to #190000/300000, outstanding queue size 6
2020-04-13 13:48:37,198 : INFO : PROGRESS: pass 0, dispatched chunk #95 = documents up to #192000/300000, outstanding queue size 6
2020-04-13 13:48:37,283 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:48:37,289 : INFO : topic #3 (0.100): 0.017*"urg" + 0.017*"jail" + 0.014*"help" + 0.013*"communiti" + 0.013*"public" + 0.013*"child" + 0.013*"fight" + 0.012*"worker" + 0.012*"stand" + 0.011*"safeti"
2020-04-13 13:48:37,291 : INFO : topic #4 (0.100): 0.034*"polic" + 0.025*"death" + 0.021*"probe" + 0.020*"investig" + 0.017*"crash" + 0.013*"road" +

2020-04-13 13:48:40,477 : INFO : topic #7 (0.100): 0.040*"govt" + 0.020*"fund" + 0.017*"reject" + 0.015*"rise" + 0.014*"claim" + 0.013*"boost" + 0.012*"face" + 0.012*"power" + 0.011*"urg" + 0.011*"law"
2020-04-13 13:48:40,482 : INFO : topic #0 (0.100): 0.015*"govt" + 0.014*"union" + 0.013*"head" + 0.012*"region" + 0.012*"sale" + 0.010*"telstra" + 0.010*"back" + 0.010*"fish" + 0.010*"titl" + 0.009*"leav"
2020-04-13 13:48:40,484 : INFO : topic #8 (0.100): 0.028*"warn" + 0.016*"coast" + 0.016*"price" + 0.016*"deal" + 0.015*"australia" + 0.014*"bird" + 0.012*"gold" + 0.012*"threat" + 0.011*"fuel" + 0.010*"terror"
2020-04-13 13:48:40,485 : INFO : topic #9 (0.100): 0.015*"busi" + 0.014*"market" + 0.014*"say" + 0.013*"high" + 0.011*"highlight" + 0.011*"inquiri" + 0.011*"campaign" + 0.011*"servic" + 0.010*"health" + 0.009*"hous"
2020-04-13 13:48:40,487 : INFO : topic diff=0.055717, rho=0.100000
2020-04-13 13:48:40,488 : INFO : PROGRESS: pass 0, dispatched chunk #106 = documents up to #214000/3

2020-04-13 13:48:43,582 : INFO : topic diff=0.049445, rho=0.095346
2020-04-13 13:48:43,583 : INFO : PROGRESS: pass 0, dispatched chunk #116 = documents up to #234000/300000, outstanding queue size 6
2020-04-13 13:48:44,003 : INFO : PROGRESS: pass 0, dispatched chunk #117 = documents up to #236000/300000, outstanding queue size 6
2020-04-13 13:48:44,230 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:48:44,237 : INFO : topic #0 (0.100): 0.014*"govt" + 0.013*"sale" + 0.013*"union" + 0.013*"fish" + 0.013*"head" + 0.011*"region" + 0.010*"illeg" + 0.010*"titl" + 0.010*"back" + 0.010*"parti"
2020-04-13 13:48:44,239 : INFO : topic #2 (0.100): 0.029*"kill" + 0.020*"test" + 0.016*"report" + 0.016*"hospit" + 0.015*"open" + 0.013*"iraq" + 0.013*"dead" + 0.012*"attack" + 0.012*"bomb" + 0.012*"forc"
2020-04-13 13:48:44,240 : INFO : topic #6 (0.100): 0.014*"south" + 0.014*"continu" + 0.013*"rain" + 0.012*"world" + 0.012*"game" + 0.012*"sydney" + 0.010*"fin

2020-04-13 13:48:47,233 : INFO : topic #9 (0.100): 0.016*"market" + 0.014*"inquiri" + 0.014*"busi" + 0.013*"say" + 0.013*"nuclear" + 0.012*"high" + 0.011*"campaign" + 0.010*"highlight" + 0.010*"servic" + 0.009*"confid"
2020-04-13 13:48:47,235 : INFO : topic #5 (0.100): 0.054*"plan" + 0.040*"council" + 0.019*"chang" + 0.015*"govt" + 0.014*"water" + 0.011*"nation" + 0.011*"park" + 0.011*"urg" + 0.010*"delay" + 0.010*"mayor"
2020-04-13 13:48:47,237 : INFO : topic #2 (0.100): 0.033*"kill" + 0.020*"test" + 0.018*"open" + 0.016*"report" + 0.014*"hospit" + 0.013*"iraq" + 0.012*"forc" + 0.012*"dead" + 0.012*"attack" + 0.011*"releas"
2020-04-13 13:48:47,239 : INFO : topic diff=0.048958, rho=0.090536
2020-04-13 13:48:47,240 : INFO : PROGRESS: pass 0, dispatched chunk #128 = documents up to #258000/300000, outstanding queue size 6
2020-04-13 13:48:47,636 : INFO : PROGRESS: pass 0, dispatched chunk #129 = documents up to #260000/300000, outstanding queue size 6
2020-04-13 13:48:47,750 : INFO : mer

2020-04-13 13:48:50,649 : INFO : PROGRESS: pass 0, dispatched chunk #139 = documents up to #280000/300000, outstanding queue size 6
2020-04-13 13:48:50,762 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:48:50,769 : INFO : topic #0 (0.100): 0.016*"sale" + 0.014*"union" + 0.013*"govt" + 0.013*"head" + 0.012*"fish" + 0.011*"back" + 0.010*"criticis" + 0.010*"titl" + 0.010*"region" + 0.009*"critic"
2020-04-13 13:48:50,771 : INFO : topic #1 (0.100): 0.053*"polic" + 0.035*"charg" + 0.028*"court" + 0.024*"face" + 0.019*"miss" + 0.018*"murder" + 0.015*"accus" + 0.015*"drug" + 0.013*"woman" + 0.013*"arrest"
2020-04-13 13:48:50,772 : INFO : topic #9 (0.100): 0.015*"market" + 0.015*"say" + 0.015*"nuclear" + 0.014*"busi" + 0.013*"inquiri" + 0.012*"campaign" + 0.011*"high" + 0.010*"howard" + 0.009*"servic" + 0.009*"talk"
2020-04-13 13:48:50,774 : INFO : topic #8 (0.100): 0.028*"warn" + 0.016*"deal" + 0.015*"price" + 0.014*"coast" + 0.014*"australia" + 0.0

2020-04-13 13:48:53,722 : INFO : topic #3 (0.100): 0.021*"urg" + 0.019*"help" + 0.015*"jail" + 0.015*"communiti" + 0.014*"worker" + 0.013*"child" + 0.012*"fight" + 0.012*"strike" + 0.011*"stand" + 0.011*"action"
2020-04-13 13:48:53,725 : INFO : topic #0 (0.100): 0.015*"sale" + 0.014*"govt" + 0.014*"union" + 0.013*"head" + 0.011*"back" + 0.011*"region" + 0.011*"fish" + 0.010*"titl" + 0.010*"condit" + 0.010*"drought"
2020-04-13 13:48:53,726 : INFO : topic diff=0.041355, rho=0.083333
2020-04-13 13:48:54,570 : INFO : -8.274 per-word bound, 309.6 perplexity estimate based on a held-out corpus of 2000 documents with 8648 words
2020-04-13 13:48:54,578 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:48:54,585 : INFO : topic #4 (0.100): 0.033*"death" + 0.030*"polic" + 0.029*"crash" + 0.028*"investig" + 0.017*"road" + 0.017*"probe" + 0.014*"year" + 0.011*"close" + 0.011*"break" + 0.011*"fatal"
2020-04-13 13:48:54,587 : INFO : topic #3 (0.100): 0.020*"u

2020-04-13 13:48:57,690 : INFO : topic diff=0.039194, rho=0.081111
2020-04-13 13:48:57,691 : INFO : PROGRESS: pass 1, dispatched chunk #9 = documents up to #20000/300000, outstanding queue size 6
2020-04-13 13:48:58,279 : INFO : PROGRESS: pass 1, dispatched chunk #10 = documents up to #22000/300000, outstanding queue size 6
2020-04-13 13:48:58,303 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:48:58,312 : INFO : topic #4 (0.100): 0.032*"death" + 0.028*"polic" + 0.026*"crash" + 0.026*"investig" + 0.017*"road" + 0.017*"probe" + 0.013*"year" + 0.012*"break" + 0.012*"close" + 0.010*"fatal"
2020-04-13 13:48:58,314 : INFO : topic #2 (0.100): 0.031*"kill" + 0.029*"iraq" + 0.018*"report" + 0.017*"test" + 0.017*"iraqi" + 0.017*"open" + 0.016*"forc" + 0.013*"baghdad" + 0.012*"dead" + 0.012*"hospit"
2020-04-13 13:48:58,316 : INFO : topic #5 (0.100): 0.051*"plan" + 0.041*"council" + 0.024*"water" + 0.017*"govt" + 0.016*"chang" + 0.013*"closer" + 0.012*"

2020-04-13 13:49:02,154 : INFO : topic #3 (0.100): 0.018*"help" + 0.017*"urg" + 0.015*"sar" + 0.013*"communiti" + 0.013*"fight" + 0.013*"worker" + 0.012*"jail" + 0.012*"strike" + 0.011*"child" + 0.011*"action"
2020-04-13 13:49:02,156 : INFO : topic #2 (0.100): 0.032*"kill" + 0.027*"iraq" + 0.019*"report" + 0.018*"test" + 0.018*"open" + 0.016*"iraqi" + 0.014*"forc" + 0.013*"dead" + 0.012*"attack" + 0.011*"hospit"
2020-04-13 13:49:02,157 : INFO : topic #1 (0.100): 0.056*"polic" + 0.038*"charg" + 0.033*"court" + 0.029*"face" + 0.021*"miss" + 0.019*"murder" + 0.016*"drug" + 0.014*"arrest" + 0.014*"woman" + 0.014*"trial"
2020-04-13 13:49:02,160 : INFO : topic diff=0.032568, rho=0.081111
2020-04-13 13:49:02,161 : INFO : PROGRESS: pass 1, dispatched chunk #21 = documents up to #44000/300000, outstanding queue size 6
2020-04-13 13:49:02,688 : INFO : PROGRESS: pass 1, dispatched chunk #22 = documents up to #46000/300000, outstanding queue size 6
2020-04-13 13:49:02,924 : INFO : merging changes 

2020-04-13 13:49:06,130 : INFO : PROGRESS: pass 1, dispatched chunk #32 = documents up to #66000/300000, outstanding queue size 6
2020-04-13 13:49:06,336 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:49:06,345 : INFO : topic #9 (0.100): 0.014*"busi" + 0.014*"high" + 0.013*"say" + 0.012*"market" + 0.012*"inquiri" + 0.011*"highlight" + 0.010*"nuclear" + 0.010*"confid" + 0.009*"hous" + 0.009*"talk"
2020-04-13 13:49:06,347 : INFO : topic #6 (0.100): 0.023*"world" + 0.015*"continu" + 0.014*"south" + 0.014*"final" + 0.012*"england" + 0.011*"rain" + 0.009*"sydney" + 0.009*"black" + 0.007*"club" + 0.007*"blue"
2020-04-13 13:49:06,349 : INFO : topic #3 (0.100): 0.019*"help" + 0.016*"strike" + 0.015*"urg" + 0.014*"worker" + 0.014*"fight" + 0.012*"communiti" + 0.012*"child" + 0.012*"jail" + 0.012*"public" + 0.011*"action"
2020-04-13 13:49:06,350 : INFO : topic #4 (0.100): 0.031*"death" + 0.027*"polic" + 0.023*"crash" + 0.023*"probe" + 0.022*"investig"

2020-04-13 13:49:09,666 : INFO : topic #3 (0.100): 0.019*"help" + 0.016*"strike" + 0.015*"urg" + 0.013*"fight" + 0.013*"worker" + 0.013*"child" + 0.013*"public" + 0.012*"action" + 0.012*"communiti" + 0.012*"jail"
2020-04-13 13:49:09,667 : INFO : topic #2 (0.100): 0.038*"kill" + 0.026*"iraq" + 0.019*"test" + 0.019*"report" + 0.017*"open" + 0.015*"dead" + 0.014*"iraqi" + 0.014*"attack" + 0.012*"leader" + 0.012*"blast"
2020-04-13 13:49:09,669 : INFO : topic diff=0.029346, rho=0.081111
2020-04-13 13:49:09,670 : INFO : PROGRESS: pass 1, dispatched chunk #43 = documents up to #88000/300000, outstanding queue size 6
2020-04-13 13:49:10,047 : INFO : PROGRESS: pass 1, dispatched chunk #44 = documents up to #90000/300000, outstanding queue size 6
2020-04-13 13:49:10,282 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:49:10,290 : INFO : topic #4 (0.100): 0.031*"death" + 0.029*"probe" + 0.026*"polic" + 0.022*"crash" + 0.021*"investig" + 0.018*"road" + 0.

2020-04-13 13:49:14,243 : INFO : topic #2 (0.100): 0.039*"kill" + 0.031*"iraq" + 0.019*"test" + 0.019*"report" + 0.016*"open" + 0.016*"iraqi" + 0.015*"attack" + 0.014*"dead" + 0.012*"releas" + 0.012*"bomb"
2020-04-13 13:49:14,246 : INFO : topic #0 (0.100): 0.016*"head" + 0.013*"lead" + 0.012*"union" + 0.012*"titl" + 0.011*"sale" + 0.011*"leav" + 0.011*"back" + 0.011*"secur" + 0.010*"fish" + 0.010*"region"
2020-04-13 13:49:14,247 : INFO : topic #1 (0.100): 0.060*"polic" + 0.038*"charg" + 0.035*"court" + 0.029*"face" + 0.020*"miss" + 0.019*"murder" + 0.018*"drug" + 0.015*"arrest" + 0.013*"trial" + 0.013*"woman"
2020-04-13 13:49:14,249 : INFO : topic #8 (0.100): 0.030*"warn" + 0.019*"deal" + 0.015*"trade" + 0.015*"australia" + 0.014*"coast" + 0.011*"north" + 0.011*"iraq" + 0.011*"sign" + 0.010*"threat" + 0.010*"price"
2020-04-13 13:49:14,251 : INFO : topic #4 (0.100): 0.029*"death" + 0.027*"probe" + 0.026*"polic" + 0.022*"investig" + 0.020*"crash" + 0.019*"road" + 0.014*"year" + 0.012*"cl

2020-04-13 13:49:17,650 : INFO : topic #2 (0.100): 0.039*"kill" + 0.031*"iraq" + 0.019*"test" + 0.018*"report" + 0.016*"open" + 0.015*"iraqi" + 0.015*"attack" + 0.014*"dead" + 0.013*"bomb" + 0.013*"releas"
2020-04-13 13:49:17,652 : INFO : topic diff=0.028190, rho=0.081111
2020-04-13 13:49:17,653 : INFO : PROGRESS: pass 1, dispatched chunk #65 = documents up to #132000/300000, outstanding queue size 6
2020-04-13 13:49:18,116 : INFO : PROGRESS: pass 1, dispatched chunk #66 = documents up to #134000/300000, outstanding queue size 6
2020-04-13 13:49:18,364 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:49:18,372 : INFO : topic #7 (0.100): 0.042*"govt" + 0.026*"fund" + 0.022*"boost" + 0.017*"claim" + 0.016*"reject" + 0.015*"rise" + 0.015*"power" + 0.013*"hospit" + 0.011*"industri" + 0.011*"health"
2020-04-13 13:49:18,374 : INFO : topic #6 (0.100): 0.018*"final" + 0.016*"continu" + 0.015*"world" + 0.015*"south" + 0.011*"rain" + 0.010*"sydney" + 0.

2020-04-13 13:49:22,095 : INFO : topic #6 (0.100): 0.016*"continu" + 0.015*"world" + 0.015*"final" + 0.014*"south" + 0.011*"rain" + 0.011*"sydney" + 0.009*"storm" + 0.009*"effort" + 0.008*"england" + 0.008*"blue"
2020-04-13 13:49:22,097 : INFO : topic #8 (0.100): 0.031*"warn" + 0.017*"deal" + 0.017*"australia" + 0.016*"coast" + 0.015*"gold" + 0.013*"latham" + 0.013*"trade" + 0.012*"price" + 0.010*"die" + 0.010*"threat"
2020-04-13 13:49:22,099 : INFO : topic #0 (0.100): 0.015*"head" + 0.013*"union" + 0.013*"sale" + 0.013*"lead" + 0.012*"secur" + 0.011*"titl" + 0.011*"back" + 0.011*"leav" + 0.010*"fish" + 0.010*"parti"
2020-04-13 13:49:22,102 : INFO : topic #7 (0.100): 0.043*"govt" + 0.025*"fund" + 0.021*"boost" + 0.017*"claim" + 0.015*"reject" + 0.015*"power" + 0.014*"hospit" + 0.014*"rise" + 0.012*"urg" + 0.011*"health"
2020-04-13 13:49:22,103 : INFO : topic diff=0.026266, rho=0.081111
2020-04-13 13:49:22,104 : INFO : PROGRESS: pass 1, dispatched chunk #77 = documents up to #156000/300

2020-04-13 13:49:26,136 : INFO : topic diff=0.026767, rho=0.081111
2020-04-13 13:49:26,139 : INFO : PROGRESS: pass 1, dispatched chunk #87 = documents up to #176000/300000, outstanding queue size 6
2020-04-13 13:49:26,918 : INFO : PROGRESS: pass 1, dispatched chunk #88 = documents up to #178000/300000, outstanding queue size 6
2020-04-13 13:49:27,225 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:49:27,235 : INFO : topic #7 (0.100): 0.045*"govt" + 0.026*"fund" + 0.019*"boost" + 0.018*"claim" + 0.016*"hospit" + 0.015*"reject" + 0.015*"rise" + 0.014*"power" + 0.011*"urg" + 0.011*"health"
2020-04-13 13:49:27,238 : INFO : topic #5 (0.100): 0.057*"plan" + 0.046*"council" + 0.018*"chang" + 0.016*"water" + 0.014*"govt" + 0.013*"group" + 0.012*"consid" + 0.011*"nation" + 0.010*"look" + 0.010*"mayor"
2020-04-13 13:49:27,240 : INFO : topic #6 (0.100): 0.015*"continu" + 0.015*"south" + 0.015*"world" + 0.014*"final" + 0.011*"rain" + 0.010*"sydney" + 0.0

2020-04-13 13:49:31,485 : INFO : topic #5 (0.100): 0.056*"plan" + 0.045*"council" + 0.020*"chang" + 0.015*"govt" + 0.015*"water" + 0.014*"group" + 0.013*"nation" + 0.012*"consid" + 0.010*"develop" + 0.010*"look"
2020-04-13 13:49:31,487 : INFO : topic #1 (0.100): 0.064*"polic" + 0.035*"charg" + 0.030*"court" + 0.028*"face" + 0.021*"miss" + 0.019*"drug" + 0.016*"murder" + 0.014*"accus" + 0.014*"trial" + 0.014*"search"
2020-04-13 13:49:31,489 : INFO : topic #6 (0.100): 0.015*"final" + 0.015*"continu" + 0.014*"world" + 0.013*"south" + 0.012*"rain" + 0.010*"sydney" + 0.009*"england" + 0.008*"storm" + 0.008*"effort" + 0.008*"black"
2020-04-13 13:49:31,492 : INFO : topic diff=0.025866, rho=0.081111
2020-04-13 13:49:31,493 : INFO : PROGRESS: pass 1, dispatched chunk #99 = documents up to #200000/300000, outstanding queue size 6
2020-04-13 13:49:32,062 : INFO : PROGRESS: pass 1, dispatched chunk #100 = documents up to #202000/300000, outstanding queue size 6
2020-04-13 13:49:32,236 : INFO : mer

2020-04-13 13:49:35,664 : INFO : PROGRESS: pass 1, dispatched chunk #110 = documents up to #222000/300000, outstanding queue size 6
2020-04-13 13:49:35,843 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:49:35,854 : INFO : topic #8 (0.100): 0.031*"warn" + 0.019*"coast" + 0.017*"deal" + 0.016*"price" + 0.014*"australia" + 0.013*"bird" + 0.013*"gold" + 0.012*"threat" + 0.011*"trade" + 0.011*"north"
2020-04-13 13:49:35,855 : INFO : topic #6 (0.100): 0.016*"continu" + 0.016*"world" + 0.014*"final" + 0.013*"south" + 0.012*"rain" + 0.012*"sydney" + 0.011*"storm" + 0.008*"dump" + 0.008*"effort" + 0.008*"england"
2020-04-13 13:49:35,859 : INFO : topic #0 (0.100): 0.015*"secur" + 0.015*"sale" + 0.015*"head" + 0.014*"union" + 0.013*"fish" + 0.013*"telstra" + 0.012*"back" + 0.011*"region" + 0.011*"leav" + 0.011*"lead"
2020-04-13 13:49:35,861 : INFO : topic #5 (0.100): 0.054*"plan" + 0.043*"council" + 0.022*"chang" + 0.016*"water" + 0.015*"govt" + 0.014*

2020-04-13 13:49:39,525 : INFO : topic #2 (0.100): 0.041*"kill" + 0.022*"test" + 0.019*"iraq" + 0.018*"open" + 0.016*"attack" + 0.015*"forc" + 0.015*"report" + 0.013*"dead" + 0.013*"leader" + 0.013*"bomb"
2020-04-13 13:49:39,527 : INFO : topic #5 (0.100): 0.053*"plan" + 0.042*"council" + 0.023*"chang" + 0.016*"water" + 0.015*"govt" + 0.014*"group" + 0.013*"nation" + 0.011*"concern" + 0.011*"mayor" + 0.011*"park"
2020-04-13 13:49:39,529 : INFO : topic diff=0.023141, rho=0.081111
2020-04-13 13:49:39,530 : INFO : PROGRESS: pass 1, dispatched chunk #121 = documents up to #244000/300000, outstanding queue size 6
2020-04-13 13:49:39,968 : INFO : PROGRESS: pass 1, dispatched chunk #122 = documents up to #246000/300000, outstanding queue size 6
2020-04-13 13:49:40,228 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:49:40,238 : INFO : topic #6 (0.100): 0.018*"continu" + 0.015*"world" + 0.014*"final" + 0.013*"south" + 0.011*"sydney" + 0.011*"rain" + 0.

2020-04-13 13:49:43,715 : INFO : topic #5 (0.100): 0.052*"plan" + 0.042*"council" + 0.021*"chang" + 0.019*"water" + 0.016*"govt" + 0.013*"nation" + 0.012*"group" + 0.011*"closer" + 0.011*"concern" + 0.011*"urg"
2020-04-13 13:49:43,718 : INFO : topic #4 (0.100): 0.035*"death" + 0.027*"investig" + 0.026*"crash" + 0.025*"polic" + 0.021*"probe" + 0.021*"road" + 0.014*"break" + 0.013*"close" + 0.013*"year" + 0.012*"liber"
2020-04-13 13:49:43,720 : INFO : topic #6 (0.100): 0.019*"continu" + 0.017*"world" + 0.014*"final" + 0.013*"south" + 0.010*"rain" + 0.010*"sydney" + 0.009*"england" + 0.009*"flood" + 0.008*"game" + 0.008*"tiger"
2020-04-13 13:49:43,723 : INFO : topic #1 (0.100): 0.060*"polic" + 0.036*"charg" + 0.031*"face" + 0.031*"court" + 0.020*"miss" + 0.018*"accus" + 0.018*"murder" + 0.018*"drug" + 0.014*"arrest" + 0.013*"woman"
2020-04-13 13:49:43,725 : INFO : topic #0 (0.100): 0.017*"sale" + 0.015*"union" + 0.015*"head" + 0.014*"secur" + 0.014*"fish" + 0.013*"back" + 0.011*"leav" + 0

2020-04-13 13:49:46,974 : INFO : topic #7 (0.100): 0.048*"govt" + 0.023*"fund" + 0.019*"reject" + 0.018*"boost" + 0.016*"hospit" + 0.016*"rise" + 0.015*"claim" + 0.014*"urg" + 0.014*"power" + 0.013*"health"
2020-04-13 13:49:46,977 : INFO : topic diff=0.023085, rho=0.081111
2020-04-13 13:49:46,979 : INFO : PROGRESS: pass 1, dispatched chunk #143 = documents up to #288000/300000, outstanding queue size 6
2020-04-13 13:49:47,235 : INFO : PROGRESS: pass 1, dispatched chunk #144 = documents up to #290000/300000, outstanding queue size 6
2020-04-13 13:49:47,533 : INFO : merging changes from 4000 documents into a model of 300000 documents
2020-04-13 13:49:47,542 : INFO : topic #6 (0.100): 0.018*"continu" + 0.015*"world" + 0.014*"south" + 0.013*"final" + 0.012*"sydney" + 0.012*"england" + 0.011*"rain" + 0.009*"storm" + 0.009*"blue" + 0.008*"victori"
2020-04-13 13:49:47,543 : INFO : topic #3 (0.100): 0.026*"help" + 0.020*"urg" + 0.019*"servic" + 0.015*"worker" + 0.015*"communiti" + 0.014*"child

2020-04-13 13:49:50,795 : INFO : topic #8 (0.100): 0.029*"warn" + 0.022*"aust" + 0.015*"coast" + 0.015*"deal" + 0.014*"north" + 0.013*"price" + 0.012*"australia" + 0.012*"bodi" + 0.012*"threat" + 0.011*"gold"
2020-04-13 13:49:50,797 : INFO : topic #4 (0.100): 0.034*"death" + 0.031*"crash" + 0.028*"investig" + 0.027*"polic" + 0.021*"road" + 0.017*"probe" + 0.015*"break" + 0.015*"year" + 0.012*"close" + 0.011*"fatal"
2020-04-13 13:49:50,797 : INFO : topic diff=0.024443, rho=0.081111
2020-04-13 13:49:51,284 : INFO : -8.040 per-word bound, 263.2 perplexity estimate based on a held-out corpus of 2000 documents with 8648 words


In [30]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

2020-04-13 13:49:51,394 : INFO : topic #0 (0.100): 0.015*"union" + 0.015*"sale" + 0.014*"head" + 0.014*"secur" + 0.014*"back" + 0.013*"lead" + 0.012*"leav" + 0.012*"increas" + 0.011*"fish" + 0.011*"region"
2020-04-13 13:49:51,395 : INFO : topic #1 (0.100): 0.063*"polic" + 0.036*"charg" + 0.032*"court" + 0.029*"face" + 0.022*"miss" + 0.019*"accus" + 0.019*"murder" + 0.018*"drug" + 0.015*"search" + 0.014*"woman"
2020-04-13 13:49:51,397 : INFO : topic #2 (0.100): 0.039*"kill" + 0.022*"iraq" + 0.021*"test" + 0.019*"open" + 0.019*"forc" + 0.016*"attack" + 0.015*"report" + 0.013*"leader" + 0.013*"dead" + 0.011*"releas"
2020-04-13 13:49:51,399 : INFO : topic #3 (0.100): 0.026*"help" + 0.019*"servic" + 0.018*"urg" + 0.014*"worker" + 0.014*"communiti" + 0.013*"jail" + 0.012*"child" + 0.012*"safeti" + 0.011*"stand" + 0.011*"fight"
2020-04-13 13:49:51,400 : INFO : topic #4 (0.100): 0.034*"death" + 0.031*"crash" + 0.028*"investig" + 0.027*"polic" + 0.021*"road" + 0.017*"probe" + 0.015*"break" + 0.

Topic: 0 
Words: 0.015*"union" + 0.015*"sale" + 0.014*"head" + 0.014*"secur" + 0.014*"back" + 0.013*"lead" + 0.012*"leav" + 0.012*"increas" + 0.011*"fish" + 0.011*"region"


Topic: 1 
Words: 0.063*"polic" + 0.036*"charg" + 0.032*"court" + 0.029*"face" + 0.022*"miss" + 0.019*"accus" + 0.019*"murder" + 0.018*"drug" + 0.015*"search" + 0.014*"woman"


Topic: 2 
Words: 0.039*"kill" + 0.022*"iraq" + 0.021*"test" + 0.019*"open" + 0.019*"forc" + 0.016*"attack" + 0.015*"report" + 0.013*"leader" + 0.013*"dead" + 0.011*"releas"


Topic: 3 
Words: 0.026*"help" + 0.019*"servic" + 0.018*"urg" + 0.014*"worker" + 0.014*"communiti" + 0.013*"jail" + 0.012*"child" + 0.012*"safeti" + 0.011*"stand" + 0.011*"fight"


Topic: 4 
Words: 0.034*"death" + 0.031*"crash" + 0.028*"investig" + 0.027*"polic" + 0.021*"road" + 0.017*"probe" + 0.015*"break" + 0.015*"year" + 0.012*"close" + 0.011*"fatal"


Topic: 5 
Words: 0.050*"plan" + 0.041*"council" + 0.031*"water" + 0.019*"chang" + 0.018*"govt" + 0.015*"closer" + 0.0

### Classification of the topics ###

Using the words in each topic and their corresponding weights, what categories were you able to infer?

* 0: 
* 1: 
* 2: 
* 3: 
* 4: 
* 5: 
* 6: 
* 7:  
* 8: 
* 9: 

## Step 4.2 Running LDA using TF-IDF ##

In [31]:
'''
Define lda model using corpus_tfidf
'''
# TODO
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
                                             num_topics=10, 
                                             id2word = dictionary, 
                                             passes = 2, 
                                             workers=4)

2020-04-13 13:49:51,425 : INFO : using symmetric alpha at 0.1
2020-04-13 13:49:51,426 : INFO : using symmetric eta at 0.1
2020-04-13 13:49:51,429 : INFO : using serial LDA version on this node
2020-04-13 13:49:51,441 : INFO : running online LDA training, 10 topics, 2 passes over the supplied corpus of 300000 documents, updating every 8000 documents, evaluating every ~80000 documents, iterating 50x with a convergence threshold of 0.001000
2020-04-13 13:49:51,443 : INFO : training LDA model using 4 processes
2020-04-13 13:49:51,834 : INFO : PROGRESS: pass 0, dispatched chunk #0 = documents up to #2000/300000, outstanding queue size 1
2020-04-13 13:49:52,233 : INFO : PROGRESS: pass 0, dispatched chunk #1 = documents up to #4000/300000, outstanding queue size 2
2020-04-13 13:49:52,447 : INFO : PROGRESS: pass 0, dispatched chunk #2 = documents up to #6000/300000, outstanding queue size 3
2020-04-13 13:49:52,718 : INFO : PROGRESS: pass 0, dispatched chunk #3 = documents up to #8000/300000, o

2020-04-13 13:49:59,322 : INFO : topic #7 (0.100): 0.006*"face" + 0.006*"charg" + 0.005*"budget" + 0.005*"polic" + 0.005*"test" + 0.004*"appeal" + 0.004*"govt" + 0.004*"stab" + 0.004*"reject" + 0.004*"plan"
2020-04-13 13:49:59,324 : INFO : topic diff=0.136362, rho=0.277350
2020-04-13 13:49:59,587 : INFO : PROGRESS: pass 0, dispatched chunk #25 = documents up to #52000/300000, outstanding queue size 10
2020-04-13 13:49:59,848 : INFO : PROGRESS: pass 0, dispatched chunk #26 = documents up to #54000/300000, outstanding queue size 11
2020-04-13 13:50:00,202 : INFO : PROGRESS: pass 0, dispatched chunk #27 = documents up to #56000/300000, outstanding queue size 12
2020-04-13 13:50:00,521 : INFO : PROGRESS: pass 0, dispatched chunk #28 = documents up to #58000/300000, outstanding queue size 12
2020-04-13 13:50:00,702 : INFO : merging changes from 8000 documents into a model of 300000 documents
2020-04-13 13:50:00,712 : INFO : topic #5 (0.100): 0.008*"water" + 0.007*"iraq" + 0.007*"govt" + 0.0

2020-04-13 13:50:05,632 : INFO : topic #9 (0.100): 0.007*"court" + 0.006*"face" + 0.005*"teacher" + 0.005*"black" + 0.005*"strike" + 0.005*"plan" + 0.005*"polic" + 0.005*"solomon" + 0.005*"hospit" + 0.004*"home"
2020-04-13 13:50:05,647 : INFO : topic #2 (0.100): 0.007*"iraq" + 0.006*"melbourn" + 0.006*"kill" + 0.005*"dollar" + 0.005*"win" + 0.004*"world" + 0.004*"secur" + 0.004*"report" + 0.004*"final" + 0.004*"high"
2020-04-13 13:50:05,649 : INFO : topic #8 (0.100): 0.008*"kill" + 0.007*"polic" + 0.007*"miss" + 0.006*"murder" + 0.006*"bird" + 0.006*"attack" + 0.006*"palestinian" + 0.006*"baghdad" + 0.005*"iraqi" + 0.005*"suspect"
2020-04-13 13:50:05,650 : INFO : topic diff=0.089985, rho=0.174078
2020-04-13 13:50:05,853 : INFO : PROGRESS: pass 0, dispatched chunk #46 = documents up to #94000/300000, outstanding queue size 11
2020-04-13 13:50:06,095 : INFO : PROGRESS: pass 0, dispatched chunk #47 = documents up to #96000/300000, outstanding queue size 12
2020-04-13 13:50:06,284 : INFO :

2020-04-13 13:50:11,773 : INFO : topic #4 (0.100): 0.009*"latham" + 0.007*"polic" + 0.006*"face" + 0.006*"charg" + 0.006*"plan" + 0.006*"drug" + 0.006*"court" + 0.005*"death" + 0.005*"council" + 0.004*"govt"
2020-04-13 13:50:11,804 : INFO : topic #5 (0.100): 0.009*"water" + 0.008*"fund" + 0.008*"govt" + 0.008*"boost" + 0.007*"plan" + 0.007*"council" + 0.005*"athen" + 0.005*"restrict" + 0.005*"iraq" + 0.005*"want"
2020-04-13 13:50:11,806 : INFO : topic #0 (0.100): 0.012*"polic" + 0.009*"investig" + 0.007*"probe" + 0.006*"accid" + 0.005*"olymp" + 0.005*"doubt" + 0.005*"govt" + 0.005*"kill" + 0.005*"claim" + 0.005*"inquiri"
2020-04-13 13:50:11,807 : INFO : topic diff=0.072522, rho=0.134840
2020-04-13 13:50:11,972 : INFO : PROGRESS: pass 0, dispatched chunk #62 = documents up to #126000/300000, outstanding queue size 5
2020-04-13 13:50:12,217 : INFO : PROGRESS: pass 0, dispatched chunk #63 = documents up to #128000/300000, outstanding queue size 5
2020-04-13 13:50:12,460 : INFO : PROGRESS:

2020-04-13 13:50:17,378 : INFO : merging changes from 8000 documents into a model of 300000 documents
2020-04-13 13:50:17,391 : INFO : topic #6 (0.100): 0.008*"crash" + 0.007*"polic" + 0.006*"warn" + 0.005*"victim" + 0.005*"fear" + 0.005*"doctor" + 0.005*"highway" + 0.005*"plan" + 0.004*"govt" + 0.004*"council"
2020-04-13 13:50:17,392 : INFO : topic #9 (0.100): 0.007*"teacher" + 0.006*"strike" + 0.005*"black" + 0.005*"court" + 0.005*"chelsea" + 0.004*"face" + 0.004*"polic" + 0.004*"titl" + 0.004*"action" + 0.004*"hospit"
2020-04-13 13:50:17,413 : INFO : topic #1 (0.100): 0.006*"plan" + 0.006*"council" + 0.004*"merger" + 0.004*"confid" + 0.004*"land" + 0.004*"govt" + 0.004*"pope" + 0.004*"growth" + 0.004*"profit" + 0.004*"lead"
2020-04-13 13:50:17,415 : INFO : topic #2 (0.100): 0.007*"iraq" + 0.006*"kill" + 0.005*"high" + 0.005*"win" + 0.005*"lead" + 0.005*"melbourn" + 0.005*"final" + 0.005*"record" + 0.004*"hit" + 0.004*"pakistan"
2020-04-13 13:50:17,436 : INFO : topic #4 (0.100): 0.00

2020-04-13 13:50:21,986 : INFO : PROGRESS: pass 0, dispatched chunk #99 = documents up to #200000/300000, outstanding queue size 5
2020-04-13 13:50:22,239 : INFO : PROGRESS: pass 0, dispatched chunk #100 = documents up to #202000/300000, outstanding queue size 5
2020-04-13 13:50:22,458 : INFO : PROGRESS: pass 0, dispatched chunk #101 = documents up to #204000/300000, outstanding queue size 5
2020-04-13 13:50:22,556 : INFO : merging changes from 8000 documents into a model of 300000 documents
2020-04-13 13:50:22,566 : INFO : topic #0 (0.100): 0.013*"polic" + 0.010*"investig" + 0.008*"probe" + 0.006*"fatal" + 0.005*"accid" + 0.005*"plan" + 0.005*"doubt" + 0.005*"inquiri" + 0.005*"kill" + 0.005*"airport"
2020-04-13 13:50:22,568 : INFO : topic #2 (0.100): 0.005*"iraq" + 0.005*"high" + 0.005*"price" + 0.005*"win" + 0.005*"kill" + 0.005*"final" + 0.005*"london" + 0.004*"hit" + 0.004*"lead" + 0.004*"record"
2020-04-13 13:50:22,569 : INFO : topic #9 (0.100): 0.007*"teacher" + 0.006*"black" + 0

2020-04-13 13:50:26,673 : INFO : topic #9 (0.100): 0.007*"teacher" + 0.005*"black" + 0.005*"strike" + 0.005*"action" + 0.005*"titl" + 0.004*"court" + 0.004*"chelsea" + 0.004*"hurrican" + 0.004*"legal" + 0.004*"illeg"
2020-04-13 13:50:26,677 : INFO : topic diff=0.057868, rho=0.094916
2020-04-13 13:50:26,910 : INFO : PROGRESS: pass 0, dispatched chunk #118 = documents up to #238000/300000, outstanding queue size 5
2020-04-13 13:50:27,134 : INFO : PROGRESS: pass 0, dispatched chunk #119 = documents up to #240000/300000, outstanding queue size 4
2020-04-13 13:50:27,359 : INFO : PROGRESS: pass 0, dispatched chunk #120 = documents up to #242000/300000, outstanding queue size 5
2020-04-13 13:50:27,594 : INFO : PROGRESS: pass 0, dispatched chunk #121 = documents up to #244000/300000, outstanding queue size 5
2020-04-13 13:50:27,674 : INFO : merging changes from 8000 documents into a model of 300000 documents
2020-04-13 13:50:27,682 : INFO : topic #4 (0.100): 0.007*"polic" + 0.007*"charg" + 0.0

2020-04-13 13:50:32,184 : INFO : topic #0 (0.100): 0.013*"polic" + 0.012*"investig" + 0.007*"fatal" + 0.007*"probe" + 0.005*"accid" + 0.005*"doubt" + 0.005*"crash" + 0.005*"inquiri" + 0.005*"dump" + 0.005*"govt"
2020-04-13 13:50:32,186 : INFO : topic #8 (0.100): 0.014*"kill" + 0.010*"murder" + 0.010*"miss" + 0.010*"polic" + 0.009*"bomb" + 0.008*"attack" + 0.008*"search" + 0.008*"bird" + 0.007*"suspect" + 0.007*"woman"
2020-04-13 13:50:32,187 : INFO : topic #2 (0.100): 0.007*"lebanon" + 0.005*"market" + 0.005*"open" + 0.005*"win" + 0.005*"uranium" + 0.005*"victori" + 0.005*"aussi" + 0.005*"record" + 0.005*"final" + 0.004*"high"
2020-04-13 13:50:32,189 : INFO : topic diff=0.056607, rho=0.087370
2020-04-13 13:50:32,398 : INFO : PROGRESS: pass 0, dispatched chunk #138 = documents up to #278000/300000, outstanding queue size 5
2020-04-13 13:50:32,638 : INFO : PROGRESS: pass 0, dispatched chunk #139 = documents up to #280000/300000, outstanding queue size 5
2020-04-13 13:50:32,866 : INFO : P

2020-04-13 13:50:38,166 : INFO : PROGRESS: pass 1, dispatched chunk #7 = documents up to #16000/300000, outstanding queue size 5
2020-04-13 13:50:38,246 : INFO : merging changes from 8000 documents into a model of 300000 documents
2020-04-13 13:50:38,254 : INFO : topic #6 (0.100): 0.012*"crash" + 0.006*"highway" + 0.006*"plane" + 0.006*"polic" + 0.006*"warn" + 0.006*"cyclon" + 0.005*"govt" + 0.005*"victim" + 0.005*"criticis" + 0.004*"green"
2020-04-13 13:50:38,256 : INFO : topic #0 (0.100): 0.014*"polic" + 0.013*"investig" + 0.008*"fatal" + 0.007*"probe" + 0.005*"crash" + 0.005*"doubt" + 0.005*"accid" + 0.005*"climat" + 0.005*"govt" + 0.005*"death"
2020-04-13 13:50:38,259 : INFO : topic #8 (0.100): 0.014*"kill" + 0.011*"miss" + 0.011*"murder" + 0.011*"polic" + 0.009*"search" + 0.009*"bomb" + 0.008*"attack" + 0.007*"woman" + 0.007*"iraq" + 0.007*"suspect"
2020-04-13 13:50:38,284 : INFO : topic #1 (0.100): 0.027*"closer" + 0.005*"plan" + 0.005*"council" + 0.005*"profit" + 0.004*"confid" 

2020-04-13 13:50:42,595 : INFO : topic diff=0.049279, rho=0.081111
2020-04-13 13:50:42,804 : INFO : PROGRESS: pass 1, dispatched chunk #24 = documents up to #50000/300000, outstanding queue size 5
2020-04-13 13:50:43,060 : INFO : PROGRESS: pass 1, dispatched chunk #25 = documents up to #52000/300000, outstanding queue size 5
2020-04-13 13:50:43,326 : INFO : PROGRESS: pass 1, dispatched chunk #26 = documents up to #54000/300000, outstanding queue size 5
2020-04-13 13:50:43,556 : INFO : PROGRESS: pass 1, dispatched chunk #27 = documents up to #56000/300000, outstanding queue size 5
2020-04-13 13:50:43,638 : INFO : merging changes from 8000 documents into a model of 300000 documents
2020-04-13 13:50:43,679 : INFO : topic #0 (0.100): 0.014*"polic" + 0.012*"investig" + 0.008*"fatal" + 0.008*"probe" + 0.006*"accid" + 0.006*"doubt" + 0.005*"crash" + 0.005*"inquiri" + 0.005*"death" + 0.005*"govt"
2020-04-13 13:50:43,681 : INFO : topic #2 (0.100): 0.006*"final" + 0.006*"open" + 0.006*"lead" + 0

2020-04-13 13:50:48,347 : INFO : topic #9 (0.100): 0.008*"teacher" + 0.007*"black" + 0.007*"solomon" + 0.006*"strike" + 0.006*"titl" + 0.005*"wallabi" + 0.005*"action" + 0.004*"asylum" + 0.004*"forc" + 0.004*"legal"
2020-04-13 13:50:48,348 : INFO : topic diff=0.047957, rho=0.081111
2020-04-13 13:50:48,551 : INFO : PROGRESS: pass 1, dispatched chunk #43 = documents up to #88000/300000, outstanding queue size 4
2020-04-13 13:50:48,828 : INFO : PROGRESS: pass 1, dispatched chunk #44 = documents up to #90000/300000, outstanding queue size 5
2020-04-13 13:50:49,123 : INFO : PROGRESS: pass 1, dispatched chunk #45 = documents up to #92000/300000, outstanding queue size 5
2020-04-13 13:50:49,378 : INFO : PROGRESS: pass 1, dispatched chunk #46 = documents up to #94000/300000, outstanding queue size 4
2020-04-13 13:50:49,621 : INFO : PROGRESS: pass 1, dispatched chunk #47 = documents up to #96000/300000, outstanding queue size 5
2020-04-13 13:50:49,707 : INFO : merging changes from 8000 document

2020-04-13 13:50:54,171 : INFO : topic #9 (0.100): 0.009*"teacher" + 0.007*"strike" + 0.006*"black" + 0.006*"titl" + 0.005*"solomon" + 0.005*"wallabi" + 0.005*"action" + 0.004*"seri" + 0.004*"jone" + 0.004*"legal"
2020-04-13 13:50:54,172 : INFO : topic #5 (0.100): 0.012*"govt" + 0.012*"fund" + 0.011*"water" + 0.011*"plan" + 0.009*"council" + 0.009*"boost" + 0.009*"health" + 0.007*"urg" + 0.006*"group" + 0.006*"servic"
2020-04-13 13:50:54,177 : INFO : topic #1 (0.100): 0.010*"closer" + 0.006*"profit" + 0.006*"merger" + 0.006*"council" + 0.005*"plan" + 0.005*"highlight" + 0.005*"hill" + 0.004*"confid" + 0.004*"growth" + 0.004*"young"
2020-04-13 13:50:54,205 : INFO : topic diff=0.047784, rho=0.081111
2020-04-13 13:50:54,399 : INFO : PROGRESS: pass 1, dispatched chunk #63 = documents up to #128000/300000, outstanding queue size 4
2020-04-13 13:50:54,660 : INFO : PROGRESS: pass 1, dispatched chunk #64 = documents up to #130000/300000, outstanding queue size 4
2020-04-13 13:50:54,952 : INFO 

2020-04-13 13:51:00,355 : INFO : PROGRESS: pass 1, dispatched chunk #83 = documents up to #168000/300000, outstanding queue size 5
2020-04-13 13:51:00,449 : INFO : merging changes from 8000 documents into a model of 300000 documents
2020-04-13 13:51:00,458 : INFO : topic #1 (0.100): 0.008*"closer" + 0.006*"profit" + 0.005*"council" + 0.005*"merger" + 0.005*"plan" + 0.005*"hill" + 0.005*"highlight" + 0.005*"confid" + 0.005*"growth" + 0.005*"pope"
2020-04-13 13:51:00,461 : INFO : topic #8 (0.100): 0.019*"kill" + 0.012*"polic" + 0.012*"murder" + 0.011*"miss" + 0.010*"attack" + 0.010*"bomb" + 0.010*"iraq" + 0.009*"search" + 0.009*"iraqi" + 0.007*"woman"
2020-04-13 13:51:00,481 : INFO : topic #3 (0.100): 0.007*"celebr" + 0.006*"french" + 0.006*"clean" + 0.005*"shark" + 0.005*"injuri" + 0.005*"tree" + 0.005*"hobart" + 0.005*"hunter" + 0.005*"storm" + 0.005*"warrior"
2020-04-13 13:51:00,483 : INFO : topic #7 (0.100): 0.009*"charg" + 0.008*"appeal" + 0.007*"jail" + 0.007*"drive" + 0.007*"sente

2020-04-13 13:51:05,814 : INFO : topic diff=0.044159, rho=0.081111
2020-04-13 13:51:06,019 : INFO : PROGRESS: pass 1, dispatched chunk #100 = documents up to #202000/300000, outstanding queue size 5
2020-04-13 13:51:06,295 : INFO : PROGRESS: pass 1, dispatched chunk #101 = documents up to #204000/300000, outstanding queue size 4
2020-04-13 13:51:06,598 : INFO : PROGRESS: pass 1, dispatched chunk #102 = documents up to #206000/300000, outstanding queue size 5
2020-04-13 13:51:06,876 : INFO : PROGRESS: pass 1, dispatched chunk #103 = documents up to #208000/300000, outstanding queue size 5
2020-04-13 13:51:06,973 : INFO : merging changes from 8000 documents into a model of 300000 documents
2020-04-13 13:51:06,982 : INFO : topic #4 (0.100): 0.008*"polic" + 0.007*"drug" + 0.006*"latham" + 0.006*"charg" + 0.006*"care" + 0.006*"face" + 0.006*"toll" + 0.006*"death" + 0.005*"govt" + 0.005*"road"
2020-04-13 13:51:06,984 : INFO : topic #7 (0.100): 0.008*"charg" + 0.008*"jail" + 0.007*"drive" + 0

2020-04-13 13:51:11,908 : INFO : topic #1 (0.100): 0.010*"closer" + 0.007*"profit" + 0.006*"highlight" + 0.005*"hill" + 0.005*"council" + 0.005*"plan" + 0.004*"confid" + 0.004*"merger" + 0.004*"growth" + 0.004*"young"
2020-04-13 13:51:11,911 : INFO : topic #8 (0.100): 0.018*"kill" + 0.012*"polic" + 0.012*"miss" + 0.011*"bomb" + 0.011*"murder" + 0.010*"search" + 0.010*"attack" + 0.008*"bird" + 0.008*"iraq" + 0.008*"charg"
2020-04-13 13:51:11,922 : INFO : topic diff=0.043288, rho=0.081111
2020-04-13 13:51:12,117 : INFO : PROGRESS: pass 1, dispatched chunk #120 = documents up to #242000/300000, outstanding queue size 5
2020-04-13 13:51:12,391 : INFO : PROGRESS: pass 1, dispatched chunk #121 = documents up to #244000/300000, outstanding queue size 4
2020-04-13 13:51:12,668 : INFO : PROGRESS: pass 1, dispatched chunk #122 = documents up to #246000/300000, outstanding queue size 5
2020-04-13 13:51:12,753 : INFO : merging changes from 8000 documents into a model of 300000 documents
2020-04-13

2020-04-13 13:51:17,489 : INFO : topic #3 (0.100): 0.007*"celebr" + 0.006*"clean" + 0.005*"injuri" + 0.005*"shark" + 0.005*"storm" + 0.005*"hobart" + 0.005*"french" + 0.005*"hunter" + 0.005*"tree" + 0.005*"warrior"
2020-04-13 13:51:17,492 : INFO : topic #0 (0.100): 0.013*"investig" + 0.013*"polic" + 0.008*"fatal" + 0.008*"probe" + 0.006*"accid" + 0.006*"doubt" + 0.006*"dump" + 0.005*"death" + 0.005*"indonesian" + 0.005*"crash"
2020-04-13 13:51:17,494 : INFO : topic #6 (0.100): 0.013*"crash" + 0.008*"highway" + 0.006*"polic" + 0.006*"plane" + 0.006*"warn" + 0.006*"cyclon" + 0.006*"whale" + 0.005*"victim" + 0.005*"truck" + 0.005*"criticis"
2020-04-13 13:51:17,508 : INFO : topic #7 (0.100): 0.009*"jail" + 0.008*"charg" + 0.008*"drive" + 0.008*"drink" + 0.007*"appeal" + 0.007*"sentenc" + 0.006*"polic" + 0.005*"term" + 0.005*"news" + 0.005*"wine"
2020-04-13 13:51:17,510 : INFO : topic diff=0.040676, rho=0.081111
2020-04-13 13:51:17,681 : INFO : PROGRESS: pass 1, dispatched chunk #140 = docu

In [32]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

2020-04-13 13:51:22,461 : INFO : topic #0 (0.100): 0.014*"investig" + 0.013*"polic" + 0.009*"fatal" + 0.007*"probe" + 0.006*"climat" + 0.006*"doubt" + 0.006*"accid" + 0.005*"alic" + 0.005*"crash" + 0.005*"death"
2020-04-13 13:51:22,463 : INFO : topic #1 (0.100): 0.030*"closer" + 0.006*"profit" + 0.006*"hill" + 0.005*"highlight" + 0.005*"council" + 0.005*"young" + 0.005*"plan" + 0.005*"merger" + 0.004*"growth" + 0.004*"confid"
2020-04-13 13:51:22,465 : INFO : topic #2 (0.100): 0.008*"open" + 0.007*"lead" + 0.007*"aussi" + 0.006*"market" + 0.006*"victori" + 0.006*"final" + 0.006*"australia" + 0.005*"win" + 0.005*"lebanon" + 0.005*"record"
2020-04-13 13:51:22,466 : INFO : topic #3 (0.100): 0.007*"celebr" + 0.006*"clean" + 0.006*"storm" + 0.006*"hobart" + 0.006*"injuri" + 0.006*"shark" + 0.006*"recycl" + 0.005*"pipelin" + 0.005*"hunter" + 0.005*"warrior"
2020-04-13 13:51:22,468 : INFO : topic #4 (0.100): 0.007*"polic" + 0.007*"drug" + 0.006*"toll" + 0.006*"road" + 0.006*"death" + 0.006*"as

Topic: 0 Word: 0.014*"investig" + 0.013*"polic" + 0.009*"fatal" + 0.007*"probe" + 0.006*"climat" + 0.006*"doubt" + 0.006*"accid" + 0.005*"alic" + 0.005*"crash" + 0.005*"death"


Topic: 1 Word: 0.030*"closer" + 0.006*"profit" + 0.006*"hill" + 0.005*"highlight" + 0.005*"council" + 0.005*"young" + 0.005*"plan" + 0.005*"merger" + 0.004*"growth" + 0.004*"confid"


Topic: 2 Word: 0.008*"open" + 0.007*"lead" + 0.007*"aussi" + 0.006*"market" + 0.006*"victori" + 0.006*"final" + 0.006*"australia" + 0.005*"win" + 0.005*"lebanon" + 0.005*"record"


Topic: 3 Word: 0.007*"celebr" + 0.006*"clean" + 0.006*"storm" + 0.006*"hobart" + 0.006*"injuri" + 0.006*"shark" + 0.006*"recycl" + 0.005*"pipelin" + 0.005*"hunter" + 0.005*"warrior"


Topic: 4 Word: 0.007*"polic" + 0.007*"drug" + 0.006*"toll" + 0.006*"road" + 0.006*"death" + 0.006*"assault" + 0.006*"charg" + 0.005*"face" + 0.005*"timor" + 0.005*"govt"


Topic: 5 Word: 0.015*"govt" + 0.015*"water" + 0.012*"fund" + 0.012*"plan" + 0.010*"council" + 0.009*"

### Classification of the topics ###

As we can see, when using tf-idf, heavier weights are given to words that are not as frequent which results in nouns being factored in. That makes it harder to figure out the categories as nouns can be hard to categorize. This goes to show that the models we apply depend on the type of corpus of text we are dealing with. 

Using the words in each topic and their corresponding weights, what categories could you find?

* 0: 
* 1:  
* 2: 
* 3: 
* 4:  
* 5: 
* 6: 
* 7: 
* 8: 
* 9: 

## Step 5.1: Performance evaluation by classifying sample document using LDA Bag of Words model##

We will check to see where our test document would be classified. 

In [33]:
'''
Text of sample document 4310
'''
processed_docs[4310]

['ratepay', 'group', 'want', 'compulsori', 'local', 'govt', 'vote']

In [34]:
'''
Check which topic our test document belongs to using the LDA Bag of Words model.
'''

# Our test document is document number 4310
for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.3726736009120941	 
Topic: 0.050*"plan" + 0.041*"council" + 0.031*"water" + 0.019*"chang" + 0.018*"govt" + 0.015*"closer" + 0.014*"nation" + 0.012*"group" + 0.012*"push" + 0.012*"urg"

Score: 0.2769027650356293	 
Topic: 0.051*"govt" + 0.022*"fund" + 0.019*"reject" + 0.018*"boost" + 0.017*"hospit" + 0.015*"power" + 0.015*"rise" + 0.014*"claim" + 0.014*"urg" + 0.012*"opposit"

Score: 0.13820751011371613	 
Topic: 0.019*"continu" + 0.016*"world" + 0.013*"sydney" + 0.013*"south" + 0.013*"rain" + 0.012*"final" + 0.011*"england" + 0.010*"flood" + 0.009*"blue" + 0.008*"storm"

Score: 0.13718006014823914	 
Topic: 0.017*"say" + 0.016*"high" + 0.016*"market" + 0.015*"nuclear" + 0.014*"busi" + 0.012*"howard" + 0.012*"inquiri" + 0.011*"campaign" + 0.011*"rudd" + 0.011*"hous"

Score: 0.012506648898124695	 
Topic: 0.026*"help" + 0.019*"servic" + 0.018*"urg" + 0.014*"worker" + 0.014*"communiti" + 0.013*"jail" + 0.012*"child" + 0.012*"safeti" + 0.011*"stand" + 0.011*"fight"

Score: 0.012506356

### It has the highest probability (`x`) to be  part of the topic that we assigned as X, which is the accurate classification. ###

## Step 5.2: Performance evaluation by classifying sample document using LDA TF-IDF model##

In [35]:
'''
Check which topic our test document belongs to using the LDA TF-IDF model.
'''
for index, score in sorted(lda_model_tfidf[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.5525990724563599	 
Topic: 0.015*"govt" + 0.015*"water" + 0.012*"fund" + 0.012*"plan" + 0.010*"council" + 0.009*"urg" + 0.008*"health" + 0.007*"boost" + 0.007*"group" + 0.007*"indigen"

Score: 0.18644870817661285	 
Topic: 0.007*"polic" + 0.007*"drug" + 0.006*"toll" + 0.006*"road" + 0.006*"death" + 0.006*"assault" + 0.006*"charg" + 0.005*"face" + 0.005*"timor" + 0.005*"govt"

Score: 0.1734178364276886	 
Topic: 0.008*"open" + 0.007*"lead" + 0.007*"aussi" + 0.006*"market" + 0.006*"victori" + 0.006*"final" + 0.006*"australia" + 0.005*"win" + 0.005*"lebanon" + 0.005*"record"

Score: 0.012505277059972286	 
Topic: 0.014*"crash" + 0.007*"highway" + 0.007*"cyclon" + 0.006*"plane" + 0.006*"polic" + 0.006*"warn" + 0.005*"die" + 0.005*"whale" + 0.005*"govt" + 0.005*"green"

Score: 0.012505244463682175	 
Topic: 0.030*"closer" + 0.006*"profit" + 0.006*"hill" + 0.005*"highlight" + 0.005*"council" + 0.005*"young" + 0.005*"plan" + 0.005*"merger" + 0.004*"growth" + 0.004*"confid"

Score: 0.0125

### It has the highest probability (`x%`) to be  part of the topic that we assigned as X. ###

## Step 6: Testing model on unseen document ##

In [36]:
unseen_document = "My favorite sports activities are running and swimming."

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.22002790868282318	 Topic: 0.015*"union" + 0.015*"sale" + 0.014*"head" + 0.014*"secur" + 0.014*"back"
Score: 0.2200108766555786	 Topic: 0.026*"help" + 0.019*"servic" + 0.018*"urg" + 0.014*"worker" + 0.014*"communiti"
Score: 0.22000348567962646	 Topic: 0.019*"continu" + 0.016*"world" + 0.013*"sydney" + 0.013*"south" + 0.013*"rain"
Score: 0.21991842985153198	 Topic: 0.017*"say" + 0.016*"high" + 0.016*"market" + 0.015*"nuclear" + 0.014*"busi"
Score: 0.020008614286780357	 Topic: 0.051*"govt" + 0.022*"fund" + 0.019*"reject" + 0.018*"boost" + 0.017*"hospit"
Score: 0.02000614069402218	 Topic: 0.063*"polic" + 0.036*"charg" + 0.032*"court" + 0.029*"face" + 0.022*"miss"
Score: 0.02000614069402218	 Topic: 0.039*"kill" + 0.022*"iraq" + 0.021*"test" + 0.019*"open" + 0.019*"forc"
Score: 0.02000614069402218	 Topic: 0.034*"death" + 0.031*"crash" + 0.028*"investig" + 0.027*"polic" + 0.021*"road"
Score: 0.02000614069402218	 Topic: 0.050*"plan" + 0.041*"council" + 0.031*"water" + 0.019*"chang" + 

In [37]:
from gensim.test.utils import datapath
    
# Save model to disk.
temp_file = datapath(r'C:\Users\Marco\Desktop\Gits\hmm-tagger\lda-models\model_save')
lda_model_tfidf.save(temp_file)


2020-04-13 13:51:22,578 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-04-13 13:51:22,583 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
2020-04-13 13:51:22,585 : INFO : saving LdaState object under C:\Users\Marco\Desktop\Gits\hmm-tagger\lda-models\model_save.state, separately None
2020-04-13 13:51:22,590 : INFO : saved C:\Users\Marco\Desktop\Gits\hmm-tagger\lda-models\model_save.state
2020-04-13 13:51:22,596 : INFO : saving LdaMulticore object under C:\Users\Marco\Desktop\Gits\hmm-tagger\lda-models\model_save, separately ['expElogbeta', 'sstats']
2020-04-13 13:51:22,597 : INFO : storing np array 'expElogbeta' to C:\Users\Marco\Desktop\Gits\hmm-tagger\lda-models\model_save.expElogbeta.npy
2020-04-13 13:51:22,599 : INFO : not storing attribute dispatcher
2020-04-13 13:51:22,601 : INFO : not storing attribute state
2020-04-13 13:51:22,601 : INFO : not storing attrib

In [38]:
# Load a potentially pretrained model from disk.
new_lda = gensim.models.LdaMulticore.load(temp_file)

2020-04-13 13:51:22,613 : INFO : loading LdaMulticore object from C:\Users\Marco\Desktop\Gits\hmm-tagger\lda-models\model_save
2020-04-13 13:51:22,619 : INFO : loading expElogbeta from C:\Users\Marco\Desktop\Gits\hmm-tagger\lda-models\model_save.expElogbeta.npy with mmap=None
2020-04-13 13:51:22,621 : INFO : setting ignored attribute dispatcher to None
2020-04-13 13:51:22,622 : INFO : setting ignored attribute state to None
2020-04-13 13:51:22,623 : INFO : setting ignored attribute id2word to None
2020-04-13 13:51:22,624 : INFO : loaded C:\Users\Marco\Desktop\Gits\hmm-tagger\lda-models\model_save
2020-04-13 13:51:22,625 : INFO : loading LdaState object from C:\Users\Marco\Desktop\Gits\hmm-tagger\lda-models\model_save.state
2020-04-13 13:51:22,629 : INFO : loaded C:\Users\Marco\Desktop\Gits\hmm-tagger\lda-models\model_save.state


In [39]:
unseen_document = "My favorite sports activities are biking and eating."

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(new_lda[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.3405349552631378	 Topic: 0.063*"polic" + 0.036*"charg" + 0.032*"court" + 0.029*"face" + 0.022*"miss"
Score: 0.2704864740371704	 Topic: 0.017*"say" + 0.016*"high" + 0.016*"market" + 0.015*"nuclear" + 0.014*"busi"
Score: 0.24889309704303741	 Topic: 0.026*"help" + 0.019*"servic" + 0.018*"urg" + 0.014*"worker" + 0.014*"communiti"
Score: 0.020019548013806343	 Topic: 0.019*"continu" + 0.016*"world" + 0.013*"sydney" + 0.013*"south" + 0.013*"rain"
Score: 0.020014025270938873	 Topic: 0.034*"death" + 0.031*"crash" + 0.028*"investig" + 0.027*"polic" + 0.021*"road"
Score: 0.020013127475976944	 Topic: 0.051*"govt" + 0.022*"fund" + 0.019*"reject" + 0.018*"boost" + 0.017*"hospit"
Score: 0.020010529085993767	 Topic: 0.029*"warn" + 0.022*"aust" + 0.015*"coast" + 0.015*"deal" + 0.014*"north"
Score: 0.020009785890579224	 Topic: 0.015*"union" + 0.015*"sale" + 0.014*"head" + 0.014*"secur" + 0.014*"back"
Score: 0.020009400323033333	 Topic: 0.039*"kill" + 0.022*"iraq" + 0.021*"test" + 0.019*"open" +

The model correctly classifies the unseen document with 'x'% probability to the X category.