## Step 0: Latent Dirichlet Allocation ##

LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. 

* Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial. 
* It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. 

## Step 1: Load the dataset

The dataset we'll use is a list of over one million news headlines published over a period of 15 years. We'll start by loading it from the `abcnews-date-text.csv` file.

In [1]:
'''
Load the dataset from the CSV and save it to 'data_text'
'''
import pandas as pd
data = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False);
# We only need the Headlines text column from the data
data_text = data[:300000][['headline_text']];
data_text['index'] = data_text.index

documents = data_text

Let's glance at the dataset:

In [2]:
'''
Get the total number of documents
'''
print(len(documents))

300000


In [3]:
documents[:5]

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


## Step 2: Data Preprocessing ##

We will perform the following steps:

* **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
* Words that have fewer than 3 characters are removed.
* All **stopwords** are removed.
* Words are **lemmatized** - words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are **stemmed** - words are reduced to their root form.


In [5]:
'''
Loading Gensim and nltk libraries
'''
# pip install gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)



In [7]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\wc5257\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Lemmatizer Example
Before preprocessing our dataset, let's first look at an lemmatizing example. What would be the output if we lemmatized the word 'went':

In [8]:
print(WordNetLemmatizer().lemmatize('went', pos = 'v')) # past tense to present tense

go


### Stemmer Example
Let's also look at a stemming example. Let's throw a number of words at the stemmer and see how it deals with each one:

In [9]:
stemmer = SnowballStemmer("english")
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]

pd.DataFrame(data={'original word':original_words, 'stemmed':singles })

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [10]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            # TODO: Apply lemmatize_stemming() on the token, then add to the results list
            result.append(lemmatize_stemming(token))
            
    return result



In [11]:
'''
Preview a document after preprocessing
'''
document_num = 4310
doc_sample = documents[documents['index'] == document_num].values[0][0]

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['rain', 'helps', 'dampen', 'bushfires']


Tokenized and lemmatized document: 
['rain', 'help', 'dampen', 'bushfir']


In [12]:
documents

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4
...,...,...
299995,broughton hall audit reveals serious breaches,299995
299996,broughton hall fails key standards,299996
299997,broughton hall safe for residents govt says,299997
299998,burn off at conservation park aims to prevent,299998


Let's now preprocess all the news headlines we have. To do that, let's use the [map](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) function from pandas to apply `preprocess()` to the `headline_text` column

**Note**: This may take a few minutes (it take 6 minutes on my laptop)

In [18]:
# TODO: preprocess all the headlines, saving the list of results as 'processed_docs'
# when using map, send the function but without the ()
processed_docs = documents['headline_text'].map(preprocess)



In [19]:
'''
Preview 'processed_docs'
'''
processed_docs[:10]

0            [decid, communiti, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

## Step 3.1: Bag of words on the dataset

Now let's create a dictionary from 'processed_docs' containing the number of times a word appears in the training set. To do that, let's pass `processed_docs` to [`gensim.corpora.Dictionary()`](https://radimrehurek.com/gensim/corpora/dictionary.html) and call it '`dictionary`'.

In [20]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
# This module implements the concept of a Dictionary – a mapping between words and their integer ids.
dictionary = gensim.corpora.Dictionary(processed_docs)

In [60]:
'''
Checking dictionary created
'''
# dict of (int, str) -> (token id, word)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit
11 aust
12 rise
13 staff
14 strike
15 affect
16 australian
17 travel
18 jump
19 tripl
20 win
21 antic
22 barca
23 break
24 delight
25 record
26 aussi
27 match
28 memphi
29 qualifi
30 stosur
31 wast
32 address
33 council
34 iraq
35 secur
36 australia
37 lock
38 timet
39 contribut
40 million
41 birthday
42 celebr
43 robson
44 ahead
45 plan
46 championship
47 cycl
48 hop
49 launceston
50 boost
51 suppli
52 water
53 bill
54 buri
55 state
56 unit
57 dismiss
58 harass
59 report
60 troop
61 arriv
62 british
63 combat
64 daili
65 kuwait
66 bryant
67 doubl
68 laker
69 lead
70 overtim
71 bushfir
72 centrelink
73 urg
74 victim
75 attack
76 busi
77 prepar
78 terrorist
79 aveng
80 defeat
81 elimin
82 final
83 blend
84 ethanol
85 fuel
86 freak
87 goal
88 leav
89 roma
90 ruin
91 cemeteri
92 fund
93 miss
94 code
95 conduct
96 donat
97 organ
98 regul
99 toughen
100 bank
101 commonwealth
102 cut
103 

1611 ring
1612 paper
1613 preserv
1614 gunner
1615 stumbl
1616 chairman
1617 hampshir
1618 hard
1619 ident
1620 hayden
1621 outlast
1622 chariti
1623 hockeyroo
1624 perform
1625 special
1626 cooper
1627 prompt
1628 regain
1629 swan
1630 bash
1631 door
1632 swim
1633 bulk
1634 dip
1635 medicar
1636 slay
1637 pressur
1638 resolv
1639 monster
1640 sweep
1641 natasha
1642 spat
1643 newton
1644 karachi
1645 mosqu
1646 outsid
1647 literaci
1648 captur
1649 odriscol
1650 okay
1651 flow
1652 opec
1653 welfar
1654 perth
1655 plung
1656 ultralight
1657 director
1658 bottl
1659 avert
1660 ronaldinho
1661 raymond
1662 bing
1663 spotlight
1664 inspector
1665 retriev
1666 intensifi
1667 glori
1668 cafe
1669 cyber
1670 highest
1671 oxygen
1672 steadi
1673 storm
1674 whip
1675 kingz
1676 striker
1677 knight
1678 tail
1679 tamworth
1680 guidelin
1681 modifi
1682 transplant
1683 tendulkar
1684 total
1685 comedi
1686 talent
1687 campus
1688 lucki
1689 uni
1690 greet
1691 foul
1692 signal
1693 draft
1694 

3111 fortun
3112 armidal
3113 wright
3114 citizen
3115 radio
3116 superannu
3117 basebal
3118 golfer
3119 bench
3120 rid
3121 guitar
3122 extent
3123 takeov
3124 bulldog
3125 sustain
3126 merci
3127 crunch
3128 tilt
3129 bias
3130 psych
3131 swarm
3132 mentor
3133 load
3134 incit
3135 deadlin
3136 serv
3137 dili
3138 drainag
3139 heed
3140 cloud
3141 alien
3142 capitalis
3143 rich
3144 starv
3145 bairnsdal
3146 invad
3147 secretari
3148 tightlip
3149 profil
3150 blackmail
3151 profession
3152 retali
3153 jewel
3154 facelift
3155 keelti
3156 liabil
3157 regatta
3158 lithgow
3159 junior
3160 maradona
3161 antarctica
3162 conscienc
3163 inclus
3164 northam
3165 unsur
3166 kidnapp
3167 flop
3168 perilya
3169 pioneer
3170 solo
3171 bandit
3172 powder
3173 compulsori
3174 ratepay
3175 consolid
3176 playoff
3177 embark
3178 voyag
3179 mozzi
3180 weaker
3181 birth
3182 past
3183 verkerk
3184 brew
3185 contractor
3186 ribbon
3187 anzac
3188 unfair
3189 recreat
3190 ail
3191 competit
3192 ongo
3

4791 spar
4792 pride
4793 murphi
4794 cabl
4795 ant
4796 protestor
4797 aftermath
4798 simul
4799 frawley
4800 firework
4801 sydneysid
4802 wellington
4803 cannib
4804 conscious
4805 nightmar
4806 forgiv
4807 garner
4808 refund
4809 abolish
4810 motorway
4811 eat
4812 petroleum
4813 polar
4814 toronto
4815 hapless
4816 camper
4817 asean
4818 lunch
4819 curfew
4820 biofuel
4821 brilliant
4822 davydenko
4823 repatri
4824 heroic
4825 behead
4826 lownd
4827 supercar
4828 maher
4829 hungari
4830 briberi
4831 mcginti
4832 buffalo
4833 mick
4834 mooloolaba
4835 brunt
4836 unleash
4837 rfds
4838 averag
4839 buck
4840 cinema
4841 crocodil
4842 genom
4843 gould
4844 greek
4845 appl
4846 surgeon
4847 chalabi
4848 classif
4849 deton
4850 submarin
4851 scholarship
4852 emptiv
4853 odour
4854 barnett
4855 bribe
4856 alp
4857 picket
4858 cheap
4859 wooli
4860 lawsuit
4861 masri
4862 fitzgibbon
4863 poker
4864 moscow
4865 lloyd
4866 mudge
4867 garbag
4868 tractor
4869 climber
4870 mortem
4871 harm
487

6361 greyhound
6362 complic
6363 stalker
6364 googl
6365 flemington
6366 credenti
6367 kickback
6368 chad
6369 raze
6370 katich
6371 inverel
6372 fool
6373 dugong
6374 retrench
6375 culprit
6376 halliburton
6377 brigalow
6378 leaver
6379 petrova
6380 auspin
6381 balibo
6382 currumbin
6383 flanneri
6384 morwel
6385 elvstroem
6386 yuko
6387 paralympian
6388 simplot
6389 diva
6390 makyb
6391 shorten
6392 iceberg
6393 embargo
6394 midland
6395 oversuppli
6396 idol
6397 statehood
6398 grove
6399 titan
6400 wilko
6401 cartoon
6402 chess
6403 kearn
6404 sophi
6405 impeach
6406 rioter
6407 plotter
6408 detector
6409 carney
6410 spencer
6411 gascoyn
6412 courtney
6413 synagogu
6414 luggag
6415 takeaway
6416 passag
6417 weightlift
6418 pardon
6419 launcher
6420 rottnest
6421 brock
6422 figo
6423 saha
6424 cronulla
6425 ripper
6426 vline
6427 vizard
6428 valid
6429 gillard
6430 gilli
6431 raper
6432 jetstar
6433 deporte
6434 landscap
6435 greenough
6436 gore
6437 diari
6438 coastguard
6439 parmal

In [65]:
# dictionary.cfs is a dict of (int, int) -> (token id, # occurences)

# this code shows (token id, word, # occurences)

count = 0
for k, v in dictionary.iteritems():
    print(k,v,dictionary.cfs[k])
    count += 1
    if count > 10:
        break

0 broadcast 112
1 communiti 1668
2 decid 485
3 licenc 371
4 awar 162
5 defam 92
6 wit 531
7 call 2769
8 infrastructur 273
9 protect 882
10 summit 413


** Gensim filter_extremes **

[`filter_extremes(no_below=5, no_above=0.5, keep_n=100000)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes)

Filter out tokens that appear in

* less than no_below documents (absolute number) or
* more than no_above documents (fraction of total corpus size, not absolute number).
* after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [22]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
# TODO: apply dictionary.filter_extremes() with the parameters mentioned above
# https://stackoverflow.com/questions/66621708/filter-extreme-in-gensim
dictionary_filtered = dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)

# Filter out tokens that appear in

#    less than no_below documents (absolute number) or
#    more than no_above documents (fraction of total corpus size, not absolute number).
#    after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).


** Gensim doc2bow **

[`doc2bow(document)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow)

* Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [27]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
# TODO
# https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow
# doc2bow(document, allow_update=False, return_missing=False)
# iterate over processed_docs. use list comphrehesion
bow_corpus = [dictionary.doc2bow(processed_doc ,allow_update=False, return_missing=False) for \
              processed_doc in processed_docs]

In [30]:
'''
Checking Bag of Words corpus for our sample document --> (token_id, token_count)
'''
print(document_num)
bow_corpus[document_num]

4310


[(71, 1), (107, 1), (462, 1), (3530, 1)]

In [29]:
'''
Preview BOW for our sample preprocessed document
'''
# Here document_num is document number 4310 which we have checked in Step 2
bow_doc_4310 = bow_corpus[document_num]

for i in range(len(bow_doc[0])):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 71 ("bushfir") appears 1 time.
Word 107 ("help") appears 1 time.
Word 462 ("rain") appears 1 time.
Word 3530 ("dampen") appears 1 time.


## Step 3.2: TF-IDF on our document set ##

While performing TF-IDF on the corpus is not necessary for LDA implemention using the gensim model, it is recemmended. TF-IDF expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality.

*Please note: The author of Gensim dictates the standard procedure for LDA to be using the Bag of Words model.*

** TF-IDF stands for "Term Frequency, Inverse Document Frequency".**

* It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.
* If a word appears frequently in a document, it's important. Give the word a high score. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.
* Therefore, common words like "the" and "for", which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

In other words:

* TF(w) = `(Number of times term w appears in a document) / (Total number of terms in the document)`.
* IDF(w) = `log_e(Total number of documents / Number of documents with term w in it)`.

** For example **

* Consider a document containing `100` words wherein the word 'tiger' appears 3 times. 
* The term frequency (i.e., tf) for 'tiger' is then: 
    - `TF = (3 / 100) = 0.03`. 

* Now, assume we have `10 million` documents and the word 'tiger' appears in `1000` of these. Then, the inverse document frequency (i.e., idf) is calculated as:
    - `IDF = log(10,000,000 / 1,000) = 4`. 

* Thus, the Tf-idf weight is the product of these quantities: 
    - `TF-IDF = 0.03 * 4 = 0.12`.

In [34]:
'''
Create tf-idf model object using models.TfidfModel on 'bow_corpus' and save it to 'tfidf'
'''
from gensim import corpora, models

# TODO
# https://radimrehurek.com/gensim/models/tfidfmodel.html
# model = TfidfModel(corpus)  # fit model
tfidf = models.TfidfModel(bow_corpus) 

In [37]:
'''
Apply transformation to the entire corpus and call it 'corpus_tfidf'
'''
# TODO
# https://radimrehurek.com/gensim/models/tfidfmodel.html
# vector = model[corpus[0]]  # apply model to the first corpus document
# this is wierd syntax
# Calling model[corpus] only creates a wrapper around the old corpus document stream – 
#  actual conversions are done on-the-fly, during document iteration. We cannot convert 
#  the entire corpus at the time of calling corpus_transformed = model[corpus], 
#  because that would mean storing the result in main memory, and that contradicts 
#  gensim’s objective of memory-indepedence.

corpus_tfidf = tfidf[bow_corpus] #apply model to entire corpus

In [38]:
'''
Preview TF-IDF scores for our first document --> --> (token_id, tfidf score)
'''
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5959813347777092),
 (1, 0.39204529549491984),
 (2, 0.48531419274988147),
 (3, 0.5055461098578569)]


In [44]:
# Here document_num is document number 4310 which we have checked in Step 2
bow_doc_0 = bow_corpus[0]

for i in range(len(bow_doc_0)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_0[i][0], 
                                                     dictionary[bow_doc_0[i][0]], 
                                                     bow_doc_0[i][1]))

Word 0 ("broadcast") appears 1 time.
Word 1 ("communiti") appears 1 time.
Word 2 ("decid") appears 1 time.
Word 3 ("licenc") appears 1 time.


In [42]:
bow_corpus[0]

[(0, 1), (1, 1), (2, 1), (3, 1)]

## Step 4.1: Running LDA using Bag of Words ##

We are going for 10 topics in the document corpus.

** We will be running LDA using all CPU cores to parallelize and speed up model training.**

Some of the parameters we will be tweaking are:

* **num_topics** is the number of requested latent topics to be extracted from the training corpus.
* **id2word** is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
* **workers** is the number of extra processes to use for parallelization. Uses all available cores by default.
* **alpha** and **eta** are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. We will let these be the default values for now(default value is `1/num_topics`)
    - Alpha is the per document topic distribution.
        * High alpha: Every document has a mixture of all topics(documents appear similar to each other).
        * Low alpha: Every document has a mixture of very few topics

    - Eta is the per topic word distribution.
        * High eta: Each topic has a mixture of most words(topics appear similar to each other).
        * Low eta: Each topic has a mixture of few words.

* ** passes ** is the number of training passes through the corpus. For  example, if the training corpus has 50,000 documents, chunksize is  10,000, passes is 2, then online training is done in 10 updates: 
    * `#1 documents 0-9,999 `
    * `#2 documents 10,000-19,999 `
    * `#3 documents 20,000-29,999 `
    * `#4 documents 30,000-39,999 `
    * `#5 documents 40,000-49,999 `
    * `#6 documents 0-9,999 `
    * `#7 documents 10,000-19,999 `
    * `#8 documents 20,000-29,999 `
    * `#9 documents 30,000-39,999 `
    * `#10 documents 40,000-49,999` 

In [46]:
# LDA mono-core -- fallback code in case LdaMulticore throws an error on your machine
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# LDA multicore 
'''
Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model'
'''
# TODO
# lda = LdaMulticore(common_corpus, id2word=common_dictionary, num_topics=10)
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics = 10, id2word = dictionary, passes = 15, workers = 3)

In [47]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(topic, idx ))
    print("\n")

Topic: 0.026*"open" + 0.022*"test" + 0.019*"world" + 0.016*"take" + 0.015*"lead" + 0.014*"south" + 0.013*"aussi" + 0.013*"strike" + 0.012*"target" + 0.012*"action" 
Words: 0


Topic: 0.037*"report" + 0.020*"resid" + 0.019*"deal" + 0.017*"inquiri" + 0.014*"releas" + 0.013*"trade" + 0.012*"blaze" + 0.012*"firefight" + 0.011*"bushfir" + 0.011*"compani" 
Words: 1


Topic: 0.039*"kill" + 0.033*"crash" + 0.019*"attack" + 0.016*"closer" + 0.016*"die" + 0.015*"coast" + 0.015*"bomb" + 0.015*"continu" + 0.013*"road" + 0.013*"dead" 
Words: 2


Topic: 0.040*"govt" + 0.039*"plan" + 0.032*"council" + 0.023*"urg" + 0.021*"fund" + 0.020*"water" + 0.015*"group" + 0.012*"chang" + 0.012*"servic" + 0.012*"seek" 
Words: 3


Topic: 0.026*"say" + 0.018*"labor" + 0.017*"support" + 0.017*"elect" + 0.017*"defend" + 0.016*"govt" + 0.015*"protest" + 0.014*"minist" + 0.013*"chief" + 0.011*"howard" 
Words: 4


Topic: 0.045*"warn" + 0.016*"nuclear" + 0.015*"safeti" + 0.014*"threat" + 0.013*"health" + 0.013*"win" + 0

### Classification of the topics ###

Using the words in each topic and their corresponding weights, what categories were you able to infer?

* 0: maybe nuclear?
* 1: some kind of report
* 2: war 
* 3: government
* 4: politics / government 
* 5: nuclear stuff
* 6: farm
* 7: Iraq war involving Australia
* 8: Crime
* 9: 

## Step 4.2 Running LDA using TF-IDF ##

In [48]:
'''
Define lda model using corpus_tfidf, again using gensim.models.LdaMulticore()
'''
# TODO
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics = 10, id2word = dictionary, passes = 15, workers = 3)

In [49]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.009*"teacher" + 0.009*"strike" + 0.009*"control" + 0.007*"mine" + 0.007*"right" + 0.007*"extend" + 0.007*"action" + 0.007*"legal" + 0.006*"govt" + 0.006*"worker"


Topic: 1 Word: 0.014*"price" + 0.013*"market" + 0.012*"rise" + 0.008*"tiger" + 0.008*"toll" + 0.007*"profit" + 0.007*"retir" + 0.007*"blue" + 0.007*"share" + 0.007*"record"


Topic: 2 Word: 0.007*"export" + 0.007*"govt" + 0.007*"respons" + 0.006*"wait" + 0.005*"growth" + 0.005*"live" + 0.005*"recycl" + 0.005*"list" + 0.005*"safe" + 0.005*"wheat"


Topic: 3 Word: 0.008*"murray" + 0.007*"climat" + 0.007*"lebanon" + 0.007*"escap" + 0.007*"open" + 0.006*"alic" + 0.006*"emerg" + 0.006*"kangaroo" + 0.006*"festiv" + 0.005*"crop"


Topic: 4 Word: 0.013*"council" + 0.013*"plan" + 0.013*"govt" + 0.012*"fund" + 0.010*"health" + 0.009*"urg" + 0.008*"water" + 0.007*"group" + 0.007*"boost" + 0.007*"concern"


Topic: 5 Word: 0.025*"closer" + 0.019*"crash" + 0.011*"search" + 0.010*"miss" + 0.009*"die" + 0.009*"polic" + 0.00

### Classification of the topics ###

As we can see, when using tf-idf, heavier weights are given to words that are not as frequent which results in nouns being factored in. That makes it harder to figure out the categories as nouns can be hard to categorize. This goes to show that the models we apply depend on the type of corpus of text we are dealing with. 

Using the words in each topic and their corresponding weights, what categories could you find?

* 0: teacher strike
* 1: economics 
* 2: exports
* 3: climate
* 4: government council meeting regarding health funding 
* 5: maybe a car crash
* 6: some kind of crime and journey through the justice system
* 7: war or terrorism involving iran and maybe iraq
* 8: some kind of crime, perhaps involving water
* 9: government; maybe labor party election

## Step 5.1: Performance evaluation by classifying sample document using LDA Bag of Words model##

We will check to see where our test document would be classified. 

In [50]:
'''
Text of sample document 4310
'''
processed_docs[4310]

['rain', 'help', 'dampen', 'bushfir']

In [51]:
'''
Check which topic our test document belongs to using the LDA Bag of Words model.
'''
document_num = 4310
# Our test document is document number 4310

# TODO
# Our test document is document number 4310
for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.6043445467948914	 
Topic: 0.018*"rise" + 0.017*"farmer" + 0.017*"price" + 0.015*"drought" + 0.014*"high" + 0.013*"market" + 0.013*"look" + 0.012*"rain" + 0.011*"feder" + 0.011*"studi"

Score: 0.2356332540512085	 
Topic: 0.037*"report" + 0.020*"resid" + 0.019*"deal" + 0.017*"inquiri" + 0.014*"releas" + 0.013*"trade" + 0.012*"blaze" + 0.012*"firefight" + 0.011*"bushfir" + 0.011*"compani"

Score: 0.020003611221909523	 
Topic: 0.040*"govt" + 0.039*"plan" + 0.032*"council" + 0.023*"urg" + 0.021*"fund" + 0.020*"water" + 0.015*"group" + 0.012*"chang" + 0.012*"servic" + 0.012*"seek"

Score: 0.020003236830234528	 
Topic: 0.073*"polic" + 0.031*"charg" + 0.027*"court" + 0.027*"face" + 0.019*"miss" + 0.019*"investig" + 0.019*"death" + 0.017*"jail" + 0.016*"drug" + 0.016*"murder"

Score: 0.02000255510210991	 
Topic: 0.026*"open" + 0.022*"test" + 0.019*"world" + 0.016*"take" + 0.015*"lead" + 0.014*"south" + 0.013*"aussi" + 0.013*"strike" + 0.012*"target" + 0.012*"action"

Score: 0.02000255

### It has the highest probability (`0.6`) to be  part of the topic that we assigned as 6 (issues with climate and farms), which is the accurate classification. ###

## Step 5.2: Performance evaluation by classifying sample document using LDA TF-IDF model##

In [52]:
'''
Check which topic our test document belongs to using the LDA TF-IDF model.
'''
# Our test document is document number 4310
for index, score in sorted(lda_model_tfidf[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.8199763298034668	 
Topic: 0.025*"closer" + 0.019*"crash" + 0.011*"search" + 0.010*"miss" + 0.009*"die" + 0.009*"polic" + 0.009*"blaze" + 0.009*"coast" + 0.009*"rain" + 0.008*"fatal"

Score: 0.02000398375093937	 
Topic: 0.013*"council" + 0.013*"plan" + 0.013*"govt" + 0.012*"fund" + 0.010*"health" + 0.009*"urg" + 0.008*"water" + 0.007*"group" + 0.007*"boost" + 0.007*"concern"

Score: 0.020002974197268486	 
Topic: 0.011*"hick" + 0.009*"plead" + 0.008*"illeg" + 0.007*"rescu" + 0.007*"news" + 0.007*"fish" + 0.006*"guilti" + 0.006*"coal" + 0.006*"fiji" + 0.006*"rate"

Score: 0.020002782344818115	 
Topic: 0.009*"teacher" + 0.009*"strike" + 0.009*"control" + 0.007*"mine" + 0.007*"right" + 0.007*"extend" + 0.007*"action" + 0.007*"legal" + 0.006*"govt" + 0.006*"worker"

Score: 0.020002612844109535	 
Topic: 0.008*"murray" + 0.007*"climat" + 0.007*"lebanon" + 0.007*"escap" + 0.007*"open" + 0.006*"alic" + 0.006*"emerg" + 0.006*"kangaroo" + 0.006*"festiv" + 0.005*"crop"

Score: 0.020002488

### It has the highest probability (`82%`) to be  part of the topic that we assigned as 5 which seems to involve a car crash. ###

## Step 6: Testing model on unseen document ##

In [53]:
unseen_document = "My favorite sports activities are running and swimming."

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.4199610948562622	 Topic: 0.026*"open" + 0.022*"test" + 0.019*"world" + 0.016*"take" + 0.015*"lead"
Score: 0.22001239657402039	 Topic: 0.045*"warn" + 0.016*"nuclear" + 0.015*"safeti" + 0.014*"threat" + 0.013*"health"
Score: 0.21997405588626862	 Topic: 0.019*"iraq" + 0.018*"talk" + 0.015*"australia" + 0.014*"play" + 0.013*"hold"
Score: 0.020011387765407562	 Topic: 0.039*"kill" + 0.033*"crash" + 0.019*"attack" + 0.016*"closer" + 0.016*"die"
Score: 0.02001088298857212	 Topic: 0.026*"say" + 0.018*"labor" + 0.017*"support" + 0.017*"elect" + 0.017*"defend"
Score: 0.02000684104859829	 Topic: 0.040*"govt" + 0.039*"plan" + 0.032*"council" + 0.023*"urg" + 0.021*"fund"
Score: 0.020006731152534485	 Topic: 0.037*"report" + 0.020*"resid" + 0.019*"deal" + 0.017*"inquiri" + 0.014*"releas"
Score: 0.02000553533434868	 Topic: 0.018*"rise" + 0.017*"farmer" + 0.017*"price" + 0.015*"drought" + 0.014*"high"
Score: 0.02000553533434868	 Topic: 0.073*"polic" + 0.031*"charg" + 0.027*"court" + 0.027*"face

The model correctly classifies the unseen document with 'x'% probability to the X category.