## Step 0: Latent Dirichlet Allocation ##

LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. 

* Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial. 
* It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. 

* LDA tackles ambiguity by comparing a document to two topics and determining which topic is closer to the document, across all combinations of topics which seem broadly relevant.  In doing so, LDA helps an information retrieval system (such as a search engine) to determine which documents are most relevant to which topics.

### LDA on a real world dataset ###

We will be using the 20 Newsgroup dataset wich is a collection of emails exchanged in 20 newsgroups. You can read more about the dataset [here](http://qwone.com/~jason/20Newsgroups/).

## Step 1: Load the dataset ##

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics. Some of these topics are similar, so we will try to classify them into 10 topics.

Remove all email headers, footers and quotes from the replied emails.

In [1]:
'''
Dataset from http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
'''
from sklearn.datasets import fetch_20newsgroups
documents = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), random_state=101).data

In [2]:
'''
Get the total number of documents
'''
print(len(documents))

11314


In [5]:
# Previewing a document
document_num = 4310
print("\n**Printing out a sample document:**")
print(documents[document_num])


**Printing out a sample document:**


Fret not, you made it.


Not while we still have our guns.  <evil grin>  

Hey, gang, it's not about duck hunting, or about dark alleys,
it's about black-clad, helmeted and booted troops storming
houses and violating civil rights under color of law. 

Are YOU ready to defend YOUR Constitution?


** Looks like this document will come under `Politics`. **

## Step 2: Data Preprocessing ##

We will perform the following steps:

* **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
* Words that have fewer than 3 characters are removed.
* All **stopwords** are removed.
* Words are **lemmatized** - words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are **stemmed** - words are reduced to their root form.

In [6]:
'''
Loading Gensim library
'''
# pip install gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [None]:
# import nltk
# nltk.download()

In [7]:
'''
Lemmatizing example for a verb, noun.
'''
print(WordNetLemmatizer().lemmatize('went', pos = 'v')) # past tense to present tense
print(WordNetLemmatizer().lemmatize('Churches', pos = 'n')) # noun

go
Churches


In [10]:
'''
Stemming example
'''
stemmer = SnowballStemmer("english")
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer colon plot


In [8]:
'''
Data preprocessing
'''
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result = [lemmatize_stemming(token) for token in gensim.utils.simple_preprocess(text) 
              # Remove stop words and words less than 3 characters long
              if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3]
    return result

In [11]:
'''
Preview a document after preprocessing
'''
print("Original document: ")
words = []
for word in documents[document_num].split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(documents[document_num]))

Original document: 
[u'\n\nFret', u'not,', u'you', u'made', u'it.\n\n\nNot', u'while', u'we', u'still', u'have', u'our', u'guns.', u'', u'<evil', u'grin>', u'', u'\n\nHey,', u'gang,', u"it's", u'not', u'about', u'duck', u'hunting,', u'or', u'about', u'dark', u"alleys,\nit's", u'about', u'black-clad,', u'helmeted', u'and', u'booted', u'troops', u'storming\nhouses', u'and', u'violating', u'civil', u'rights', u'under', u'color', u'of', u'law.', u'\n\nAre', u'YOU', u'ready', u'to', u'defend', u'YOUR', u'Constitution?']


Tokenized and lemmatized document: 
[u'fret', u'gun', u'evil', u'grin', u'gang', u'duck', u'hunt', u'dark', u'alley', u'black', u'cloth', u'helmet', u'boot', u'troop', u'storm', u'hous', u'violat', u'civil', u'right', u'color', u'readi', u'defend', u'constitut']


In [12]:
'''
Perform preprocessing on entire dataset
'''
processed_docs = [preprocess(doc) for doc in documents]

## Step 3.1: Bag of words on the dataset ##

In [13]:
'''
Create a dictionary containing the number of times a word appears in the training set
'''
dictionary = gensim.corpora.Dictionary(processed_docs)

In [14]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

(33015, u'fawn')
(12744, u'circuitri')
(2123, u'woodi')
(41769, u'darrylo')
(41054, u'poplar')
(29617, u'spideri')
(17723, u'polytechniqu')
(13930, u'suzann')
(6954, u'phenomenologist')
(33722, u'francesca')
(22109, u'honorari')


** Gensim filter_extremes **

`filter_extremes(no_below=5, no_above=0.5, keep_n=100000)`

Filter out tokens that appear in

* less than no_below documents (absolute number) or
* more than no_above documents (fraction of total corpus size, not absolute number).
* after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [15]:
'''
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
dictionary.filter_extremes(no_below=15, 
                                no_above=0.10)

** Gensim doc2bow **

`doc2bow(document)`

* Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [16]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear.
'''
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [17]:
'''
Checking Bag of Words corpus for our sample document --> (token_id, token_count)
'''
# document_num = 4310
bow_corpus[document_num]

[(630, 1),
 (881, 1),
 (918, 1),
 (999, 1),
 (1206, 1),
 (1770, 1),
 (1874, 1),
 (1962, 1),
 (2100, 1),
 (2501, 1),
 (2503, 1),
 (2765, 1),
 (2829, 1),
 (3084, 1),
 (3131, 1),
 (3335, 1),
 (3442, 1),
 (3591, 1),
 (3642, 1),
 (3790, 1)]

In [18]:
'''
Preview BOW for our sample preprocessed document
'''
# Here document_num is document number 4310 which we have checked in Step 2
bow_doc_4310 = bow_corpus[document_num]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 630 ("troop") appears 1 time.
Word 881 ("boot") appears 1 time.
Word 918 ("hunt") appears 1 time.
Word 999 ("civil") appears 1 time.
Word 1206 ("gun") appears 1 time.
Word 1770 ("black") appears 1 time.
Word 1874 ("color") appears 1 time.
Word 1962 ("helmet") appears 1 time.
Word 2100 ("constitut") appears 1 time.
Word 2501 ("evil") appears 1 time.
Word 2503 ("hous") appears 1 time.
Word 2765 ("storm") appears 1 time.
Word 2829 ("violat") appears 1 time.
Word 3084 ("readi") appears 1 time.
Word 3131 ("dark") appears 1 time.
Word 3335 ("cloth") appears 1 time.
Word 3442 ("gang") appears 1 time.
Word 3591 ("duck") appears 1 time.
Word 3642 ("grin") appears 1 time.
Word 3790 ("defend") appears 1 time.


## Step 3.2: TF-IDF on our document set ##

While performing TF-IDF on the corpus is not necessary for LDA implemention using the gensim model, it is recemmended. TF-IDF expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality.

*Please note: The author of Gensim dictates the standard procedure for LDA to be using the Bag of Words model.*

** TF-IDF stands for "Term Frequency, Inverse Document Frequency".**

* It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.
* If a word appears frequently in a document, it's important. Give the word a high score. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.
* Therefore, common words like "the" and "for", which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

In other words:

* TF(w) = `(Number of times term w appears in a document) / (Total number of terms in the document)`.
* IDF(w) = `log_e(Total number of documents / Number of documents with term w in it)`.

** For example **

* Consider a document containing `100` words wherein the word 'tiger' appears 3 times. 
* The term frequency (i.e., tf) for 'tiger' is then: 
    - `TF = (3 / 100) = 0.03`. 

* Now, assume we have `10 million` documents and the word 'tiger' appears in `1000` of these. Then, the inverse document frequency (i.e., idf) is calculated as:
    - `IDF = log(10,000,000 / 1,000) = 4`. 

* Thus, the Tf-idf weight is the product of these quantities: 
    - `TF-IDF = 0.03 * 4 = 0.12`.

In [17]:
'''
Create tf-idf model object on 
'''
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)

In [18]:
'''
Apply transformation to the entire corpus
'''
corpus_tfidf = tfidf[bow_corpus]

In [19]:
'''
Preview TF-IDF scores for our first document --> --> (token_id, tfidf score)
'''
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(37, 0.02924493305598579),
 (70, 0.041178768527452336),
 (113, 0.04008568657768459),
 (114, 0.04043688209578033),
 (221, 0.04080091785715794),
 (224, 0.04543886857260671),
 (385, 0.036416350095079326),
 (408, 0.04535483542132858),
 (416, 0.040907420148641224),
 (455, 0.08325783671969088),
 (485, 0.032378781530203696),
 (570, 0.026838732092688673),
 (576, 0.19081274763367403),
 (610, 0.03318341752642401),
 (645, 0.08957131881027089),
 (650, 0.03796273634604967),
 (692, 0.03522833696525684),
 (808, 0.23695174265722618),
 (832, 0.02982583110923724),
 (866, 0.15107472089110963),
 (909, 0.06533564647804836),
 (930, 0.032772974509255474),
 (994, 0.03730763062218163),
 (1018, 0.027738412506140866),
 (1021, 0.04043688209578033),
 (1039, 0.06489444152815979),
 (1060, 0.21558039614235047),
 (1082, 0.32768881211945),
 (1115, 0.04432241916816081),
 (1154, 0.04259528764733274),
 (1172, 0.10017080978452829),
 (1182, 0.15413015913574957),
 (1214, 0.03832783151158076),
 (1217, 0.06473195441575083),
 

## Step 4.1: Running LDA using Bag of Words ##

We are going for 10 topics in the document corpus.

** We will be running LDA using all CPU cores to parallelize and speed up model training.**

Some of the parameters we will be tweaking are:

* **num_topics** is the number of requested latent topics to be extracted from the training corpus.
* **id2word** is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
* **workers** is the number of extra processes to use for parallelization. Uses all available cores by default.
* **alpha** and **eta** are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. We will let these be the default values for now(default value is `1/num_topics`)
    - Alpha is the per document topic distribution.
        * High alpha: Every document has a mixture of all topics(documents appear similar to each other).
        * Low alpha: Every document has a mixture of very few topics

    - Eta is the per topic word distribution.
        * High eta: Each topic has a mixture of most words(topics appear similar to each other).
        * Low eta: Each topic has a mixture of few words.

* ** passes ** is the number of training passes through the corpus. For  example, if the training corpus has 50,000 documents, chunksize is  10,000, passes is 2, then online training is done in 10 updates: 
    * `#1 documents 0-9,999 `
    * `#2 documents 10,000-19,999 `
    * `#3 documents 20,000-29,999 `
    * `#4 documents 30,000-39,999 `
    * `#5 documents 40,000-49,999 `
    * `#6 documents 0-9,999 `
    * `#7 documents 10,000-19,999 `
    * `#8 documents 20,000-29,999 `
    * `#9 documents 30,000-39,999 `
    * `#10 documents 40,000-49,999`
    
The default values are:

`def __init__(self, corpus=None, 
                    num_topics=100, 
                    id2word=None,
                    distributed=False, 
                    chunksize=2000, 
                    passes=1, 
                    update_every=1,
                    alpha='symmetric', 
                    eta=None, 
                    decay=0.5, 
                    offset=1.0,
                    eval_every=10, 
                    iterations=50, 
                    gamma_threshold=0.001)`

In [19]:
# LDA mono-core
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 1)

# LDA multicore 
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics=10, 
                                       id2word = dictionary, 
                                       passes = 1, 
                                       workers=2)

In [26]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
# set topics=-1 to print all topics.
for idx, topic in lda_model.print_topics(num_topics=10, num_words=5):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.007*drive + 0.006*game + 0.003*includ + 0.003*inform + 0.003*control


Topic: 1 Word: 0.005*read + 0.004*line + 0.004*game + 0.004*team + 0.004*believ


Topic: 2 Word: 0.004*number + 0.004*includ + 0.003*power + 0.003*file + 0.003*program


Topic: 3 Word: 0.005*christian + 0.004*believ + 0.004*question + 0.004*reason + 0.003*differ


Topic: 4 Word: 0.004*file + 0.003*govern + 0.003*hear + 0.003*state + 0.003*support


Topic: 5 Word: 0.006*drive + 0.004*question + 0.004*case + 0.004*believ + 0.003*mean


Topic: 6 Word: 0.006*mail + 0.005*space + 0.004*includ + 0.004*list + 0.004*state


Topic: 7 Word: 0.006*file + 0.004*state + 0.004*program + 0.004*group + 0.004*window


Topic: 8 Word: 0.008*window + 0.005*program + 0.004*mean + 0.004*file + 0.004*version


Topic: 9 Word: 0.007*file + 0.007*program + 0.006*drive + 0.004*data + 0.004*window




In [28]:
lda_model.get_topic_terms(1)

[(2055, 0.0047381153101580574),
 (3342, 0.0043584901210308872),
 (3204, 0.0041168804188232381),
 (2618, 0.0036317371681088414),
 (2710, 0.0036186413104838198),
 (1470, 0.003436890433743134),
 (2688, 0.0033314202137994134),
 (165, 0.0032676598250588815),
 (2839, 0.0029013560053822887),
 (2743, 0.0028312477739049005)]

### Classification of the topics ###

Using the words in each topic and their corresponding weights, we have the following categories:

* 0: Computer Software
* 1: Space
* 2: Government
* 3: Energy
* 4: Sports
* 5: Applications
* 6: Religion
* 7: Computer Hardware
* 8: Automotives
* 9: Politics

## Step 4.2 Running LDA using TF-IDF ##

In [39]:
'''
Define lda model using tfidf corpus
'''
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
                                             num_topics=10, 
                                             id2word = dictionary, 
                                             passes = 50, 
                                             workers=4)

In [40]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.009*"detector" + 0.008*"adam" + 0.008*"captain" + 0.008*"radar" + 0.007*"pen" + 0.007*"delet" + 0.006*"devil" + 0.006*"espn" + 0.005*"ranger" + 0.005*"lamp"


Topic: 1 Word: 0.009*"window" + 0.007*"file" + 0.006*"drive" + 0.006*"card" + 0.005*"program" + 0.004*"mail" + 0.004*"softwar" + 0.004*"disk" + 0.004*"monitor" + 0.004*"help"


Topic: 2 Word: 0.007*"christian" + 0.005*"jesus" + 0.005*"believ" + 0.004*"bibl" + 0.004*"mean" + 0.004*"church" + 0.004*"religion" + 0.003*"moral" + 0.003*"life" + 0.003*"exist"


Topic: 3 Word: 0.017*"space" + 0.012*"nasa" + 0.012*"orbit" + 0.010*"launch" + 0.009*"moon" + 0.007*"satellit" + 0.007*"address" + 0.006*"earth" + 0.006*"shuttl" + 0.005*"test"


Topic: 4 Word: 0.012*"armenian" + 0.011*"israel" + 0.010*"isra" + 0.009*"arab" + 0.007*"jew" + 0.007*"kill" + 0.005*"turkish" + 0.005*"turkey" + 0.004*"greek" + 0.004*"turk"


Topic: 5 Word: 0.013*"gordon" + 0.013*"diseas" + 0.012*"surrend" + 0.012*"pitt" + 0.012*"skeptic" + 0.011*"sham

### Classification of the topics ###

As we can see, when using tf-idf, heavier weights are given to words that are not as frequent which results in nouns being factored in. That makes it harder to figure out the categories as nouns can be hard to categorize. This goes to show that the models we apply depend on the type of corpus of text we are dealing with. 

Using the words in each topic and their corresponding weights, we have the following categories:

* 0: Unclear
* 1: Computer Software
* 2: Space
* 3: Religion
* 4: Middle East
* 5: Unclear
* 6: Unclear
* 7: Politics
* 8: Automotives
* 9: Sports

## Step 5.1: Performance evaluation by classifying sample document using LDA Bag of Words model##

We will check to see where our test document would be classified. 

In [32]:
'''
Text of sample document 4310
'''
print(documents[document_num])



Fret not, you made it.


Not while we still have our guns.  <evil grin>  

Hey, gang, it's not about duck hunting, or about dark alleys,
it's about black-clad, helmeted and booted troops storming
houses and violating civil rights under color of law. 

Are YOU ready to defend YOUR Constitution?


In [33]:
'''
Check which topic our test document belongs to using the LDA Bag of Words model.
'''

# Our test document is document number 4310
for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.6458886951521011	 
Topic: 0.016*"armenian" + 0.009*"state" + 0.007*"kill" + 0.007*"turkish" + 0.007*"weapon" + 0.005*"govern" + 0.005*"firearm" + 0.005*"report" + 0.005*"attack" + 0.004*"children"

Score: 0.1464546793671742	 
Topic: 0.011*"power" + 0.010*"wire" + 0.007*"light" + 0.007*"grind" + 0.006*"current" + 0.005*"appear" + 0.005*"circuit" + 0.005*"cover" + 0.005*"water" + 0.004*"pictur"

Score: 0.09064204432828646	 
Topic: 0.006*"state" + 0.006*"believ" + 0.006*"mean" + 0.006*"tell" + 0.005*"israel" + 0.005*"question" + 0.005*"live" + 0.005*"govern" + 0.005*"case" + 0.004*"fact"

Score: 0.08843793487079113	 
Topic: 0.023*"drive" + 0.017*"card" + 0.012*"disk" + 0.010*"scsi" + 0.010*"control" + 0.010*"driver" + 0.008*"price" + 0.008*"hard" + 0.007*"monitor" + 0.007*"chip"


### It has the highest probability (`64.5%`) to be  part of the topic that we assigned as Politics, which is the accurate classification. ###

## Step 5.2: Performance evaluation by classifying sample document using LDA TF-IDF model##

In [41]:
'''
Check which topic our test document belongs to using the LDA TF-IDF model.
'''

# Our test document is document number 4310
for index, score in sorted(lda_model_tfidf[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.46183573112950044	 
Topic: 0.012*"armenian" + 0.011*"israel" + 0.010*"isra" + 0.009*"arab" + 0.007*"jew" + 0.007*"kill" + 0.005*"turkish" + 0.005*"turkey" + 0.004*"greek" + 0.004*"turk"

Score: 0.3419984058519753	 
Topic: 0.003*"govern" + 0.003*"encrypt" + 0.002*"state" + 0.002*"player" + 0.002*"better" + 0.002*"team" + 0.002*"number" + 0.002*"play" + 0.002*"game" + 0.002*"chip"

Score: 0.08992600402663535	 
Topic: 0.009*"detector" + 0.008*"adam" + 0.008*"captain" + 0.008*"radar" + 0.007*"pen" + 0.007*"delet" + 0.006*"devil" + 0.006*"espn" + 0.005*"ranger" + 0.005*"lamp"

Score: 0.07766221966445937	 
Topic: 0.023*"bike" + 0.010*"motorcycl" + 0.009*"ride" + 0.007*"car" + 0.007*"rid" + 0.007*"rear" + 0.007*"mile" + 0.007*"honda" + 0.006*"tire" + 0.006*"wheel"


### It has the highest probability (`34.1%`) to be  part of the topic that we assigned as Politics. ###

## Step 5.2: Performance evaluation of model as a whole ##

NOTE: log perplexity explanation

We use the perplexity(logarithm) for our evaluation. The perplexity is connected to the log likelihood that the model is able to **generate the documents, given the distribution of topics for those documents**. The lower the perplexity, the better the model as it signifies that the model can regenerate the text quite well.

The perplexity of a discrete probability distribution p is defined as

2^{H(p)}=2^{-\sum _{x}p(x)\log _{2}p(x)}} 2^{{H(p)}}=2^{{-\sum _{x}p(x)\log _{2}p(x)}}
where H(p) is the entropy of the distribution and x ranges over events.

Perplexity of a random variable X may be defined as the perplexity of the distribution over its possible values x.

In [35]:
'''
Calcualte log perplexity of the model to check if the model can generate the text.
'''
print("Log perplexity of the model is", lda_model.log_perplexity(bow_corpus))

Log perplexity of the model is -7.37815975223


## Step 6: Testing model on unseen document ##

In [38]:
unseen_document = "My favorite sports activities are running and swimming."

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.6169210224607728	 Topic: 0.026*"game" + 0.019*"team" + 0.017*"play" + 0.012*"player" + 0.010*"season"
Score: 0.22304756274928622	 Topic: 0.012*"space" + 0.009*"encrypt" + 0.007*"secur" + 0.007*"program" + 0.006*"inform"
Score: 0.020006336204154672	 Topic: 0.022*"file" + 0.020*"window" + 0.013*"imag" + 0.012*"program" + 0.011*"mail"
Score: 0.02000512555690829	 Topic: 0.023*"drive" + 0.017*"card" + 0.012*"disk" + 0.010*"scsi" + 0.010*"control"
Score: 0.020004612382756047	 Topic: 0.006*"bike" + 0.005*"tell" + 0.005*"get" + 0.005*"take" + 0.004*"differ"
Score: 0.020004361925737575	 Topic: 0.016*"armenian" + 0.009*"state" + 0.007*"kill" + 0.007*"turkish" + 0.007*"weapon"
Score: 0.020003530223374595	 Topic: 0.016*"christian" + 0.011*"believ" + 0.010*"exist" + 0.009*"book" + 0.008*"bibl"
Score: 0.020003387682446236	 Topic: 0.011*"power" + 0.010*"wire" + 0.007*"light" + 0.007*"grind" + 0.006*"current"
Score: 0.020003036486462827	 Topic: 0.006*"state" + 0.006*"believ" + 0.006*"mean" + 

The model correctly classifies the unseen document with 61.6% probability to the Sports category.

Following explanation by Edward Chen:

* Go through each document, and randomly assign each word in the document to one of the K topics. Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
* So to improve on them, for each document d:
    - Go through each word w in d:
        - And for each topic t, compute two things: 
            - 1) A - the proportion of words in document `d` that are currently assigned to topic `t`, and 
            - 2) B - the proportion of assignments to topic `t` over all documents that come from this word `w`. 
        - Reassign `w` a new topic, where we choose topic `t` with probability `A * B` (according to our generative model, this is essentially the probability that topic `t` generated word `w`, so it makes sense that we resample the current word’s topic with this probability). 
        
        
* After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document:
    - by counting the proportion of words assigned to each topic within that document and 
    - the words associated to each topic (by counting the proportion of words assigned to each topic overall|.