# Introduction to LDA and data cleaning
In this notebook, we introduce LDA and what we need for our model. We then proceed to load and clean a sample of the NOW corpus to fulfill our needs.

## What is LDA
[Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a statistical model which we will use for topic modelling/discovery. LDA will, given a list of words belonging to a text, output the topics present and their probability. In here, a topic is represented as a probability distribution of words. Thus each text/document will be a distribution over the topics. In short, texts have an associated topic distribution and topics have a word distribution. 

The image below is the plate notation for LDA, where:
* θ<sub>m</sub> is the topic distribution for document m,
* φ<sub>k</sub> is the word distribution for topic k,
* z<sub>mn</sub> is the topic for the n-th word in document m, and
* w<sub>mn</sub> is the specific word.
* α is the parameter of the Dirichlet prior on the per-document topic distributions,
* β is the parameter of the Dirichlet prior on the per-topic word distribution,

![](LDA.png)

α and β are the parameters for the model. A big α means that documents are likely to be represented by a high number of topics and vice versa. Same goes for β, a high value meaning that topics are represented by a hign number of words. The number of topics that LDA outputs is dependent on our input and works a bit like clustering. If we allow too many topics we might end up splitting topics uselessly and a too few will make us group them unnecessarily. 

## The NOW corpus
This notebook shows the cleaning process that will be used for the ADA project. Here, only a sample of the data is used (from [here](https://www.corpusdata.org/now_corpus.asp)), but the methods should be the same once scaled to the full database available on the cluster.

The NOW database is composed of billions of words from online newspapers and magazines from 20 different countries. The data we downloaded comes in different files which can be used together or independently. These files are:

1. **now-samples-lexicon.txt**: this is the full dictionnary of the english language, a lexicon. It contains four clolumns, `wID` which is the word id, `word` the actual word, `lemma` which is family of the word (ie: if word is "walked", lemma is "walk") and `PoS` which is the part of speech.
2. **now-samples_sources.txt**: this is the source of every text, in order it contains the text id, the number of words, the date, the country, the website, the url and title of the article.
3. **text.txt**: this file has the complete texts of the articles, the first column is the `textID` in the format @@textID, the second column is the full text, complete with html paragraphs and headers. It is important to note that to prevent plagiarism, every 200 words, 10 words are replaced by the string "@ @ @ @ @ @ @ @ @ @". Combined words are also split, example "can't" is written as "ca n't" and punctuation is surrounded by spaces.
4. **wordLemPoS.txt**: finally, this file contains the `word`, `lemma` and `PoS` for each word in the texts, one by one, so one could read the texts by reading down the columns. Along with that is the `textID` from where the word is and an `ID (seq)` which is a unique indetifier for each word in the database. Each time a word is added this number is incremented.

## What we need from the NOW corpus for LDA
The model will take two inputs, a matrix with all the important words for each text, and a list of all the important words. By important, it is meant the words which will give us good topic modelling. For example, names, locations, simple words like "but, "I" or "and" will not give meaningfull results and are quite common in english (so-called stopwords). Other common words present in our database should be removed too. We also should use lemmas instead of words.

Therefore, the file `wordLemPoS.txt` (hence referred as wlp) is the most important here as it lists all the lemmas with their `textID` associated. Which means that with it we can lsit all the lemmas, remove those we do not want to make our word list, but also group them by texts to create our text-word matrix.

We will also need `now-sample_sources.txt` (hence referred as sources) to link the texts with the information we will deem useful. For example country, date or website.

These are thus the two file we will import and process here with the sample data but also those we will use with the data on the cluster.

## Cleaning

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import findspark
findspark.init()
import re
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import DateType

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Wlp processing


The goal of this part is to extract the useful data from wlp text files. Since they contain all the words of all the articles and the lemmas to replace them with.

In [2]:
#first read the text file
wlp_rdd = sc.textFile('sample_data/wordLem_poS.txt')

In [3]:
#the first 3 lines are useless headlines
header = wlp_rdd.take(3)

#so let's remove those headlines
noheaders = wlp_rdd.filter(lambda r: r != header[0])\
            .filter(lambda r: r != header[1])\
            .filter(lambda r: r != header[2])

In [4]:
#we split the elements separated by tabs
lines = noheaders.map(lambda r: r.split('\t'))

#identify the columns
wlp_schema = lines.map(lambda r: Row(textID=int(r[0]),idseq=int(r[1]),word=r[2],lemma=r[3],pos=r[4]))
wlp = spark.createDataFrame(wlp_schema)
wlp.show(5)

+----------+------+----+------+-------+
|     idseq| lemma| pos|textID|   word|
+----------+------+----+------+-------+
|1095362496|      |  fo| 11241|@@11241|
|1095362497|      |null| 11241|    <p>|
|1095362498|   sol| np1| 11241|    Sol|
|1095362499|yurick| np1| 11241| Yurick|
|1095362500|      |   ,| 11241|      ,|
+----------+------+----+------+-------+
only showing top 5 rows



### Word selection
It is very important to select the right words and the right number. The ocncept of "garbage in garbage out" has never been more true than with LDA. When we analyse a text we focus on certain words to extract it's meaning and topic. The same is true here since words like if, for, numbers, common names are not that useful.

Here, we provide and example of the process we will go through. However this is not really a data cleaning step as it will directly influence our model. It is more of a model preprocessing step. We will surely go through many iterations of this next part for our model to give the best results. 

First of all, we can remove all the words which have a PoS which do not interest us. For example number (`mc`,`mc1`,`m#`) or punctuation (`.`,`'`), etc...

Details:
1. `.`,`,`,`'` and `"` are punctuations
2. `null` are html tags from the websites
3. `mc`,`mc1` and `m#` are various numbers
3. `fo` are the text ids and other useless beginnings of texts

In [5]:
pos_remove = ['.',',',"\'",'\"','null','mc','mc1','m#','fo']
wlp_nopos = wlp.filter(~wlp['pos'].isin(pos_remove)).drop('idseq','pos','word')

Now, we load our list of stopwords, the words that we are not going to use in LDA as they are too common or are common names. We can also remove the rows with no lemmas or those with lemmas that don't make sense or are not common enough.

In [6]:
#np.save('our_stopwords',stopwords)
stopwords = np.load('our_stopwords.npy').tolist()
len(stopwords)

5639

In [7]:
#filter out stopwords and looking at the frequency of words without them
wlp_nostop = wlp_nopos.filter(~wlp['lemma'].isin(stopwords))
lemma_freq = wlp_nostop.groupBy('lemma').count().sort('count', ascending=False)
lemma_freq.show()

+----------+-----+
|     lemma|count|
+----------+-----+
|      year| 4272|
|      time| 3169|
|    people| 2913|
|      take| 2667|
|       use| 2244|
|      work| 2137|
|       day| 1819|
|     state| 1713|
|   company| 1698|
|   comment| 1667|
|      need| 1654|
|      want| 1579|
|      look| 1564|
|     world| 1553|
|government| 1551|
|      show| 1480|
|      give| 1480|
|   country| 1465|
|      find| 1464|
|     right| 1408|
+----------+-----+
only showing top 20 rows



We will also remove the most common and least common lemmas. These will be useless since they won't provide enough information for our LDA analysis. Here, we filter out the top 5% and bottom 10% of all lemmas.

<div class="alert alert-success">
Maybe should change from percentile to number for bottom filtering, depending on which one is the harshest (in here it is number)
</div>

In [8]:
#calculate percentiles and filtering out the lemmas above and below them
[bottom,top] = lemma_freq.approxQuantile('count', [0.1,0.99], 0.01)
bottom = 5
lemma_tokeep = lemma_freq.filter(lemma_freq['count']>bottom).filter(lemma_freq['count']<top)
print('Percentage of lemmas left: %.2f'%(lemma_tokeep.count()/lemma_freq.count()*100))

Percentage of lemmas left: 28.47


Making a inner join, we keep only the words which are in both lists! In the end, we can group the lemmas in their texts to create our text-word matrix.

In [9]:
#perform sql query and inner join
wlp_nostop.registerTempTable('wlp_nostop')
lemma_tokeep.registerTempTable('lemma_tokeep')

query = """
SELECT wlp_nostop.lemma, wlp_nostop.textID
FROM wlp_nostop
INNER JOIN lemma_tokeep ON wlp_nostop.lemma = lemma_tokeep.lemma
"""

wlp_kept = spark.sql(query)
wlp_bytext = wlp_kept.groupBy('textID').agg(collect_list('lemma'))\
                    .sort('textID')\
                    .withColumnRenamed('collect_list(lemma)','document lemmas')
wlp_bytext.show()

+------+--------------------+
|textID|     document lemmas|
+------+--------------------+
| 11241|[1970s, film, fil...|
| 11242|[online, happen, ...|
| 11243|[dough, dough, do...|
| 11244|[trail, launch, o...|
| 21242|[online, launch, ...|
| 21243|[recognize, indic...|
| 31240|[recognize, inten...|
| 31241|[online, online, ...|
| 31242|[settlement, sett...|
| 41240|[explain, hometow...|
| 41241|[scale, lack, pre...|
| 41244|[everyday, trail,...|
| 51243|[australia, austr...|
| 61240|[frustrate, inten...|
| 61242|[editor-in-chief,...|
| 71240|[indicator, requi...|
| 71241|[likelihood, requ...|
| 71242|[1970s, character...|
| 71243|[online, staff, s...|
| 71244|[bone, archaeolog...|
+------+--------------------+
only showing top 20 rows



## Sources
Contains all the additional informations about each text.

In [11]:
sources_rdd = sc.textFile('sample_data/now-samples-sources.txt')\
                .map(lambda r: r.split('\t'))

header = sources_rdd.take(3)
sources_rdd = sources_rdd.filter(lambda l: l != header[0])\
                .filter(lambda l: l != header[1])\
                .filter(lambda l: l != header[2])

In [12]:
#create schema and change data type for date
sources_schema = sources_rdd.map(lambda r: Row(textID=int(r[0]),nwords=int(r[1]),date=r[2],country=r[3],website=r[4],url=r[5],title=r[6],)) 
sources = spark.createDataFrame(sources_schema)
sources = sources.withColumn('date',to_date(sources.date, 'yy-MM-dd'))

In [13]:
sources.printSchema()

root
 |-- country: string (nullable = true)
 |-- date: date (nullable = true)
 |-- nwords: long (nullable = true)
 |-- textID: long (nullable = true)
 |-- title: string (nullable = true)
 |-- url: string (nullable = true)
 |-- website: string (nullable = true)



In [14]:
sources.show(5)

+-------+----------+------+------+--------------------+--------------------+-------------------+
|country|      date|nwords|textID|               title|                 url|            website|
+-------+----------+------+------+--------------------+--------------------+-------------------+
|     US|2013-01-06|   397| 11241|Author of The War...|http://kotaku.com...|             Kotaku|
|     US|2013-01-06|   757| 11242|That's What They ...|http://michiganra...|     Michigan Radio|
|     US|2013-01-06|   755| 11243|Best of New York:...|http://www.nydail...|New York Daily News|
|     US|2013-01-06|  1677| 11244|Reflecting on a q...|http://www.oregon...|     OregonLive.com|
|     US|2013-01-11|   794| 21242|Ask Ars: Does Fac...|http://arstechnic...|       Ars Technica|
+-------+----------+------+------+--------------------+--------------------+-------------------+
only showing top 5 rows



## First tries with LDA
Try using spark lda but with gensim corpus processing.

In [16]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim import corpora
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import SparseVector
from gensim.models.ldamodel import LdaModel

In [11]:
class MyCorpus(object):
     def __iter__(self):
            for line in wlp_bytext.rdd.map(lambda r: r[1]).collect():
                yield dictionary.doc2bow(line)

In [25]:
c = 0
for line in wlp_bytext.rdd.map(lambda r: r[1]).collect():
    c+=len(line)
    
print(c)

631610


In [14]:
dictionary = corpora.Dictionary(line for line in wlp_bytext.rdd.map(lambda r: r[1]).collect())
corpus = MyCorpus()

2018-12-03 10:57:32,650 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-12-03 10:57:33,119 : INFO : built Dictionary(11399 unique tokens: ['1970s', 'adapt', 'adaptation', 'advertisement', 'afternoon']...) from 2914 documents (total 631610 corpus positions)


In [23]:
lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1, chunksize=100, passes=5)

2018-12-03 11:04:07,310 : INFO : using symmetric alpha at 0.1
2018-12-03 11:04:07,311 : INFO : using symmetric eta at 0.1
2018-12-03 11:04:07,314 : INFO : using serial LDA version on this node
2018-12-03 11:04:09,358 : INFO : running online (multi-pass) LDA training, 10 topics, 5 passes over the supplied corpus of 2914 documents, updating model once every 100 documents, evaluating perplexity every 1000 documents, iterating 50x with a convergence threshold of 0.001000
2018-12-03 11:04:11,401 : INFO : PROGRESS: pass 0, at document #100/2914
2018-12-03 11:04:11,482 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:11,501 : INFO : topic #2 (0.100): 0.007*"people" + 0.005*"work" + 0.004*"use" + 0.004*"look" + 0.004*"time" + 0.004*"company" + 0.004*"want" + 0.004*"high" + 0.004*"big" + 0.003*"portland"
2018-12-03 11:04:11,502 : INFO : topic #5 (0.100): 0.006*"time" + 0.006*"people" + 0.005*"price" + 0.005*"company" + 0.005*"take" + 0.004*"find" + 0.0

2018-12-03 11:04:12,162 : INFO : topic diff=0.819720, rho=0.408248
2018-12-03 11:04:12,184 : INFO : PROGRESS: pass 0, at document #700/2914
2018-12-03 11:04:12,271 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:12,285 : INFO : topic #2 (0.100): 0.010*"people" + 0.007*"city" + 0.007*"work" + 0.006*"want" + 0.006*"look" + 0.005*"show" + 0.005*"time" + 0.005*"day" + 0.004*"need" + 0.004*"week"
2018-12-03 11:04:12,286 : INFO : topic #1 (0.100): 0.008*"film" + 0.007*"time" + 0.007*"take" + 0.007*"play" + 0.006*"win" + 0.005*"team" + 0.005*"school" + 0.005*"show" + 0.005*"game" + 0.004*"start"
2018-12-03 11:04:12,287 : INFO : topic #6 (0.100): 0.011*"khan" + 0.008*"court" + 0.005*"city" + 0.005*"official" + 0.005*"member" + 0.005*"state" + 0.004*"plan" + 0.004*"security" + 0.004*"council" + 0.004*"time"
2018-12-03 11:04:12,288 : INFO : topic #4 (0.100): 0.011*"market" + 0.011*"power" + 0.010*"company" + 0.009*"country" + 0.006*"industry" + 0.006*"

2018-12-03 11:04:13,074 : INFO : topic #4 (0.100): 0.011*"company" + 0.011*"market" + 0.009*"business" + 0.008*"cost" + 0.006*"price" + 0.006*"system" + 0.005*"rate" + 0.005*"product" + 0.005*"country" + 0.005*"industry"
2018-12-03 11:04:13,074 : INFO : topic #5 (0.100): 0.012*"woman" + 0.011*"people" + 0.010*"family" + 0.009*"child" + 0.008*"life" + 0.007*"police" + 0.006*"time" + 0.006*"health" + 0.005*"death" + 0.005*"take"
2018-12-03 11:04:13,075 : INFO : topic #0 (0.100): 0.023*"comment" + 0.009*"share" + 0.007*"time" + 0.006*"post" + 0.006*"government" + 0.006*"recommend" + 0.006*"work" + 0.006*"help" + 0.005*"name" + 0.005*"twitter"
2018-12-03 11:04:13,076 : INFO : topic #2 (0.100): 0.012*"people" + 0.009*"vegas" + 0.008*"work" + 0.007*"want" + 0.007*"city" + 0.007*"time" + 0.006*"look" + 0.005*"show" + 0.005*"day" + 0.004*"funny"
2018-12-03 11:04:13,076 : INFO : topic diff=0.464808, rho=0.277350
2018-12-03 11:04:13,095 : INFO : PROGRESS: pass 0, at document #1400/2914
2018-12-0

2018-12-03 11:04:13,655 : INFO : topic #1 (0.100): 0.012*"game" + 0.011*"team" + 0.010*"time" + 0.010*"play" + 0.010*"win" + 0.007*"take" + 0.006*"world" + 0.006*"player" + 0.005*"film" + 0.005*"show"
2018-12-03 11:04:13,656 : INFO : topic diff=0.301679, rho=0.229416
2018-12-03 11:04:13,778 : INFO : -8.537 per-word bound, 371.4 perplexity estimate based on a held-out corpus of 100 documents with 19093 words
2018-12-03 11:04:13,778 : INFO : PROGRESS: pass 0, at document #2000/2914
2018-12-03 11:04:13,833 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:13,849 : INFO : topic #3 (0.100): 0.015*"use" + 0.010*"car" + 0.009*"police" + 0.007*"cookie" + 0.006*"vehicle" + 0.006*"service" + 0.006*"office" + 0.006*"press" + 0.005*"website" + 0.005*"county"
2018-12-03 11:04:13,849 : INFO : topic #4 (0.100): 0.019*"company" + 0.010*"market" + 0.009*"price" + 0.009*"business" + 0.006*"rate" + 0.006*"cost" + 0.006*"quarter" + 0.005*"cent" + 0.005*"billion" +

2018-12-03 11:04:14,279 : INFO : topic diff=0.220038, rho=0.200000
2018-12-03 11:04:14,293 : INFO : PROGRESS: pass 0, at document #2600/2914
2018-12-03 11:04:14,347 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:14,360 : INFO : topic #8 (0.100): 0.016*"country" + 0.013*"state" + 0.011*"government" + 0.009*"nation" + 0.008*"president" + 0.007*"national" + 0.007*"minister" + 0.007*"nigeria" + 0.006*"force" + 0.006*"attack"
2018-12-03 11:04:14,361 : INFO : topic #3 (0.100): 0.017*"use" + 0.012*"police" + 0.010*"cookie" + 0.008*"site" + 0.008*"car" + 0.008*"website" + 0.008*"service" + 0.007*"press" + 0.007*"office" + 0.006*"provide"
2018-12-03 11:04:14,362 : INFO : topic #2 (0.100): 0.014*"people" + 0.008*"city" + 0.007*"work" + 0.007*"time" + 0.007*"want" + 0.006*"show" + 0.006*"party" + 0.005*"election" + 0.005*"look" + 0.005*"day"
2018-12-03 11:04:14,363 : INFO : topic #7 (0.100): 0.018*"school" + 0.011*"university" + 0.011*"student" + 0.008

2018-12-03 11:04:16,673 : INFO : topic #3 (0.100): 0.020*"use" + 0.011*"user" + 0.011*"site" + 0.009*"website" + 0.009*"cookie" + 0.008*"police" + 0.007*"content" + 0.007*"press" + 0.007*"service" + 0.007*"provide"
2018-12-03 11:04:16,674 : INFO : topic #4 (0.100): 0.020*"market" + 0.017*"company" + 0.010*"price" + 0.008*"project" + 0.008*"business" + 0.007*"industry" + 0.007*"report" + 0.006*"cost" + 0.006*"rate" + 0.006*"investment"
2018-12-03 11:04:16,675 : INFO : topic #6 (0.100): 0.018*"court" + 0.008*"official" + 0.008*"state" + 0.007*"party" + 0.007*"indian" + 0.007*"case" + 0.006*"council" + 0.006*"order" + 0.006*"member" + 0.006*"house"
2018-12-03 11:04:16,676 : INFO : topic #0 (0.100): 0.031*"comment" + 0.010*"post" + 0.009*"government" + 0.008*"share" + 0.007*"time" + 0.006*"minister" + 0.006*"right" + 0.005*"please" + 0.005*"community" + 0.005*"facebook"
2018-12-03 11:04:16,676 : INFO : topic #7 (0.100): 0.015*"school" + 0.012*"student" + 0.009*"need" + 0.008*"university" +

2018-12-03 11:04:17,261 : INFO : topic #1 (0.100): 0.010*"play" + 0.010*"time" + 0.009*"game" + 0.009*"take" + 0.009*"team" + 0.008*"win" + 0.006*"world" + 0.006*"film" + 0.005*"point" + 0.005*"music"
2018-12-03 11:04:17,262 : INFO : topic #7 (0.100): 0.021*"school" + 0.014*"student" + 0.013*"university" + 0.008*"program" + 0.008*"education" + 0.008*"study" + 0.008*"need" + 0.007*"work" + 0.007*"high" + 0.007*"use"
2018-12-03 11:04:17,263 : INFO : topic #6 (0.100): 0.016*"court" + 0.009*"khan" + 0.009*"official" + 0.008*"state" + 0.007*"indian" + 0.007*"member" + 0.007*"city" + 0.006*"case" + 0.006*"committee" + 0.006*"council"
2018-12-03 11:04:17,264 : INFO : topic diff=0.156346, rho=0.179201
2018-12-03 11:04:17,284 : INFO : PROGRESS: pass 1, at document #900/2914
2018-12-03 11:04:17,345 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:17,357 : INFO : topic #4 (0.100): 0.017*"company" + 0.014*"market" + 0.008*"business" + 0.008*"price" + 0.00

2018-12-03 11:04:17,938 : INFO : topic #8 (0.100): 0.013*"country" + 0.012*"state" + 0.009*"government" + 0.007*"president" + 0.007*"national" + 0.007*"war" + 0.007*"nation" + 0.006*"leader" + 0.006*"force" + 0.006*"attack"
2018-12-03 11:04:17,939 : INFO : topic diff=0.137303, rho=0.179201
2018-12-03 11:04:17,964 : INFO : PROGRESS: pass 1, at document #1500/2914
2018-12-03 11:04:18,019 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:18,033 : INFO : topic #0 (0.100): 0.027*"comment" + 0.010*"post" + 0.008*"government" + 0.007*"time" + 0.007*"right" + 0.006*"share" + 0.006*"canada" + 0.006*"work" + 0.006*"facebook" + 0.006*"community"
2018-12-03 11:04:18,035 : INFO : topic #4 (0.100): 0.014*"company" + 0.011*"market" + 0.010*"business" + 0.007*"cost" + 0.006*"price" + 0.006*"cent" + 0.006*"industry" + 0.006*"rate" + 0.006*"consumer" + 0.006*"system"
2018-12-03 11:04:18,036 : INFO : topic #8 (0.100): 0.013*"country" + 0.011*"state" + 0.009*"gove

2018-12-03 11:04:18,556 : INFO : topic diff=0.113544, rho=0.179201
2018-12-03 11:04:18,575 : INFO : PROGRESS: pass 1, at document #2100/2914
2018-12-03 11:04:18,619 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:18,632 : INFO : topic #9 (0.100): 0.011*"water" + 0.007*"food" + 0.006*"open" + 0.006*"*" + 0.005*"hotel" + 0.005*"area" + 0.005*"park" + 0.004*"day" + 0.004*"place" + 0.004*"use"
2018-12-03 11:04:18,633 : INFO : topic #2 (0.100): 0.013*"people" + 0.009*"want" + 0.008*"work" + 0.008*"time" + 0.008*"city" + 0.007*"show" + 0.006*"day" + 0.006*"look" + 0.004*"party" + 0.004*"bad"
2018-12-03 11:04:18,633 : INFO : topic #8 (0.100): 0.014*"country" + 0.012*"state" + 0.010*"government" + 0.008*"president" + 0.008*"war" + 0.007*"leader" + 0.006*"minister" + 0.006*"nation" + 0.006*"national" + 0.005*"take"
2018-12-03 11:04:18,634 : INFO : topic #7 (0.100): 0.022*"school" + 0.013*"student" + 0.012*"university" + 0.010*"study" + 0.008*"use" + 0

2018-12-03 11:04:19,131 : INFO : topic #2 (0.100): 0.014*"people" + 0.009*"want" + 0.008*"time" + 0.008*"work" + 0.007*"show" + 0.007*"city" + 0.006*"day" + 0.006*"look" + 0.004*"party" + 0.004*"trump"
2018-12-03 11:04:19,132 : INFO : topic #1 (0.100): 0.014*"game" + 0.013*"team" + 0.013*"play" + 0.011*"win" + 0.010*"time" + 0.007*"take" + 0.007*"player" + 0.007*"world" + 0.006*"league" + 0.006*"sport"
2018-12-03 11:04:19,133 : INFO : topic #7 (0.100): 0.016*"school" + 0.011*"university" + 0.010*"student" + 0.009*"work" + 0.008*"need" + 0.007*"study" + 0.007*"use" + 0.007*"community" + 0.007*"help" + 0.007*"education"
2018-12-03 11:04:19,134 : INFO : topic #0 (0.100): 0.028*"comment" + 0.011*"post" + 0.011*"government" + 0.008*"time" + 0.007*"right" + 0.006*"colombia" + 0.005*"minister" + 0.005*"share" + 0.005*"community" + 0.005*"personal"
2018-12-03 11:04:19,134 : INFO : topic diff=0.102807, rho=0.179201
2018-12-03 11:04:19,152 : INFO : PROGRESS: pass 1, at document #2800/2914
2018-1

2018-12-03 11:04:21,564 : INFO : topic #3 (0.100): 0.023*"use" + 0.012*"user" + 0.012*"police" + 0.011*"site" + 0.010*"website" + 0.009*"car" + 0.008*"content" + 0.007*"cookie" + 0.007*"service" + 0.007*"press"
2018-12-03 11:04:21,566 : INFO : topic #7 (0.100): 0.014*"school" + 0.012*"student" + 0.010*"university" + 0.010*"need" + 0.009*"work" + 0.009*"study" + 0.008*"community" + 0.007*"provide" + 0.007*"program" + 0.007*"help"
2018-12-03 11:04:21,568 : INFO : topic #9 (0.100): 0.014*"water" + 0.008*"food" + 0.006*"park" + 0.006*"farm" + 0.005*"*" + 0.005*"use" + 0.005*"area" + 0.005*"coffee" + 0.005*"plant" + 0.005*"tree"
2018-12-03 11:04:21,569 : INFO : topic diff=0.119779, rho=0.176391
2018-12-03 11:04:21,589 : INFO : PROGRESS: pass 2, at document #400/2914
2018-12-03 11:04:21,646 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:21,659 : INFO : topic #6 (0.100): 0.020*"court" + 0.009*"case" + 0.008*"state" + 0.008*"official" + 0.007*"membe

2018-12-03 11:04:22,140 : INFO : topic #4 (0.100): 0.018*"company" + 0.014*"market" + 0.009*"business" + 0.008*"price" + 0.007*"cent" + 0.007*"bank" + 0.006*"industry" + 0.006*"power" + 0.006*"energy" + 0.005*"cost"
2018-12-03 11:04:22,141 : INFO : topic diff=0.097619, rho=0.176391
2018-12-03 11:04:22,272 : INFO : -8.289 per-word bound, 312.8 perplexity estimate based on a held-out corpus of 100 documents with 22700 words
2018-12-03 11:04:22,273 : INFO : PROGRESS: pass 2, at document #1000/2914
2018-12-03 11:04:22,326 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:22,340 : INFO : topic #3 (0.100): 0.021*"use" + 0.012*"police" + 0.011*"site" + 0.010*"car" + 0.009*"user" + 0.009*"website" + 0.007*"service" + 0.007*"road" + 0.007*"area" + 0.006*"traffic"
2018-12-03 11:04:22,341 : INFO : topic #5 (0.100): 0.018*"child" + 0.014*"family" + 0.011*"life" + 0.010*"woman" + 0.009*"people" + 0.008*"health" + 0.008*"death" + 0.007*"police" + 0.007*"take

2018-12-03 11:04:22,815 : INFO : PROGRESS: pass 2, at document #1600/2914
2018-12-03 11:04:22,866 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:22,887 : INFO : topic #0 (0.100): 0.032*"comment" + 0.011*"post" + 0.009*"government" + 0.008*"right" + 0.008*"time" + 0.006*"facebook" + 0.006*"please" + 0.006*"canada" + 0.006*"view" + 0.006*"community"
2018-12-03 11:04:22,888 : INFO : topic #6 (0.100): 0.017*"court" + 0.009*"city" + 0.008*"indian" + 0.008*"case" + 0.007*"council" + 0.007*"official" + 0.006*"public" + 0.006*"khan" + 0.006*"claim" + 0.006*"plan"
2018-12-03 11:04:22,889 : INFO : topic #8 (0.100): 0.015*"country" + 0.013*"state" + 0.009*"government" + 0.008*"pakistan" + 0.008*"president" + 0.007*"oct" + 0.007*"national" + 0.007*"nation" + 0.006*"world" + 0.006*"force"
2018-12-03 11:04:22,890 : INFO : topic #2 (0.100): 0.014*"people" + 0.010*"time" + 0.009*"want" + 0.008*"work" + 0.008*"show" + 0.007*"city" + 0.007*"day" + 0.007*"look

2018-12-03 11:04:23,461 : INFO : topic #8 (0.100): 0.015*"country" + 0.014*"state" + 0.010*"government" + 0.009*"president" + 0.007*"leader" + 0.007*"national" + 0.007*"minister" + 0.006*"war" + 0.006*"group" + 0.006*"nation"
2018-12-03 11:04:23,462 : INFO : topic #9 (0.100): 0.012*"water" + 0.009*"food" + 0.006*"area" + 0.006*"open" + 0.005*"park" + 0.005*"hotel" + 0.005*"*" + 0.004*"use" + 0.004*"green" + 0.004*"place"
2018-12-03 11:04:23,463 : INFO : topic #5 (0.100): 0.013*"child" + 0.012*"family" + 0.012*"woman" + 0.009*"life" + 0.008*"take" + 0.008*"health" + 0.008*"people" + 0.007*"death" + 0.006*"find" + 0.006*"police"
2018-12-03 11:04:23,464 : INFO : topic #6 (0.100): 0.017*"court" + 0.009*"committee" + 0.009*"city" + 0.008*"official" + 0.008*"member" + 0.008*"state" + 0.007*"indian" + 0.007*"statement" + 0.007*"board" + 0.007*"issue"
2018-12-03 11:04:23,464 : INFO : topic diff=0.079076, rho=0.176391
2018-12-03 11:04:23,481 : INFO : PROGRESS: pass 2, at document #2300/2914
201

2018-12-03 11:04:23,935 : INFO : topic #1 (0.100): 0.014*"play" + 0.013*"game" + 0.013*"team" + 0.011*"win" + 0.009*"time" + 0.008*"player" + 0.008*"take" + 0.007*"world" + 0.006*"second" + 0.006*"sport"
2018-12-03 11:04:23,936 : INFO : topic #5 (0.100): 0.016*"child" + 0.013*"woman" + 0.012*"family" + 0.011*"health" + 0.009*"girl" + 0.008*"life" + 0.008*"drug" + 0.008*"people" + 0.007*"take" + 0.007*"case"
2018-12-03 11:04:23,937 : INFO : topic diff=0.087456, rho=0.176391
2018-12-03 11:04:23,952 : INFO : PROGRESS: pass 2, at document #2900/2914
2018-12-03 11:04:24,004 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:24,019 : INFO : topic #4 (0.100): 0.018*"company" + 0.012*"market" + 0.012*"business" + 0.011*"price" + 0.007*"cost" + 0.007*"billion" + 0.006*"share" + 0.006*"industry" + 0.006*"tax" + 0.005*"growth"
2018-12-03 11:04:24,022 : INFO : topic #6 (0.100): 0.019*"court" + 0.011*"state" + 0.008*"official" + 0.008*"case" + 0.008*"indian"

2018-12-03 11:04:26,226 : INFO : topic #2 (0.100): 0.014*"people" + 0.012*"want" + 0.008*"work" + 0.008*"show" + 0.008*"time" + 0.007*"city" + 0.007*"look" + 0.006*"day" + 0.006*"beer" + 0.005*"really"
2018-12-03 11:04:26,227 : INFO : topic #7 (0.100): 0.020*"school" + 0.013*"student" + 0.011*"need" + 0.010*"work" + 0.009*"university" + 0.008*"provide" + 0.008*"study" + 0.007*"project" + 0.007*"community" + 0.007*"education"
2018-12-03 11:04:26,228 : INFO : topic diff=0.096585, rho=0.173710
2018-12-03 11:04:26,245 : INFO : PROGRESS: pass 3, at document #500/2914
2018-12-03 11:04:26,290 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:26,303 : INFO : topic #1 (0.100): 0.013*"play" + 0.013*"game" + 0.010*"team" + 0.009*"take" + 0.009*"time" + 0.008*"win" + 0.007*"score" + 0.006*"point" + 0.006*"player" + 0.006*"look"
2018-12-03 11:04:26,304 : INFO : topic #5 (0.100): 0.016*"child" + 0.011*"family" + 0.011*"life" + 0.010*"woman" + 0.008*"people" 

2018-12-03 11:04:26,832 : INFO : topic #3 (0.100): 0.022*"use" + 0.012*"site" + 0.011*"police" + 0.010*"car" + 0.010*"user" + 0.010*"website" + 0.007*"service" + 0.007*"road" + 0.006*"area" + 0.006*"contact"
2018-12-03 11:04:26,833 : INFO : topic diff=0.071545, rho=0.173710
2018-12-03 11:04:26,854 : INFO : PROGRESS: pass 3, at document #1100/2914
2018-12-03 11:04:26,898 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:26,911 : INFO : topic #6 (0.100): 0.016*"court" + 0.009*"official" + 0.008*"case" + 0.008*"state" + 0.007*"member" + 0.007*"indian" + 0.007*"khan" + 0.006*"claim" + 0.006*"committee" + 0.006*"council"
2018-12-03 11:04:26,912 : INFO : topic #0 (0.100): 0.029*"comment" + 0.010*"government" + 0.008*"right" + 0.008*"post" + 0.008*"time" + 0.007*"facebook" + 0.006*"canada" + 0.006*"minister" + 0.006*"work" + 0.006*"community"
2018-12-03 11:04:26,912 : INFO : topic #8 (0.100): 0.016*"country" + 0.016*"state" + 0.011*"government" + 0.00

2018-12-03 11:04:27,426 : INFO : topic #6 (0.100): 0.016*"court" + 0.010*"indian" + 0.009*"city" + 0.008*"official" + 0.008*"case" + 0.007*"public" + 0.006*"council" + 0.006*"claim" + 0.006*"state" + 0.006*"law"
2018-12-03 11:04:27,427 : INFO : topic #5 (0.100): 0.016*"child" + 0.012*"family" + 0.012*"woman" + 0.009*"life" + 0.009*"death" + 0.008*"people" + 0.007*"health" + 0.007*"take" + 0.007*"police" + 0.006*"find"
2018-12-03 11:04:27,427 : INFO : topic #1 (0.100): 0.015*"game" + 0.012*"team" + 0.011*"play" + 0.010*"win" + 0.010*"time" + 0.008*"take" + 0.008*"world" + 0.006*"player" + 0.006*"second" + 0.005*"big"
2018-12-03 11:04:27,428 : INFO : topic #7 (0.100): 0.014*"school" + 0.011*"university" + 0.011*"study" + 0.011*"student" + 0.008*"work" + 0.008*"need" + 0.008*"program" + 0.007*"use" + 0.006*"people" + 0.006*"help"
2018-12-03 11:04:27,429 : INFO : topic #8 (0.100): 0.016*"country" + 0.013*"state" + 0.009*"government" + 0.008*"pakistan" + 0.008*"president" + 0.007*"group" + 

2018-12-03 11:04:27,990 : INFO : topic #7 (0.100): 0.020*"school" + 0.013*"student" + 0.010*"work" + 0.010*"university" + 0.009*"study" + 0.008*"need" + 0.008*"help" + 0.007*"program" + 0.006*"use" + 0.006*"project"
2018-12-03 11:04:27,991 : INFO : topic #4 (0.100): 0.023*"company" + 0.011*"business" + 0.010*"market" + 0.009*"price" + 0.008*"share" + 0.007*"financial" + 0.006*"growth" + 0.006*"cost" + 0.006*"value" + 0.006*"cent"
2018-12-03 11:04:27,991 : INFO : topic #3 (0.100): 0.025*"use" + 0.013*"cookie" + 0.013*"website" + 0.012*"site" + 0.011*"car" + 0.010*"user" + 0.009*"content" + 0.008*"press" + 0.008*"provide" + 0.008*"police"
2018-12-03 11:04:27,992 : INFO : topic #2 (0.100): 0.014*"people" + 0.010*"want" + 0.010*"time" + 0.009*"work" + 0.008*"show" + 0.007*"day" + 0.006*"look" + 0.006*"city" + 0.005*"take" + 0.005*"really"
2018-12-03 11:04:27,993 : INFO : topic diff=0.068178, rho=0.173710
2018-12-03 11:04:28,007 : INFO : PROGRESS: pass 3, at document #2400/2914
2018-12-03 1

2018-12-03 11:04:28,450 : INFO : topic #1 (0.100): 0.014*"game" + 0.013*"play" + 0.013*"team" + 0.011*"win" + 0.009*"time" + 0.008*"player" + 0.008*"take" + 0.007*"world" + 0.006*"point" + 0.006*"second"
2018-12-03 11:04:28,451 : INFO : topic #6 (0.100): 0.019*"court" + 0.011*"state" + 0.008*"case" + 0.008*"official" + 0.007*"indian" + 0.007*"house" + 0.007*"order" + 0.007*"issue" + 0.007*"statement" + 0.006*"committee"
2018-12-03 11:04:28,452 : INFO : topic diff=0.063040, rho=0.173710
2018-12-03 11:04:28,491 : INFO : -8.110 per-word bound, 276.2 perplexity estimate based on a held-out corpus of 14 documents with 2867 words
2018-12-03 11:04:28,492 : INFO : PROGRESS: pass 3, at document #2914/2914
2018-12-03 11:04:28,508 : INFO : merging changes from 14 documents into a model of 2914 documents
2018-12-03 11:04:28,522 : INFO : topic #4 (0.100): 0.026*"market" + 0.018*"company" + 0.011*"report" + 0.010*"price" + 0.010*"business" + 0.007*"energy" + 0.007*"investment" + 0.007*"share" + 0.00

2018-12-03 11:04:30,705 : INFO : topic #7 (0.100): 0.019*"school" + 0.012*"student" + 0.011*"need" + 0.011*"work" + 0.009*"university" + 0.008*"provide" + 0.008*"program" + 0.008*"project" + 0.007*"study" + 0.007*"system"
2018-12-03 11:04:30,705 : INFO : topic diff=0.067714, rho=0.171147
2018-12-03 11:04:30,730 : INFO : PROGRESS: pass 4, at document #600/2914
2018-12-03 11:04:30,783 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:30,800 : INFO : topic #1 (0.100): 0.013*"play" + 0.012*"game" + 0.010*"team" + 0.009*"win" + 0.009*"time" + 0.009*"take" + 0.007*"player" + 0.007*"score" + 0.006*"point" + 0.006*"world"
2018-12-03 11:04:30,801 : INFO : topic #3 (0.100): 0.024*"use" + 0.014*"site" + 0.012*"user" + 0.011*"website" + 0.010*"police" + 0.010*"car" + 0.007*"content" + 0.007*"app" + 0.007*"press" + 0.007*"service"
2018-12-03 11:04:30,801 : INFO : topic #8 (0.100): 0.018*"state" + 0.017*"country" + 0.013*"government" + 0.009*"president" + 0.

2018-12-03 11:04:31,383 : INFO : PROGRESS: pass 4, at document #1200/2914
2018-12-03 11:04:31,424 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:31,441 : INFO : topic #9 (0.100): 0.009*"food" + 0.008*"water" + 0.006*"park" + 0.006*"area" + 0.005*"use" + 0.005*"open" + 0.005*"plant" + 0.004*"restaurant" + 0.004*"city" + 0.004*"place"
2018-12-03 11:04:31,442 : INFO : topic #4 (0.100): 0.018*"company" + 0.013*"market" + 0.013*"business" + 0.007*"cost" + 0.007*"price" + 0.006*"bank" + 0.006*"industry" + 0.006*"cent" + 0.006*"product" + 0.006*"rate"
2018-12-03 11:04:31,443 : INFO : topic #2 (0.100): 0.015*"people" + 0.010*"want" + 0.009*"work" + 0.009*"time" + 0.009*"show" + 0.008*"look" + 0.007*"day" + 0.006*"city" + 0.005*"really" + 0.005*"story"
2018-12-03 11:04:31,443 : INFO : topic #6 (0.100): 0.016*"court" + 0.009*"official" + 0.008*"case" + 0.008*"state" + 0.007*"indian" + 0.007*"claim" + 0.007*"member" + 0.006*"khan" + 0.006*"city" + 0.00

2018-12-03 11:04:31,945 : INFO : topic #2 (0.100): 0.014*"people" + 0.011*"time" + 0.009*"want" + 0.009*"work" + 0.009*"show" + 0.007*"day" + 0.007*"look" + 0.005*"city" + 0.005*"film" + 0.005*"really"
2018-12-03 11:04:31,946 : INFO : topic #8 (0.100): 0.016*"country" + 0.014*"state" + 0.010*"government" + 0.008*"president" + 0.007*"pakistan" + 0.007*"national" + 0.007*"nation" + 0.007*"group" + 0.007*"war" + 0.007*"leader"
2018-12-03 11:04:31,946 : INFO : topic #4 (0.100): 0.015*"company" + 0.013*"market" + 0.010*"business" + 0.007*"cost" + 0.007*"price" + 0.006*"cent" + 0.006*"product" + 0.006*"bank" + 0.005*"rate" + 0.005*"low"
2018-12-03 11:04:31,947 : INFO : topic diff=0.063773, rho=0.171147
2018-12-03 11:04:31,966 : INFO : PROGRESS: pass 4, at document #1900/2914
2018-12-03 11:04:32,006 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:32,018 : INFO : topic #4 (0.100): 0.017*"company" + 0.012*"market" + 0.010*"business" + 0.008*"price" + 

2018-12-03 11:04:32,517 : INFO : topic #6 (0.100): 0.019*"court" + 0.010*"state" + 0.008*"official" + 0.008*"issue" + 0.008*"city" + 0.007*"indian" + 0.007*"case" + 0.007*"committee" + 0.007*"member" + 0.007*"charge"
2018-12-03 11:04:32,518 : INFO : topic diff=0.075512, rho=0.171147
2018-12-03 11:04:32,536 : INFO : PROGRESS: pass 4, at document #2500/2914
2018-12-03 11:04:32,575 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 11:04:32,588 : INFO : topic #4 (0.100): 0.022*"company" + 0.011*"market" + 0.010*"business" + 0.010*"price" + 0.007*"share" + 0.006*"cost" + 0.006*"financial" + 0.006*"service" + 0.006*"growth" + 0.006*"cent"
2018-12-03 11:04:32,589 : INFO : topic #6 (0.100): 0.019*"court" + 0.010*"state" + 0.008*"city" + 0.008*"official" + 0.007*"case" + 0.007*"issue" + 0.007*"indian" + 0.007*"committee" + 0.007*"member" + 0.007*"charge"
2018-12-03 11:04:32,590 : INFO : topic #8 (0.100): 0.018*"country" + 0.017*"state" + 0.012*"government" + 

2018-12-03 11:04:32,956 : INFO : topic diff=0.104971, rho=0.171147


In [24]:
lda.print_topics()

2018-12-03 11:04:32,961 : INFO : topic #0 (0.100): 0.033*"comment" + 0.010*"post" + 0.010*"government" + 0.009*"minister" + 0.008*"time" + 0.008*"right" + 0.008*"election" + 0.007*"please" + 0.007*"community" + 0.007*"vote"
2018-12-03 11:04:32,962 : INFO : topic #1 (0.100): 0.016*"game" + 0.014*"play" + 0.013*"win" + 0.013*"team" + 0.009*"time" + 0.008*"take" + 0.008*"player" + 0.008*"goal" + 0.008*"second" + 0.008*"score"
2018-12-03 11:04:32,963 : INFO : topic #2 (0.100): 0.012*"people" + 0.011*"beer" + 0.010*"want" + 0.010*"time" + 0.009*"show" + 0.008*"day" + 0.007*"work" + 0.006*"look" + 0.006*"city" + 0.005*"happen"
2018-12-03 11:04:32,964 : INFO : topic #3 (0.100): 0.022*"use" + 0.012*"website" + 0.012*"user" + 0.012*"site" + 0.011*"content" + 0.010*"cookie" + 0.009*"press" + 0.009*"police" + 0.009*"newspaper" + 0.008*"contact"
2018-12-03 11:04:32,964 : INFO : topic #4 (0.100): 0.026*"market" + 0.018*"company" + 0.011*"report" + 0.010*"business" + 0.010*"price" + 0.007*"energy" +

[(0,
  '0.033*"comment" + 0.010*"post" + 0.010*"government" + 0.009*"minister" + 0.008*"time" + 0.008*"right" + 0.008*"election" + 0.007*"please" + 0.007*"community" + 0.007*"vote"'),
 (1,
  '0.016*"game" + 0.014*"play" + 0.013*"win" + 0.013*"team" + 0.009*"time" + 0.008*"take" + 0.008*"player" + 0.008*"goal" + 0.008*"second" + 0.008*"score"'),
 (2,
  '0.012*"people" + 0.011*"beer" + 0.010*"want" + 0.010*"time" + 0.009*"show" + 0.008*"day" + 0.007*"work" + 0.006*"look" + 0.006*"city" + 0.005*"happen"'),
 (3,
  '0.022*"use" + 0.012*"website" + 0.012*"user" + 0.012*"site" + 0.011*"content" + 0.010*"cookie" + 0.009*"press" + 0.009*"police" + 0.009*"newspaper" + 0.008*"contact"'),
 (4,
  '0.026*"market" + 0.018*"company" + 0.011*"report" + 0.010*"business" + 0.010*"price" + 0.007*"energy" + 0.007*"share" + 0.007*"investment" + 0.007*"service" + 0.006*"cost"'),
 (5,
  '0.014*"family" + 0.014*"woman" + 0.013*"child" + 0.012*"medical" + 0.010*"health" + 0.009*"police" + 0.009*"drug" + 0.008*"