# Introduction to LDA and data cleaning
In this notebook, we introduce LDA and what we need for our model. We then proceed to load and clean a sample of the NOW corpus to fulfill our needs.

## What is LDA
[Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a statistical model which we will use for topic modelling/discovery. LDA will, given a list of words belonging to a text, output the topics present and their probability. In here, a topic is represented as a probability distribution of words. Thus each text/document will be a distribution over the topics. In short, texts have an associated topic distribution and topics have a word distribution. 

The image below is the plate notation for LDA, where:
* θ<sub>m</sub> is the topic distribution for document m,
* φ<sub>k</sub> is the word distribution for topic k,
* z<sub>mn</sub> is the topic for the n-th word in document m, and
* w<sub>mn</sub> is the specific word.
* α is the parameter of the Dirichlet prior on the per-document topic distributions,
* β is the parameter of the Dirichlet prior on the per-topic word distribution,

![](LDA.png)

α and β are the parameters for the model. A big α means that documents are likely to be represented by a high number of topics and vice versa. Same goes for β, a high value meaning that topics are represented by a hign number of words. The number of topics that LDA outputs is dependent on our input and works a bit like clustering. If we allow too many topics we might end up splitting topics uselessly and a too few will make us group them unnecessarily. 

## The NOW corpus
This notebook shows the cleaning process that will be used for the ADA project. Here, only a sample of the data is used (from [here](https://www.corpusdata.org/now_corpus.asp)), but the methods should be the same once scaled to the full database available on the cluster.

The NOW database is composed of billions of words from online newspapers and magazines from 20 different countries. The data we downloaded comes in different files which can be used together or independently. These files are:

1. **now-samples-lexicon.txt**: this is the full dictionnary of the english language, a lexicon. It contains four clolumns, `wID` which is the word id, `word` the actual word, `lemma` which is family of the word (ie: if word is "walked", lemma is "walk") and `PoS` which is the part of speech.
2. **now-samples_sources.txt**: this is the source of every text, in order it contains the text id, the number of words, the date, the country, the website, the url and title of the article.
3. **text.txt**: this file has the complete texts of the articles, the first column is the `textID` in the format @@textID, the second column is the full text, complete with html paragraphs and headers. It is important to note that to prevent plagiarism, every 200 words, 10 words are replaced by the string "@ @ @ @ @ @ @ @ @ @". Combined words are also split, example "can't" is written as "ca n't" and punctuation is surrounded by spaces.
4. **wordLemPoS.txt**: finally, this file contains the `word`, `lemma` and `PoS` for each word in the texts, one by one, so one could read the texts by reading down the columns. Along with that is the `textID` from where the word is and an `ID (seq)` which is a unique indetifier for each word in the database. Each time a word is added this number is incremented.

## What we need from the NOW corpus for LDA
The model will take two inputs, a matrix with all the important words for each text, and a list of all the important words. By important, it is meant the words which will give us good topic modelling. For example, names, locations, simple words like "but, "I" or "and" will not give meaningfull results and are quite common in english (so-called stopwords). Other common words present in our database should be removed too. We also should use lemmas instead of words.

Therefore, the file `wordLemPoS.txt` (hence referred as wlp) is the most important here as it lists all the lemmas with their `textID` associated. Which means that with it we can lsit all the lemmas, remove those we do not want to make our word list, but also group them by texts to create our text-word matrix.

We will also need `now-sample_sources.txt` (hence referred as sources) to link the texts with the information we will deem useful. For example country, date or website.

These are thus the two file we will import and process here with the sample data but also those we will use with the data on the cluster.

## Cleaning

In [171]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pyspark

import re
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import DateType

from scipy.sparse import csr_matrix
from pyspark.ml.feature import *

from pyspark.sql.types import *

from pyspark.mllib.clustering import LDA, LDAModel

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Wlp processing


The goal of this part is to extract the useful data from wlp text files. Since they contain all the words of all the articles and the lemmas to replace them with.

In [3]:
#first read the text file
wlp_rdd = sc.textFile('sample_data/wordLem_poS.txt')

In [4]:
#the first 3 lines are useless headlines
header = wlp_rdd.take(3)

#so let's remove those headlines
noheaders = wlp_rdd.filter(lambda r: r != header[0])\
            .filter(lambda r: r != header[1])\
            .filter(lambda r: r != header[2])

In [5]:
#we split the elements separated by tabs
lines = noheaders.map(lambda r: r.split('\t'))

#identify the columns
wlp_schema = lines.map(lambda r: Row(textID=int(r[0]),idseq=int(r[1]),word=r[2],lemma=r[3],pos=r[4]))
wlp = spark.createDataFrame(wlp_schema)
wlp.show(5)

+----------+------+----+------+-------+
|     idseq| lemma| pos|textID|   word|
+----------+------+----+------+-------+
|1095362496|      |  fo| 11241|@@11241|
|1095362497|      |null| 11241|    <p>|
|1095362498|   sol| np1| 11241|    Sol|
|1095362499|yurick| np1| 11241| Yurick|
|1095362500|      |   ,| 11241|      ,|
+----------+------+----+------+-------+
only showing top 5 rows



### Word selection
It is very important to select the right words and the right number. The ocncept of "garbage in garbage out" has never been more true than with LDA. When we analyse a text we focus on certain words to extract it's meaning and topic. The same is true here since words like if, for, numbers, common names are not that useful.

Here, we provide and example of the process we will go through. However this is not really a data cleaning step as it will directly influence our model. It is more of a model preprocessing step. We will surely go through many iterations of this next part for our model to give the best results. 

First of all, we can remove all the words which have a PoS which do not interest us. For example number (`mc`,`mc1`,`m#`) or punctuation (`.`,`'`), etc...

Details:
1. `.`,`,`,`'` and `"` are punctuations
2. `null` are html tags from the websites
3. `mc`,`mc1` and `m#` are various numbers
3. `fo` are the text ids and other useless beginnings of texts

In [11]:
pos_remove = ['.',',',"\'",'\"','null','mc','mc1','m#','fo']
wlp_nopos = wlp.filter(~wlp['pos'].isin(pos_remove)).drop('idseq','pos','word')

Now, we load our list of stopwords, the words that we are not going to use in LDA as they are too common or are common names. We can also remove the rows with no lemmas or those with lemmas that don't make sense or are not common enough.

In [6]:
#np.save('our_stopwords',stopwords)
stopwords = np.load('our_stopwords.npy').tolist()
len(stopwords)

5639

In [7]:
#filter out stopwords and looking at the frequency of words without them
wlp_nostop = wlp_nopos.filter(~wlp['lemma'].isin(stopwords))
lemma_freq = wlp_nostop.groupBy('lemma').count().sort('count', ascending=False)
lemma_freq.show()

+----------+-----+
|     lemma|count|
+----------+-----+
|      year| 4272|
|      time| 3169|
|    people| 2913|
|      take| 2667|
|       use| 2244|
|      work| 2137|
|       day| 1819|
|     state| 1713|
|   company| 1698|
|   comment| 1667|
|      need| 1654|
|      want| 1579|
|      look| 1564|
|     world| 1553|
|government| 1551|
|      give| 1480|
|      show| 1480|
|   country| 1465|
|      find| 1464|
|     right| 1408|
+----------+-----+
only showing top 20 rows



We will also remove the most common and least common lemmas. These will be useless since they won't provide enough information for our LDA analysis. Here, we filter out the top 5% and bottom 10% of all lemmas.

In [21]:
wlp_nostop.count()

712608

In [74]:
test = wlp_nostop.groupBy('lemma').count().sort('count',ascending=False).drop('count')

In [75]:
test2 = wlp_bytext

In [76]:
test.show(5)

+-----+
|lemma|
+-----+
|   's|
|    '|
| year|
|  n't|
| time|
+-----+
only showing top 5 rows



In [66]:
test2.show(5)

+------+--------------------+
|textID|               lemma|
+------+--------------------+
| 11241|[yurick, writer, ...|
| 11242|[dialect, society...|
| 11243|[sublime, croissa...|
| 11244|[reflect, quarter...|
| 21242|[ars, facebook, c...|
+------+--------------------+
only showing top 5 rows



In [70]:
cv = CountVectorizer(inputCol="lemma", outputCol="vectors")
model = cv.fit(test2)
model.transform(test2).select('vectors').show()

+------+--------------------+--------------------+
|textID|               lemma|             vectors|
+------+--------------------+--------------------+
| 11241|[yurick, writer, ...|(40843,[0,1,2,8,1...|
| 11242|[dialect, society...|(40843,[0,2,3,4,5...|
| 11243|[sublime, croissa...|(40843,[0,1,2,3,4...|
| 11244|[reflect, quarter...|(40843,[0,1,2,3,4...|
| 21242|[ars, facebook, c...|(40843,[0,2,3,4,5...|
| 21243|[york, associate,...|(40843,[0,1,2,7,8...|
| 31240|[ireland, 's, oly...|(40843,[0,1,2,3,4...|
| 31241|[launch, online, ...|(40843,[0,1,2,6,1...|
| 31242|[entrepreneur, po...|(40843,[0,10,12,4...|
| 41240|[syrian, woman, o...|(40843,[0,3,4,5,6...|
| 41241|[published, medic...|(40843,[0,5,7,8,1...|
| 41244|[bay, bridge, jar...|(40843,[0,1,2,4,5...|
| 51243|[mpaa, lobby, arm...|(40843,[0,3,4,5,7...|
| 61240|[mum, 's, fight, ...|(40843,[0,1,2,5,1...|
| 61242|[investigate, cas...|(40843,[0,2,3,4,5...|
| 71240|[north, 's, popul...|(40843,[0,1,2,13,...|
| 71241|[fergusson, air, ...|(4

In [79]:
df = model.transform(test2).select('vectors')

In [81]:
df.show(5)

+--------------------+
|             vectors|
+--------------------+
|(40843,[0,1,2,8,1...|
|(40843,[0,2,3,4,5...|
|(40843,[0,1,2,3,4...|
|(40843,[0,1,2,3,4...|
|(40843,[0,2,3,4,5...|
+--------------------+
only showing top 5 rows



In [82]:
df.printSchema()

root
 |-- vectors: vector (nullable = true)



In [88]:
df.map(lambda x: x.vectors.toArray())

AttributeError: 'DataFrame' object has no attribute 'map'

In [174]:
def vect2Array(vector):
    return [int(vector[i]) for i in range(len(vector))]

udfvect2Array = udf(vect2Array, ArrayType(IntegerType()))
word_count = df.withColumn("iterate", udfvect2Array("vectors")).drop("vectors")

In [175]:
word_count.show()


+--------------------+
|             iterate|
+--------------------+
|[8, 2, 4, 0, 0, 0...|
|[3, 0, 12, 3, 1, ...|
|[2, 1, 2, 2, 2, 0...|
|[10, 12, 5, 7, 5,...|
|[6, 0, 1, 1, 2, 3...|
|[6, 1, 2, 0, 0, 0...|
|[10, 1, 1, 4, 7, ...|
|[2, 24, 2, 0, 0, ...|
|[1, 0, 0, 0, 0, 0...|
|[3, 0, 0, 2, 1, 1...|
|[1, 0, 0, 0, 0, 1...|
|[6, 2, 2, 0, 1, 2...|
|[4, 0, 0, 2, 1, 1...|
|[4, 1, 1, 0, 0, 2...|
|[6, 0, 2, 3, 1, 1...|
|[11, 7, 1, 0, 0, ...|
|[2, 0, 1, 0, 0, 0...|
|[8, 0, 1, 1, 3, 0...|
|[9, 4, 0, 2, 2, 1...|
|[4, 4, 3, 0, 2, 1...|
+--------------------+
only showing top 20 rows



In [169]:
word_count=word_count.rdd.map(list).zipWithIndex().map(lambda x: [x[1], Vectors.dense(x[0])]).cache()

In [173]:
word_count.foreach(print)

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 127.0 failed 1 times, most recent failure: Lost task 6.0 in stage 127.0 (TID 8634, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 230, in main
    process()
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 225, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-169-1abccd003ad3>", line 1, in <lambda>
NameError: name 'Vectors' is not defined

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:378)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1109)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1083)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1018)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1083)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:809)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:162)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 230, in main
    process()
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 225, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-169-1abccd003ad3>", line 1, in <lambda>
NameError: name 'Vectors' is not defined

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:378)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1109)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1083)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1018)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1083)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:809)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more


In [172]:
ldaModel = LDA.train(word_count, k=10)

Py4JJavaError: An error occurred while calling o6921.trainLDAModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 124.0 failed 1 times, most recent failure: Lost task 0.0 in stage 124.0 (TID 8627, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 230, in main
    process()
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 225, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-169-1abccd003ad3>", line 1, in <lambda>
NameError: name 'Vectors' is not defined

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:378)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1109)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1083)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1018)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1083)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:809)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
	at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
	at org.apache.spark.mllib.clustering.EMLDAOptimizer.initialize(LDAOptimizer.scala:167)
	at org.apache.spark.mllib.clustering.EMLDAOptimizer.initialize(LDAOptimizer.scala:81)
	at org.apache.spark.mllib.clustering.LDA.run(LDA.scala:331)
	at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainLDAModel(PythonMLLibAPI.scala:552)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 230, in main
    process()
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 225, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/Users/jeanmarcbejjani/anaconda3/envs/ada/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-169-1abccd003ad3>", line 1, in <lambda>
NameError: name 'Vectors' is not defined

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:378)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1109)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1083)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1018)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1083)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:809)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more


In [94]:
df.show(5)

+--------------------+
|             vectors|
+--------------------+
|(40843,[0,1,2,8,1...|
|(40843,[0,2,3,4,5...|
|(40843,[0,1,2,3,4...|
|(40843,[0,1,2,3,4...|
|(40843,[0,2,3,4,5...|
+--------------------+
only showing top 5 rows



In [37]:
test=csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])


In [38]:
corpus = test.zipWithIndex().map(lambda x: [x[1], Vectors.dense(x[0])]).cache()

AttributeError: 'DataFrame' object has no attribute 'zipWithIndex'

In [None]:
def storeMatrix(wlp_bytext, wlp_nostop):
    all_words = wlp_nostop.groupBy('lemma').count().drop('count')
    
    
    word2line={}
    line2word={}
    shape=(len(all_words),len(all_descriptions))
    matrix=np.zeros(shape)
    
    for index_doc, words_list in enumerate(all_descriptions):
        max_freq=max(words_list.count(j) for j in set(all_words))
        for index_word, i in enumerate(set(all_words)):
            matrix[index_word,index_doc]=words_list.count(i)
            tup=index_word
            line2word[tup]=i
            word2line[i]=tup
            
    return matrix, word2line, line2word

<div class="alert alert-success">
Maybe should change from percentile to number for bottom filtering, depending on which one is the harshest (in here it is number)
</div>

In [8]:
#calculate percentiles and filtering out the lemmas above and below them
[bottom,top] = lemma_freq.approxQuantile('count', [0.1,0.99], 0.01)
bottom = 5
lemma_tokeep = lemma_freq.filter(lemma_freq['count']>bottom).filter(lemma_freq['count']<top)
print('Percentage of lemmas left: %.2f'%(lemma_tokeep.count()/lemma_freq.count()*100))

Percentage of lemmas left: 28.47


Making a inner join, we keep only the words which are in both lists! In the end, we can group the lemmas in their texts to create our text-word matrix.

In [9]:
#perform sql query and inner join
wlp_nostop.registerTempTable('wlp_nostop')
lemma_tokeep.registerTempTable('lemma_tokeep')

query = """
SELECT wlp_nostop.lemma, wlp_nostop.textID
FROM wlp_nostop
INNER JOIN lemma_tokeep ON wlp_nostop.lemma = lemma_tokeep.lemma
"""

wlp_kept = spark.sql(query)
wlp_bytext = wlp_kept.groupBy('textID').agg(collect_list('lemma'))\
                    .sort('textID')\
                    .withColumnRenamed('collect_list(lemma)','document lemmas')
wlp_bytext.show()

+------+--------------------+
|textID|     document lemmas|
+------+--------------------+
| 11241|[1970s, film, fil...|
| 11242|[online, happen, ...|
| 11243|[dough, dough, do...|
| 11244|[trail, launch, o...|
| 21242|[online, launch, ...|
| 21243|[recognize, indic...|
| 31240|[recognize, inten...|
| 31241|[online, online, ...|
| 31242|[settlement, sett...|
| 41240|[explain, hometow...|
| 41241|[scale, lack, pre...|
| 41244|[everyday, trail,...|
| 51243|[australia, austr...|
| 61240|[frustrate, inten...|
| 61242|[editor-in-chief,...|
| 71240|[indicator, requi...|
| 71241|[likelihood, requ...|
| 71242|[1970s, character...|
| 71243|[online, staff, s...|
| 71244|[bone, archaeolog...|
+------+--------------------+
only showing top 20 rows



## Sources
Contains all the additional informations about each text.

In [10]:
sources_rdd = sc.textFile('sample_data/now-samples-sources.txt')\
                .map(lambda r: r.split('\t'))

header = sources_rdd.take(3)
sources_rdd = sources_rdd.filter(lambda l: l != header[0])\
                .filter(lambda l: l != header[1])\
                .filter(lambda l: l != header[2])

In [11]:
#create schema and change data type for date
sources_schema = sources_rdd.map(lambda r: Row(textID=int(r[0]),nwords=int(r[1]),date=r[2],country=r[3],website=r[4],url=r[5],title=r[6],)) 
sources = spark.createDataFrame(sources_schema)
sources = sources.withColumn('date',to_date(sources.date, 'yy-MM-dd'))

In [12]:
sources.printSchema()

root
 |-- country: string (nullable = true)
 |-- date: date (nullable = true)
 |-- nwords: long (nullable = true)
 |-- textID: long (nullable = true)
 |-- title: string (nullable = true)
 |-- url: string (nullable = true)
 |-- website: string (nullable = true)



In [13]:
sources.show(5)

+-------+----------+------+------+--------------------+--------------------+-------------------+
|country|      date|nwords|textID|               title|                 url|            website|
+-------+----------+------+------+--------------------+--------------------+-------------------+
|     US|2013-01-06|   397| 11241|Author of The War...|http://kotaku.com...|             Kotaku|
|     US|2013-01-06|   757| 11242|That's What They ...|http://michiganra...|     Michigan Radio|
|     US|2013-01-06|   755| 11243|Best of New York:...|http://www.nydail...|New York Daily News|
|     US|2013-01-06|  1677| 11244|Reflecting on a q...|http://www.oregon...|     OregonLive.com|
|     US|2013-01-11|   794| 21242|Ask Ars: Does Fac...|http://arstechnic...|       Ars Technica|
+-------+----------+------+------+--------------------+--------------------+-------------------+
only showing top 5 rows



## First tries with LDA
Try using spark lda but with gensim corpus processing.

### Spark and gensim mixed

In [14]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim import corpora
from gensim.matutils import corpus2dense
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors

2018-12-03 22:12:44,799 : INFO : 'pattern' package not found; tag filters are not available for English


In [15]:
#create the dictionary from gensim, each lemma will be assigned a number
dictionary = corpora.Dictionary(line for line in wlp_bytext.rdd.map(lambda r: r[1]).collect())

2018-12-03 22:13:12,557 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-12-03 22:13:13,027 : INFO : built Dictionary(11399 unique tokens: ['1970s', 'adapt', 'adaptation', 'advertisement', 'afternoon']...) from 2914 documents (total 631610 corpus positions)


In [16]:
#the dictionary object also have some useful informations stored in it
print('Number of documents in corpus: \t', dictionary.num_docs)
print('Number of words in corpus: \t', dictionary.num_pos)
print('Number of tokens in dictionary: ', len(dictionary.token2id))

Number of documents in corpus: 	 2914
Number of words in corpus: 	 631610
Number of tokens in dictionary:  11399


In [17]:
#class that makes the gensim corpus object, 
#for now this is the only way I found to go from sparse to dense vector form (using the gensim corpus2dense fct)
class MyCorpus(object):
     def __iter__(self):
            for line in wlp_bytext.rdd.map(lambda r: r[1]).collect():
                yield dictionary.doc2bow(line)

In [19]:
#create the corpus and turn it into a format that spark will like
corpus = MyCorpus()
#changing from sparse to dense representation
data = sc.parallelize(corpus2dense(corpus,num_terms=len(dictionary.token2id),num_docs=dictionary.num_docs).T)
#not sure this is entirely necessary but the data is transformed into spark dense vectors (maybe faster)
parsedData = data.map(lambda line: Vectors.dense(line))
#index documents with unique IDs
corpus_rdd = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()

In [None]:
#train model, here it crashes, it should work though, I think it is juste because of a lack of resources
ldas = LDA.train(corpus_rdd, k=10)

#output topics, sadly there aren't any strings here, we need to map that to dictionary, would be even harder to do without gensim
'''print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize())
      + " words):")
topics = ldaModel.topicsMatrix()
for topic in range(10):
    print("Topic " + str(topic) + ":")
    for word in range(0, ldaModel.vocabSize()):
        print(" " + str(topics[word][topic]))'''

### Full gensim
Same thing but entirely done with gensim. Very practical and concise. The word selection could even be done here, see the dictionary attributes `.filter_extremes` and `filter_n_most_frequent` [here](https://radimrehurek.com/gensim/corpora/dictionary.html).

In [20]:
from gensim.models.ldamodel import LdaModel

In [21]:
corpus = MyCorpus()
ldag = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1, chunksize=100, passes=5)

2018-12-03 22:16:39,166 : INFO : using symmetric alpha at 0.1
2018-12-03 22:16:39,167 : INFO : using symmetric eta at 0.1
2018-12-03 22:16:39,172 : INFO : using serial LDA version on this node
2018-12-03 22:16:41,611 : INFO : running online (multi-pass) LDA training, 10 topics, 5 passes over the supplied corpus of 2914 documents, updating model once every 100 documents, evaluating perplexity every 1000 documents, iterating 50x with a convergence threshold of 0.001000
2018-12-03 22:16:43,489 : INFO : PROGRESS: pass 0, at document #100/2914
2018-12-03 22:16:43,601 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:16:43,633 : INFO : topic #8 (0.100): 0.012*"use" + 0.008*"interest" + 0.007*"site" + 0.005*"mission" + 0.005*"work" + 0.005*"rate" + 0.005*"people" + 0.004*"user" + 0.004*"court" + 0.004*"time"
2018-12-03 22:16:43,637 : INFO : topic #6 (0.100): 0.006*"look" + 0.005*"use" + 0.005*"work" + 0.005*"take" + 0.004*"find" + 0.004*"company" + 0.004

2018-12-03 22:16:44,303 : INFO : topic diff=0.825588, rho=0.408248
2018-12-03 22:16:44,322 : INFO : PROGRESS: pass 0, at document #700/2914
2018-12-03 22:16:44,409 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:16:44,424 : INFO : topic #9 (0.100): 0.008*"play" + 0.008*"take" + 0.007*"time" + 0.007*"want" + 0.007*"show" + 0.007*"team" + 0.006*"game" + 0.006*"really" + 0.005*"city" + 0.005*"work"
2018-12-03 22:16:44,425 : INFO : topic #7 (0.100): 0.007*"area" + 0.006*"brand" + 0.006*"share" + 0.005*"comment" + 0.005*"people" + 0.005*"product" + 0.005*"time" + 0.005*"child" + 0.005*"company" + 0.004*"mayor"
2018-12-03 22:16:44,426 : INFO : topic #6 (0.100): 0.006*"country" + 0.005*"power" + 0.005*"work" + 0.005*"number" + 0.005*"state" + 0.005*"child" + 0.005*"look" + 0.004*"percent" + 0.004*"use" + 0.004*"take"
2018-12-03 22:16:44,427 : INFO : topic #3 (0.100): 0.010*"state" + 0.006*"take" + 0.006*"governor" + 0.006*"member" + 0.005*"film" + 0.00

2018-12-03 22:16:45,245 : INFO : topic #7 (0.100): 0.012*"*" + 0.007*"comment" + 0.006*"share" + 0.005*"product" + 0.005*"sept" + 0.005*"area" + 0.005*"people" + 0.004*"local" + 0.004*"consumer" + 0.004*"campaign"
2018-12-03 22:16:45,247 : INFO : topic #9 (0.100): 0.013*"game" + 0.010*"time" + 0.009*"team" + 0.009*"play" + 0.008*"take" + 0.007*"show" + 0.007*"want" + 0.007*"look" + 0.006*"win" + 0.006*"player"
2018-12-03 22:16:45,248 : INFO : topic #4 (0.100): 0.011*"company" + 0.011*"business" + 0.007*"time" + 0.007*"market" + 0.007*"comment" + 0.006*"technology" + 0.006*"share" + 0.006*"delhi" + 0.005*"use" + 0.005*"web"
2018-12-03 22:16:45,250 : INFO : topic #3 (0.100): 0.007*"woman" + 0.007*"take" + 0.006*"state" + 0.006*"people" + 0.005*"war" + 0.005*"member" + 0.005*"force" + 0.005*"country" + 0.004*"attack" + 0.004*"kill"
2018-12-03 22:16:45,252 : INFO : topic #0 (0.100): 0.008*"world" + 0.005*"book" + 0.005*"change" + 0.005*"use" + 0.005*"university" + 0.004*"write" + 0.004*"pe

2018-12-03 22:16:45,867 : INFO : topic #2 (0.100): 0.015*"police" + 0.010*"people" + 0.008*"comment" + 0.008*"health" + 0.008*"family" + 0.007*"find" + 0.006*"life" + 0.006*"use" + 0.006*"child" + 0.005*"hospital"
2018-12-03 22:16:45,867 : INFO : topic #8 (0.100): 0.013*"government" + 0.009*"minister" + 0.008*"party" + 0.006*"comment" + 0.006*"court" + 0.006*"pay" + 0.006*"right" + 0.006*"public" + 0.006*"state" + 0.005*"use"
2018-12-03 22:16:45,868 : INFO : topic diff=0.298404, rho=0.229416
2018-12-03 22:16:46,000 : INFO : -8.543 per-word bound, 372.9 perplexity estimate based on a held-out corpus of 100 documents with 19093 words
2018-12-03 22:16:46,001 : INFO : PROGRESS: pass 0, at document #2000/2914
2018-12-03 22:16:46,059 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:16:46,074 : INFO : topic #9 (0.100): 0.011*"game" + 0.011*"play" + 0.011*"time" + 0.009*"team" + 0.009*"win" + 0.008*"show" + 0.007*"take" + 0.006*"look" + 0.006*"want" + 0.

2018-12-03 22:16:46,520 : INFO : topic diff=0.215177, rho=0.200000
2018-12-03 22:16:46,535 : INFO : PROGRESS: pass 0, at document #2600/2914
2018-12-03 22:16:46,601 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:16:46,614 : INFO : topic #6 (0.100): 0.007*"tax" + 0.007*"work" + 0.007*"need" + 0.006*"water" + 0.006*"number" + 0.005*"child" + 0.005*"area" + 0.005*"plan" + 0.005*"people" + 0.005*"power"
2018-12-03 22:16:46,615 : INFO : topic #9 (0.100): 0.013*"game" + 0.011*"play" + 0.011*"time" + 0.010*"team" + 0.010*"win" + 0.007*"look" + 0.007*"take" + 0.006*"show" + 0.006*"player" + 0.006*"want"
2018-12-03 22:16:46,616 : INFO : topic #0 (0.100): 0.010*"world" + 0.008*"book" + 0.006*"university" + 0.006*"change" + 0.005*"dr" + 0.004*"professor" + 0.004*"republican" + 0.004*"people" + 0.004*"use" + 0.004*"write"
2018-12-03 22:16:46,617 : INFO : topic #7 (0.100): 0.010*"*" + 0.009*"local" + 0.007*"kashmir" + 0.007*"editor" + 0.006*"area" + 0.006*"

2018-12-03 22:16:49,049 : INFO : topic #5 (0.100): 0.015*"school" + 0.011*"student" + 0.009*"city" + 0.008*"community" + 0.006*"day" + 0.006*"space" + 0.006*"open" + 0.006*"people" + 0.005*"time" + 0.005*"education"
2018-12-03 22:16:49,050 : INFO : topic #9 (0.100): 0.013*"game" + 0.012*"play" + 0.011*"time" + 0.010*"team" + 0.009*"win" + 0.009*"score" + 0.008*"take" + 0.008*"look" + 0.006*"want" + 0.006*"big"
2018-12-03 22:16:49,051 : INFO : topic #0 (0.100): 0.010*"book" + 0.010*"world" + 0.006*"change" + 0.006*"dr" + 0.005*"supplement" + 0.005*"use" + 0.005*"human" + 0.005*"university" + 0.005*"write" + 0.004*"people"
2018-12-03 22:16:49,051 : INFO : topic #7 (0.100): 0.022*"beer" + 0.008*"*" + 0.008*"newspaper" + 0.007*"website" + 0.007*"kashmir" + 0.007*"use" + 0.006*"visit" + 0.006*"local" + 0.006*"editor" + 0.006*"area"
2018-12-03 22:16:49,052 : INFO : topic #3 (0.100): 0.009*"president" + 0.009*"state" + 0.007*"country" + 0.007*"force" + 0.006*"take" + 0.006*"attack" + 0.006*"p

2018-12-03 22:16:49,673 : INFO : topic #0 (0.100): 0.010*"world" + 0.009*"book" + 0.008*"dr" + 0.007*"university" + 0.007*"professor" + 0.006*"write" + 0.006*"history" + 0.005*"people" + 0.005*"use" + 0.005*"change"
2018-12-03 22:16:49,674 : INFO : topic #7 (0.100): 0.010*"brand" + 0.008*"product" + 0.007*"beer" + 0.007*"local" + 0.007*"website" + 0.006*"online" + 0.006*"editor" + 0.006*"area" + 0.006*"use" + 0.006*"site"
2018-12-03 22:16:49,675 : INFO : topic #4 (0.100): 0.021*"company" + 0.011*"business" + 0.010*"market" + 0.009*"share" + 0.007*"time" + 0.006*"service" + 0.006*"price" + 0.006*"technology" + 0.006*"industry" + 0.005*"investment"
2018-12-03 22:16:49,675 : INFO : topic diff=0.147317, rho=0.179201
2018-12-03 22:16:49,693 : INFO : PROGRESS: pass 1, at document #900/2914
2018-12-03 22:16:49,753 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:16:49,767 : INFO : topic #8 (0.100): 0.012*"government" + 0.010*"comment" + 0.009*"state" + 

2018-12-03 22:16:50,400 : INFO : topic #4 (0.100): 0.017*"company" + 0.014*"business" + 0.008*"market" + 0.008*"technology" + 0.008*"share" + 0.008*"time" + 0.005*"cent" + 0.005*"price" + 0.005*"service" + 0.005*"industry"
2018-12-03 22:16:50,402 : INFO : topic #2 (0.100): 0.013*"police" + 0.012*"family" + 0.010*"child" + 0.009*"health" + 0.009*"life" + 0.008*"people" + 0.008*"comment" + 0.008*"drug" + 0.007*"woman" + 0.007*"use"
2018-12-03 22:16:50,404 : INFO : topic diff=0.139187, rho=0.179201
2018-12-03 22:16:50,427 : INFO : PROGRESS: pass 1, at document #1500/2914
2018-12-03 22:16:50,487 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:16:50,502 : INFO : topic #6 (0.100): 0.008*"need" + 0.008*"child" + 0.008*"work" + 0.007*"plan" + 0.007*"cost" + 0.006*"people" + 0.006*"number" + 0.006*"percent" + 0.005*"area" + 0.005*"water"
2018-12-03 22:16:50,503 : INFO : topic #7 (0.100): 0.016*"*" + 0.009*"food" + 0.008*"use" + 0.007*"dog" + 0.007*"onlin

2018-12-03 22:16:51,025 : INFO : topic diff=0.113808, rho=0.179201
2018-12-03 22:16:51,042 : INFO : PROGRESS: pass 1, at document #2100/2914
2018-12-03 22:16:51,088 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:16:51,101 : INFO : topic #2 (0.100): 0.015*"police" + 0.011*"family" + 0.008*"child" + 0.008*"people" + 0.008*"life" + 0.008*"health" + 0.007*"find" + 0.007*"take" + 0.006*"woman" + 0.006*"comment"
2018-12-03 22:16:51,102 : INFO : topic #5 (0.100): 0.017*"school" + 0.012*"city" + 0.009*"student" + 0.007*"day" + 0.007*"people" + 0.006*"open" + 0.006*"community" + 0.005*"street" + 0.005*"work" + 0.005*"time"
2018-12-03 22:16:51,104 : INFO : topic #4 (0.100): 0.024*"company" + 0.014*"share" + 0.013*"business" + 0.008*"market" + 0.008*"price" + 0.007*"service" + 0.007*"time" + 0.007*"technology" + 0.005*"car" + 0.005*"customer"
2018-12-03 22:16:51,105 : INFO : topic #9 (0.100): 0.014*"game" + 0.012*"play" + 0.011*"time" + 0.011*"team" + 0.0

2018-12-03 22:16:51,617 : INFO : topic #2 (0.100): 0.017*"police" + 0.011*"family" + 0.010*"drug" + 0.010*"health" + 0.009*"child" + 0.008*"woman" + 0.008*"medical" + 0.007*"life" + 0.007*"arrest" + 0.007*"find"
2018-12-03 22:16:51,618 : INFO : topic #7 (0.100): 0.013*"use" + 0.011*"website" + 0.010*"*" + 0.009*"local" + 0.009*"site" + 0.008*"editor" + 0.008*"food" + 0.007*"contact" + 0.007*"online" + 0.007*"content"
2018-12-03 22:16:51,619 : INFO : topic #6 (0.100): 0.010*"tax" + 0.010*"need" + 0.009*"work" + 0.009*"water" + 0.007*"plan" + 0.007*"people" + 0.007*"area" + 0.006*"cost" + 0.006*"number" + 0.006*"budget"
2018-12-03 22:16:51,620 : INFO : topic #0 (0.100): 0.010*"world" + 0.008*"book" + 0.007*"change" + 0.006*"university" + 0.006*"people" + 0.005*"write" + 0.005*"use" + 0.005*"life" + 0.005*"republican" + 0.005*"dr"
2018-12-03 22:16:51,620 : INFO : topic diff=0.098573, rho=0.179201
2018-12-03 22:16:51,637 : INFO : PROGRESS: pass 1, at document #2800/2914
2018-12-03 22:16:51

2018-12-03 22:16:53,916 : INFO : topic #6 (0.100): 0.011*"water" + 0.011*"need" + 0.010*"work" + 0.008*"project" + 0.008*"tax" + 0.007*"funding" + 0.007*"plan" + 0.007*"people" + 0.006*"cost" + 0.006*"area"
2018-12-03 22:16:53,918 : INFO : topic #2 (0.100): 0.016*"police" + 0.011*"child" + 0.010*"family" + 0.009*"medical" + 0.009*"life" + 0.008*"woman" + 0.008*"case" + 0.008*"court" + 0.008*"death" + 0.008*"drug"
2018-12-03 22:16:53,920 : INFO : topic diff=0.128700, rho=0.176391
2018-12-03 22:16:53,937 : INFO : PROGRESS: pass 2, at document #400/2914
2018-12-03 22:16:53,990 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:16:54,003 : INFO : topic #3 (0.100): 0.009*"state" + 0.008*"president" + 0.008*"country" + 0.007*"force" + 0.006*"government" + 0.006*"take" + 0.006*"people" + 0.005*"attack" + 0.005*"military" + 0.005*"security"
2018-12-03 22:16:54,004 : INFO : topic #9 (0.100): 0.011*"game" + 0.011*"play" + 0.010*"time" + 0.010*"team" + 0.009*

2018-12-03 22:16:54,478 : INFO : topic diff=0.095386, rho=0.176391
2018-12-03 22:16:54,616 : INFO : -8.316 per-word bound, 318.8 perplexity estimate based on a held-out corpus of 100 documents with 22700 words
2018-12-03 22:16:54,617 : INFO : PROGRESS: pass 2, at document #1000/2914
2018-12-03 22:16:54,669 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:16:54,683 : INFO : topic #1 (0.100): 0.012*"market" + 0.010*"country" + 0.009*"oil" + 0.008*"energy" + 0.008*"economic" + 0.008*"project" + 0.008*"rate" + 0.007*"growth" + 0.007*"bank" + 0.007*"power"
2018-12-03 22:16:54,684 : INFO : topic #8 (0.100): 0.013*"comment" + 0.012*"government" + 0.009*"public" + 0.008*"party" + 0.008*"state" + 0.007*"right" + 0.007*"member" + 0.007*"service" + 0.006*"minister" + 0.006*"court"
2018-12-03 22:16:54,685 : INFO : topic #0 (0.100): 0.011*"world" + 0.009*"book" + 0.007*"people" + 0.007*"change" + 0.007*"university" + 0.006*"write" + 0.005*"dr" + 0.005*"human"

2018-12-03 22:16:55,234 : INFO : topic #2 (0.100): 0.014*"police" + 0.012*"child" + 0.012*"family" + 0.008*"life" + 0.008*"woman" + 0.007*"find" + 0.007*"people" + 0.007*"drug" + 0.007*"time" + 0.007*"death"
2018-12-03 22:16:55,235 : INFO : topic #3 (0.100): 0.008*"state" + 0.008*"pakistan" + 0.008*"country" + 0.006*"people" + 0.006*"take" + 0.006*"president" + 0.006*"force" + 0.006*"oct" + 0.005*"muslim" + 0.005*"war"
2018-12-03 22:16:55,236 : INFO : topic #9 (0.100): 0.012*"game" + 0.012*"time" + 0.010*"team" + 0.009*"play" + 0.008*"take" + 0.008*"film" + 0.008*"look" + 0.007*"win" + 0.007*"show" + 0.007*"want"
2018-12-03 22:16:55,240 : INFO : topic #7 (0.100): 0.015*"*" + 0.013*"use" + 0.010*"food" + 0.009*"website" + 0.008*"site" + 0.007*"online" + 0.007*"facebook" + 0.007*"local" + 0.006*"dog" + 0.006*"e-mail"
2018-12-03 22:16:55,240 : INFO : topic #5 (0.100): 0.014*"city" + 0.012*"school" + 0.008*"student" + 0.008*"people" + 0.008*"day" + 0.007*"open" + 0.006*"place" + 0.005*"wor

2018-12-03 22:16:55,839 : INFO : topic #4 (0.100): 0.023*"company" + 0.015*"business" + 0.013*"share" + 0.008*"market" + 0.008*"price" + 0.007*"service" + 0.007*"time" + 0.006*"technology" + 0.005*"car" + 0.005*"customer"
2018-12-03 22:16:55,839 : INFO : topic #0 (0.100): 0.010*"world" + 0.008*"book" + 0.008*"people" + 0.007*"write" + 0.006*"university" + 0.006*"change" + 0.005*"dr" + 0.005*"use" + 0.005*"life" + 0.005*"study"
2018-12-03 22:16:55,840 : INFO : topic #6 (0.100): 0.010*"need" + 0.009*"tax" + 0.009*"work" + 0.008*"water" + 0.008*"plan" + 0.008*"people" + 0.007*"area" + 0.006*"cost" + 0.006*"job" + 0.006*"worker"
2018-12-03 22:16:55,841 : INFO : topic diff=0.081183, rho=0.176391
2018-12-03 22:16:55,858 : INFO : PROGRESS: pass 2, at document #2300/2914
2018-12-03 22:16:55,902 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:16:55,914 : INFO : topic #5 (0.100): 0.018*"school" + 0.014*"city" + 0.011*"student" + 0.008*"people" + 0.008*"da

2018-12-03 22:16:56,313 : INFO : topic #3 (0.100): 0.012*"state" + 0.011*"country" + 0.011*"president" + 0.007*"pakistan" + 0.007*"people" + 0.006*"group" + 0.006*"attack" + 0.006*"nigeria" + 0.006*"indian" + 0.006*"government"
2018-12-03 22:16:56,314 : INFO : topic diff=0.093184, rho=0.176391
2018-12-03 22:16:56,329 : INFO : PROGRESS: pass 2, at document #2900/2914
2018-12-03 22:16:56,371 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:16:56,387 : INFO : topic #9 (0.100): 0.012*"game" + 0.011*"play" + 0.011*"time" + 0.011*"team" + 0.009*"win" + 0.008*"take" + 0.008*"look" + 0.006*"show" + 0.006*"player" + 0.006*"want"
2018-12-03 22:16:56,388 : INFO : topic #4 (0.100): 0.023*"company" + 0.016*"business" + 0.011*"share" + 0.009*"price" + 0.008*"service" + 0.008*"time" + 0.008*"market" + 0.007*"technology" + 0.006*"app" + 0.006*"value"
2018-12-03 22:16:56,389 : INFO : topic #7 (0.100): 0.019*"use" + 0.013*"website" + 0.012*"site" + 0.011*"kashmir"

2018-12-03 22:16:58,804 : INFO : PROGRESS: pass 3, at document #500/2914
2018-12-03 22:16:58,848 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:16:58,861 : INFO : topic #5 (0.100): 0.017*"school" + 0.015*"city" + 0.011*"student" + 0.009*"people" + 0.007*"day" + 0.006*"community" + 0.006*"work" + 0.006*"canada" + 0.005*"park" + 0.005*"open"
2018-12-03 22:16:58,862 : INFO : topic #4 (0.100): 0.025*"company" + 0.013*"business" + 0.010*"share" + 0.009*"market" + 0.009*"service" + 0.008*"time" + 0.007*"price" + 0.006*"technology" + 0.005*"customer" + 0.005*"sales"
2018-12-03 22:16:58,862 : INFO : topic #0 (0.100): 0.010*"world" + 0.009*"people" + 0.009*"book" + 0.007*"change" + 0.006*"life" + 0.006*"use" + 0.006*"write" + 0.005*"university" + 0.005*"find" + 0.005*"human"
2018-12-03 22:16:58,863 : INFO : topic #1 (0.100): 0.018*"market" + 0.011*"report" + 0.010*"project" + 0.009*"country" + 0.008*"energy" + 0.008*"high" + 0.007*"plant" + 0.007*"rate"

2018-12-03 22:16:59,490 : INFO : topic #3 (0.100): 0.011*"state" + 0.009*"country" + 0.007*"president" + 0.006*"take" + 0.006*"force" + 0.006*"war" + 0.005*"people" + 0.005*"government" + 0.005*"attack" + 0.005*"leader"
2018-12-03 22:16:59,491 : INFO : topic #9 (0.100): 0.011*"game" + 0.010*"time" + 0.010*"play" + 0.009*"take" + 0.009*"team" + 0.008*"look" + 0.007*"show" + 0.007*"win" + 0.007*"want" + 0.006*"big"
2018-12-03 22:16:59,492 : INFO : topic #0 (0.100): 0.010*"world" + 0.010*"book" + 0.009*"people" + 0.007*"change" + 0.007*"university" + 0.007*"write" + 0.006*"life" + 0.005*"use" + 0.005*"dr" + 0.005*"work"
2018-12-03 22:16:59,493 : INFO : topic #8 (0.100): 0.014*"comment" + 0.012*"government" + 0.009*"public" + 0.009*"party" + 0.008*"state" + 0.008*"member" + 0.007*"minister" + 0.007*"right" + 0.006*"service" + 0.006*"issue"
2018-12-03 22:16:59,493 : INFO : topic diff=0.066986, rho=0.173710
2018-12-03 22:16:59,511 : INFO : PROGRESS: pass 3, at document #1200/2914
2018-12-03 

2018-12-03 22:17:00,038 : INFO : topic #0 (0.100): 0.011*"world" + 0.009*"people" + 0.008*"book" + 0.006*"change" + 0.006*"write" + 0.006*"find" + 0.006*"study" + 0.006*"university" + 0.006*"use" + 0.005*"life"
2018-12-03 22:17:00,039 : INFO : topic #2 (0.100): 0.016*"police" + 0.013*"child" + 0.012*"family" + 0.009*"woman" + 0.008*"death" + 0.008*"life" + 0.007*"take" + 0.007*"people" + 0.007*"health" + 0.007*"case"
2018-12-03 22:17:00,040 : INFO : topic diff=0.069410, rho=0.173710
2018-12-03 22:17:00,055 : INFO : PROGRESS: pass 3, at document #1800/2914
2018-12-03 22:17:00,101 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:17:00,116 : INFO : topic #6 (0.100): 0.012*"need" + 0.010*"people" + 0.009*"work" + 0.009*"plan" + 0.008*"cost" + 0.007*"area" + 0.007*"tax" + 0.006*"job" + 0.006*"water" + 0.006*"child"
2018-12-03 22:17:00,117 : INFO : topic #1 (0.100): 0.012*"market" + 0.009*"energy" + 0.008*"country" + 0.008*"high" + 0.008*"low" + 0.008*

2018-12-03 22:17:00,620 : INFO : topic diff=0.070345, rho=0.173710
2018-12-03 22:17:00,634 : INFO : PROGRESS: pass 3, at document #2400/2914
2018-12-03 22:17:00,683 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:17:00,698 : INFO : topic #2 (0.100): 0.015*"police" + 0.012*"child" + 0.011*"family" + 0.009*"woman" + 0.008*"take" + 0.008*"life" + 0.007*"court" + 0.007*"health" + 0.006*"find" + 0.006*"use"
2018-12-03 22:17:00,700 : INFO : topic #8 (0.100): 0.012*"government" + 0.012*"comment" + 0.011*"party" + 0.008*"state" + 0.008*"court" + 0.007*"issue" + 0.007*"minister" + 0.007*"member" + 0.006*"public" + 0.006*"election"
2018-12-03 22:17:00,700 : INFO : topic #3 (0.100): 0.012*"state" + 0.009*"country" + 0.008*"president" + 0.007*"people" + 0.006*"attack" + 0.005*"leader" + 0.005*"take" + 0.005*"pakistan" + 0.005*"indian" + 0.005*"group"
2018-12-03 22:17:00,701 : INFO : topic #0 (0.100): 0.010*"world" + 0.009*"people" + 0.007*"book" + 0.007*"ch

2018-12-03 22:17:01,194 : INFO : topic #2 (0.100): 0.019*"police" + 0.014*"medical" + 0.013*"family" + 0.012*"child" + 0.011*"woman" + 0.009*"drug" + 0.009*"mother" + 0.008*"health" + 0.008*"case" + 0.007*"girl"
2018-12-03 22:17:01,195 : INFO : topic #3 (0.100): 0.012*"state" + 0.012*"president" + 0.011*"country" + 0.007*"pakistan" + 0.007*"government" + 0.007*"attack" + 0.006*"force" + 0.006*"people" + 0.006*"modi" + 0.006*"group"
2018-12-03 22:17:01,196 : INFO : topic #6 (0.100): 0.018*"need" + 0.012*"funding" + 0.011*"water" + 0.010*"tax" + 0.009*"work" + 0.009*"project" + 0.008*"people" + 0.008*"job" + 0.007*"development" + 0.007*"budget"
2018-12-03 22:17:01,197 : INFO : topic #7 (0.100): 0.039*"beer" + 0.017*"use" + 0.012*"website" + 0.010*"site" + 0.010*"kashmir" + 0.010*"newspaper" + 0.009*"visit" + 0.009*"click" + 0.009*"contact" + 0.008*"user"
2018-12-03 22:17:01,198 : INFO : topic #5 (0.100): 0.018*"city" + 0.013*"school" + 0.010*"community" + 0.009*"student" + 0.008*"space" 

2018-12-03 22:17:03,616 : INFO : topic #5 (0.100): 0.016*"school" + 0.015*"city" + 0.014*"student" + 0.009*"people" + 0.008*"day" + 0.006*"community" + 0.006*"work" + 0.006*"park" + 0.005*"canada" + 0.005*"street"
2018-12-03 22:17:03,617 : INFO : topic #3 (0.100): 0.013*"state" + 0.009*"country" + 0.008*"president" + 0.006*"government" + 0.006*"force" + 0.006*"take" + 0.005*"attack" + 0.005*"people" + 0.005*"group" + 0.005*"security"
2018-12-03 22:17:03,618 : INFO : topic #2 (0.100): 0.016*"police" + 0.015*"child" + 0.010*"family" + 0.009*"drug" + 0.009*"life" + 0.008*"death" + 0.008*"case" + 0.007*"woman" + 0.007*"medical" + 0.007*"hospital"
2018-12-03 22:17:03,620 : INFO : topic diff=0.088460, rho=0.171147
2018-12-03 22:17:03,640 : INFO : PROGRESS: pass 4, at document #700/2914
2018-12-03 22:17:03,692 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:17:03,709 : INFO : topic #8 (0.100): 0.015*"comment" + 0.012*"government" + 0.011*"state" + 0.00

2018-12-03 22:17:04,253 : INFO : topic #5 (0.100): 0.015*"city" + 0.015*"school" + 0.010*"people" + 0.009*"student" + 0.008*"day" + 0.007*"community" + 0.006*"work" + 0.006*"street" + 0.006*"local" + 0.005*"place"
2018-12-03 22:17:04,254 : INFO : topic #7 (0.100): 0.015*"use" + 0.013*"*" + 0.011*"food" + 0.010*"website" + 0.010*"facebook" + 0.009*"site" + 0.008*"user" + 0.008*"e-mail" + 0.007*"online" + 0.007*"click"
2018-12-03 22:17:04,255 : INFO : topic diff=0.067556, rho=0.171147
2018-12-03 22:17:04,273 : INFO : PROGRESS: pass 4, at document #1300/2914
2018-12-03 22:17:04,326 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:17:04,339 : INFO : topic #8 (0.100): 0.016*"comment" + 0.011*"government" + 0.009*"public" + 0.008*"state" + 0.007*"right" + 0.007*"member" + 0.007*"party" + 0.007*"minister" + 0.006*"service" + 0.006*"court"
2018-12-03 22:17:04,340 : INFO : topic #5 (0.100): 0.014*"city" + 0.012*"school" + 0.009*"day" + 0.009*"people" + 0.

2018-12-03 22:17:04,748 : INFO : topic diff=0.069917, rho=0.171147
2018-12-03 22:17:04,765 : INFO : PROGRESS: pass 4, at document #1900/2914
2018-12-03 22:17:04,805 : INFO : merging changes from 100 documents into a model of 2914 documents
2018-12-03 22:17:04,818 : INFO : topic #4 (0.100): 0.021*"company" + 0.014*"business" + 0.011*"share" + 0.008*"time" + 0.007*"market" + 0.007*"car" + 0.006*"price" + 0.006*"technology" + 0.006*"service" + 0.006*"sales"
2018-12-03 22:17:04,819 : INFO : topic #3 (0.100): 0.009*"state" + 0.008*"country" + 0.007*"war" + 0.006*"president" + 0.006*"pakistan" + 0.006*"people" + 0.006*"take" + 0.006*"attack" + 0.006*"force" + 0.005*"group"
2018-12-03 22:17:04,819 : INFO : topic #1 (0.100): 0.012*"market" + 0.009*"high" + 0.008*"country" + 0.008*"energy" + 0.008*"rate" + 0.008*"low" + 0.008*"growth" + 0.007*"report" + 0.006*"price" + 0.006*"economic"
2018-12-03 22:17:04,820 : INFO : topic #2 (0.100): 0.017*"police" + 0.012*"child" + 0.012*"family" + 0.009*"wo

2018-12-03 22:17:05,376 : INFO : topic #0 (0.100): 0.011*"people" + 0.010*"world" + 0.008*"book" + 0.007*"change" + 0.007*"life" + 0.006*"university" + 0.006*"time" + 0.005*"write" + 0.005*"find" + 0.005*"use"
2018-12-03 22:17:05,377 : INFO : topic #1 (0.100): 0.011*"market" + 0.010*"high" + 0.009*"growth" + 0.009*"country" + 0.008*"economic" + 0.008*"low" + 0.008*"project" + 0.007*"bank" + 0.007*"rate" + 0.007*"gas"
2018-12-03 22:17:05,377 : INFO : topic #8 (0.100): 0.012*"comment" + 0.012*"government" + 0.011*"party" + 0.008*"minister" + 0.008*"state" + 0.007*"court" + 0.007*"issue" + 0.007*"member" + 0.006*"public" + 0.006*"election"
2018-12-03 22:17:05,378 : INFO : topic #7 (0.100): 0.019*"use" + 0.015*"*" + 0.014*"website" + 0.012*"site" + 0.009*"cookie" + 0.009*"online" + 0.008*"user" + 0.008*"food" + 0.008*"facebook" + 0.007*"editor"
2018-12-03 22:17:05,379 : INFO : topic diff=0.067578, rho=0.171147
2018-12-03 22:17:05,393 : INFO : PROGRESS: pass 4, at document #2600/2914
2018-1

In [22]:
ldag.print_topics()

2018-12-03 22:17:05,780 : INFO : topic #0 (0.100): 0.010*"people" + 0.009*"world" + 0.008*"supplement" + 0.007*"change" + 0.006*"book" + 0.006*"life" + 0.006*"time" + 0.006*"use" + 0.005*"human" + 0.005*"find"
2018-12-03 22:17:05,781 : INFO : topic #1 (0.100): 0.029*"market" + 0.017*"report" + 0.012*"energy" + 0.012*"region" + 0.010*"country" + 0.010*"project" + 0.009*"gas" + 0.009*"growth" + 0.008*"economic" + 0.007*"rise"
2018-12-03 22:17:05,782 : INFO : topic #2 (0.100): 0.020*"police" + 0.014*"family" + 0.013*"medical" + 0.012*"child" + 0.011*"woman" + 0.009*"drug" + 0.009*"mother" + 0.008*"health" + 0.008*"case" + 0.008*"girl"
2018-12-03 22:17:05,783 : INFO : topic #3 (0.100): 0.012*"state" + 0.012*"country" + 0.012*"president" + 0.007*"government" + 0.007*"pakistan" + 0.007*"attack" + 0.006*"force" + 0.006*"people" + 0.006*"modi" + 0.006*"group"
2018-12-03 22:17:05,784 : INFO : topic #4 (0.100): 0.024*"company" + 0.015*"business" + 0.012*"share" + 0.012*"market" + 0.011*"service"

[(0,
  '0.010*"people" + 0.009*"world" + 0.008*"supplement" + 0.007*"change" + 0.006*"book" + 0.006*"life" + 0.006*"time" + 0.006*"use" + 0.005*"human" + 0.005*"find"'),
 (1,
  '0.029*"market" + 0.017*"report" + 0.012*"energy" + 0.012*"region" + 0.010*"country" + 0.010*"project" + 0.009*"gas" + 0.009*"growth" + 0.008*"economic" + 0.007*"rise"'),
 (2,
  '0.020*"police" + 0.014*"family" + 0.013*"medical" + 0.012*"child" + 0.011*"woman" + 0.009*"drug" + 0.009*"mother" + 0.008*"health" + 0.008*"case" + 0.008*"girl"'),
 (3,
  '0.012*"state" + 0.012*"country" + 0.012*"president" + 0.007*"government" + 0.007*"pakistan" + 0.007*"attack" + 0.006*"force" + 0.006*"people" + 0.006*"modi" + 0.006*"group"'),
 (4,
  '0.024*"company" + 0.015*"business" + 0.012*"share" + 0.012*"market" + 0.011*"service" + 0.009*"price" + 0.008*"singapore" + 0.008*"sales" + 0.008*"technology" + 0.008*"time"'),
 (5,
  '0.019*"city" + 0.013*"school" + 0.009*"community" + 0.009*"student" + 0.008*"space" + 0.008*"day" + 0.0