# Introduction to LDA and data cleaning
In this notebook, we introduce LDA and what we need for our model. We then proceed to load and clean a sample of the NOW corpus to fulfill our needs.

## What is LDA
[Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a statistical model which we will use for topic modelling/discovery. LDA will, given a list of words belonging to a text, output the topics present and their probability. In here, a topic is represented as a probability distribution of words. Thus each text/document will be a distribution over the topics. In short, texts have an associated topic distribution and topics have a word distribution. 

The image below is the plate notation for LDA, where:
* θ<sub>m</sub> is the topic distribution for document m,
* φ<sub>k</sub> is the word distribution for topic k,
* z<sub>mn</sub> is the topic for the n-th word in document m, and
* w<sub>mn</sub> is the specific word.
* α is the parameter of the Dirichlet prior on the per-document topic distributions,
* β is the parameter of the Dirichlet prior on the per-topic word distribution,

![](LDA.png)

α and β are the parameters for the model. A big α means that documents are likely to be represented by a high number of topics and vice versa. Same goes for β, a high value meaning that topics are represented by a hign number of words. The number of topics that LDA outputs is dependent on our input and works a bit like clustering. If we allow too many topics we might end up splitting topics uselessly and a too few will make us group them unnecessarily. 

## The NOW corpus
This notebook shows the cleaning process that will be used for the ADA project. Here, only a sample of the data is used (from [here](https://www.corpusdata.org/now_corpus.asp)), but the methods should be the same once scaled to the full database available on the cluster.

The NOW database is composed of billions of words from online newspapers and magazines from 20 different countries. The data we downloaded comes in different files which can be used together or independently. These files are:

1. **now-samples-lexicon.txt**: this is the full dictionnary of the english language, a lexicon. It contains four clolumns, `wID` which is the word id, `word` the actual word, `lemma` which is family of the word (ie: if word is "walked", lemma is "walk") and `PoS` which is the part of speech.
2. **now-samples_sources.txt**: this is the source of every text, in order it contains the text id, the number of words, the date, the country, the website, the url and title of the article.
3. **text.txt**: this file has the complete texts of the articles, the first column is the `textID` in the format @@textID, the second column is the full text, complete with html paragraphs and headers. It is important to note that to prevent plagiarism, every 200 words, 10 words are replaced by the string "@ @ @ @ @ @ @ @ @ @". Combined words are also split, example "can't" is written as "ca n't" and punctuation is surrounded by spaces.
4. **wordLemPoS.txt**: finally, this file contains the `word`, `lemma` and `PoS` for each word in the texts, one by one, so one could read the texts by reading down the columns. Along with that is the `textID` from where the word is and an `ID (seq)` which is a unique indetifier for each word in the database. Each time a word is added this number is incremented.

## What we need from the NOW corpus for LDA
The model will take two inputs, a matrix with all the important words for each text, and a list of all the important words. By important, it is meant the words which will give us good topic modelling. For example, names, locations, simple words like "but, "I" or "and" will not give meaningfull results and are quite common in english (so-called stopwords). Other common words present in our database should be removed too. We also should use lemmas instead of words.

Therefore, the file `wordLemPoS.txt` (hence referred as wlp) is the most important here as it lists all the lemmas with their `textID` associated. Which means that with it we can lsit all the lemmas, remove those we do not want to make our word list, but also group them by texts to create our text-word matrix.

We will also need `now-sample_sources.txt` (hence referred as sources) to link the texts with the information we will deem useful. For example country, date or website.

These are thus the two file we will import and process here with the sample data but also those we will use with the data on the cluster.

## Cleaning

In [22]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pyspark

import re
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import DateType

from scipy.sparse import csr_matrix
from pyspark.ml.feature import *

from pyspark.sql.types import *

#from pyspark.mllib.clustering import LDA, LDAModel
#from pyspark.mllib.linalg import Vector, Vectors
from pyspark.ml.linalg import Vectors, VectorUDT

from pyspark.ml.clustering import LDA

from pyspark.sql.functions import monotonically_increasing_id

import re as re
from pyspark.ml.feature import CountVectorizer , IDF

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Wlp processing


The goal of this part is to extract the useful data from wlp text files. Since they contain all the words of all the articles and the lemmas to replace them with.

In [2]:
#first read the text file
wlp_rdd = sc.textFile('sample_data/wordLem_poS.txt')

In [3]:
#the first 3 lines are useless headlines
header = wlp_rdd.take(3)

#so let's remove those headlines
noheaders = wlp_rdd.filter(lambda r: r != header[0])\
            .filter(lambda r: r != header[1])\
            .filter(lambda r: r != header[2])

In [4]:
#we split the elements separated by tabs
lines = noheaders.map(lambda r: r.split('\t'))

#identify the columns
wlp_schema = lines.map(lambda r: Row(textID=int(r[0]),idseq=int(r[1]),word=r[2],lemma=r[3],pos=r[4]))
wlp = spark.createDataFrame(wlp_schema)


In [None]:
wlp.show(5)

### Word selection
It is very important to select the right words and the right number. The ocncept of "garbage in garbage out" has never been more true than with LDA. When we analyse a text we focus on certain words to extract it's meaning and topic. The same is true here since words like if, for, numbers, common names are not that useful.

Here, we provide and example of the process we will go through. However this is not really a data cleaning step as it will directly influence our model. It is more of a model preprocessing step. We will surely go through many iterations of this next part for our model to give the best results. 

First of all, we can remove all the words which have a PoS which do not interest us. For example number (`mc`,`mc1`) or punctuation (`.`,`'`), etc...

In [5]:
pos_remove = ['.',',','\"',"\'",'null','mc','mc1']
wlp_nopos = wlp.filter(~wlp['pos'].isin(pos_remove)).drop('idseq', 'pos', 'word')

Now, we load our list of stopwords, the words that we are not going to use in LDA as they are too common or are common names. We can also remove the rows with no lemmas or those with lemmas that don't make sense or are not common enough.

In [6]:
#np.save('our_stopwords',stopwords)
stopwords = np.load('our_stopwords.npy').tolist()
len(stopwords)

5634

In [7]:
#filter out stopwords and looking at the frquency of words without them
wlp_nostop = wlp_nopos.filter(~wlp['lemma'].isin(stopwords)) #.filter(~wlp['lemma'].rlike('\W'))
#wlp_nostop.groupBy('lemma').count().sort('count', ascending=False).show()

In [23]:
wlp_nostop.groupBy('lemma').count().sort('count', ascending=False).show(5000)

+------------------+-----+------------+
|             lemma|count|          id|
+------------------+-----+------------+
|                's| 9878|           0|
|                 '| 5968|           1|
|              year| 4272|           2|
|               n't| 3841|           3|
|              time| 3169|           4|
|            people| 2913|           5|
|              take| 2667|           6|
|               use| 2244|           7|
|              work| 2137|           8|
|               day| 1819|           9|
|             state| 1713|          10|
|           company| 1698|          11|
|           comment| 1667|          12|
|              need| 1654|          13|
|              want| 1579|          14|
|              look| 1564|          15|
|             world| 1553|          16|
|        government| 1551|          17|
|                 -| 1524|          18|
|              show| 1480|          19|
|              give| 1480|          20|
|           country| 1465|          21|


In the end, we can group the lemmas in their texts to create our text-word matrix.

In [8]:
#grouping the selected words by text
wlp_bytext = wlp_nostop.groupBy('textID').agg(collect_list('lemma'))\
                .sort('textID')\
                .withColumnRenamed('collect_list(lemma)','lemma')


In [9]:
wlp_bytext.show(5)

+------+--------------------+
|textID|               lemma|
+------+--------------------+
| 11241|[yurick, writer, ...|
| 11242|[dialect, society...|
| 11243|[sublime, croissa...|
| 11244|[reflect, quarter...|
| 21242|[ars, facebook, c...|
+------+--------------------+
only showing top 5 rows



We will also remove the most common and least common lemmas. These will be useless since they won't provide enough information for our LDA analysis. This can be done in sql for example.

In [15]:
all_lemmas = wlp_nostop.drop('textID')
all_lemmas.show(5)

+-------+
|  lemma|
+-------+
| yurick|
| writer|
|  novel|
|warrior|
|  adapt|
+-------+
only showing top 5 rows



In [16]:
#we calculate the frequencies and filter out the top
lemmas_freq = all_lemmas.groupby('lemma').count().sort('count', ascending=False)
lemmas_tokeep = lemmas_freq.where('count<7000')

Making a inner join, we keep only the words which are in both lists!

In [17]:
#perform sql query and inner join
wlp_nostop.registerTempTable('wlp_nostop')
lemmas_tokeep.registerTempTable('lemma_tokeep')

query = """
SELECT wlp_nostop.lemma, wlp_nostop.textID
FROM wlp_nostop
INNER JOIN lemma_tokeep ON wlp_nostop.lemma = lemma_tokeep.lemma
order by textID
"""

wlp_kept = spark.sql(query)
wlp_kept_bytext = wlp_kept.groupBy('textID').agg(collect_list('lemma'))\
                    .sort('textID')\
                    .withColumnRenamed('collect_list(lemma)','lemma')
wlp_kept_bytext.show(5)

+------+--------------------+
|textID|               lemma|
+------+--------------------+
| 11241|[1970s, decay, fi...|
| 11242|[online, happen, ...|
| 11243|[airiest, crust, ...|
| 11244|[trail, launch, o...|
| 21242|[online, launch, ...|
+------+--------------------+
only showing top 5 rows



## Sources
Contains all the additional informations about each text.

In [17]:
sources_rdd = sc.textFile('sample_data/now-samples-sources.txt')\
                .map(lambda r: r.split('\t'))

header = sources_rdd.take(3)
sources_rdd = sources_rdd.filter(lambda l: l != header[0])\
                .filter(lambda l: l != header[1])\
                .filter(lambda l: l != header[2])

In [18]:
#create schema and change data type for date
sources_schema = sources_rdd.map(lambda r: Row(textID=int(r[0]),nwords=int(r[1]),date=r[2],country=r[3],website=r[4],url=r[5],title=r[6],)) 
sources = spark.createDataFrame(sources_schema)
sources = sources.withColumn('date',to_date(sources.date, 'yy-MM-dd'))

In [19]:
sources.printSchema()

root
 |-- country: string (nullable = true)
 |-- date: date (nullable = true)
 |-- nwords: long (nullable = true)
 |-- textID: long (nullable = true)
 |-- title: string (nullable = true)
 |-- url: string (nullable = true)
 |-- website: string (nullable = true)



In [20]:
sources.show(5)

+-------+----------+------+------+--------------------+--------------------+-------------------+
|country|      date|nwords|textID|               title|                 url|            website|
+-------+----------+------+------+--------------------+--------------------+-------------------+
|     US|2013-01-06|   397| 11241|Author of The War...|http://kotaku.com...|             Kotaku|
|     US|2013-01-06|   757| 11242|That's What They ...|http://michiganra...|     Michigan Radio|
|     US|2013-01-06|   755| 11243|Best of New York:...|http://www.nydail...|New York Daily News|
|     US|2013-01-06|  1677| 11244|Reflecting on a q...|http://www.oregon...|     OregonLive.com|
|     US|2013-01-11|   794| 21242|Ask Ars: Does Fac...|http://arstechnic...|       Ars Technica|
+-------+----------+------+------+--------------------+--------------------+-------------------+
only showing top 5 rows



## LDA

In [10]:
cv = CountVectorizer(inputCol="lemma", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cvmodel = cv.fit(wlp_bytext)
result_cv = cvmodel.transform(wlp_bytext)

In [10]:
result_cv.show()

+------+--------------------+--------------------+
|textID|               lemma|        raw_features|
+------+--------------------+--------------------+
| 11241|[yurick, writer, ...|(5000,[0,1,2,8,19...|
| 11242|[dialect, society...|(5000,[0,2,3,4,5,...|
| 11243|[sublime, croissa...|(5000,[0,1,2,3,4,...|
| 11244|[reflect, quarter...|(5000,[0,1,2,3,4,...|
| 21242|[ars, facebook, c...|(5000,[0,2,3,4,5,...|
| 21243|[york, associate,...|(5000,[0,1,2,7,8,...|
| 31240|[ireland, 's, oly...|(5000,[0,1,2,3,4,...|
| 31241|[launch, online, ...|(5000,[0,1,2,6,13...|
| 31242|[entrepreneur, po...|(5000,[0,10,12,41...|
| 41240|[syrian, woman, o...|(5000,[0,3,4,5,6,...|
| 41241|[published, medic...|(5000,[0,5,7,8,15...|
| 41244|[bay, bridge, jar...|(5000,[0,1,2,4,5,...|
| 51243|[mpaa, lobby, arm...|(5000,[0,3,4,5,7,...|
| 61240|[mum, 's, fight, ...|(5000,[0,1,2,5,14...|
| 61242|[investigate, cas...|(5000,[0,2,3,4,5,...|
| 71240|[north, 's, popul...|(5000,[0,1,2,13,1...|
| 71241|[fergusson, air, ...|(5

In [11]:
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_cv)
result_tfidf = idfModel.transform(result_cv) 

In [12]:
result_tfidf.show()

+------+--------------------+--------------------+--------------------+
|textID|               lemma|        raw_features|            features|
+------+--------------------+--------------------+--------------------+
| 11241|[yurick, writer, ...|(5000,[0,1,2,8,19...|(5000,[0,1,2,8,19...|
| 11242|[dialect, society...|(5000,[0,2,3,4,5,...|(5000,[0,2,3,4,5,...|
| 11243|[sublime, croissa...|(5000,[0,1,2,3,4,...|(5000,[0,1,2,3,4,...|
| 11244|[reflect, quarter...|(5000,[0,1,2,3,4,...|(5000,[0,1,2,3,4,...|
| 21242|[ars, facebook, c...|(5000,[0,2,3,4,5,...|(5000,[0,2,3,4,5,...|
| 21243|[york, associate,...|(5000,[0,1,2,7,8,...|(5000,[0,1,2,7,8,...|
| 31240|[ireland, 's, oly...|(5000,[0,1,2,3,4,...|(5000,[0,1,2,3,4,...|
| 31241|[launch, online, ...|(5000,[0,1,2,6,13...|(5000,[0,1,2,6,13...|
| 31242|[entrepreneur, po...|(5000,[0,10,12,41...|(5000,[0,10,12,41...|
| 41240|[syrian, woman, o...|(5000,[0,3,4,5,6,...|(5000,[0,3,4,5,6,...|
| 41241|[published, medic...|(5000,[0,5,7,8,15...|(5000,[0,5,7,8

In [24]:
%%time
lda_model=LDA(k=10, maxIter=10).fit(result_tfidf)

CPU times: user 1.41 s, sys: 662 ms, total: 2.07 s
Wall time: 1min 32s


In [25]:
# Describe topics.
topics = lda_model.describeTopics(10)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)

The topics described by their top-weighted terms:
+-----+-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|topic|termIndices                                          |termWeights                                                                                                                                                                                                                    |
+-----+-----------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0    |[283, 222, 148, 246, 201, 302, 173, 141, 117, 17]    |[0.006435370648676985, 0.005083

In [18]:
transformed = lda_model.transform(result_tfidf)
transformed.show()

+------+--------------------+--------------------+--------------------+--------------------+
|textID|               lemma|        raw_features|            features|   topicDistribution|
+------+--------------------+--------------------+--------------------+--------------------+
| 11241|[yurick, writer, ...|(5000,[0,1,2,8,19...|(5000,[0,1,2,8,19...|[2.38390381960235...|
| 11242|[dialect, society...|(5000,[0,2,3,4,5,...|(5000,[0,2,3,4,5,...|[0.06818866686863...|
| 11243|[sublime, croissa...|(5000,[0,1,2,3,4,...|(5000,[0,1,2,3,4,...|[1.70141802838506...|
| 11244|[reflect, quarter...|(5000,[0,1,2,3,4,...|(5000,[0,1,2,3,4,...|[0.11144120731673...|
| 21242|[ars, facebook, c...|(5000,[0,2,3,4,5,...|(5000,[0,2,3,4,5,...|[1.57219970100479...|
| 21243|[york, associate,...|(5000,[0,1,2,7,8,...|(5000,[0,1,2,7,8,...|[0.05339419306571...|
| 31240|[ireland, 's, oly...|(5000,[0,1,2,3,4,...|(5000,[0,1,2,3,4,...|[1.69562838087262...|
| 31241|[launch, online, ...|(5000,[0,1,2,6,13...|(5000,[0,1,2,6,13...