# SI 618 Homework 8 - Improving LDA

## Objectives
* to gain practical experience with NLP techniques
* to be exposed to loading large datasets via AWS S3 and parquet format

## Submission Instructions:
Please turn in your completed Databricks notebook in HTML format as well as the URL to the published version of your completed notebook.

## Assignment Instructions:
In this week's lab, we investigated the use of latent Dirichilet allocation (LDA) to analyze text.
In particular, we applied LDA to the Enron Corp. email data.  In this homework assignment we
are going to ask you to revisit the Enron LDA analysis and try to improve it.  

Recall that LDA seeks to extract a number of topics (the number is supplied by you) from a 
collection of documents. Each one of those topics can be described by the words that are most closely
associated with it and thereby facilitate the interpretation of the topic.  
For example, a topic that is most closely associated with the words blue, green, 
yellow, red, and purple might be interpreted as being about "colors".  That's an ideal
example.  In practice, the words that are associated with the topics often don't lead us to 
an easy interpretation of the topic.  In some cases, we can improve the interpretability of
the topics.  

For example, we can manipulate the model parameters (e.g. changing the number of topics) or
we can try to do a better job of cleaning the data before analyzing it.  We can experiment
with the inclusion or exclusion of stopwords.  Or we can get very creative and use bigrams or
trigrams instead of unigrams (words) in our analysis.

This homework assignment provides you with an oppportunity to improve the LDA we performed on
the Enron data, which is reproduced below.  To start this lab, run the cells below and examine the
output. Describe the topics and comment on the quality and/or interpretability of the topics. Then, follow the steps below to apply some of the techniques mentioned above.
 
One measure of the "goodness" of a topic model is the interpretability of the topics.  That is,
do the words associated with the topic form a coherent set (like the colors example above) or
are the seemingly random words?

You will notice that we're using a 1% sample of the email corpus (note the ```sample(0.01)``` function). Another measure of the "goodness" of a topic model is the stability of the model over different random samples.  
What happens to your topics when you re-run the analysis (thereby sampling a differnt 1%).  What happens when you
run your analysis on the complete email corpus?

There are also two numerical measures of model goodness that are available:  log(perplexity) and log(likelihood).
Lower values of log(perplexity) are better, whereas higher values of log(likelihood) are generally
considered better.   You can use these "objective" measures in combination with the "subjective" assessments of 
the interpretability of topics when assessing your model.

This assignment is worth a total of 80 points.  You will receive up to 16 points for each of the following 4 improvements:
1. Vary the number of topics from 6 to 12 (i.e. 6, 7, 8, 9, 10, 11, and 12).  Which value(s) gives you the "best" solution?  What criteria did you use for determining how good each solution is? 
2. How does the topic model change if you include or exclude stopwords? What's the best way to deal with non-alpha chacaters (e.g. numbers)? Is it better to include or exclude stopwords?  Use the "better" version in subsequent steps.
3. Clean the text from the body of each email message by excluding the "quoted replies" (i.e. the copy of the original message
that is often included in a reply).  How do the results of your topic model change? (Note: you might want to use RDDs and regular expressions for part of this analysis.)
4. Given the model from the "best" number of topics from Step 1, the best choice of including or excluding stopwords, 
and using cleaned email bodies, how consistent/stable are the topics from 
multiple runs (i.e. using different 1% samples)?  How do you define consistency and stability?

For each improvement, you will be assessed on:

1. the clarity of your code (both in terms of programming aspects such as variable names and in terms of Markdown cells explaining what you did),
2. the completeness of your interpretations, and
3. the quality of the presentation of your results (e.g. using tables and/or visualizations as appropriate).

### Above and Beyond
Select one of the following options for up to 16 points:
1. Use LDA to create two additional topic models based on (1) bigrams and (2) trigrams.  Find the best number of topics, determine whether to include
stopwords, and clean the email bodies.  How do these topics compare with the ones from the unigram analysis above in terms of interpretability and stability?
2. In the cell below, the LDA model of the enron DataFrame is stored in a DataFrame called ```enron_lda```.  If you examine that DataFrame you will notice a column called ```topicDistribution```, which tells you the proportion of each topic that makes up each document.  For each document (i.e. row in ```enron_lda```), figure out which topic is the dominant one and label that document as belonging to that topic.  So, for example, if you 
have a 6-topic model and see
```
topicDistribution=DenseVector([0.011, 0.0114, 0.9446, 0.0111, 0.011, 0.0109]
```
for a document, you would label that document as topic "3" because the largest number (0.9446) is associated with topic #3.  Based on this approach, report the number of documents that are labelled with each topic number.

### End of instructions... code follows

In [2]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, RegexTokenizer 
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.clustering import LDA
from pyspark.ml.pipeline import Pipeline
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords as nltkstopwords
nltk.download("book")

In [3]:
ACCESS_KEY = 
SECRET_KEY = 
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = "umsi-data-science-west"
MOUNT_NAME = "umsi-data-science"
try:
  dbutils.fs.unmount("/mnt/%s/" % MOUNT_NAME)
except:
  print("Could not unmount %s, but that's ok." % MOUNT_NAME)
dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
#display(dbutils.fs.ls("/mnt/umsi-data-science/si618wn2017"))

In [4]:
# This is a helper function that looks up the words associated with indices.  
# It's used below.
from pyspark.sql.types import ArrayType, StringType

def indices_to_terms(vocabulary):
    def indices_to_terms(xs):
        return [vocabulary[int(x)] for x in xs]
    return udf(indices_to_terms, ArrayType(StringType()))

In [5]:
# The next line loads the Enron email dataset from parquet format.  For details, see
# https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
# Note the following line takes a sample of approximately 1% of the rows
enron = spark.read.parquet("/mnt/umsi-data-science/si618wn2017/mail.parquet").sample(False,0.00001)

In [6]:
# This cell is a complete machine learning pipeline to run LDA on a dataset
# Note that you might want to split this up into individual cells for
# your assignment.  

k = 6 # set the number of topics to extract

tokenizer = Tokenizer(inputCol="body", outputCol="words")

stopWordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered")
stopWordsRemover.loadDefaultStopWords("english")

#minDF=2 means word has to occur at least twice
vectorizer = CountVectorizer(inputCol="filtered", outputCol="features", minDF=2) 

print ("k = ",k)
lda = LDA(k=6, maxIter=10)

# we've defined all of the transformers and estimators in our pipeline. Now set up 
pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, vectorizer, lda])
pipelineModel = pipeline.fit(enron)

countVectorModel = pipelineModel.stages[-2]
cmv = countVectorModel.vocabulary
print("Vocab length is",len(cmv))

ldaModel = pipelineModel.stages[-1]

# Assess the model using .transform()
enron_lda = pipelineModel.transform(enron)

lp = ldaModel.logPerplexity(enron_lda)
print("Log perplexity  (lower is better): ",lp)
ll = ldaModel.logLikelihood(enron_lda)
print("Log likelihood (higher is better): ",ll)

# Describe topics.
topics = ldaModel.describeTopics(8)
topics = topics.withColumn(
    "topicWords", indices_to_terms(countVectorModel.vocabulary)("termIndices"))
topics.select("topicWords").show(10,truncate=False)


##1. Determine the optimal number of topics.

In [8]:
k = []
log_perplexity = []
log_likelihood = []

for kay in range(6,13,1):
  print ("k = ",kay)
  lda = LDA(k=kay, maxIter=10)
  pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, vectorizer, lda])
  pipelineModel = pipeline.fit(enron)
  
  countVectorModel = pipelineModel.stages[-2]
  cmv = countVectorModel.vocabulary

  ldaModel = pipelineModel.stages[-1]
  enron_lda = pipelineModel.transform(enron)
  
  k.append(kay)
  lp = ldaModel.logPerplexity(enron_lda)
  lp = round(lp, 4)
  log_perplexity.append(lp)
  ll = ldaModel.logLikelihood(enron_lda)
  ll = round(ll, 4)
  log_likelihood.append(ll)

df_lda_scores = pd.DataFrame({'perplexity': log_perplexity, 'likelihood': log_likelihood}, index=k)
df_lda_scores.sort_values(by=['likelihood','perplexity'], ascending=[False,True])

After running LDA analyses with a range of topics (at least 6, at most 12), we conclude that the 6 topic model has the highest likelihood and lowest perplexity. Let's use this model below:

In [10]:
lda = LDA(k=6, maxIter=10)
pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, vectorizer, lda])
pipelineModel = pipeline.fit(enron)

countVectorModel = pipelineModel.stages[-2]
cmv = countVectorModel.vocabulary

ldaModel = pipelineModel.stages[-1]
enron_lda = pipelineModel.transform(enron)

lp_no_stop = ldaModel.logPerplexity(enron_lda)
print("Log perplexity  (no stopwords): ",lp_no_stop)
ll_no_stop = ldaModel.logLikelihood(enron_lda)
print("Log likelihood (no stopwords): ",ll_no_stop)

topics = ldaModel.describeTopics(6)
topics = topics.withColumn(
    "topicWords", indices_to_terms(countVectorModel.vocabulary)("termIndices"))
topics.select("topicWords").show(6,truncate=False)

##2. Now tweak pipeline, building it without a stop words remover and by cleaning the text of numbers.

In [12]:
# Re-run analysis without a stop words remover in our pipeline
tokenizer = Tokenizer(inputCol="body", outputCol="words")
vectorizer = CountVectorizer(inputCol="words", outputCol="features", minDF=2)
lda = LDA(k=6, maxIter=10)
pipeline = Pipeline(stages=[tokenizer, vectorizer, lda])
pipelineModel = pipeline.fit(enron)

countVectorModel = pipelineModel.stages[-2]
cmv = countVectorModel.vocabulary

ldaModel = pipelineModel.stages[-1]
enron_lda = pipelineModel.transform(enron)

lp_w_stop = ldaModel.logPerplexity(enron_lda)
print("Log perplexity  (w/ stopwords): ",lp_w_stop)
ll_w_stop = ldaModel.logLikelihood(enron_lda)
print("Log likelihood (w/ stopwords): ",ll_w_stop)

In [13]:
topics = ldaModel.describeTopics(6)
topics = topics.withColumn(
    "topicWords", indices_to_terms(countVectorModel.vocabulary)("termIndices"))
topics.select("topicWords").show(6,truncate=False)

It's clear that stop words shouldn't be included. Many of the topics include stop words, which are not that informative. In fact one of our topics is only stop words ("of, the, to, and, you). Below, we can re-run our pipeline after removing numerical characters from the text:

In [15]:
tokenizer = RegexTokenizer(inputCol="body", outputCol="words", pattern="[a-zA-Z]*", gaps=False)
stopWordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered")
stopWordsRemover.loadDefaultStopWords("english")
vectorizer = CountVectorizer(inputCol="filtered", outputCol="features", minDF=2)
lda = LDA(k=6, maxIter=10)
pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, vectorizer, lda])
pipelineModel = pipeline.fit(enron)

countVectorModel = pipelineModel.stages[-2]
cmv = countVectorModel.vocabulary

ldaModel = pipelineModel.stages[-1]
enron_lda = pipelineModel.transform(enron)

lp_no_nums = ldaModel.logPerplexity(enron_lda)
print("Log perplexity  (w/out numbers): ",lp_no_nums)
ll_no_nums = ldaModel.logLikelihood(enron_lda)
print("Log likelihood (w/out numbers): ",ll_no_nums)

topics = ldaModel.describeTopics(6)
topics = topics.withColumn(
    "topicWords", indices_to_terms(countVectorModel.vocabulary)("termIndices"))
topics.select("topicWords").show(6,truncate=False)

Removing characters from the text improves our analysis. The topics are not that informative, but better than before. What are the topics?
- The first has to do with agreements and contracts
- The second, third, and fourth aren't that informative. Seems like a smattering of email parts like com, subject, cc, etc.
- The fifth seems like it has to do with management (mind, role, everyone,...)
- The sixth topics aren't that informative either.

Seems like a lot of emails have the word Monday, which suggests that employees may have been emailing over the weekend to set up in person emails to discuss the problems on Monday?

## 3
Clean the text from the body of each email message by excluding the "quoted replies" (i.e. the copy of the original message that is often included in a reply). How do the results of your topic model change? (Note: you might want to use RDDs and regular expressions for part of this analysis.)

In [18]:
from pyspark.sql import Row
import re

enron_data = sqlContext.read.parquet("/mnt/umsi-data-science/si618wn2017/mail.parquet").sample(False, 0.0001)
enron_rdd = enron_data.select("body").rdd.flatMap(list)
enron_rdd_parsed = enron_rdd.map(lambda document: document.strip().lower()).map(lambda document: re.split(" -----Original Message-----", str(document))[0])
enron_df = enron_rdd_parsed.map(lambda x: Row(body= x))
enron_df = spark.createDataFrame(enron_df)

In [19]:
# These are same tokenizer, stop words remover, vectorizer, and lda from before.
tokenizer = RegexTokenizer(inputCol="body", outputCol="words", pattern="[a-zA-Z]*", gaps=False)
stopWordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered")
stopWordsRemover.loadDefaultStopWords("english")
vectorizer = CountVectorizer(inputCol="filtered", outputCol="features", minDF=2)
lda = LDA(k=6, maxIter=10)
pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, vectorizer, lda])
pipelineModel = pipeline.fit(enron)

countVectorModel = pipelineModel.stages[-2]
cmv = countVectorModel.vocabulary
print("Vocab length is",len(cmv))

ldaModel = pipelineModel.stages[-1]
enron_lda = pipelineModel.transform(enron)
lp = ldaModel.logPerplexity(enron_lda)
print("Log perplexity: ",lp)
ll = ldaModel.logLikelihood(enron_lda)
print("Log likelihood: ",ll)

topics = ldaModel.describeTopics(8)
topics = topics.withColumn(
    "topicWords", indices_to_terms(countVectorModel.vocabulary)("termIndices"))
topics.select("topicWords").show(10,truncate=False)

Our topic model is slightly improved. Many of the topics are still about meetings and documentation, but each topic seems more coherent. For example, the first topic has to do with an employee asking another employee about questions and changes to an agreement. Monday, April, and May all remain important topicwords. This has to do with when our data is from (in April and May, right before Enron failed as a company). There must have been a lot of changes in energy transfer pricing markets over the weekend, to explain why Monday is such a common topicword.

## 4
Given the model from the "best" number of topics from Step 1, the best choice of including or excluding stopwords, and using cleaned email bodies, how consistent/stable are the topics from multiple runs (i.e. using different 1% samples)? How do you define consistency and stability?

In [22]:
iteration = []
log_perplexity = []
log_likelihood = []

for b in range(5):
  boot = spark.read.parquet("/mnt/umsi-data-science/si618wn2017/mail.parquet").sample(False,0.00001)
  boot_rdd = boot.select("body").rdd.flatMap(list)
  boot_rdd_parsed = boot_rdd.map(lambda document: document.strip().lower()).map(lambda document: re.split(" -----Original Message-----", str(document))[0])
  boot_df = boot_rdd_parsed.map(lambda x: Row(body= x))
  boot_df = spark.createDataFrame(boot_df)
  
  tokenizer = RegexTokenizer(inputCol="body", outputCol="words", pattern="[a-zA-Z]*", gaps=False)
  stopWordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered")
  stopWordsRemover.loadDefaultStopWords("english")
  vectorizer = CountVectorizer(inputCol="filtered", outputCol="features", minDF=2)
  lda = LDA(k=6, maxIter=10)
  pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, vectorizer, lda])
  
  pipelineModel = pipeline.fit(boot_df)
  
  #countVectorModel = pipelineModel.stages[-2]
  #cmv = countVectorModel.vocabulary

  ldaModel = pipelineModel.stages[-1]
  enron_lda = pipelineModel.transform(boot_df)
  
  iteration.append(b)
  lp = ldaModel.logPerplexity(enron_lda)
  lp = round(lp, 4)
  log_perplexity.append(lp)
  ll = ldaModel.logLikelihood(enron_lda)
  ll = round(ll, 4)
  log_likelihood.append(ll)

In [23]:
boot_scores = pd.DataFrame({'perplexity': log_perplexity, 'likelihood': log_likelihood}, index=boot)

In [24]:
print(boot_scores.var(axis=0))
print(boot_scores.mean(axis=0))

We see tiny variation in likelihood relative to the mean, but huge variation in the perplexity relative to the mean.

#Above and Beyond
First, calculate optimal number of topics when using bigrams instead of single tokens

In [27]:
from pyspark.ml.feature import NGram
enron = spark.read.parquet("/mnt/umsi-data-science/si618wn2017/mail.parquet").sample(False,0.00001)

k = []
log_perplexity = []
log_likelihood = []

In [28]:
for kay in range(6,13,1):
  print ("k = ",kay)
  tokenizer = Tokenizer(inputCol="body", outputCol="words")
  stopWordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered")
  stopWordsRemover.loadDefaultStopWords("english")
  bigram = NGram(n=2, inputCol="filtered", outputCol="bigrams")
  vectorizer = CountVectorizer(inputCol="bigrams", outputCol="features", minDF=2) 
  
  lda = LDA(k=kay, maxIter=10)
  pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, bigram, vectorizer, lda])
  pipelineModel = pipeline.fit(enron)
  
  countVectorModel = pipelineModel.stages[-2]
  cmv = countVectorModel.vocabulary

  ldaModel = pipelineModel.stages[-1]
  enron_lda = pipelineModel.transform(enron)
  
  k.append(kay)
  lp = ldaModel.logPerplexity(enron_lda)
  lp = round(lp, 4)
  log_perplexity.append(lp)
  ll = ldaModel.logLikelihood(enron_lda)
  ll = round(ll, 4)
  log_likelihood.append(ll)

df_lda_scores = pd.DataFrame({'perplexity': log_perplexity, 'likelihood': log_likelihood}, index=k)
df_lda_scores.sort_values(by=['likelihood','perplexity'], ascending=[False,True])

In [29]:
# Now clean the text of numbers, and get rid of quoted replies, and print
optimal_k = 6

enron_data = sqlContext.read.parquet("/mnt/umsi-data-science/si618wn2017/mail.parquet").sample(False, 0.0001)
enron_rdd = enron_data.select("body").rdd.flatMap(list)
enron_rdd_parsed = enron_rdd.map(lambda document: document.strip().lower()).map(lambda document: re.split(" -----Original Message-----", str(document))[0])
enron_df = enron_rdd_parsed.map(lambda x: Row(body= x))
enron_DataFrame = spark.createDataFrame(enron_df)

tokenizer = RegexTokenizer(inputCol="body", outputCol="words", pattern="[a-zA-Z]*", gaps=False)
stopWordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered")
stopWordsRemover.loadDefaultStopWords("english")
bigram = NGram(n=2, inputCol="filtered", outputCol="bigrams")
vectorizer = CountVectorizer(inputCol="bigrams", outputCol="features", minDF=2) 
  
lda = LDA(k=optimal_k, maxIter=10)
pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, bigram, vectorizer, lda])
pipelineModel = pipeline.fit(enron_DataFrame)
  
countVectorModel = pipelineModel.stages[-2]
cmv = countVectorModel.vocabulary

ngramdf = bigram.transform(enron_DataFrame)
ngramdf.select("ngrams").show(truncate=False)

ldaModel = pipelineModel.stages[-1]
enron_lda = pipelineModel.transform(enron_DataFrame)
  
lp = ldaModel.logPerplexity(enron_lda)
print(round(lp, 4))
ll = ldaModel.logLikelihood(enron_lda)
print(round(ll, 4))

Now do the same for trigrams

In [31]:
k = []
log_perplexity = []
log_likelihood = []

for kay in range(6,13,1):
  print ("k = ",kay)
  tokenizer = Tokenizer(inputCol="body", outputCol="words")
  stopWordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered")
  stopWordsRemover.loadDefaultStopWords("english")
  trigram = NGram(n=3, inputCol="filtered", outputCol="trigrams")
  vectorizer = CountVectorizer(inputCol="trigrams", outputCol="features", minDF=2) 
  
  lda = LDA(k=kay, maxIter=10)
  pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, trigram, vectorizer, lda])
  pipelineModel = pipeline.fit(enron)
  
  countVectorModel = pipelineModel.stages[-2]
  cmv = countVectorModel.vocabulary

  ldaModel = pipelineModel.stages[-1]
  enron_lda = pipelineModel.transform(enron)
  
  k.append(kay)
  lp = ldaModel.logPerplexity(enron_lda)
  lp = round(lp, 4)
  log_perplexity.append(lp)
  ll = ldaModel.logLikelihood(enron_lda)
  ll = round(ll, 4)
  log_likelihood.append(ll)

df_lda_scores = pd.DataFrame({'perplexity': log_perplexity, 'likelihood': log_likelihood}, index=k)
df_lda_scores.sort_values(by=['likelihood','perplexity'], ascending=[False,True])

Like for bigrams, with trigrams we want 6 topics.

In [33]:
# Now clean the text of numbers, and get rid of quoted replies, and print
optimal_k = 6

enron_data = sqlContext.read.parquet("/mnt/umsi-data-science/si618wn2017/mail.parquet").sample(False, 0.0001)
enron_rdd = enron_data.select("body").rdd.flatMap(list)
enron_rdd_parsed = enron_rdd.map(lambda document: document.strip().lower()).map(lambda document: re.split(" -----Original Message-----", str(document))[0])
enron_df = enron_rdd_parsed.map(lambda x: Row(body= x))
enron_DataFrame = spark.createDataFrame(enron_df)

tokenizer = RegexTokenizer(inputCol="body", outputCol="words", pattern="[a-zA-Z]*", gaps=False)
stopWordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered")
stopWordsRemover.loadDefaultStopWords("english")
trigram = NGram(n=3, inputCol="filtered", outputCol="trigrams")
vectorizer = CountVectorizer(inputCol="trigrams", outputCol="features", minDF=2) 
  
lda = LDA(k=optimal_k, maxIter=10)
pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, trigram, vectorizer, lda])
pipelineModel = pipeline.fit(enron_DataFrame)
  
countVectorModel = pipelineModel.stages[-2]
cmv = countVectorModel.vocabulary

ngramdf = trigram.transform(enron_DataFrame)
ngramdf.select("ngrams").show(truncate=False)

ldaModel = pipelineModel.stages[-1]
enron_lda = pipelineModel.transform(enron_DataFrame)
  
lp = ldaModel.logPerplexity(enron_lda)
print(round(lp, 4))
ll = ldaModel.logLikelihood(enron_lda)
print(round(ll, 4))