## Topic Modeling with Latent Dirichlet Allocation

"_In machine learning and natural language processing, a topic model is a type of statistical <br>
model for discovering the abstract “topics” that occur in a collection of documents. (from Wikipedia)_"

A group of words (i.e topic) from a collection of documents that best represents the information in the collection.

Latent Dirichlet Allocation (LDA) represents documents as mixtures of topics that spit out words with certain probabilities. <br>
<font color=blue>More precisely, a topic is a probability distribution over the entire vocabulary.</font>

<img src="./images/lda.png"  width="800" height="400" />
<img src="./images/lda_graphical.png" width="800" height="400"/>

### Overall Procedure Using PySpark
<img src="./images/overall.png" />

***

Overall Description of the Assignment

You are to apply LDA to Mr. Trump's tweet data and infer topics that represent the entire tweet data. From topic modeling perspective, a tweet is considered a document, and the whole collection of tweets corpus. For the assignment, you need to process the original input and produce a sparse vector representation which will be passed into Spark's LDA package.

Data is located at: _data/trump.csv_ and its format is described in _data/trump_header.txt_

Show your steps below and submit your notebook.

## Warm up excercise

#### Compute the total number of tweets per year, and per month.

## LDA Assignment

### Step 0: Setting up SparkContext

In [None]:
from pyspark import SparkConf, SparkContext
sc = SparkContext(conf=SparkConf())

In [None]:
from collections import defaultdict,OrderedDict
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.mllib.clustering import LDA, LDAModel
import re

### Step 1: Create an RDD of Tweets

### Step 2: Filter out unneccesary characters, numbers, etc. 

### Step 3: Remove stop words. Each RDD element is now a bag of words.

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

### Step 4: Create a dictionary of unique words in the corpus. Also create an inverse dictionary

### Step 5: Convert each bag of words into a sparse vector with ID

### Step 6: Apply LDA, create a model, and print out topics.

In [None]:
# Hint for Step 5.
dict_of_doc = { 4: 3, 2: 4, 5: 5}
sorted_keys, sorted_values = zip(*sorted(dict_of_doc.items()))
print(sorted_keys, sorted_values)

In [None]:
# Hint for Step 5.
def to_sparseVector(doc): 
    # create dict_of_doc, sorted_keys, sorted_values here...
    return [id, Vectors.sparse(len(dict_of_doc), 
                               sorted_keys, 
                               sorted_values)] 

In [None]:
corpus = raw_rdd.map(to_sparseVector)

In [None]:
num_topics = 5
num_words_in_topic = 10

lda_model = LDA.train(corpus, k=num_topics, maxIterations=50)
topics = lda_model.describeTopics(maxTermsPerTopic=num_words_in_topic)

for idx in range(num_topics):
    print("Topic #{0}".format(idx))
    for i in range(num_words_in_topic):
        print("  {0}\t{1}".format(inverse_dictionary[topics[idx][0][i]], topics[idx][1][i]))

    print("")

print("Vocabulary size = {0}".format(len(dictionary)))

***Note***: Topic results are not much meaningful. Many words should have been removed from the vocabulary. How can you improve it? For example, how can you remove top 200 most frequent words?