# Latent Dirichlet Allocation

This is the main chunk of the code.

The eventual goal is to treat the hashtag list for each user as being document1, and the cleaned full-text words as being document 2. So each user has two documents. Now I do topic modeling across each document for each user and for each user find a list of topics, and then the words that lie within each topic. Therefore, I have now for each user a dictionary with keys as topics and values as the words associated with each topic. What I am then hoping to do is some sort of visualization to extract the most relevant topics that exhibit the words that I am interested in. 

In [1]:
## Imports

In [1]:
# General imports
import json
import glob
import pickle
import collections
import random
from tqdm import tqdm as tqdm
import config
import time
import os
dirpath = os.path.dirname(os.path.realpath('__file__'))
from pprint import pprint

# import logging
# logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
# logging.root.level = logging.INFO

# NLP imports
import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['https', 'http'])
import re
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# To check later if a words is in english or not
with open('./words_dictionary.json') as filehandle:
    words_dictionary = json.load(filehandle)
english_words = words_dictionary.keys()

# Visualization imports
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
import matplotlib.pyplot as plt

# Other imports
import pandas as pd
import numpy as np
import tweepy

## Creating the cleaned and simplified tweet dictionary

### Note on the format of input and output dictionaries

Here, we first load in the dictionaries that were dumped in as pickle files and then do a series of text processing and cleaning tasks. I initially start with a dicitonary of the form:

```
{
    market_1: {
                screen_name_1: [{tweet1, ..., tweetn}],
                .
                .
                .
                screen_name_m: [{tweet1, ..., tweetn}]
             }
    .
    .
    .
    market_k: {
                screen_name_1: [{tweet1, ..., tweetn}],
                .
                .
                .
                screen_name_m: [{tweet1, ..., tweetn}]
             }
}
```

This section of the code will then process will result in a dictionary of the form

```
{
    market_1: {
                screen_name_1: 
                    {
                        hashtags: [list of hashtags from each tweet], 
                        fulltext: [list of all cleaned/depunkt words across all tweets]
                    },
                .
                .
                screen_name_m: 
                    {
                        hashtags: [list of hashtags from each tweet], 
                        fulltext: [list of all cleaned/depunkt words across all tweets]
                    }
              }
    .
    .
    .
    market_k: {
                screen_name_1: 
                    {
                        hashtags: [list of hashtags from each tweet], 
                        fulltext: [list of all cleaned/depunkt words across all tweets]
                    },
                .
                .
                screen_name_m: 
                    {
                        hashtags: [list of hashtags from each tweet], 
                        fulltext: [list of all cleaned/depunkt words across all tweets]
                    }
              }
}
```

Then I can turn this into a pandas dataframe and do some pretty nice data manipulation.

We will call this dictionary the `master_dict`.

To do this, we first define some helper functions

### Defining some utility functions

In [2]:
def get_user(tweet):
    """
    input: tweet dictionary
    returns: return the username
    """
    return tweet['user']['screen_name']


def get_hashtag_list(tweet):
    """
    input: tweet dictionary
    returns: list of all hashtags in both the direct tweet and the
    retweet 
    """

    l = []
    for d in tweet['entities']['hashtags']:
        l += [d['text']]

    if 'retweeted_status' in tweet.keys():
        for d in tweet['retweeted_status']['entities']['hashtags']:
            l += [d['text']]
    return l


def tokenizer_cleaner_nostop_lemmatizer(text):
    """
    This function tokenizes the text of a tweet, cleans it off punctuation,
    removes stop words, and lemmatizes the words (i.e. finds word roots to remove noise)
    I am largely using the gensim and spacy packages 

    Input: Some text
    Output: List of tokenized, cleaned, lemmatized words
    """

    tokenized_depunkt = gensim.utils.simple_preprocess(text, min_len=4, deacc=True)
    tokenized_depunkt_nostop = ([word for word in tokenized_depunkt 
                                 if (word not in stop_words and word in english_words)])
    
    # Lemmatizer while also only allowing certain parts of speech.
    # See here: https://spacy.io/api/annotation
    allowed_pos = ['ADJ', 'ADV', 'NOUN', 'PROPN','VERB']
    doc = nlp(' '.join(tokenized_depunkt_nostop))
    words_final = [token.lemma_ for token in doc if token.pos_ in allowed_pos]
    return words_final

    
def get_tweet_words_list(tweet):
    """
    This function takes in a tweet and checks if there is a retweet associated with it
    input: tweet
    output: list of tokenized words without punctuation
    """

    text = tweet['full_text']
    clean_words = tokenizer_cleaner_nostop_lemmatizer(text)
    
    if 'retweeted_status' in tweet.keys():
        retweet_text = tweet['retweeted_status']['full_text']
        retweet_clean_words = tokenizer_cleaner_nostop_lemmatizer(retweet_text)
        clean_words += retweet_clean_words
    return clean_words

# load the classifer model and the commercial filter LDA model
with open('./models/commercial-filter-classifier.model', 'rb') as filehandle:
    clf = pickle.load(filehandle)

lda_model_path = './ldamodels/random_users/model.model'
lda_model_random_users = gensim.models.ldamodel.LdaModel.load(
                                                        lda_model_path)

with open('./ldamodels/random_users/corpus.corpus', 'rb') as filehandle:
    random_users_corpus = pickle.load(filehandle)

def get_augmented_feature_vectors(feature_vectors):
    """
    Takes in the feature vector list of list and augments it. gensim does not
    actually put a 0 for topics that have 0 probability so I need to manually
    add it in to build my feature vector. 
    input: accepts the feature vectors output by gensim. It is a list of 
    tuples - one list entry per document and tuple are (topic, probability)
    pairs.
    returns: Augmented feature vectors as list of list. Each list entry 
    corresponds to one document, with the i-th element in the inner list
    corresponding to the probability that the document was generated with 
    topic i.
    """
    augmented = []
    for i, vector in enumerate(feature_vectors): # each vector is a list of tuples
        topics = [tup[0] for tup in vector]
        for t in range(10):
            if t not in topics:
                feature_vectors[i].append((t, 0))
        new_feature_vector = sorted(feature_vectors[i], key=lambda tup: tup[0])
        augmented.append([tup[1] for tup in new_feature_vector])
    return augmented

def feature_vector_commercial_model(doc):
    """
    This function accepts a document and then makes an inference with the
    lda model trained on the commercial/random user dataset. It then
    returns the feature vector which consists of the probabilities that document
    was generated by topic i. Note that this feature vector is unaugmented, and we
    will call the augmentation function to add zeros appropriately.
    
    input: document consisting of the full text of all tweets of a particular user
    returns: probability feature vector. 
    """
    id2word = corpora.Dictionary([doc]) # To satisfy that id2word needs a list of lists
    corpus = id2word.doc2bow(doc)
    topics = lda_model_random_users.get_document_topics(corpus)
    return get_augmented_feature_vectors([topics]) # for consistency in dimensions

def commercial_Q(feature_vector):
    """
    This function uses the pre-trained binary classifer to make a prediction on whether
    given the full text document of a given user, whether it is commercial or not.

    input: feature vector
    returns: scalar prediction on commercial label (1 if yes)
    """

    return clf.predict(feature_vector)[0]

# The workflow above upto and including commercial_Q all passed some basic unit tests
# showing that it functions correctly.

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


## Creating the `master_dict`

In [3]:
# with open('./data/all_tweets_dict.data', 'rb') as filehandle:
#     all_tweets_data = pickle.load(filehandle)

# master_dict = {}

# for market in all_tweets_data:
#     followers = all_tweets_data[market]
#     master_dict[market] = {}

#     for follower in tqdm(followers):
#         tweets = all_tweets_data[market][follower] # list of tweet_.json
#         master_dict[market][follower] = {}
#         master_dict[market][follower]['hashtags'] = []
#         master_dict[market][follower]['fulltext'] = []
#         for tweet in tweets:
#             hashtags = get_hashtag_list(tweet)
#             words = get_tweet_words_list(tweet)
            
#             master_dict[market][follower]['hashtags'].extend(hashtags)
#             master_dict[market][follower]['fulltext'].extend(words)

# with open('./data/master_dict.data', 'wb') as filehandle:
#     pickle.dump(master_dict, filehandle, protocol=pickle.HIGHEST_PROTOCOL)

## Computing the Latent Dirichlet Allocation

Now we apply the LDA algorithm to identify themes in the documents/topics. In my case, a single document corresponds to the set of all words of a single user's tweets. Note that the list of words that comprose a document have already been cleaned, tokenized and lemmatized. 

One other thought is to have one MASSIVE document containing all tweets of all users, and then finding the topics there. In the comparison step, I could use these top topics and then compare this to all tweets of individual users and then returning top-k users based on similarity. See [this](https://stats.stackexchange.com/questions/269031/how-to-find-similar-documents-after-a-latent-dirichlet-allocation-model-is-bui) stack-exchange post for ideas.

We first write some helper functions

In [4]:
with open('./data/master_dict.data', 'rb') as filehandle:
    master_dict = pickle.load(filehandle)

In [5]:
def get_docs(d, market):
    """
    Accepts a market and then returns the documents for the market. A document
    is a list of of word lists for each user in the market city i.e. it is a list of lists.
    Each outer list is a follower and the innner list is the cleaner, tokenized, depunkt, 
    lematized set of words for that follower.
    """
    docs = []
    for user in d[market]:
        text_list = d[market][user]['fulltext']
        docs.append(text_list)
    return docs

In [6]:
markets = list(master_dict.keys())
market_index = 1
docs = get_docs(master_dict, markets[market_index])

# use the commercial_Q filter to see if a document needs to be included. 
# For now, I am using a for loop to keep track of the count of how many documents are
# rejected, but can probably change to list comprehension

docs_filtered = []
for doc in docs:
    f = feature_vector_commercial_model(doc)
    if commercial_Q(f) == 0: # 0 means it is not a commercial doc
        docs_filtered.append(doc)
print('Total documents:', len(docs), ', Documents accepted:', len(docs_filtered))
id2word = corpora.Dictionary(docs_filtered)

# Idea: Keep only those tokens that appear in at least 10% of the documents
id2word.filter_extremes(no_below=int(0.1*len(docs_filtered)))
corpus = [id2word.doc2bow(doc) for doc in docs_filtered]
print('Length of corpus:', len(corpus))

Total documents:419, Documents accepted:246
Length of corpus:246


In [7]:
def compute_lda(corpus, id2word, k=10, alpha='auto'):
    """
    Performs the LDA and returns the computer model.
    Input: Corpus, dictionary and hyperparameters to optimize
    Output: the fitted/computed LDA model
    """
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, 
                                                id2word=id2word,
                                                num_topics=k,
                                                random_state=100,
                                                update_every=1,
                                                chunksize=5,
                                                passes=100,
                                                alpha=.001,
                                                iterations=100,
                                                per_word_topics=True)
    return lda_model
t1 = time.time()
lda_model = compute_lda(corpus, id2word)
t2 = time.time()
print('time:', t2-t1)
# save the model
filename_model = './ldamodels/market' + str(market_index) + '/model.model'
lda_model.save(filename_model)
# save the corpus
filename_corpus = './ldamodels/market' + str(market_index) + '/corpus.corpus'
with open(filename_corpus, 'wb') as filehandle:
    pickle.dump(corpus, filehandle, protocol=pickle.HIGHEST_PROTOCOL)
pprint(lda_model.print_topics())

time:83.89632606506348
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
[(0,
'0.000*"bullet" + 0.000*"bowl" + 0.000*"cartoon" + 0.000*"captain" + '
'0.000*"bush" + 0.000*"ceremony" + 0.000*"buddy" + 0.000*"blanket" + '
'0.000*"blessing" + 0.000*"chris"'),
(1,
'0.029*"entrepreneur" + 0.011*"teacher" + 0.010*"tech" + 0.010*"success" + '
'0.009*"register" + 0.008*"customer" + 0.007*"expert" + 0.007*"training" + '
'0.007*"technology" + 0.007*"client"'),
(2,
'0.026*"impeachment" + 0.018*"trial" + 0.015*"republican" + 0.014*"iran" + '
'0.013*"democrats" + 0.012*"republicans" + 0.012*"ukraine" + 0.010*"russia" '
'+ 0.009*"impeach" + 0.008*"investigation"'),
(3,
'0.021*"farmer" + 0.013*"wine" + 0.010*"spice" + 0.010*"fresh" + '
'0.009*"taste" + 0.009*"beach" + 0.009*"cake" + 0.009*"square" + '
'0.008*"farm" + 0.008*"chicken"'),
(4,
'0.097*"player" + 0.095*"blazer" + 0.083*"trailblazer" + 0.075*"coach" + '
'0.0

### Some comments on the hyperparameter tuning:

1. Doing a chunksize of 1 is pretty slow and time consuming (although it might be worthwhile to time this more accurately). I think either choosing `chunksize=5` or `chunksize=10` works well.
2. `passes` is a parameter similar to number of epochs. 
3. `alpha='auto'` seems to work pretty well. 
4. Keep `random_state=100` in case you want to repeat results. 
5. Keep `update_every` small, ideally equal to 1. 
6. Use a for loop to calculate the optimal number of topics. This just has to be done. Might be better to do this in the background on the python terminal

## Calculating the optimal number of opics

In [9]:
def optimal_topics():
    coherence_scores = []
    for k in tqdm(range(5, 20)):
        lda_model = compute_lda(corpus, id2word, k=k)
        coherence_model_lda = CoherenceModel(model=lda_model,
                                        texts=docs_filtered,
                                        dictionary=id2word,
                                        coherence='c_v')
        coherence_lda = coherence_model_lda.get_coherence()
        print((k, coherence_lda))
        coherence_scores.append((k, coherence_lda))
    return coherence_scores
    
coherence_scores = optimal_topics()
plt.plot(range(5,15), coherence_scores)
plt.show()


  0%|          | 0/15 [00:00<?, ?it/s](5, 0.5189726908534633)

  7%|▋         | 1/15 [01:20<18:41, 80.13s/it](6, 0.519356341248931)

 13%|█▎        | 2/15 [03:00<18:41, 86.30s/it](7, 0.47789018284809354)

 20%|██        | 3/15 [04:48<18:31, 92.60s/it](8, 0.469150491505505)

 27%|██▋       | 4/15 [06:41<18:05, 98.72s/it](9, 0.48546409778958455)

 33%|███▎      | 5/15 [08:35<17:13, 103.40s/it](10, 0.4606212152153395)

 40%|████      | 6/15 [10:21<15:38, 104.23s/it](11, 0.4329650553357137)

 47%|████▋     | 7/15 [12:06<13:56, 104.55s/it](12, 0.43535675757035736)

 53%|█████▎    | 8/15 [13:41<11:51, 101.62s/it](13, 0.4217650294141849)

 60%|██████    | 9/15 [15:24<10:11, 101.89s/it](14, 0.4541297801261875)

 67%|██████▋   | 10/15 [17:14<08:42, 104.42s/it](15, 0.4334219414278646)

 73%|███████▎  | 11/15 [19:12<07:13, 108.42s/it](16, 0.4723659369842674)

 80%|████████  | 12/15 [48:11<29:52, 597.64s/it](17, 0.4893456672625304)

 87%|████████▋ | 13/15 [50:04<15:04, 452.33s/it](18, 0.474282379

ValueError: x and y must have same first dimension, but have shapes (10,) and (15, 2)

In [75]:
t1 = time.time()
LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary=lda_model.id2word, mds='tsne')
t2 = time.time()
print('LDAvis prep time:', t2-t1)
pyLDAvis.show(LDAvis_prepared)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


KeyboardInterrupt: 