# Latent Dirichlet Allocation

This is the main chunk of the code.

The eventual goal is to treat the hashtag list for each user as being document1, and the cleaned full-text words as being document 2. So each user has two documents. Now I do topic modeling across each document for each user and for each user find a list of topics, and then the words that lie within each topic. Therefore, I have now for each user a dictionary with keys as topics and values as the words associated with each topic. What I am then hoping to do is some sort of visualization to extract the most relevant topics that exhibit the words that I am interested in. 

In [77]:
## Imports

In [5]:
# General imports
import json
import glob
import pickle
import collections
import random
from tqdm import tqdm as tqdm
import config
import time
import os
dirpath = os.path.dirname(os.path.realpath('__file__'))
from pprint import pprint

# import logging
# logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
# logging.root.level = logging.INFO

# NLP imports
import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['https', 'http', 'shit', 'shitting',
                    'london', 'para', 'fuck', 'fucking', 'bitch'])
import re
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# To check later if a words is in english or not
with open('./words_dictionary.json') as filehandle:
    words_dictionary = json.load(filehandle)
english_words = words_dictionary.keys()

# Visualization imports
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
import matplotlib.pyplot as plt

# Other imports
import pandas as pd
import numpy as np
import tweepy

## Creating the cleaned and simplified tweet dictionary

### Note on the format of input and output dictionaries

Here, we first load in the dictionaries that were dumped in as pickle files and then do a series of text processing and cleaning tasks. I initially start with a dicitonary of the form:

```
{
    market_1: {
                screen_name_1: [{tweet1, ..., tweetn}],
                .
                .
                .
                screen_name_m: [{tweet1, ..., tweetn}]
             }
    .
    .
    .
    market_k: {
                screen_name_1: [{tweet1, ..., tweetn}],
                .
                .
                .
                screen_name_m: [{tweet1, ..., tweetn}]
             }
}
```

This section of the code will then process will result in a dictionary of the form

```
{
    market_1: {
                screen_name_1: 
                    {
                        hashtags: [list of hashtags from each tweet], 
                        fulltext: [list of all cleaned/depunkt words across all tweets]
                    },
                .
                .
                screen_name_m: 
                    {
                        hashtags: [list of hashtags from each tweet], 
                        fulltext: [list of all cleaned/depunkt words across all tweets]
                    }
              }
    .
    .
    .
    market_k: {
                screen_name_1: 
                    {
                        hashtags: [list of hashtags from each tweet], 
                        fulltext: [list of all cleaned/depunkt words across all tweets]
                    },
                .
                .
                screen_name_m: 
                    {
                        hashtags: [list of hashtags from each tweet], 
                        fulltext: [list of all cleaned/depunkt words across all tweets]
                    }
              }
}
```

Then I can turn this into a pandas dataframe and do some pretty nice data manipulation.

We will call this dictionary the `master_dict`.

To do this, we first define some helper functions

### Defining some utility functions

In [7]:
def get_user(tweet):
    """
    input: tweet dictionary
    returns: return the username
    """
    return tweet['user']['screen_name']


def get_hashtag_list(tweet):
    """
    input: tweet dictionary
    returns: list of all hashtags in both the direct tweet and the
    retweet 
    """

    l = []
    for d in tweet['entities']['hashtags']:
        l += [d['text']]

    if 'retweeted_status' in tweet.keys():
        for d in tweet['retweeted_status']['entities']['hashtags']:
            l += [d['text']]
    return l


def tokenizer_cleaner_nostop_lemmatizer(text):
    """
    This function tokenizes the text of a tweet, cleans it off punctuation,
    removes stop words, and lemmatizes the words (i.e. finds word roots to remove noise)
    I am largely using the gensim and spacy packages 

    Input: Some text
    Output: List of tokenized, cleaned, lemmatized words
    """

    tokenized_depunkt = gensim.utils.simple_preprocess(text, min_len=4, deacc=True)
    tokenized_depunkt_nostop = ([word for word in tokenized_depunkt 
                                 if (word not in stop_words and word in english_words)])
    
    # Lemmatizer while also only allowing certain parts of speech.
    # See here: https://spacy.io/api/annotation
    allowed_pos = ['ADJ', 'ADV', 'NOUN', 'PROPN','VERB']
    doc = nlp(' '.join(tokenized_depunkt_nostop))
    words_final = [token.lemma_ for token in doc if token.pos_ in allowed_pos]
    return words_final

    
def get_tweet_words_list(tweet):
    """
    This function takes in a tweet and checks if there is a retweet associated with it
    input: tweet
    output: list of tokenized words without punctuation
    """

    text = tweet['full_text']
    clean_words = tokenizer_cleaner_nostop_lemmatizer(text)
    
    if 'retweeted_status' in tweet.keys():
        retweet_text = tweet['retweeted_status']['full_text']
        retweet_clean_words = tokenizer_cleaner_nostop_lemmatizer(retweet_text)
        clean_words += retweet_clean_words
    return clean_words

## Creating the `master_dict`

In [8]:
# with open('./data/all_tweets_dict.data', 'rb') as filehandle:
#     all_tweets_data = pickle.load(filehandle)

# master_dict = {}

# for market in all_tweets_data:
#     followers = all_tweets_data[market]
#     master_dict[market] = {}

#     for follower in tqdm(followers):
#         tweets = all_tweets_data[market][follower] # list of tweet_.json
#         master_dict[market][follower] = {}
#         master_dict[market][follower]['hashtags'] = []
#         master_dict[market][follower]['fulltext'] = []
#         for tweet in tweets:
#             hashtags = get_hashtag_list(tweet)
#             words = get_tweet_words_list(tweet)
            
#             master_dict[market][follower]['hashtags'].extend(hashtags)
#             master_dict[market][follower]['fulltext'].extend(words)

# with open('./data/master_dict.data', 'wb') as filehandle:
#     pickle.dump(master_dict, filehandle, protocol=pickle.HIGHEST_PROTOCOL)

## Computing the Latent Dirichlet Allocation

Now we apply the LDA algorithm to identify themes in the documents/topics. In my case, a single document corresponds to the set of all words of a single user's tweets. Note that the list of words that comprose a document have already been cleaned, tokenized and lemmatized. 

One other thought is to have one MASSIVE document containing all tweets of all users, and then finding the topics there. In the comparison step, I could use these top topics and then compare this to all tweets of individual users and then returning top-k users based on similarity. See [this](https://stats.stackexchange.com/questions/269031/how-to-find-similar-documents-after-a-latent-dirichlet-allocation-model-is-bui) stack-exchange post for ideas.

We first write some helper functions

In [9]:
with open('./data/master_dict.data', 'rb') as filehandle:
    master_dict = pickle.load(filehandle)

In [10]:
def get_docs(d, market):
    """
    Accepts a market and then returns the documents for the market. A document
    is a list of of word lists for each user in the market city i.e. it is a list of lists.
    Each outer list is a follower and the innner list is the cleaner, tokenized, depunkt, 
    lematized set of words for that follower.
    """
    docs = []
    for user in d[market]:
        text_list = d[market][user]['fulltext']
        docs.append(text_list)
    return docs

In [55]:
markets = list(master_dict.keys())
market_index = 2
docs = get_docs(master_dict, markets[market_index])
id2word = corpora.Dictionary(docs)

# Idea: Keep only those tokens that appear in at least 10% of the documents
id2word.filter_extremes(no_below=int(0.1*len(docs)))
corpus = [id2word.doc2bow(doc) for doc in docs]

In [56]:
def compute_lda(corpus, id2word, k=10, alpha='auto'):
    """
    Performs the LDA and returns the computer model.
    Input: Corpus, dictionary and hyperparameters to optimize
    Output: the fitted/computed LDA model
    """
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, 
                                                id2word=id2word,
                                                num_topics=k,
                                                random_state=100,
                                                # update_every=1,
                                                chunksize=5,
                                                passes=100,
                                                alpha=.01,
                                                iterations=100,
                                                per_word_topics=True)
    return lda_model
t1 = time.time()
lda_model = compute_lda(corpus, id2word)
t2 = time.time()
print('time:', t2-t1)
# save the model
filename_model = './ldamodels/market' + str(market_index) + '/model.model'
lda_model.save(filename_model)
# save the corpus
filename_corpus = './ldamodels/market' + str(market_index) + '/corpus.corpus'
with open(filename_corpus, 'wb') as filehandle:
    pickle.dump(corpus, filehandle, protocol=pickle.HIGHEST_PROTOCOL)
pprint(lda_model.print_topics())

time:119.46721744537354
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
[(0,
'0.108*"climate" + 0.050*"sustainable" + 0.044*"plastic" + 0.042*"para" + '
'0.039*"planet" + 0.022*"fuel" + 0.022*"waste" + 0.021*"global" + '
'0.019*"ocean" + 0.018*"fossil"'),
(1,
'0.007*"literally" + 0.007*"pain" + 0.006*"shit" + 0.006*"fuck" + '
'0.005*"brain" + 0.005*"burn" + 0.005*"doctor" + 0.005*"sleep" + '
'0.005*"character" + 0.005*"wake"'),
(2,
'0.009*"senate" + 0.009*"donald" + 0.009*"election" + 0.009*"elizabeth" + '
'0.008*"americans" + 0.008*"attack" + 0.008*"impeachment" + 0.008*"congress" '
'+ 0.008*"senator" + 0.008*"justice"'),
(3,
'0.012*"tech" + 0.008*"technology" + 0.008*"patient" + 0.007*"education" + '
'0.006*"hospital" + 0.005*"insurance" + 0.005*"marketing" + 0.005*"vision" + '
'0.005*"apply" + 0.005*"college"'),
(4,
'0.091*"farm" + 0.086*"farmer" + 0.085*"garden" + 0.049*"organic" + '
'0.039*"soil"

### Some comments on the hyperparameter tuning:

1. Doing a chunksize of 1 is pretty slow and time consuming (although it might be worthwhile to time this more accurately). I think either choosing `chunksize=5` or `chunksize=10` works well.
2. `passes` is a parameter similar to number of epochs. 
3. `alpha='auto'` seems to work pretty well. 
4. Keep `random_state=100` in case you want to repeat results. 
5. Keep `update_every` small, ideally equal to 1. 
6. Use a for loop to calculate the optimal number of topics. This just has to be done. Might be better to do this in the background on the python terminal

## Calculating the Coherence score

In [57]:
coherence_model_lda = CoherenceModel(model=lda_model,
                                     texts=docs,
                                     dictionary=id2word,
                                     coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

print('Coherence Score:', coherence_lda)


Coherence Score:0.5412639729797816


In [51]:
t1 = time.time()
LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary=lda_model.id2word, mds='tsne')
t2 = time.time()
print('LDAvis prep time:', t2-t1)
pyLDAvis.show(LDAvis_prepared)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
LDAvis prep time:63.65331959724426

Note: if you're in the IPython notebook, pyLDAvis.show() is not the best command
      to use. Consider using pyLDAvis.display(), or pyLDAvis.enable_notebook().
      See more information at http://pyLDAvis.github.io/quickstart.html .

You must interrupt the kernel to end this command

Serving to http://127.0.0.1:8889/    [Ctrl-C to exit]
127.0.0.1 - - [23/Jan/2020 14:07:55] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [23/Jan/2020 14:07:55] "GET /LDAvis.css HTTP/1.1" 200 -
127.0.0.1 - - [23/Jan/2020 14:07:55] "GET /d3.js HTTP/1.1" 200 -
127.0.0.1 - - [23/Jan/2020 14:07:55] "GET /LDAvis.js HTTP/1.1" 200 -

stopping Server...
