# Latent Dirichlet Allocation

This is the main chunk of the code.

The eventual goal is to treat the hashtag list for each user as being document1, and the cleaned full-text words as being document 2. So each user has two documents. Now I do topic modeling across each document for each user and for each user find a list of topics, and then the words that lie within each topic. Therefore, I have now for each user a dictionary with keys as topics and values as the words associated with each topic. What I am then hoping to do is some sort of visualization to extract the most relevant topics that exhibit the words that I am interested in. 

In [1]:
# General imports
import json
import glob
import pickle
import collections
import random
from tqdm import tqdm as tqdm
import config
import os
dirpath = os.path.dirname(os.path.realpath('__file__'))

# NLP imports
import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'http', 'https'])
import re
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
# To check later if a words is in english or not
with open('./english-words.txt', 'rb') as filehandle:
    english_words = filehandle.readlines()

# Visualization imports
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt

# Other imports
import pandas as pd
import numpy as np
import tweepy

## Creating the cleaned and simplified tweet dictionary

Here, we first load in the dictionaries that were dumped in as pickle files and then do a series of text processing and cleaning tasks. I initially start with a dicitonary of the form:

```
{
    user1: [{tweet1_json}, {tweet2_json}, ..., {tweetn_json}]
    .
    .
    .
    usern: [{tweet1_json}, {tweet2_json}, ..., {tweetn_json}]
}
```

This section of the code will then process will result in a dictionary of the form

```
{
    user1: 
        {
            hashtags: [list of hashtags from each tweet], 
            fulltext: [list of all cleaned/depunkt words across all tweets]
        },
    .
    .
    .
    usern: 
        {
            hashtags: [list of hashtags from each tweet], 
            fulltext: [list of all cleaned/depunkt words across all tweets]
        }
}
```

Then I can turn this into a pandas dataframe and do some pretty nice data manipulation.

We will call this dictionary the `master_dict`.

To do this, we first define some helper functions

In [2]:
def get_user(tweet):
    """
    input: tweet dictionary
    returns: return the username
    """
    return tweet['user']['screen_name']


def get_hashtag_list(tweet):
    """
    input: tweet dictionary
    returns: list of all hashtags in both the direct tweet and the
    retweet 
    """

    l = []
    for d in tweet['entities']['hashtags']:
        l += [d['text']]

    if 'retweeted_status' in tweet.keys():
        for d in tweet['retweeted_status']['entities']['hashtags']:
            l += [d['text']]
    return l


def tokenizer_cleaner_nostop_lemmatizer(text):
    """
    This function tokenizes the text of a tweet, cleans it off punctuation,
    removes stop words, and lemmatizes the words (i.e. finds word roots to remove noise)
    I am largely using the gensim and spacy packages 

    Input: Some text
    Output: List of tokenized, cleaned, lemmatized words
    """

    tokenized_depunkt = gensim.utils.simple_preprocess(text, min_len=4, deacc=True)
    tokenized_depunkt_nostop = ([word for word in tokenized_depunkt 
                                 if word not in stop_words and word in english_words])
    
    # Lemmatizer while also only allowing certain parts of speech.
    # See here: https://spacy.io/api/annotation
    allowed_pos = ['ADJ', 'ADV', 'NOUN', 'PROPN','VERB']
    doc = nlp(' '.join(tokenized_depunkt_nostop))
    words_final = [token.lemma_ for token in doc if token.pos_ in allowed_pos]
    return words_final

    
def get_tweet_words_list(tweet):
    """
    This function takes in a tweet and checks if there is a retweet associated with it
    input: tweet
    output: list of tokenized words without punctuation
    """

    text = tweet['full_text']
    clean_words = tokenizer_cleaner_nostop_lemmatizer(text)
    
    if 'retweeted_status' in tweet.keys():
        retweet_text = tweet['retweeted_status']['full_text']
        retweet_clean_words = tokenizer_cleaner_nostop_lemmatizer(retweet_text)
        clean_words += retweet_clean_words
    return clean_words

In [3]:
with open('./all_tweets_dict.data', 'rb') as filehandle:
    all_tweets_data = pickle.load(filehandle)

master_dict = collections.defaultdict(lambda: {})
users = list(all_tweets_data.keys())[:200]
for user in tqdm(users):
    user_tweets = all_tweets_data[user]
    for tweet in user_tweets: # tweet is a json_object
        master_dict[user]['hashtags'] = (master_dict[user].get('hashtags', [])
                                      + get_hashtag_list(tweet))
        master_dict[user]['tweet_words'] = (master_dict[user].get('tweet_words', [])
                                         + get_tweet_words_list(tweet))

# Go back and do this later for all the 1000 documents
with open('./master_dict.data', 'wb') as filehandle:
    pickle.dump(dict(master_dict), filehandle)

 52%|█████▎    | 105/200 [3:09:22<1:23:57, 53.03s/it]

In [0]:
## Latent Dirichlet Allocation

Now we apply the LDA algorithm to identify themes in the documents/topics. In my case, a single document corresponds to the set of all words of a single user's tweets. Note that the list of words that comprose a document have already been cleaned, tokenized and lemmatized. 

One other thought is to have one MASSIVE document containing all tweets of all users, and then finding the topics there. In the comparison step, I could use these top topics and then compare this to all tweets of individual users and then returning top-k users based on similarity. See [this](https://stats.stackexchange.com/questions/269031/how-to-find-similar-documents-after-a-latent-dirichlet-allocation-model-is-bui) stack-exchange post for ideas.