### This is a quick and dirty projet using Latent Dirichlet Allocation (LDA) to observe Donald Trump's tweets.    
Project objectives include:    
    - observe topics in Donald Trump's tweets
    - identify (if any) differences in the tweets
    - see if LDA models can effectively categorize tweets based on the their topics 

### 1. import, select, preview tweets data

In [4]:
import pandas as pd
data = pd.read_csv('DT.csv', error_bad_lines=False);
texts = data[['Tweet_Text']]
texts['index'] = texts.index
documents = texts
print(len(documents))
print(documents[:10])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


### 2. pre-process tweets data to get clean corpora

In [28]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
stemmer = SnowballStemmer("english")
from nltk.stem.porter import *
import numpy as np
np.random.seed(1357)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/Eddie/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [31]:
def preprocess(x):
    output = []
    lemStem = stemmer.stem(WordNetLemmatizer().lemmatize(x, pos='v'))
    #lem = WordNetLemmatizer().lemmatize(x, pos='v')
    #Stem = SnowballStemmer().stem(lem, "english")
    for wd in gensim.utils.simple_preprocess(lemStem):
        if wd not in gensim.parsing.preprocessing.STOPWORDS:
            output.append(wd)
    return output


In [33]:
# pick an example and test how preprocess() function is doing

test_doc = documents[documents['index'] == 5000].values[0][0]
print('original document: ')
words = []
for word in test_doc.split(' '):
    words.append(word)
print(words)
print('\n\n pre-processed document: ')
print(preprocess(test_doc))

original document: 
['"@nbcsnl:', 'Were', 'live', 'from', 'Studio', '8H', 'tonight.', '#SNL', 'https://t.co/Uiq9llzwEC"']


 pre-processed document: 
['nbcsnl', 'live', 'studio', 'tonight', 'snl', 'https', 'uiq', 'llzwec']


In [36]:
# dump all the tweets into the preprocess() function and store the results

processed_docs = documents['Tweet_Text'].map(preprocess)
processed_docs[:50]

0     [today, express, deepest, gratitude, served, a...
1     [busy, day, planned, new, york, soon, making, ...
2     [love, fact, small, groups, protesters, night,...
3     [open, successful, presidential, election, pro...
4     [fantastic, day, met, president, obama, time, ...
5     [happy, st, birthday, marine, corps, thank, se...
6     [beautiful, important, evening, forgotten, man...
7     [watching, returns, pm, electionnight, maga__,...
8     [rt, ivankatrump, surreal, moment, vote, fathe...
9     [rt, erictrump, join, family, incredible, move...
10    [rt, donaldjtrumpjr, final, push, eric, dozens...
11    [time, votetrump, ivoted, electionnight, https...
12    [dont, let, getting, vote, election, far, time...
13    [according, cnn, utah, officials, report, voti...
14    [watching, election, results, trump, tower, ma...
15    [electionday, https, mxraxyntjy, https, fzhoncih]
16    [need, vote, polls, lets, continue, movement, ...
17    [vote, today, https, mxraxyntjy, polling, 

In [38]:
# use gensim to create a token dictionary

dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 20:
        break

0 armed
1 deepest
2 express
3 forces
4 gratitude
5 https
6 qwpk
7 served
8 thankavet
9 today
10 wpk
11 busy
12 day
13 decisions
14 government
15 important
16 making
17 new
18 people
19 planned
20 running


### 3. set threshholds to filter tokens 

In [None]:
# filter extreme tokens. For rare tokens that appear fewer than 10 times and those that appear over 60% of the corpus size
# in the end, we will keep 10,000 most common tokens
dictionary.filter_extremes(no_below=10, no_above=0.6, keep_n=10000)

### 4. get bag of words

In [49]:
# use doc2bow() to get the frequencies of each unique token in each document (tweet)

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# bow_corpus[168]
# use no.168 tweet for sanity check  
doc168bow = bow_corpus[167] # since python index starts with 0, the index is 167 for no.168 tweet
for i in range(len(doc168bow)):
    print("Word {} (\"{}\") occurs {} time.".format(doc168bow[i][0], 
                                               dictionary[doc168bow[i][0]], doc168bow[i][1]))

Word 5 ("https") occurs 1 time.
Word 153 ("hillary") occurs 1 time.
Word 469 ("total") occurs 1 time.
Word 769 ("to_") occurs 1 time.
Word 785 ("said") occurs 1 time.
Word 922 ("fit") occurs 1 time.
Word 923 ("hbirgj") occurs 1 time.
Word 924 ("lie") occurs 1 time.
Word 925 ("sniper") occurs 1 time.
Word 926 ("surrounded") occurs 1 time.
Word 927 ("turned") occurs 1 time.
Word 928 ("usss") occurs 1 time.


### 5. get TF-IDF

In [55]:
# build TF-IDF object using bow_corpus

from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
count = 0

# print out the first 3 documents to see their TF-IDF
for doc in corpus_tfidf:
    pprint(doc)
    count += 1
    if count >= 3:
        break

[(0, 0.30855648236230654),
 (1, 0.30855648236230654),
 (2, 0.32036118979895833),
 (3, 0.32036118979895833),
 (4, 0.33699899046618453),
 (5, 0.048314725767360775),
 (6, 0.3654414985700625),
 (7, 0.2919186816950804),
 (8, 0.33699899046618453),
 (9, 0.15013581908492357),
 (10, 0.3654414985700625)]
[(11, 0.4384947196000289),
 (12, 0.22124023855411182),
 (13, 0.37160688989712864),
 (14, 0.3155545278400487),
 (15, 0.30581997026715957),
 (16, 0.2605808337198211),
 (17, 0.15600760174699135),
 (18, 0.15274698884137305),
 (19, 0.3474743885816719),
 (20, 0.2710836383151232),
 (21, 0.22856571470692552),
 (22, 0.2560146127214273)]
[(23, 0.2955464303737287),
 (24, 0.21500065508005778),
 (25, 0.32195874482687165),
 (26, 0.11920656293060786),
 (27, 0.3629808160169991),
 (28, 0.22745288696870547),
 (29, 0.22277342683399368),
 (30, 0.4040028872071265),
 (31, 0.37526939922514985),
 (32, 0.286762587364972),
 (33, 0.3593929091167368)]


### 6. LDA model using bag of words

In [57]:
# LDA w/ bag of words. Use topic number = 5 without a strong assumption

lda_model_bagwd = gensim.models.LdaMulticore(bow_corpus, num_topics=5, id2word=dictionary, passes=2, workers=2)

for idx, topic in lda_model_bagwd.show_topics(formatted=False, num_words= 10):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: [('https', 0.022474874), ('trump', 0.013916188), ('http', 0.012599999), ('thank', 0.011988601), ('great', 0.009230257), ('realdonaldtrump', 0.008889767), ('rt', 0.007360429), ('cnn', 0.0068106838), ('america', 0.006691999), ('vote', 0.006301517)]
Topic: 1 
Words: [('trump', 0.02320403), ('https', 0.022528851), ('new', 0.010681709), ('thank', 0.00826124), ('realdonaldtrump', 0.007611265), ('great', 0.0066496427), ('poll', 0.006541915), ('people', 0.0063205766), ('tonight', 0.006231952), ('amp', 0.0055969083)]
Topic: 2 
Words: [('realdonaldtrump', 0.04254456), ('https', 0.033187404), ('trump', 0.01799937), ('amp', 0.013525866), ('great', 0.012807073), ('thank', 0.008255979), ('america', 0.0063346014), ('rt', 0.0061293403), ('hillary', 0.0061226757), ('http', 0.004500832)]
Topic: 3 
Words: [('trump', 0.034351584), ('realdonaldtrump', 0.027367981), ('great', 0.023072526), ('https', 0.017378042), ('thank', 0.013730368), ('america', 0.009406475), ('cnn', 0.008115207), ('http

### 7. LDA model using TF-IDF

In [59]:
# LDA w/ TF-IDF

lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=5, id2word=dictionary, passes=2, workers=4)
for idx, topic in lda_model_tfidf.show_topics(formatted=False, num_words= 10):
    print('Topic: {}\n Word: {}'.format(idx, topic))

Topic: 0
 Word: [('trump', 0.0055621434), ('https', 0.0054579587), ('realdonaldtrump', 0.0049422355), ('great', 0.0043380414), ('thank', 0.0043010963), ('poll', 0.0030632033), ('http', 0.0028839065), ('rt', 0.0028728626), ('america', 0.0026202595), ('new', 0.0024704987)]
Topic: 1
 Word: [('https', 0.005192176), ('trump', 0.0044930703), ('realdonaldtrump', 0.00406542), ('thank', 0.003941524), ('great', 0.0038001095), ('amp', 0.002685054), ('america', 0.0025206378), ('people', 0.00249638), ('tonight', 0.002177804), ('cnn', 0.0021590334)]
Topic: 2
 Word: [('https', 0.0052454895), ('trump', 0.004629123), ('thank', 0.003722043), ('realdonaldtrump', 0.0034826875), ('great', 0.0030130579), ('amp', 0.0027811544), ('america', 0.0023733608), ('people', 0.0018757227), ('hillary', 0.0018369337), ('donald', 0.0016012391)]
Topic: 3
 Word: [('interviewed', 0.0049579735), ('https', 0.0041542416), ('enjoy', 0.0033767347), ('realdonaldtrump', 0.003281778), ('trump', 0.0032197528), ('foxandfriends', 0.00

- Both bag of words and TF-IDF models show that there is a large homogeneity between topics and the word they consist of. This indicate that using single person's tweets may not be the best practice for LDA since the tweets inevitably contains very similar information rather than distinct topics 

### 8. Test TF-IDF LDA model using an example tweet document

In [67]:
# use the same example document no.168 to test tf-idf model performance

print(processed_docs[167])

for index, score in sorted(lda_model_tfidf[bow_corpus[167]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))

['hillary', 'said', 'sniper', 'surrounded', 'usss', 'turned', 'total', 'lie', 'fit', 'to_', 'https', 'hbirgj']

Score: 0.9373569488525391	 
Topic: 0.005*"great" + 0.005*"https" + 0.004*"trump" + 0.004*"realdonaldtrump" + 0.004*"america" + 0.003*"hillary" + 0.003*"thank" + 0.003*"clinton" + 0.002*"enjoy" + 0.002*"rt"

Score: 0.01570645347237587	 
Topic: 0.005*"interviewed" + 0.004*"https" + 0.003*"enjoy" + 0.003*"realdonaldtrump" + 0.003*"trump" + 0.003*"foxandfriends" + 0.003*"hillary" + 0.003*"president" + 0.003*"foxnews" + 0.003*"thank"

Score: 0.01568310707807541	 
Topic: 0.005*"https" + 0.004*"trump" + 0.004*"realdonaldtrump" + 0.004*"thank" + 0.004*"great" + 0.003*"amp" + 0.003*"america" + 0.002*"people" + 0.002*"tonight" + 0.002*"cnn"

Score: 0.015653494745492935	 
Topic: 0.005*"https" + 0.005*"trump" + 0.004*"thank" + 0.003*"realdonaldtrump" + 0.003*"great" + 0.003*"amp" + 0.002*"america" + 0.002*"people" + 0.002*"hillary" + 0.002*"donald"

Score: 0.015599987469613552	 
Topic: 0

- In general, the model performance is ok given the example tweet. However, this may not directly result from 
the model tuning but may be due to the homogeneity in the tweets, 