# Analyzing Dynamic Front '18 Campaign 

Historical data that was crawled over the months of February to March 2018 by IST Pulse was utilized in this notebook to generate a topic model and draw themes in the english language. We narrowed our focus on English data for the purposes of initial analysis and pointers for conversation.

This notebook desires to discover conversations around mentions of US Forces and like media syndicates. The `nats_data_query.py` program was utilized as a wrapper to quickly query the data. The `es_data_processor.py` program was used to extract the fields from the JSON formatted data that are most necessary for linguistic and time series analyses. The `tweet_processor.py` program was utilized to preprocess the text data in preparation for the topic modeling task. The latest version separates hashtags into terms (best guess).

The Python package `gensim` was used to perform the Latent Dirichlet Allocation algorithm. A single core LDA model was used, in order to allow for guaranteed reproducibility. This is much slower than utilizing gensim's multi-core option but is only worthwhile if reproducibility is necessary.

This analysis was re-processed to provide the ability to save/load models and data associated with each part of the process.

## Query Data from Elasticsearch (es)

In [None]:
from nats_data_query import TweetGathererNats

In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
nats = TweetGathererNats()

### User enters query of interest in `q_s`

In [None]:
#query_string = 'meta.rule_matcher.results.rule_tag: UNIT-SM AND (doc.quoted_status.user.screen_name: (1stAirCavBDE  OR 2dCavalryRegt  OR 7thATC  OR DaggerBDE  OR Eucom  OR hqarrc OR NATO OR USArmy OR USArmyEurope )  OR doc.in_reply_to_screen_name: (1stAirCavBDE  OR 2dCavalryRegt  OR 7thATC  OR DaggerBDE  OR Eucom  OR hqarrc OR NATO OR USArmy OR USArmyEurope ))' + lang

q_s= 'doc.lang: en AND meta.rule_matcher.results.rule_tag: UNIT-SM AND (doc.quoted_status.user.screen_name: (1stAirCavBDE  OR 2dCavalryRegt  OR 7thATC  OR DaggerBDE  OR Eucom  OR hqarrc OR NATO OR USArmy OR USArmyEurope )  OR doc.in_reply_to_screen_name: (1stAirCavBDE  OR 2dCavalryRegt  OR 7thATC  OR DaggerBDE  OR Eucom  OR hqarrc OR NATO OR USArmy OR USArmyEurope ))'
        

### Print number of tweets in the English language, from February 16th - March 19th 

In [None]:
#print(nats.get_n_items(begin='2018-02-16', end='2018-03-19', lang=None))
print(nats.get_n_items(begin='2018-02-15', end='2018-03-19', query_str=q_s, lang='en'))

In [None]:
#Estimated time of processing ~ 5 mins  

In [None]:
en_unit_data = nats.get_data(begin='2018-02-15', end='2018-03-19',query_str=q_s, lang='en') 


In [None]:
en_unit_data[0]["_source"]

## Extract Necessary Fields

In [None]:
from es_data_processor import ESDataProcessor

In [None]:
esdp = ESDataProcessor(en_unit_data)

In [None]:
df = esdp.format_df()

In [None]:
df.head()

## Clean Text Data

In [3]:
from tweet_processor import TweetProcessor

In [4]:
tp = TweetProcessor()

In [5]:
texts = list(df.text)
cleaned_texts = []
for t in texts:
    cleaned_text = tp.clean_text(t)
    cleaned_texts.append(cleaned_text)

In [6]:
cleaned_texts[0]

['rodger',
 'drink',
 'cups',
 'water',
 'get',
 'good',
 'kind',
 'dip',
 'meditate',
 'minutes']

In [7]:
sparse = tp.make_sparse(texts=cleaned_texts)

In [8]:
vecs = [tp.stem_text(word_list=text) for text in sparse]

In [9]:
strings = [tp.re_string(text_list=text).strip() for text in vecs]

In [None]:
strings[0]

In [10]:
#append the preprocessed text as a column to the dataframe to keep track of original tweets
df['final_string'] = strings

In [11]:
df.head(n=5)

Unnamed: 0.1,Unnamed: 0,date,text,tweet_id,final_string
0,0,2018-03-05T16:12:04+00:00,"@USArmy Rodger that, drink 7 cups of water, ge...",051b3537b50ddb65737071e35c3813e71203721288308e...,rodger drink water get good kind minut
1,1,2018-03-14T09:19:04+00:00,RT @SafetyCenter: Safety Shout Out @USArmyEuro...,b0369b47b9e337112c285e623d1ec4bfb17c7a488b9404...,safeti shout readi safeti
2,2,2018-02-26T00:06:13+00:00,@USArmy @USArmyOldGuard KEY OF DAVID https://t...,28d297f4e4dfe097ff956e6508ef9aa0ed439d5b8d8b38...,key david read sin remov
3,3,2018-03-12T15:34:29+00:00,"@USArmy fuck you, Child-Fuckers. https://t.co/...",be78a7523bd9f53dd93943b2d0a6ec3bec275dd5b00d8b...,fuck child fucker
4,4,2018-03-12T15:20:46+00:00,"@USArmy ""Ides of March"" Free Amazon ebook 4.5⭐...",fa3b8e26e465bf2bf9652156efe48ed404fd8f9aa4e8a9...,ide march free amazon ebook promot militari po...


## Topic Modeling Analysis

In [12]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
!pip install numpy==1.14.0

In [None]:
df.to_csv('~/repos/nats/082016_espull.csv')

In [1]:
import pandas as pd
df=pd.read_csv('082016_espull.csv')

In [None]:
print(len(corpus))
print(len(dictionary))

In [13]:
from gensim import corpora

dictionary = corpora.Dictionary(vecs)

2018-03-21 03:35:43,385 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-03-21 03:35:43,513 : INFO : built Dictionary(4393 unique tokens: ['drink', 'get', 'good', 'kind', 'minut']...) from 8289 documents (total 71240 corpus positions)


In [14]:
corpus = [dictionary.doc2bow(item) for item in vecs]

In [15]:
#Save + pickle
dictionary.save('~/repos/nats/032018.dict')
corpora.MmCorpus.serialize('~/repos/nats/032018.mm', corpus)

2018-03-21 03:35:48,409 : INFO : saving Dictionary object under ~/repos/nats/032018.dict, separately None
2018-03-21 03:35:48,413 : INFO : saved ~/repos/nats/032018.dict
2018-03-21 03:35:48,415 : INFO : storing corpus in Matrix Market format to ~/repos/nats/032018.mm
2018-03-21 03:35:48,415 : INFO : saving sparse matrix to ~/repos/nats/032018.mm
2018-03-21 03:35:48,416 : INFO : PROGRESS: saving document #0
2018-03-21 03:35:48,435 : INFO : PROGRESS: saving document #1000
2018-03-21 03:35:48,455 : INFO : PROGRESS: saving document #2000
2018-03-21 03:35:48,479 : INFO : PROGRESS: saving document #3000
2018-03-21 03:35:48,503 : INFO : PROGRESS: saving document #4000
2018-03-21 03:35:48,525 : INFO : PROGRESS: saving document #5000
2018-03-21 03:35:48,546 : INFO : PROGRESS: saving document #6000
2018-03-21 03:35:48,566 : INFO : PROGRESS: saving document #7000
2018-03-21 03:35:48,587 : INFO : PROGRESS: saving document #8000
2018-03-21 03:35:48,594 : INFO : saved 8289x4393 matrix, density=0.182

In [None]:
def evaluate_graph(dictionary, corpus, texts, limit):
    """
    Function to display num_topics - LDA graph using c_v coherence
    
    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    limit : topic limit
    
    Returns:
    -------
    lm_list : List of LDA topic models
    c_v : Coherence values corresponding to the LDA model with respective number of topics
    """
    c_v = []
    lm_list = []
    for num_topics in range(1, limit):
        lm = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)
        lm_list.append(lm)
        cm = CoherenceModel(model=lm, texts=texts, dictionary=dictionary, coherence='c_v')
        c_v.append(cm.get_coherence())
        
    # Show graph
    x = range(1, limit)
    plt.plot(x, c_v)
    plt.xlabel("num_topics")
    plt.ylabel("Coherence score")
    plt.legend(("c_v"), loc='best')
    plt.show()
    
    return lm_list, c_v

In [None]:
!pip install matplotlib

In [31]:
import pyLDAvis.gensim
import matplotlib.pyplot as plt
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel

In [None]:
## Runs for about 20 mins

In [None]:
%timeit lmlist, c_v = evaluate_graph(dictionary=dictionary, corpus=corpus, texts=vecs, limit=10)

In [16]:
import numpy
v1 = numpy.asarray([0., 2.], dtype='f')
v2 = numpy.asarray([0., 1.], dtype='f')
print(numpy.dot(v1, v2))

2.0


In [None]:
"""
!pip show numpy

Display 

"""

In [17]:
import gensim.models.ldamodel 
#ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)
model = gensim.models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=150, iterations=50, alpha='asymmetric')

2018-03-21 03:36:01,074 : INFO : using asymmetric alpha [0.031139033, 0.028788464, 0.02676786, 0.025012296, 0.023472836, 0.02211189, 0.020900112, 0.019814245, 0.018835641, 0.017949153, 0.017142355, 0.01640497, 0.015728405, 0.015105434, 0.014529934, 0.013996676, 0.013501173, 0.0130395545, 0.012608458, 0.012204954, 0.011826476, 0.011470766, 0.011135828, 0.010819895, 0.010521393, 0.010238921, 0.009971219, 0.009717159, 0.009475723, 0.0092459945, 0.009027141, 0.008818409, 0.008619112, 0.008428624, 0.008246372, 0.008071837, 0.007904536, 0.0077440296, 0.0075899116, 0.0074418085, 0.0072993743, 0.0071622906, 0.0070302603, 0.00690301, 0.0067802845, 0.0066618463, 0.006547475, 0.006436964, 0.006330122, 0.0062267683, 0.0061267363, 0.0060298666, 0.005936013, 0.0058450364, 0.0057568057, 0.0056711994, 0.0055881017, 0.005507404, 0.005429004, 0.005352805, 0.005278715, 0.0052066483, 0.0051365225, 0.0050682607, 0.0050017894, 0.0049370397, 0.0048739444, 0.0048124413, 0.004752471, 0.0046939775, 0.004636906,

2018-03-21 03:36:20,241 : INFO : topic #145 (0.002): 0.262*"america" + 0.192*"protect" + 0.092*"gen" + 0.051*"russian" + 0.043*"s" + 0.035*"macarthur" + 0.033*"dougla" + 0.032*"mind" + 0.022*"obvious" + 0.021*"make"
2018-03-21 03:36:20,242 : INFO : topic #2 (0.027): 0.088*"oh" + 0.045*"inf" + 0.036*"rather" + 0.036*"took" + 0.032*"ma" + 0.031*"live" + 0.029*"eat" + 0.025*"shout" + 0.017*"model" + 0.015*"search"
2018-03-21 03:36:20,242 : INFO : topic #1 (0.029): 0.055*"bn" + 0.042*"empir" + 0.042*"fli" + 0.026*"va" + 0.026*"amen" + 0.023*"well" + 0.022*"rid" + 0.020*"doubl" + 0.019*"need" + 0.018*"s"
2018-03-21 03:36:20,243 : INFO : topic #0 (0.031): 0.029*"s" + 0.019*"armi" + 0.018*"speed" + 0.016*"impress" + 0.016*"weak" + 0.016*"cool" + 0.015*"ambassador" + 0.015*"store" + 0.012*"boot" + 0.012*"go"
2018-03-21 03:36:20,245 : INFO : topic diff=inf, rho=0.447214


In [18]:
model.save('032018lda.model')

2018-03-21 03:37:10,649 : INFO : saving LdaState object under 032018lda.model.state, separately None
2018-03-21 03:37:10,660 : INFO : saved 032018lda.model.state
2018-03-21 03:37:10,663 : INFO : saving LdaModel object under 032018lda.model, separately ['expElogbeta', 'sstats']
2018-03-21 03:37:10,664 : INFO : storing np array 'expElogbeta' to 032018lda.model.expElogbeta.npy
2018-03-21 03:37:10,668 : INFO : not storing attribute id2word
2018-03-21 03:37:10,669 : INFO : not storing attribute dispatcher
2018-03-21 03:37:10,669 : INFO : not storing attribute state
2018-03-21 03:37:10,671 : INFO : saved 032018lda.model


In [None]:
for i in range(0, model.num_topics):
    print(str(i),':',model.print_topic(i))

## Get Top Topic for Each Tweet

In future it would probably be best to have it return the list of topics with their respective adherences for each tweet; for now it is just the topic most adherent to each tweet.

In [19]:
#assign topics to tweets
doc_top_scores = []
for i in range(len(cleaned_texts)):
    doc_top_scores.append(model.get_document_topics(bow=dictionary.doc2bow(cleaned_texts[i])))

In [20]:
topic_scores = {}

for i in range(len(doc_top_scores)):
    topic_scores[i] = {}
    topics = [topic[0] for topic in doc_top_scores[i]]
    scores = [topic[1] for topic in doc_top_scores[i]]
    for topic_n in range(500):
        
        if topic_n in topics:
            topic_scores[i][topic_n] = scores[topics.index(topic_n)]

In [21]:
import pandas as pd

top_Score_df = pd.DataFrame.from_dict(topic_scores)
top_Score_df = top_Score_df.fillna(0)
top_Score_df = top_Score_df.transpose()

In [22]:
top_Score_df['text'] = list(df.text)
top_Score_df['processed_text'] = list(strings)

In [24]:
import numpy as np

In [25]:
maxes = [] 
for row in range(top_Score_df.shape[0]):
    topic_adherence = list(top_Score_df.iloc[row,:top_Score_df.shape[1] - 2])
    max_score = topic_adherence.index(np.max(topic_adherence))
    maxes.append(max_score)

In [26]:
top_Score_df['max_topic'] = maxes
df['max_topic'] = maxes

In [27]:
top_Score_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,142,143,145,146,147,148,149,text,processed_text,max_topic
0,0.159077,0.0,0.14122,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"@USArmy Rodger that, drink 7 cups of water, ge...",rodger drink water get good kind minut,0
1,0.01557,0.014394,0.513384,0.012506,0.011736,0.011056,0.01045,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,RT @SafetyCenter: Safety Shout Out @USArmyEuro...,safeti shout readi safeti,2
2,0.0,0.0,0.0,0.41273,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,@USArmy @USArmyOldGuard KEY OF DAVID https://t...,key david read sin remov,3
3,0.01038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"@USArmy fuck you, Child-Fuckers. https://t.co/...",fuck child fucker,30
4,0.056279,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.052761,0.0,0.0,0.0,0.0,0.0,0.0,"@USArmy ""Ides of March"" Free Amazon ebook 4.5⭐...",ide march free amazon ebook promot militari po...,48


In [50]:
print(df.groupby('max_topic').count().sort_index(by=['text'],ascending=False)['text'].loc[[143]].sum())

46


  if __name__ == '__main__':


## Display Top 20 Tweets Per Topic

In [51]:
#remove duplicates so you get the most out of the top 20 tweets
df_no_dups = df.drop_duplicates(subset='final_string')
print(df_no_dups.shape)

(6974, 6)


In [59]:
from IPython.display import display
from ipywidgets import widgets
from IPython.display import clear_output

text = widgets.Text()
display(text)

def handle_submit(sender):
    clear_output()
    print('Showing top 20 tweets in Topic',text.value)
    try:
        for t in df_no_dups.loc[df_no_dups.max_topic == int(text.value)].sample(frac=1)['text'][:70]:
            print(t)
            print()
    except KeyError:
        print('Invalid Topic Number (try anything from 0 to 199).')
    
text.on_submit(handle_submit)

Showing top 20 tweets in Topic 23
@NATO @Interpol @INTERPOL_HQ @Poland @PolandMFA @PutinRF_Eng meeting w Hillary. @sherylsandberg Mark to Beverly Hills. Arrangements to Embassy being made, for Hillary house. Bday tomorrow. @Lagarde stay out until redeem Suzanne +Plastic reconstructive-get Troopers off Soldiers https://t.co/0KrQsPv3Bv

Hey siri? Define plastic paddies. https://t.co/njwCv3vISP

@USArmy @FtBraggNC @fayobserver Prayer breakfast. Its 2018 not 1918

@USArmy @FtBraggNC @fayobserver Ironic Armed Force prayer breakfast, yet prayer is not allowed in our schools - go figure!

@NATO @jensstoltenberg Joint forces ISISTurkish dictator in Afrin they making sure testing how people prayer after test killing them in same place,what you feel ?!

@USArmy My father,  grandfather,  son, and myself . 4 generations, 3 Army, 1 Marine. All for the same cause

@NATO they should  help everyone and stop other countries bulling smaller nations if they want ot adapt and survive.

@USArmy @future_sol

In [53]:
df_no_dups[df_no_dups.text.str.contains('maternal')][['text', 'max_topic']]

Unnamed: 0,text,max_topic


In [35]:
dictionary = corpora.Dictionary.load('~/repos/nats/032018.dict')
corpus = corpora.MmCorpus('~/repos/nats/032018.mm')
lda = LdaModel.load('032018lda.model')
#print dictionary
#print corpus
#print lda

2018-03-21 04:40:14,348 : INFO : loading Dictionary object from ~/repos/nats/032018.dict
2018-03-21 04:40:14,352 : INFO : loaded ~/repos/nats/032018.dict
2018-03-21 04:40:14,355 : INFO : loaded corpus index from ~/repos/nats/032018.mm.index
2018-03-21 04:40:14,356 : INFO : initializing corpus reader from ~/repos/nats/032018.mm
2018-03-21 04:40:14,359 : INFO : accepted corpus with 8289 documents, 4393 features, 66353 non-zero entries
2018-03-21 04:40:14,360 : INFO : loading LdaModel object from 032018lda.model
2018-03-21 04:40:14,362 : INFO : loading expElogbeta from 032018lda.model.expElogbeta.npy with mmap=None
2018-03-21 04:40:14,367 : INFO : setting ignored attribute id2word to None
2018-03-21 04:40:14,367 : INFO : setting ignored attribute dispatcher to None
2018-03-21 04:40:14,368 : INFO : setting ignored attribute state to None
2018-03-21 04:40:14,368 : INFO : loaded 032018lda.model
2018-03-21 04:40:14,369 : INFO : loading LdaState object from 032018lda.model.state
2018-03-21 04:

In [54]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

In [55]:
pyLDAvis.gensim.prepare(lda, corpus, dictionary,mds='mmds')

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


In [None]:
#### moving manually to data folder
import re

save_text = list(kenya_geo_df.text)
save_text = [re.sub('\\n|\n|,|\s|\t', ' ', str(save_text[i])) for i in range(len(save_text))]
kenya_geo_df.text = save_text


kenya_geo_df.to_csv('~/repos/validate/data/model_persist/month01/August 2016.csv')
top_Score_df.to_csv('~/repos/validate/data/model_persist/month01/august2016_extended.csv')

In [None]:
import pandas as pd

kenya_geo_df = pd.read_csv('~/repos/validate/data/model_persist/month01/August 2016.csv', encoding='iso-8859-1')

In [None]:
kenya_geo_df[kenya_geo_df.text.str.contains('health')][['text', 'max_topic']]

In [None]:
[text for text in kenya_geo_df['text'] if 'health' in text.lower() ]

In [None]:
[i for i in kenya_geo_df.loc[i,'text'] if 'health' in i.lower()]

In [None]:
# remove duplicates so you get the most out of the top 20 tweets
# kenya_tweet_df_no_dups = top_Score_df.drop_duplicates(subset='processed_text')
lda_save_path = "./saved-lda-model"
ldaModel.save(lda_save_path) 

#moving manually to data folder
kenya_geo_df.to_csv('kenya_data_full_all.csv', encoding='utf-8')  

In [None]:
oup = open("topic_summary.txt", "wb")
for x in topics_final:
    oup.write("%s\n" % (x))
oup.close()

sc.stop()

In [None]:
#Free up some memory 
clear()