# Capstone 2 Project - Russian Troll Tweets

### Initial tweet data exploration
First, I'll join together all of the tweet data, reading and joining all of the 13 data files

I will use the Tweets to explore questions about the nature of the disinformation campaign, such as:
* Did the tweets increase in frequency or volume around the time of major events? 
* Did other trolls retweet and amplify troll tweets?
* Can clusters be made of Twitter handles/’users’ grouped with similar features?
* Can common topics or themes be identified?
* What were the most-used hashtags?
* Did the tweets predominantly support one candidate or political party, or seek to undermine the other?


## Data Dictionary

Header | Definition
-------|---------
`external_author_id` | An author account ID from Twitter 
`author` | The handle sending the tweet
`content` | The text of the tweet
`region` | A region classification, as [determined by Social Studio](https://help.salesforce.com/articleView?   id=000199367&type=1)
`language` | The language of the tweet
`publish_date` | The date and time the tweet was sent
`harvested_date` | The date and time the tweet was collected by Social Studio
`following` | The number of accounts the handle was following at the time of the tweet
`followers` | The number of followers the handle had at the time of the tweet
`updates` | The number of “update actions” on the account that authored the tweet, including tweets, retweets and likes
`post_type` | Indicates if the tweet was a retweet or a quote-tweet
`account_type` | Specific account theme, as coded by Linvill and Warren
`retweet` | A binary indicator of whether or not the tweet is a retweet
`account_category` | General account theme, as coded by Linvill and Warren
`new_june_2018` | A binary indicator of whether the handle was newly listed in June 2018

In [2]:
import pandas as pd
import numpy as np
import datetime
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

Consider dropping the labels, so I can perform my own classification?
i.e., 'account_type', 'account_category','new_june_2018'

In [3]:
# Read in the cleaned
df = pd.read_csv('../data/cleaned_tweets.csv', encoding = "iso-8859-1", parse_dates = ['publish_date', 'publish_date_short'])
print(df.shape)

  interactivity=interactivity, compiler=compiler, result=result)


(2365552, 22)


## Text Preprocessing

In [None]:

def preprocess_text(lemma, document):
    with open(document, 'r') as infile:
        # transform document into one string
        text = ' '.join(line.rstrip('\n') for line in infile)
    # convert string into unicode
    text = gensim.utils.any2unicode(text)

    # remove URL's
    text = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', text)

    # remove symbols excluding the @, # and \s symbol
    text = re.sub(r'[^\w@#\s]', '', text)
    
    if lemma:
        return utils.lemmatize(text, stopwords=ignore_words, min_length=3)

    # tokenize words using NLTK Twitter Tokenizer
    tknzr = TweetTokenizer()
    text = tknzr.tokenize(text)

    # lowercase, remove words less than len 2 & remove numbers in tokenized list
    return [word.lower() for word in text if len(word) > 2 and not word.isdigit() and not word in ignore_words]


In [4]:
# https://stackoverflow.com/questions/44173624/how-to-apply-nltk-word-tokenize-library-on-a-pandas-dataframe-for-twitter-data
#df['tokenized_text'] = df['Text'].apply(word_tokenize) 

from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
df['tokenized_text'] = df['content'].apply(tt.tokenize)

In [40]:
#df.loc[df['tokenized_text'] = ]
tail = df.tail(25)
#tail.isin({'content': ['https']})
#df.isin([0, 2])
#tail

In [25]:
from gensim.parsing.preprocessing import STOPWORDS
from gensim.utils import simple_preprocess
from gensim.models import TfidfModel
from gensim.models.ldamodel import LdaModel
from gensim import corpora
from gensim import matutils
# import lda

In [16]:
# function to tokenize and preprocess text
def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

In [17]:
documents = df['content'].tolist()
texts = [tokenize(document) for document in documents]

In [18]:
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

In [19]:
texts = [[token for token in text if frequency[token] > 10] for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [21]:
len(corpus)

2365552

## LDA Model
Fit an LDA Model for the tweet topics

In [27]:
# fit LDA model
tweet_topics = LdaModel(corpus=corpus,
                           id2word=dictionary,
                           num_topics=20,
                           passes=10)

In [29]:
# print out first 10 topics
for i, topic in enumerate(tweet_topics.print_topics(10)):
    print('{} --- {}'.format(i, topic))

0 --- (19, '0.217*"https" + 0.063*"new" + 0.055*"local" + 0.019*"video" + 0.016*"watch" + 0.014*"missing" + 0.012*"north" + 0.009*"music" + 0.006*"wall" + 0.006*"facebook"')
1 --- (0, '0.124*"https" + 0.114*"trump" + 0.031*"hillary" + 0.026*"clinton" + 0.024*"president" + 0.022*"state" + 0.018*"donald" + 0.013*"think" + 0.012*"gop" + 0.011*"way"')
2 --- (1, '0.107*"https" + 0.036*"youtube" + 0.027*"â½" + 0.026*"day" + 0.024*"home" + 0.021*"business" + 0.020*"know" + 0.012*"great" + 0.012*"best" + 0.011*"want"')
3 --- (9, '0.199*"https" + 0.016*"world" + 0.012*"history" + 0.012*"guns" + 0.012*"jersey" + 0.010*"makes" + 0.010*"mccain" + 0.009*"team" + 0.009*"scottsdale" + 0.009*"ok"')
4 --- (13, '0.038*"https" + 0.036*"school" + 0.034*"ohio" + 0.023*"high" + 0.022*"vote" + 0.022*"life" + 0.019*"love" + 0.016*"god" + 0.013*"road" + 0.011*"mayor"')
5 --- (4, '0.042*"https" + 0.034*"md" + 0.025*"like" + 0.021*"car" + 0.020*"right" + 0.018*"health" + 0.018*"years" + 0.016*"america" + 0.011*"

### Try using pyLDAvis to Visualize Topics

In [32]:
## Try the pyLDAvis visualization
import pyLDAvis.gensim as gensimvis
import pyLDAvis

### pyLDAvis is great.  Need to:
* parse out non-english words (accents, etc.)  
* parse out http(s)  
* Understand it!  

In [33]:
# http://tlfvincent.github.io/2015/10/23/presidential-speech-topics/#topic=1&lambda=1&term=
vis_data = gensimvis.prepare(tweet_topics, corpus, dictionary)
pyLDAvis.display(vis_data)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [34]:
# save viz output to an HTML file
vis_data = gensimvis.prepare(tweet_topics, corpus, dictionary)

pyLDAvis.save_html(vis_data, 'pyLDAviz_file.html')

In [35]:
type(vis_data)

pyLDAvis._prepare.PreparedData

## SpaCy

In [None]:
import spacy
nlp = spacy.load('en')
nlp.entity
#doc = nlp("""Berlin is the capital of Germany;
#and the residence of Chancellor Angela Merkel.""")
doc.ents
print(doc.ents[0], doc.ents[0].label_)