# Information Warfare
## Russia’s use of Twitter during the 2016 US Presidential Election
---

### Import libraries

In [1]:
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
get_ipython().config.get('IPKernelApp', {})['parent_appname'] = ""

import spacy
import os
import pickle

from collections import Counter

from plotly import tools
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.io as pio

from IPython.display import Image

init_notebook_mode(connected=True)

### Import data

In [2]:
# All Tweets
df = pd.read_pickle('data/raw/tweets.pkl')
df.reset_index(drop = True, inplace = True)

# Only English language Tweets
dfEng = pd.read_pickle('data/raw/tweetsEng.pkl')
dfEng.reset_index(drop = True, inplace = True)

# Only non-English language Tweets
dfOth = pd.read_pickle('data/raw/tweetsOth.pkl')
dfOth.reset_index(drop = True, inplace = True)


# Sampling and Pre-Processing

As you may know, natural language processing can be computationally expensive. Since we are dealing with a large number of tweets, it makes sense to take a sample of the data that we want to work with. 

Here, I take a 30% sample of tweets from each account. 

Note: the 30% sample size is arbitrary.

In [9]:
df.content = df.content.astype(str)

# Create the groupby object of author and category. Note: every author is assigned one category
df_sampled = df.groupby(['author', 'account_category'])['content']

# Take a 30% sample from each group
df_sampled = df_sampled.apply(lambda x: x.sample(frac=0.3, replace = False)).reset_index()

You might be inclined to ask: why take a sample per account rather than just a sample of the overall dataset? When I first began expirimenting with this data, I tried to embed each individual tweet in a vector space. However, this procedure produced nonsensical results. It seems that there is not enough information contained in an individual tweet for us to gleen any useful unsights about the author, or discriminate between authors. 

However, if we were to combine tweets per author such that we have a single document for each author that is representative of everything that a particular author has ever tweeted, we will have more than enough information to give doc2vec a chance to work. 

There is just one problem: spaCy breaks if we try to parse long documents of text. As such, we will need to group our tweets by author after running the data through spaCy. 

Before we get to spaCy, we need to do some minor preprocessing. Here, I remove hyperlinks and and strip redundant whitepace from the text. I also expirimented with removing the RT symbol, but this did not seem to have any meaningful impact on our analysis. 

In [10]:
# SAMPLE AND PRE-CLEANING

# REMOVE RT SYMBOL
#from processing_functions import rt_remover
#df_sampled.content = df_sampled.content.apply(rt_remover)

# REMOVE ANY HYPERLINKS
from processing_functions import link_remover
df_sampled.content = df_sampled.content.apply(link_remover)


# STRIP ANY WHITESPACE ON EITHER SIDE OF THE TEXT
df_sampled.content = df_sampled.content.str.strip()

Now that we have done some basic pre-processing, we can begin to parse and clean our text with spaCy.

# Tokenization, cleaning, and lemmatization with spaCy

- Note: this takes approximately 5 - 10 minutes to run.

In [11]:
from spacy.tokens import Token

# This allows us to add custom attributes to tokens, in this case, hashtags and accounts
Token.set_extension('is_hashtag', default = False, force = True)
Token.set_extension('is_account', default = False, force = True)

# These functions tell spaCy what should be considered a hashtage or account
from processing_functions import hashtag_pipe
from processing_functions import is_account_pipe

# We can  disable pipeline objects to save time: disable = ['parser', 'etc']
nlp = spacy.load("en", disable = ['parser', 'ner'])

# Here I add the two custom functions for hashtags and accounts to the pipeline
nlp.add_pipe(hashtag_pipe)
nlp.add_pipe(is_account_pipe)

# And we're off!
parsed_tweets = list(nlp.pipe(df_sampled.content))

# Gensim


Before we train our gensim doc2vec model, we need to do some more cleaning. The clean_doc function is located in the processing_functions.py file in this repository. 

Note: While I tried running the following without lemmatizing or removing stop words, I did not find that this made any meaningful difference in terms of the clusters that do or do not form.

In [12]:
from processing_functions import clean_doc
from processing_functions import clean_doc_no_lemma

lemma = list(map(clean_doc, parsed_tweets))
#no_lemma = list(map(clean_doc_no_lemma, parsed_tweets))

The next step is perhaps the most critical to this analysis. We need to group the parsed and cleaned tweets by author, so that for each author, we have a 30% sample of everything that they have ever tweeted in a single list (as discussed above). We can easily do this by applying our own function to a Pandas groupby object. 

In [13]:
# each row in parsed content is a list of lists
df_sampled['parsed_content'] = lemma

# lets flatten these so that each row only has one list
from processing_functions import group_lists
        
df_grouped = df_sampled.groupby(['author', 
                    'account_category'])['parsed_content'].apply(group_lists).reset_index()

# Convert the parsed content Series to a list for gensim
parsed_content = list(df_grouped.parsed_content)

Now that out data is clean and in the proper format, we can go ahead and create our gensim doc2vec model. It never ceases to amaze me that we can implement such a poweful algorithm in just a few lines of code. 

In [14]:
# create a doc2vec model
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# parsed_content is the list of parsed text
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(parsed_content)]

# Vector size of 300
model = Doc2Vec(documents, vector_size=300, window=5, min_count=3, workers=6)

arr_list = []

for index in range(0, len(model.docvecs)):
    arr_list.append(model.docvecs[index])
    
vec_array = np.stack(arr_list)