# Information Warfare
## Russia’s use of Twitter during the 2016 US Presidential Election
---

Last updated by Benjamin Forleo 06/14/19

### Import libraries

In [1]:
import os
import sys
sys.path.insert(0, os.getcwd() + "/scripts")

import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
get_ipython().config.get('IPKernelApp', {})['parent_appname'] = ""

import spacy
import pickle

from collections import Counter

from plotly import tools
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.io as pio

from IPython.display import Image

init_notebook_mode(connected=True)

### Import data

In [2]:
# All Tweets
df = pd.read_pickle('data/raw/tweets.pkl')
df.reset_index(drop = True, inplace = True)

# Only English language Tweets
dfEng = pd.read_pickle('data/raw/tweetsEng.pkl')
dfEng.reset_index(drop = True, inplace = True)

# Only non-English language Tweets
dfOth = pd.read_pickle('data/raw/tweetsOth.pkl')
dfOth.reset_index(drop = True, inplace = True)

#### NOTE: For the time being, I am using only English tweets to replicate what I did originally. - BF 06/01/19

In [3]:
# Let's get counts of the number of tweets by each author
counts_by_author = dfEng[['author', 'content']].groupby('author').count()

counts_by_author.reset_index(inplace = True)

print(sum(counts_by_author.content > 400))

author_series = counts_by_author.author[counts_by_author.content > 400]

dfEng = dfEng[dfEng.author.isin(author_series)]

463


# Sampling and Pre-Processing

As you may know, natural language processing can be computationally expensive. Since we are dealing with a large number of tweets, it makes sense to take a sample of the data that we want to work with. 

Here, we take a 30% sample of tweets from each account. Later, we will group the tweets by author such that each line in our data frame contains one author and a 30% sample of everything that author has ever tweeted.

Note: the 30% sample size is arbitrary.

In [4]:
dfEng.content = dfEng.content.astype(str)

# Create the groupby object of author and category. Note: every author is assigned one category
df_sampled = dfEng.groupby(['author', 'account_category'])['content']

# Take a 30% sample from each group
df_sampled = df_sampled.apply(lambda x: x.sample(frac=0.3, replace = False)).reset_index()

In [5]:
# SAMPLE AND PRE-CLEANING

# REMOVE RT SYMBOL
#from processing_functions import rt_remover
#df_sampled.content = df_sampled.content.apply(rt_remover)

# REMOVE ANY HYPERLINKS
from processing_functions import link_remover
df_sampled.content = df_sampled.content.apply(link_remover)


# STRIP ANY WHITESPACE ON EITHER SIDE OF THE TEXT
df_sampled.content = df_sampled.content.str.strip()

Now that we have done some basic pre-processing, we can begin to parse and clean our text with spaCy.

# Tokenization, cleaning, and lemmatization with spaCy

- Note: this takes approximately 5 - 10 minutes to run.

In [6]:
from spacy.tokens import Token

# This allows us to add custom attributes to tokens, in this case, hashtags and accounts
Token.set_extension('is_hashtag', default = False, force = True)
Token.set_extension('is_account', default = False, force = True)

# These functions tell spaCy what should be considered a hashtage or account
from processing_functions import hashtag_pipe
from processing_functions import is_account_pipe

# We can  disable pipeline objects to save time: disable = ['parser', 'etc']
nlp = spacy.load("en", disable = ['parser', 'ner'])

# Here I add the two custom functions for hashtags and accounts to the pipeline
nlp.add_pipe(hashtag_pipe)
nlp.add_pipe(is_account_pipe)

# And we're off!
parsed_tweets = list(nlp.pipe(df_sampled.content))

# Gensim


Before we train our gensim doc2vec model, we need to do some more cleaning. The clean_doc function is located in the processing_functions.py file in this repository. 

Note: While we tried running the following without lemmatizing or removing stop words, we did not find that this made any meaningful difference in terms of the clusters that do or do not form.

In [7]:
from processing_functions import clean_doc
from processing_functions import clean_doc_no_lemma

lemma = list(map(clean_doc, parsed_tweets))
#no_lemma = list(map(clean_doc_no_lemma, parsed_tweets))

The next step is perhaps the most critical to this analysis. We need to group the parsed and cleaned tweets by author, so that for each author, we have a 30% sample of everything that they have ever tweeted in a single list (as discussed above). We can easily do this by applying our own function to a Pandas groupby object. 

In [8]:
# each row in parsed content is a list of lists
df_sampled['parsed_content'] = lemma

# lets flatten these so that each row only has one list
from processing_functions import group_lists
        
df_grouped = df_sampled.groupby(['author', 
                    'account_category'])['parsed_content'].apply(group_lists).reset_index()

# Convert the parsed content Series to a list for gensim
parsed_content = list(df_grouped.parsed_content)

Now that out data is clean and in the proper format, we can go ahead and create our gensim doc2vec model.  

In [9]:
# create a doc2vec model
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# parsed_content is the list of parsed text
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(parsed_content)]

# Vector size of 300
model = Doc2Vec(documents, vector_size=300, window=5, min_count=3, workers = 6)

# save the model to disk
model.save('./saved_models/eng_doc2vec/eng_gensim_model')

arr_list = []

for index in range(0, len(model.docvecs)):
    arr_list.append(model.docvecs[index])
    
vec_array = np.stack(arr_list)

In [10]:
# Merge docVec labels and save as csv
vec_df = pd.DataFrame(vec_array)

vec_df.insert(0, 'account_category', df_grouped.account_category)
vec_df.insert(0, 'author', df_grouped.author)

vec_df.to_csv('./data/eng_labeled_docvecs.csv', index = False)