# Information Warfare
## Russia’s use of Twitter during the 2016 US Presidential Election
---

### Import libraries

In [1]:
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
get_ipython().config.get('IPKernelApp', {})['parent_appname'] = ""

import spacy
import os
import pickle

from collections import Counter

from plotly import tools
import plotly.graph_objs as go
# from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# import plotly.offline as py

from plotly.offline import init_notebook_mode, plot, iplot
import plotly.io as pio

from IPython.display import Image

init_notebook_mode(connected=True)

### Import data

In [2]:
# All Tweets
df = pd.read_pickle('data/processed/tweets.pkl')
df.reset_index(drop = True, inplace = True)

# Only English language Tweets
dfEng = pd.read_pickle('data/processed/tweetsEng.pkl')
dfEng.reset_index(drop = True, inplace = True)

# Only non-English language Tweets
dfOth = pd.read_pickle('data/processed/tweetsOth.pkl')
dfOth.reset_index(drop = True, inplace = True)

# Visualization and clustering

Great! We've cleaned, tokenized, and embeded our textual data in a vector space. Let's try to visualize our data and see where we are at. 

Since we are working with a 300 dimension dataset, we need to employ a dimension reduction technique if we want to visualize some of the patterns in our data. t-SNE is a great algorithm for this purpose, and we will employ this technique in the space below. 

By reducing the number of dimensions with t-SNE, we can visualize our data on a two-dimensional canvas. In the following chart the color of each individual datapoint will correspond to a label manually assigned by Professors Warren and Linville. 

In [15]:
# Instantiate and fit the t-SNE model.
from sklearn.manifold import TSNE
tsne_35 = TSNE(n_components = 2, perplexity = 35, verbose = 0 , n_iter = 1000).fit_transform(vec_array)

# rgb 15 26 48
color_list = ["#003f5c", "#ffa600", "#f95d6a", "#a05195", "#ff7c43", "#2f4b7c", "#665191", "#d45087"]

data = []
for idx, i in  enumerate(df_grouped.account_category.unique()):
    x = tsne_35[df_grouped.account_category == i, 0]
    y = tsne_35[df_grouped.account_category == i, 1]
    
    data.append(go.Scatter(x = x, y = y, mode = 'markers', name = i, marker = dict(color = color_list[idx])))

layout = go.Layout(title = dict(text = 't-SNE Scatter: Original Labels', font = dict(size = 30)), legend = dict(font = dict(size = 15)))

fig = go.Figure(data = data, layout = layout)

py.iplot(fig)

Amazing! Our accounts seperate into distinct clusters that are more or less in line with the labels assigned by Warren and Linville! This suggests that we will be able to use KMeans or GMM to cluster accounts in an unsupervised way. 

One particularly interesting result is that there appears to be two distinct groups of Right Trolls that Warren and Linville did not discriminate between. By analyzing the behavior of these groups and others, we may be able to gain additional insight into Russian tradecraft and the overall objective of the Russian Twitter campaign.

Let's move on to our clustering algorithms!

### Kmeans

One of the challenges of using KMeans or any clustering algorithms is how to choose an appropriate K. We can use a metric called inertia to help aid us in choosing how many clusters we should specify for our data. 

But what is inertia and how does it work? Inertia is essentially a measure of clustering quality. That is, good clustering has tight clusters, and samples in each cluster are bunched together. Inertia measures how spread out clusters are and uses the distance from each sample to the centroid of its cluster in its calculation. 

A good rule of thumb is to choose the "elbow" in the inertia plot, or the inflection point where inertia begins to decrease at a slower rate.

Special shout-out to Hugo from Datacamp for teaching me about how inertia can be used to choose k, as well as laying the foundation for nearly everything that I know in python.

In [16]:
from sklearn.cluster import KMeans

inertias = []
clusters = list(range(1,11))

for k in clusters:
    
    model = KMeans(n_clusters = k)
    
    model.fit(vec_array)
    
    inertias.append(model.inertia_)

In [17]:
data = [go.Scatter(x = clusters, y = inertias, mode = 'lines', marker = dict(color = "#003f5c")),
        go.Scatter(x = clusters, y = inertias, mode = 'markers', marker = dict(color = "#2f4b7c", size = 9))]

layout = go.Layout(title = dict(text = 'Inertia Plot', font = dict(size = 30)), showlegend = False)

fig = go.Figure(data = data, layout = layout)

py.iplot(fig)

The inertia plot suggests that a k of five may be appropriate for our data. Let's go ahead and instantiate a KMeans model with k equal to five.

In [18]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5)

kmeans_labels = kmeans.fit_predict(vec_array)

In [19]:
color_list = ["#003f5c", "#ffa600", "#f95d6a", "#a05195", "#ff7c43", "#2f4b7c", "#665191", "#d45087"]

data = []
for idx, i in  enumerate(np.unique(kmeans_labels)):
    x = tsne_35[kmeans_labels == i, 0]
    y = tsne_35[kmeans_labels == i, 1]
    
    data.append(go.Scatter(x = x, y = y, mode = 'markers', name = str(i), marker = dict(color = color_list[idx])))

layout = go.Layout(title = dict(text = 't-SNE Scatter: KMeans Labels', font = dict(size = 30)), legend = dict(font = dict(size = 15)))

fig = go.Figure(data = data, layout = layout)

py.iplot(fig)

That looks great! I am actually surprised at how well KMeans seems to be performing on our data. 

KMeans is a discriminative algorithm, and can be thought of as a hard clustering technique. That is, a single element can only be assigned to one cluster. This may be problematic when analyzing text, as we are likely to see overlap between clusters and would naturally expect more ambiguity in our cluster assignments. For example, the ideological bent of a particular twitter acount may be unclear, and it may be difficult to place into one category or another based on the text alone. 

To account for uncertainty in cluster assignments, we could turn to a soft clustering method where we account for uncertainty in class assignment. For example, Gaussian Mixture Models (GMMs) are generative algorithms that provide a probabilistic way of doing soft clustering. 

GMMs can loosly be thought of as an extension of KMeans, except that instead of placeing centroids in random locations in space, we place probability distributions. We then use the Expectation Maximiazation algorithm to discover parameters for each probability distribution for our K sources, and move the distributions around until convergence.

GMMs allow us to calculate the probability that a sample belongs to a cluster. As such, each sample is assigned a probability that it belongs in each cluster. 

Let's go ahead and experiment with mixture models in the space below!

# Gaussian Mixture Models (Work in Progress)

I will be using Gaussian Mixture Models in the space below, although it is possible to use different kinds of probability distirbutions in mixture models. 

In [20]:
from sklearn.mixture import GaussianMixture as GMM

gmm = GMM(n_components=5).fit(vec_array)

gmm_labels = gmm.predict(vec_array)

In [21]:
color_list = ["#003f5c", "#ffa600", "#f95d6a", "#a05195", "#ff7c43", "#2f4b7c", "#665191", "#d45087"]

data = []
for idx, i in  enumerate(np.unique(gmm_labels)):
    x = tsne_35[gmm_labels == i, 0]
    y = tsne_35[gmm_labels == i, 1]
    
    data.append(go.Scatter(x = x, y = y, mode = 'markers', name = str(i), marker = dict(color = color_list[idx])))

layout = go.Layout(title = '2D representation of doc2vec clusters using GMM')

fig = go.Figure(data = data, layout = layout)

py.iplot(fig)

In [22]:
n_components = np.arange(1, 20)
models = [GMM(n, covariance_type='full', random_state=0).fit(vec_array)
          for n in n_components]

data = [go.Scatter(x = n_components, y = [m.bic(vec_array) for m in models], mode = 'lines', name = 'BIC'),
       go.Scatter(x = n_components, y = [m.aic(vec_array) for m in models], mode = 'lines', name = 'AIC')]

layout = go.Layout(xaxis = dict(title = 'n_components'))

py.iplot(data)

print('n_components associated with minimum AIC: ', 
      np.argmin([m.aic(vec_array) for m in models]) + 1)

n_components associated with minimum AIC:  5


In [23]:
gmm.predict_proba(vec_array)

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

Weird. I'm getting some results that I didn't expect, particularly how that class labels looks exactly the same as those from KMeans, and how the AIC/BIC chart looks. 

After thinking it over, I think that we may be falling victim to the curse of dimensionality. Since I am working with a relatively short and wide dataset, we might expect individual datapoints or even clusters of datapoints to be far apart in terms of distance. If this is the case, then the porbabilities that our model generates will be extremly low or extremly high, and this appears to be the case.

At any rate, I will be sticking with KMeans as I move forward. 

# Analysis

Now that we have completed our KMeans cluster assignments, it's time to see how they stack up against  Linville and Warrens labeling scheme.

In [24]:
df_grouped['kmeans_labels'] = kmeans_labels

df_grouped.head()

Unnamed: 0,author,account_category,parsed_content,kmeans_labels
0,4MYSQUAD,LeftTroll,"[there, need, stiff, penalty, officer, #LIE, #...",1
1,AANTIRACIST,LeftTroll,"[this, alabama, mayor, say, her, account, hack...",1
2,ABIGAILSSILK,HashtagGamer,"[#MyAchillesHeel, chuck, e., cheese, 's, @TagU...",2
3,ABIISSROSB,RightTroll,"[#abi, break, disgraceful, liberal, mag, featu...",3
4,ABMERRLINS,RightTroll,"[sick, planned, parenthood, want, parents, lie...",3


In [25]:
label_dict = {}

for row in df_grouped.iterrows():
    
    try: 
        label_dict[row[1].account_category].append(row[1].kmeans_labels)
    
    except:
        label_dict[row[1].account_category] = [row[1].kmeans_labels]

I want to compute the conditional probability of a cluster label given the label assigned by Linville and Warren. This is trickier than it seems, and I could not think of a better way to do this. 

In [26]:
choices = [0, 1, 2, 3, 4]

prob_df = pd.DataFrame()

for i in label_dict.keys():
    count = dict(Counter(label_dict[i]))
    length = sum(list(count.values()))

    prob_dict = {}

    for key in choices:
        try:
            prob_dict[key]= count[key]/length
        
        except:
            prob_dict[key]= 0
    
    row = pd.DataFrame(pd.Series(prob_dict)).T
    
    row.index = [i]
    
    prob_df = pd.concat([prob_df, row])

That was a lot of work!  We now have a dataframe of conditional probabilities. That is given a label by Warren and Linville, we have a conditional probability for each KMeans cluster label.  This is more clear if we view the dataframe. 

In [27]:
prob_df

Unnamed: 0,0,1,2,3,4
LeftTroll,0.059322,0.940678,0.0,0.0,0.0
HashtagGamer,0.031746,0.0,0.968254,0.0,0.0
RightTroll,0.524664,0.004484,0.0,0.470852,0.0
NewsFeed,0.066667,0.022222,0.0,0.0,0.911111
NonEnglish,0.857143,0.142857,0.0,0.0,0.0
Commercial,0.0,0.0,0.8,0.0,0.2
Unknown,0.0,1.0,0.0,0.0,0.0


The table above indicates that for the most part, our clusters align pretty well with Warren and Linville's labels. Let's go ahead and visualize this. 

In [28]:
prob_df_t = prob_df.T

colors = ["#003f5c", "#58508d", "#bc5090", "#ff6361", "#ffa600"]
data = []
for idx, label in enumerate(list(prob_df_t.columns)[:6]):  
    data.append(go.Bar(x = list(prob_df_t.index), y = list(prob_df_t[label]), 
                       marker =  dict(color = colors), opacity = 0.9))
    
fig = tools.make_subplots(rows=2, cols=3, subplot_titles=('LeftTroll', 'HashtagGamer','RightTroll', 
                                                          'NewsFeed', 'NonEnglish', 'Commercial'))
fig.append_trace(data[0], 1, 1)
fig.append_trace(data[1], 1, 2)
fig.append_trace(data[2], 1, 3)
fig.append_trace(data[3], 2, 1)
fig.append_trace(data[4], 2, 2)
fig.append_trace(data[5], 2, 3)

fig['layout'].update(height=700, 
                     title= dict(text = 'Conditional Probability of KMeans Assignment Given a Label', 
                                 font = dict(size = 27)),
                     showlegend = False)

for i in range(1, 7):
    fig['layout']['yaxis' + str(i)].update(range = [0, 1], nticks = 10)


py.iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]  [ (1,3) x3,y3 ]
[ (2,1) x4,y4 ]  [ (2,2) x5,y5 ]  [ (2,3) x6,y6 ]



# Modeling (Work in Progress)

In [29]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

First we need to check our class balance

In [30]:
y = df_grouped.account_category.copy()
x = np.copy(vec_array)

print(Counter(y))

Counter({'RightTroll': 223, 'LeftTroll': 118, 'HashtagGamer': 63, 'NewsFeed': 45, 'NonEnglish': 7, 'Commercial': 5, 'Unknown': 2})


Due to the Limited number of samples, I am going to classify NonEnglish, Commercial, and Unknown as Other.

In [31]:
other_list = ['NonEnglish','Commercial', 'Unknown']

y = y.apply(lambda x: 'Other' if x in other_list else x)

In [32]:
from sklearn.model_selection import train_test_split

from imblearn.over_sampling import SMOTE

x_train, x_test, y_train, y_test = train_test_split(x, y, stratify = y)

sm = SMOTE(random_state = 123, k_neighbors = 6)

x_res, y_res = sm.fit_resample(x_train, y_train)

#### Logistic Regression

In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

logreg = OneVsRestClassifier(LogisticRegression(solver = 'liblinear'))

logreg.fit(x_res, y_res)

y_pred = logreg.predict(x_test)

In [34]:
print("Accuracy Score: ",accuracy_score(y_test, y_pred), "\n", "\n")
print(classification_report(y_test, y_pred))

Accuracy Score:  0.9396551724137931 
 

              precision    recall  f1-score   support

HashtagGamer       1.00      1.00      1.00        16
   LeftTroll       0.93      0.93      0.93        30
    NewsFeed       0.85      1.00      0.92        11
       Other       0.50      0.33      0.40         3
  RightTroll       0.96      0.95      0.95        56

   micro avg       0.94      0.94      0.94       116
   macro avg       0.85      0.84      0.84       116
weighted avg       0.94      0.94      0.94       116



### SVM

In [35]:
from sklearn.svm import SVC

svc = OneVsRestClassifier(SVC(gamma = 'auto'))

param_grid = {'estimator__C': [1, 3, 5, 7]}

grid_svc = GridSearchCV(estimator = svc, param_grid = param_grid, 
                        scoring = 'accuracy', n_jobs = -1, verbose = 0,
                       cv = 4)

grid_svc.fit(x_res, y_res)

y_pred_svc = grid_svc.predict(x_test)

print('SVM Accuracy Score: ',accuracy_score(y_test, y_pred_svc), "\n", "\n")

print(classification_report(y_test, y_pred_svc))

SVM Accuracy Score:  0.9396551724137931 
 

              precision    recall  f1-score   support

HashtagGamer       1.00      1.00      1.00        16
   LeftTroll       0.93      0.93      0.93        30
    NewsFeed       0.92      1.00      0.96        11
       Other       0.40      0.67      0.50         3
  RightTroll       0.98      0.93      0.95        56

   micro avg       0.94      0.94      0.94       116
   macro avg       0.85      0.91      0.87       116
weighted avg       0.95      0.94      0.94       116



### Random Forrest

In [36]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state = 123, n_estimators = 10)


params_rf = {'max_depth':[2,3,4,5,6],
             'min_samples_leaf': [0.04, 0.06, 0.08],
             'max_features': [0.2, 0.4, 0.6, 0.8],
             'criterion':['gini', 'entropy']}
             
grid_rf = GridSearchCV(estimator = rf, param_grid = params_rf, 
                       scoring = 'accuracy', cv= 10, n_jobs = -1, iid = True)


grid_rf.fit(x_res, y_res)

y_pred_rf = grid_rf.predict(x_test)

print("Accuracy Score: ", accuracy_score(y_test, y_pred_rf), "\n", "\n")
print(classification_report(y_test, y_pred_rf))

Accuracy Score:  0.9396551724137931 
 

              precision    recall  f1-score   support

HashtagGamer       1.00      0.94      0.97        16
   LeftTroll       0.96      0.90      0.93        30
    NewsFeed       0.92      1.00      0.96        11
       Other       0.50      0.67      0.57         3
  RightTroll       0.95      0.96      0.96        56

   micro avg       0.94      0.94      0.94       116
   macro avg       0.87      0.89      0.88       116
weighted avg       0.94      0.94      0.94       116



### XGboost

In [37]:
import xgboost as xgb

param_grid = {
  'learning_rate': np.arange(0.05, 1.05, .10),
  'n_estimators' : [50],
  'subsample' : np.arange(0.05, 1.05, .05),
  'max_depth': [2,4,6]
  }

xg_cl = xgb.XGBClassifier(objective = 'multi:softmax')

randomized_xg_cl = RandomizedSearchCV(estimator = xg_cl, 
                                      param_distributions = param_grid,
                                      n_iter = 5, 
                                      cv = 5,
                                      scoring = 'accuracy',
                                      n_jobs = -1,
                                      verbose = 0)

randomized_xg_cl.fit(x_res, y_res)

print("Best Accuracy Score Train CV: ", randomized_xg_cl.best_score_)

Best Accuracy Score Train CV:  0.9928143712574851


In [38]:
y_pred = randomized_xg_cl.predict(x_test)

print('XGBoost Test Accuracy: ', accuracy_score(y_test, y_pred), "\n", "\n")

print(classification_report(y_test, y_pred))

XGBoost Test Accuracy:  0.9310344827586207 
 

              precision    recall  f1-score   support

HashtagGamer       1.00      0.94      0.97        16
   LeftTroll       0.93      0.93      0.93        30
    NewsFeed       0.85      1.00      0.92        11
       Other       0.50      0.33      0.40         3
  RightTroll       0.95      0.95      0.95        56

   micro avg       0.93      0.93      0.93       116
   macro avg       0.85      0.83      0.83       116
weighted avg       0.93      0.93      0.93       116

