# Twitter Trolls
### Classifying and analyzing Russian Troll Tweets using Deep Learning
#### by Christopher DeCarolis

In [46]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import keras
import re
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, Conv1D
from keras.preprocessing.text import Tokenizer
from nltk.tokenize import word_tokenize
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

## Data Pre-processing
We have two different data sets to work with. The first is a set of Tweets scraped from Russian Troll accounts (link __[here](https://www.kaggle.com/vikasg/russian-troll-tweets)__), and the second is a set of tweets scraped from random accounts during election day (link __[here](https://www.kaggle.com/kinguistics/election-day-tweets)__). 

In [3]:
df_russian = pd.read_csv('~/datasets/russian-troll-tweets/tweets.csv')

In [4]:
print(df_russian[1:10])

        user_id         user_key    created_at          created_str  \
1  2.571870e+09  detroitdailynew  1.476133e+12  2016-10-10 20:57:00   
2  1.710805e+09       cookncooks  1.487767e+12  2017-02-22 12:43:43   
3  2.584153e+09     queenofthewo  1.482765e+12  2016-12-26 15:06:41   
4  1.768260e+09     mrclydepratt  1.501987e+12  2017-08-06 02:36:24   
5  2.882014e+09      giselleevns  1.477496e+12  2016-10-26 15:33:58   
6  1.658421e+09        baobaeham  1.488910e+12  2017-03-07 18:11:44   
7  2.587101e+09   judelambertusa  1.483102e+12  2016-12-30 12:49:30   
8  1.679279e+09    ameliebaldwin  1.477792e+12  2016-10-30 01:48:19   
9  1.649488e+09        hiimkhloe  1.458155e+12  2016-03-16 19:07:39   

   retweet_count retweeted  favorite_count  \
1            0.0     False             0.0   
2            NaN       NaN             NaN   
3            NaN       NaN             NaN   
4            NaN       NaN             NaN   
5            NaN       NaN             NaN   
6            

In [5]:
df_election = pd.read_csv('~/datasets/election-day-tweets/election_day_tweets.csv')

In [6]:
print(df_election[1:10])

                                                text           created_at  \
1  My @latimesopinion op-ed on historic #Californ...  2016-11-08 04:08:10   
2  #Senate Wisconsin Senate Preview: Johnson vs. ...  2016-11-08 04:11:35   
3  If Rubio Wins and #Trump Loses in #Florida... ...  2016-11-08 04:12:16   
4  #Senate Wisconsin Senate Preview: Johnson vs. ...  2016-11-08 04:16:20   
5  bob day is an "honest  person "  #senate patte...  2016-11-08 04:18:55   
6  Make Republicans #PayAPrice!\n 💙🇺🇸#VoteBLUE🔃th...  2016-11-08 04:20:09   
7  She's done America!! Please vote for @realDona...  2016-11-08 04:20:43   
8  #Illinois #Senate #StrongerTogether https://t....  2016-11-08 04:26:36   
9  #Senate Sen. Mark Warner to speak at ODU for V...  2016-11-08 04:41:04   

   geo lang place coordinates  user.favourites_count  user.statuses_count  \
1  NaN   en   NaN         NaN                      8                 4841   
2  NaN   en   NaN         NaN                    728               160390  

Checking the first few rows of each dataset, we can see that they were properly imported in terms of format. Something important to note about the election dataset is that we technically have know way of knowing whether the data collected comes from legitimate accounts or note. It is possible (likely even) that some of the users in this dataset were actually troll accounts. However, as we have know way of definitively telling that, we will proceed under the assumption that the election day tweets dataset represents tweets from real individuals.
Now we need to split the datasets into training and testing sets.

In [7]:
df_russian = shuffle(df_russian)
df_election = shuffle(df_election)

In [8]:
df_russian.loc[20, 'text']

"Obama on Trump winning: 'Anything's possible' https://t.co/MjVMZ5TR8Y #politics"

__Text Cleaning__: Below we remove all hashtags and unnecessary filler from the text that would otherwise hinder classification. We also perform a train-test split on both datasets.

In [9]:
for r in range(0, df_russian.shape[0]):
    if not isinstance(df_russian.loc[r, 'text'], str): continue
    e = re.sub(
        "(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", 
        '', 
        df_russian.loc[r, 'text'].lower()
    )
    df_russian.set_value(r, 'text', e)

  


In [10]:
for r in range(0, df_election.shape[0]):
    if not isinstance(df_election.loc[r, 'text'], str): continue
    e = re.sub(
        "(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", 
        '', 
        df_election.loc[r, 'text'].lower()
    )
    df_election.set_value(r, 'text', e)

  


In [11]:
df_russian.head()

Unnamed: 0,user_id,user_key,created_at,created_str,retweet_count,retweeted,favorite_count,text,tweet_id,source,hashtags,expanded_urls,posted,mentions,retweeted_status_id,in_reply_to_status_id
73081,1658421000.0,baobaeham,1493807000000.0,2017-05-03 10:26:23,,,,hooray wfoods windmill wholefoods revamp is ...,8.597158e+17,,[],[],POSTED,[],,
202986,3083087000.0,aldrich420,1478644000000.0,2016-11-08 22:34:01,0.0,False,0.0,maga make america great again thanks trumpf...,7.961186e+17,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...","[""maga"",""TrumpForPresident"",""maga"",""TrumpForPr...",[],POSTED,"[""realdonaldtrump"",""psysamurai33317""]",,
22104,7.506563e+17,klara_sauber,1500302000000.0,2017-07-17 14:27:55,,,,merkel macht europa besser merkelserfolge i...,8.869556e+17,,"[""Merkel""]",[],POSTED,[],8.869492e+17,
145024,2620870000.0,puredavie,1481303000000.0,2016-12-09 16:57:46,,,,wake me up before you ho ho christmasapopsong,8.07268e+17,,[],[],POSTED,[],,
130617,1657754000.0,johnbranchh,1429742000000.0,2015-04-22 22:38:15,,,,news uva dean bashes rolling stone article i...,5.910081e+17,,"[""news""]",[],POSTED,[],,


In [12]:
train_russian_df, test_russian_df = train_test_split(df_russian.loc[:, 'text'], test_size=0.2)
train_election_df, test_election_df = train_test_split(df_election.loc[:, 'text'], test_size=0.2)

In [33]:
docs = []
num_valid_russia = 0
num_valid_election = 0
for chunk in train_russian_df:
    if isinstance(chunk, str): 
        num_valid_russia += 1
        docs.append(chunk)
for chunk in train_election_df:
    if isinstance(chunk, str): 
        num_valid_election += 1
        docs.append(chunk)

In [34]:
vocab_size = 5000
tokenize = Tokenizer(num_words=vocab_size)
tokenize.fit_on_texts(docs)

We have now created a bag of word vectors model that converts our words into embeddings. We now need to actually transform our data properly, as well as give it corresponding labels. Assume that a label of 0 for a tweet means that the tweet is a troll tweet, while a label of 1 for a tweet means that the tweet is a normal/nonmalicious tweet.

In [35]:
x_train = tokenize.texts_to_matrix(docs)

In [36]:
y_train = np.concatenate(
    (np.full((num_valid_russia), 0), np.full((num_valid_election), 0))
)

In [38]:
print(x_train.shape)
print(y_train.shape)

(480868, 5000)
(480868,)


We now move to create a model that will learn word embeddings, and then use that model to extract semantic information.

In [47]:
model = Sequential()
model.add(Embedding(vocab_size, 32, input_length=5000))
model.add(Conv1D(1, 32))
model.o

Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead


(?, 5000, 32)
