# Twitter Trolls
### Classifying and analyzing Russian Troll Tweets using Deep Learning
#### by Christopher DeCarolis

In [57]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import keras
import re
import gensim
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, Conv1D, MaxPooling1D
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from nltk.tokenize import word_tokenize
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

## Data Pre-processing
We have two different data sets to work with. The first is a set of Tweets scraped from Russian Troll accounts (link __[here](https://www.kaggle.com/vikasg/russian-troll-tweets)__), and the second is a set of tweets scraped from random accounts during election day (link __[here](https://www.kaggle.com/kinguistics/election-day-tweets)__). 

In [48]:
df_russian = pd.read_csv('~/datasets/russian-troll-tweets/tweets.csv')

In [49]:
print(df_russian[1:10])

        user_id         user_key    created_at          created_str  \
1  2.571870e+09  detroitdailynew  1.476133e+12  2016-10-10 20:57:00   
2  1.710805e+09       cookncooks  1.487767e+12  2017-02-22 12:43:43   
3  2.584153e+09     queenofthewo  1.482765e+12  2016-12-26 15:06:41   
4  1.768260e+09     mrclydepratt  1.501987e+12  2017-08-06 02:36:24   
5  2.882014e+09      giselleevns  1.477496e+12  2016-10-26 15:33:58   
6  1.658421e+09        baobaeham  1.488910e+12  2017-03-07 18:11:44   
7  2.587101e+09   judelambertusa  1.483102e+12  2016-12-30 12:49:30   
8  1.679279e+09    ameliebaldwin  1.477792e+12  2016-10-30 01:48:19   
9  1.649488e+09        hiimkhloe  1.458155e+12  2016-03-16 19:07:39   

   retweet_count retweeted  favorite_count  \
1            0.0     False             0.0   
2            NaN       NaN             NaN   
3            NaN       NaN             NaN   
4            NaN       NaN             NaN   
5            NaN       NaN             NaN   
6            

In [50]:
df_election = pd.read_csv('~/datasets/election-day-tweets/election_day_tweets.csv')

In [51]:
print(df_election[1:10])

                                                text           created_at  \
1  My @latimesopinion op-ed on historic #Californ...  2016-11-08 04:08:10   
2  #Senate Wisconsin Senate Preview: Johnson vs. ...  2016-11-08 04:11:35   
3  If Rubio Wins and #Trump Loses in #Florida... ...  2016-11-08 04:12:16   
4  #Senate Wisconsin Senate Preview: Johnson vs. ...  2016-11-08 04:16:20   
5  bob day is an "honest  person "  #senate patte...  2016-11-08 04:18:55   
6  Make Republicans #PayAPrice!\n 💙🇺🇸#VoteBLUE🔃th...  2016-11-08 04:20:09   
7  She's done America!! Please vote for @realDona...  2016-11-08 04:20:43   
8  #Illinois #Senate #StrongerTogether https://t....  2016-11-08 04:26:36   
9  #Senate Sen. Mark Warner to speak at ODU for V...  2016-11-08 04:41:04   

   geo lang place coordinates  user.favourites_count  user.statuses_count  \
1  NaN   en   NaN         NaN                      8                 4841   
2  NaN   en   NaN         NaN                    728               160390  

Checking the first few rows of each dataset, we can see that they were properly imported in terms of format. Something important to note about the election dataset is that we technically have know way of knowing whether the data collected comes from legitimate accounts or note. It is possible (likely even) that some of the users in this dataset were actually troll accounts. However, as we have know way of definitively telling that, we will proceed under the assumption that the election day tweets dataset represents tweets from real individuals.
Now we need to split the datasets into training and testing sets.

In [52]:
df_russian = shuffle(df_russian)
df_election = shuffle(df_election)

In [53]:
df_russian.loc[20, 'text']

"Obama on Trump winning: 'Anything's possible' https://t.co/MjVMZ5TR8Y #politics"

__Text Cleaning__: Below we remove all hashtags and unnecessary filler from the text that would otherwise hinder classification. We also perform a train-test split on both datasets.

In [54]:
for r in range(0, df_russian.shape[0]):
    if not isinstance(df_russian.loc[r, 'text'], str): continue
    e = re.sub(
        "(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", 
        '', 
        df_russian.loc[r, 'text'].lower()
    )
    df_russian.set_value(r, 'text', e)

  


In [55]:
for r in range(0, df_election.shape[0]):
    if not isinstance(df_election.loc[r, 'text'], str): continue
    e = re.sub(
        "(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", 
        '', 
        df_election.loc[r, 'text'].lower()
    )
    df_election.set_value(r, 'text', e)

  


In [56]:
df_russian.head()

Unnamed: 0,user_id,user_key,created_at,created_str,retweet_count,retweeted,favorite_count,text,tweet_id,source,hashtags,expanded_urls,posted,mentions,retweeted_status_id,in_reply_to_status_id
161473,2533002000.0,jasper_fly,1478535000000.0,2016-11-07 16:04:10,0.0,False,0.0,fact check trump 2016electionin3words,7.956581e+17,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...","[""2016ElectionIn3Words""]",[],POSTED,"[""meyer2311""]",7.956377e+17,
158513,4508631000.0,thefoundingson,1478798000000.0,2016-11-10 17:15:31,,,,so people have spoken whats your problem,7.967632e+17,,[],[],POSTED,[],,
18102,1680366000.0,willisbonnerr,1481345000000.0,2016-12-10 04:37:29,,,,marketing all unsigned artist can get airplay...,8.074441e+17,,"[""unsigned""]",[],POSTED,[],,
171868,2532612000.0,kathiemrr,1484833000000.0,2017-01-19 13:43:01,,,,the best jennycraig ad in the world renamemill...,8.220769e+17,,"[""JennyCraig""]",[],POSTED,[],,
151595,1617939000.0,paulinett,1499938000000.0,2017-07-13 09:30:29,,,,safari in kenya discover its unspoilt magic,8.854312e+17,,[],[],POSTED,[],,


In [12]:
train_russian_df, test_russian_df = train_test_split(df_russian.loc[:, 'text'], test_size=0.2)
train_election_df, test_election_df = train_test_split(df_election.loc[:, 'text'], test_size=0.2)

In [13]:
docs = []
num_valid_russia = 0
num_valid_election = 0
for chunk in train_russian_df:
    if isinstance(chunk, str): 
        num_valid_russia += 1
        docs.append(chunk)
for chunk in train_election_df:
    if isinstance(chunk, str): 
        num_valid_election += 1
        docs.append(chunk)

In [58]:
embeds = gensim.models.Word2Vec(docs)
embeds.train(docs, total_examples=len(docs), epochs=10)

KeyboardInterrupt: 

In [14]:
vocab_size = 50
tokenize = Tokenizer(num_words=vocab_size)
tokenize.fit_on_texts(docs)

In [25]:
print(len(tokenize.word_index))

217516


We have now created a bag of word vectors model that converts our words into embeddings. We now need to actually transform our data properly, as well as give it corresponding labels. Assume that a label of 0 for a tweet means that the tweet is a troll tweet, while a label of 1 for a tweet means that the tweet is a normal/nonmalicious tweet.

In [38]:
x = tokenize.texts_to_matrix(docs)
x = pad_sequences(x, maxlen=50)

In [39]:
print(x.shape)
frac = int(0.8*(x.shape[0]))
x_train = x[:frac]
x_valid = x[frac:]

(480872, 50)


In [40]:
y = np.concatenate(
    (np.full((num_valid_russia), 0), np.full((num_valid_election), 0))
)
y_train = y[:frac]
y_valid = y[frac:]

In [41]:
print(x_train.shape)
print(y_train.shape)

(384697, 50)
(384697,)


In [42]:
print(x_train[0].size)

50


We now move to create a model that will learn word embeddings, and then use that model to extract semantic information.

In [43]:
model = Sequential()
model.add(Embedding(32, 32, input_length=vocab_size))
model.add(Conv1D(3, 16))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='softmax'))

In [44]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [45]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 50, 32)            1024      
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 35, 3)             1539      
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 17, 3)             0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 51)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 256)               13312     
_________________________________________________________________
dense_8 (Dense)              (None, 256)               65792     
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 257       
Total para

In [46]:
model.fit(x_train, y_train, epochs=20, batch_size=32)

Epoch 1/20

KeyboardInterrupt: 