# Twitter Trolls
### Classifying and analyzing Russian Troll Tweets using Deep Learning
#### by Christopher DeCarolis

In [24]:
import string
import pickle
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from gensim.models import KeyedVectors
import matplotlib.pyplot as plt
import keras
import re
import nltk

from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from nltk.tokenize.casual import casual_tokenize
from nltk.corpus import stopwords

from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Dense, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout
from keras.models import model_from_json

from sklearn.utils import shuffle
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

## Data Pre-processing
We have two different data sets to work with. The first is a set of Tweets scraped from Russian Troll accounts (link __[here](https://www.kaggle.com/vikasg/russian-troll-tweets)__), and the second is a set of tweets scraped from random accounts during election day (link __[here](https://www.kaggle.com/kinguistics/election-day-tweets)__). 

In [2]:
df_russian = pd.read_csv('~/.kaggle/datasets/vikasg/russian-troll-tweets/tweets.csv')

In [3]:
print(df_russian[1:10])

        user_id         user_key    created_at          created_str  \
1  2.571870e+09  detroitdailynew  1.476133e+12  2016-10-10 20:57:00   
2  1.710805e+09       cookncooks  1.487767e+12  2017-02-22 12:43:43   
3  2.584153e+09     queenofthewo  1.482765e+12  2016-12-26 15:06:41   
4  1.768260e+09     mrclydepratt  1.501987e+12  2017-08-06 02:36:24   
5  2.882014e+09      giselleevns  1.477496e+12  2016-10-26 15:33:58   
6  1.658421e+09        baobaeham  1.488910e+12  2017-03-07 18:11:44   
7  2.587101e+09   judelambertusa  1.483102e+12  2016-12-30 12:49:30   
8  1.679279e+09    ameliebaldwin  1.477792e+12  2016-10-30 01:48:19   
9  1.649488e+09        hiimkhloe  1.458155e+12  2016-03-16 19:07:39   

   retweet_count retweeted  favorite_count  \
1            0.0     False             0.0   
2            NaN       NaN             NaN   
3            NaN       NaN             NaN   
4            NaN       NaN             NaN   
5            NaN       NaN             NaN   
6            

In [4]:
df_election = pd.read_csv('~/.kaggle/datasets/kinguistics/election-day-tweets/election_day_tweets.csv')

In [5]:
print(df_election[1:10])

                                                text           created_at  \
1  My @latimesopinion op-ed on historic #Californ...  2016-11-08 04:08:10   
2  #Senate Wisconsin Senate Preview: Johnson vs. ...  2016-11-08 04:11:35   
3  If Rubio Wins and #Trump Loses in #Florida... ...  2016-11-08 04:12:16   
4  #Senate Wisconsin Senate Preview: Johnson vs. ...  2016-11-08 04:16:20   
5  bob day is an "honest  person "  #senate patte...  2016-11-08 04:18:55   
6  Make Republicans #PayAPrice!\n 💙🇺🇸#VoteBLUE🔃th...  2016-11-08 04:20:09   
7  She's done America!! Please vote for @realDona...  2016-11-08 04:20:43   
8  #Illinois #Senate #StrongerTogether https://t....  2016-11-08 04:26:36   
9  #Senate Sen. Mark Warner to speak at ODU for V...  2016-11-08 04:41:04   

   geo lang place coordinates  user.favourites_count  user.statuses_count  \
1  NaN   en   NaN         NaN                      8                 4841   
2  NaN   en   NaN         NaN                    728               160390  

Checking the first few rows of each dataset, we can see that they were properly imported in terms of format. Something important to note about the election dataset is that we technically have know way of knowing whether the data collected comes from legitimate accounts or note. It is possible (likely even) that some of the users in this dataset were actually troll accounts. However, as we have know way of definitively telling that, we will proceed under the assumption that the election day tweets dataset represents tweets from real individuals.
Now we need to split the datasets into training and testing sets.

In [6]:
df_russian = shuffle(df_russian)
df_election = shuffle(df_election)

In [7]:
df_russian.loc[20, 'text']

"Obama on Trump winning: 'Anything's possible' https://t.co/MjVMZ5TR8Y #politics"

In [8]:
train_russian_df, test_russian_df = train_test_split(df_russian.loc[:, 'text'], test_size=0.2)
train_election_df, test_election_df = train_test_split(df_election.loc[:, 'text'], test_size=0.2)

__Text Cleaning__: Below we remove all hashtags and unnecessary filler from the text that would otherwise hinder classification. We also perform a train-test split on both datasets.

In [9]:
# dictionary of English language contractions
contractions = { 
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "I had / I would",
"I'd've": "I would have",
"I'll": "I shall / I will",
"I'll've": "I shall have / I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

In [10]:
def clean_text(text):
    # make sure that text is in string format
    text = str(text)
    new_text = []
    tokenized_text = casual_tokenize(text)
    for token in tokenized_text:
        # get rid of contractions
        if token in contractions:
            new_text.extend(token.split())
        # if string is not empty then add it
        elif not token == u'' and len(token) >= 3:
            new_text.append(token)
    text_wo_stops = []
    for token in new_text:
        if not token in stopwords.words('english'):
            text_wo_stops.append(token.lower())
    return ' '.join(text_wo_stops)
    
clean_text(df_russian.loc[20, 'text']) 

"obama trump winning anything's possible https://t.co/mjvmz5tr8y #politics"

In [11]:
X_russian = [clean_text(x) for x in train_russian_df]

In [12]:
X_elect = [clean_text(x) for x in train_election_df]

In [13]:
X_train = X_russian + X_elect
with open("X_train.txt", "wb") as fp:
    pickle.dump(X_train, fp)

In [14]:
X_russian = [clean_text(x) for x in test_russian_df]
X_elect = [clean_text(x) for x in test_election_df]
X_test = X_russian + X_elect
with open("X_test.txt", "wb") as fp:
    pickle.dump(X_test, fp)

In [15]:
# represent a troll tweet with a 0, and a non troll tweet with a 1
labels_train = ([0] * len(train_russian_df)) + ([1] * len(train_election_df))
labels_test = ([0] * len(test_russian_df)) + ([1] * len(test_election_df))
with open("labels_train.txt", "wb") as fp:
    pickle.dump(labels_train, fp)
with open("labels_test.txt", "wb") as fp:
    pickle.dump(labels_test, fp)

In [16]:
with open("X_train.txt", "rb") as fp:
    X_train = pickle.load(fp)
with open("X_test.txt", "rb") as fp:
    X_test = pickle.load(fp)
with open("labels_train.txt", "rb") as fp:
    labels_train = pickle.load(fp)
with open("labels_test.txt", "rb") as fp:
    labels_test = pickle.load(fp)

## Tokenization

Now, we move to the task of assigning indices to words, and filtering out infrequent words. For ease of use, we will include a parameter specifying the maximum number of words.

In [17]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train + X_test)

From these index encodings, we now obtain sequences from each tweet.

In [18]:
train_seq = tokenizer.texts_to_sequences(X_train)
test_seq = tokenizer.texts_to_sequences(X_test)

In [19]:
word_index = tokenizer.word_index

Next, we pad all of the tweets so that they have a length of 50 words. This will make it far easier to train embeddings.

In [20]:
MAX_SEQ_LENGTH = 50
train_data = pad_sequences(train_seq, maxlen=MAX_SEQ_LENGTH)
test_data = pad_sequences(test_seq, maxlen=MAX_SEQ_LENGTH)
labels_train = np.array(labels_train)
labels_test = np.array(labels_test)

In [21]:
print('shape of data:', train_data.shape)
print('shape of labels:', labels_train.shape)

shape of data: (480888, 50)
shape of labels: (480888,)


## Model Creation

In order to extract semantic information from our tweets, we will use a word embedding matrix. However, word embeddings take an extremely long time to train. Therefore, we will use pretrained embeddings from GloVe, which are specifically trained on a set of 2 billion tweets (link [here](https://nlp.stanford.edu/projects/glove/)).

Before gathering the embeddings, they need to be transferred into word2vec format, as they are currently in GloVe format.

In [45]:
EMBEDDING_FILE = '~/word2vec_twitter_model/word2vec_twitter_model.bin'
RESULT_FILE = '~/word2vec_twitter_model/word2vec_twitter_model_formatted.txt'

from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(EMBEDDING_FILE, RESULT_FILE)

(18796148, 1)

In [46]:
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, 
                                             binary=True,
                                            unicode_errors='ignore')

In [47]:
# tells us our embeddings are of size 400
print(word2vec.word_vec('obama').shape)

(400,)


In [48]:
max_words = 20000
num_words = min(len(word_index), max_words)
EMBEDDING_DIM = 400

t = 0
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, index in word_index.items():
    t += 1
    if word in word2vec.vocab and t < max_words:
        embedding_matrix[index - 1] = word2vec.word_vec(word)

In [49]:
print(embedding_matrix)

[[-0.34817043  0.04459979  0.10225909 ... -0.61556971 -0.1482826
   0.03635837]
 [-0.31657833  0.4608396   0.10638465 ...  0.05027082  0.02656249
   0.15543474]
 [-0.21830973 -0.07237566 -0.24116141 ... -0.29306951 -0.23784301
   0.0479187 ]
 ...
 [-0.26647705 -0.42137793  0.12377683 ... -0.1401113   0.31293917
  -0.16521734]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]]


## Data Formatting

Next, we need to format our data into training and validation sets.

In [50]:
VALIDATION_SPLIT = 0.2
perm = np.random.permutation(len(train_data))
idx_train = perm[:int(len(train_data)*(1-VALIDATION_SPLIT))]
idx_val = perm[int(len(train_data)*(1 - VALIDATION_SPLIT)):]
data_train = train_data[idx_train]
data_val = train_data[idx_val]
train_labels = labels_train[idx_train]
val_labels = labels_train[idx_val]

In [51]:
print(data_train.shape)
print(train_labels.shape)
print(data_val.shape)
print(val_labels.shape)

(384710, 50)
(384710,)
(96178, 50)
(96178,)


## Building Model

Now that the data is properly formatted, it is time to build the actual model. We will be using Keras' sequential model, as it represents a linear stack of layers. While our model will not be exactly linear due to the recurrent structure of LSTMs, once LSTMs are unfolded, the structure actually will be "linear" in terms of the connections from one layer to the next. Then, the structure of our model is as follows. 

The first layer is an embedding layer which serves to find an embedding for a given tokenized tweet. This embedding is then fed into a 1-dimensional convolutional layer and 1D max pooling layer, which serves to reduce dimensionality (and therefore training time). We will also use a dropout layer here for regularization. After this, the network feeds into an LSTM layer, followed by a few fully connected layers. 

In [29]:
MAX_SEQUENCE_LENGTH = 50

model = Sequential()
embedding_layer = Embedding(20000,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)
model.add(embedding_layer)
model.add(Conv1D(64, 3, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.4))
model.add(LSTM(100))
model.add(Dropout(0.4))
model.add(Dense(50,activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 400)           8000000   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 48, 64)            76864     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 24, 64)            0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 24, 64)            0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               66000     
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
___________________________________________________________

In [30]:
model.fit(data_train, train_labels, validation_data=(data_val, val_labels), epochs=10, batch_size=64)

Train on 384710 samples, validate on 96178 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f00b981a908>

Here we save the model and its weights so that we will not have to retrain the data again.

In [None]:
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model.h5")
print("Saved model to disk")

In [26]:
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("model.h5")
print("Loaded model from disk")

Loaded model from disk


## Analyzing Results

We trained the model on the dataset for 10 epochs. From the very first epoch, we found a very high accuracy, which was very good; however, since we continued to train the model for 10 epochs, and there was not much change in accuracy, there might be signs of overfitting. Below we show a variety of tests and analyses that test the results of our data. The first is a classification report, where we can see how our test data performed with the model. Following that, we showed a few of the actual tweets that we classified as "Troll" and a few we classified as "Not Troll" and compared it to the actual classification. Since we have a 95% accuracy, most of these were classified correctly. Below these, we see a few of the misclassifications.

In [27]:
y_pred = loaded_model.predict(test_data)
y_pred = [0 if x[0] < 0.5 else 1 for x in y_pred]

accuracy = accuracy_score(labels_test, y_pred)
print('accuracy: ', round(accuracy, 4))
print(classification_report(labels_test,y_pred))

accuracy:  0.9545
             precision    recall  f1-score   support

          0       0.93      0.94      0.93     40697
          1       0.97      0.96      0.97     79526

avg / total       0.95      0.95      0.95    120223



In [54]:
y_pred_yes = list(filter(lambda x: x[1] == 0, enumerate(y_pred)))
y_pred_no = list(filter(lambda x: x[1] != 0, enumerate(y_pred)))

In [62]:
print("What we classified as 'troll':")
for i, x in shuffle(y_pred_yes)[:10]:
    print('Predicted: {:} -> Actual: {:}'.format('Not Troll' if x else 'Troll',\
                                                 'Not Troll' if labels_test[i] else 'Troll'))
    print('Tweet: {:}\n'.format(X_test[i]))

What we classified as 'troll':
Predicted: Troll -> Actual: Troll
Tweet: @chriseddings5 @strhon2016 @dianeshamlin @charliekirk11 @right2liberty and let's forget bill clinton rapist ...

Predicted: Troll -> Actual: Troll
Tweet: @chadstanton defensively yell mama say i'm handsome https://t.co/k6jpuqdrlt

Predicted: Troll -> Actual: Troll
Tweet: i've told cheater always cheater got pride got

Predicted: Troll -> Actual: Not Troll
Tweet: unfortunately russia's players union left ... @fifpro prefers sponsor agents ... right https://t.co/vo4747cphw

Predicted: Troll -> Actual: Troll
Tweet: @realdonaldtrump i'm work hard never let make america great again https://t.co/cwfeceussq

Predicted: Troll -> Actual: Troll
Tweet: @ramcoban the truth behind oral anal http://t.co/l2imljzn8p

Predicted: Troll -> Actual: Troll
Tweet: funny cnn attack matthew mcconaughey immediately refused campaign trump hypocrit https://t.co/zydefzbgd1

Predicted: Troll -> Actual: Troll
Tweet: @nine_oh #nowplaying @asimsuj

In [63]:
print("What we classified as 'Not Troll':")
for i, x in shuffle(y_pred_no)[:10]:
    print('Predicted: {:} -> Actual: {:}'.format('Not Troll' if x else 'Troll',\
                                                 'Not Troll' if labels_test[i] else 'Troll'))
    print('Tweet: {:}\n'.format(X_test[i]))

What we classified as 'Not Troll':
Predicted: Not Troll -> Actual: Not Troll
Tweet: republican democrat electioneers look voters south river high school @capgaznews #annearundelvotes https://t.co/6xopwgjzzo

Predicted: Not Troll -> Actual: Not Troll
Tweet: congress think detract modiji minority caste orrss hindutva othr weapons bcoz need https://t.co/n7gkhzussd

Predicted: Not Troll -> Actual: Not Troll
Tweet: grandma alive she'd excited vote hillary i'd hate hear vile things guy #election2016

Predicted: Not Troll -> Actual: Not Troll
Tweet: boi turning party https://t.co/dzmhgqoelu

Predicted: Not Troll -> Actual: Not Troll
Tweet: hope represents lot people check results https://t.co/oa7oto2hwe

Predicted: Not Troll -> Actual: Not Troll
Tweet: michigan sends republicans democrats congress https://t.co/nfcfho1zxq

Predicted: Not Troll -> Actual: Not Troll
Tweet: still relevant #election2016 https://t.co/map3g41edj

Predicted: Not Troll -> Actual: Not Troll
Tweet: #nation republicans k

In [64]:
y_pred_mis = list(filter(lambda x: x[1] != labels_test[x[0]], enumerate(y_pred)))
print("What we misclassified:")
for i, x in shuffle(y_pred_mis)[:10]:
    print('Predicted: {:} -> Actual: {:}'.format('Not Troll' if x else 'Troll',\
                                                 'Not Troll' if labels_test[i] else 'Troll'))
    print('Tweet: {:}\n'.format(X_test[i]))

What we misclassified:
Predicted: Troll -> Actual: Not Troll
Tweet: pull florida https://t.co/jyozsz5p1w

Predicted: Troll -> Actual: Not Troll
Tweet: still amazed newspapers allowed think acceptable endorse candidates https://t.co/bwonyukhyz

Predicted: Troll -> Actual: Not Troll
Tweet: soros really angry https://t.co/yvgyuqn67k

Predicted: Not Troll -> Actual: Troll
Tweet: computer programmer testifies oath coded computers rig elections you voice important #riggedsystem https://t.co/x8wjwv8end

Predicted: Not Troll -> Actual: Troll
Tweet: @emenogu_phil https://t.co/kxo8kqu9vv

Predicted: Troll -> Actual: Not Troll
Tweet: الجمهوريون يسيطرون على الكونغرس بالكامل بعد إعصار ترامب https://t.co/cfvwmjpcwj

Predicted: Not Troll -> Actual: Troll
Tweet: mass media president https://t.co/fp9o8h1xut

Predicted: Not Troll -> Actual: Troll
Tweet: cop-hater shot two cops four civilians philadelphia much #blacklivesmatter movement https://t.co/dfqravsfyf

Predicted: Troll -> Actual: Not Troll
Tweet