<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# TROLL HUNTING: DETECTING STATE-BACKED DISINFORMATION CAMPAIGNS ON TWITTER

# NOTEOOK 1.2: HARVESTING FAKE TWEETS BY RUSSIAN STATE OPERATORS
Here, I'll harvest several sets of tweets from the main corpus of Russian state-backed tweets, as identified and released by Twitter in October 2018.

The CSV file with tweets by Russia's infamous Internet Research Agency is about 5.4Gb, and I've not included it in this repo due to the file size. You can download it [here](https://about.twitter.com/en_us/values/elections-integrity.html#data)

If you are re-running this file, do note that it would take a long time to run as I'm filtering out the non-English tweets, as well as users who described their location in Russian or other languages.

While the use of foreign languages is defintely one of the traits of these state-backed twitter accounts, including these tweets in the dataset would hurt the model's accuracy as it would simply classify any non-English tweets as likely state-tweets.

I'm also filtering out retweets by these state-backed accounts, as they don't reflect actual writing by the operators handling the accounts. Clearly, RTs are a feature of state-bot account behaviour. 

But like the language issue, including RTs would likely skew the model's predictions as the retweets are in fairly high number. RTs are also likely to muddle the countvectorizer's results.

A more complex project investigating this subject would find a way to address the language and RT issues. 

#### But in the context of my project, I'll limit the analysis and predictions to English-language tweets that are non-retweets.

In [2]:
import pandas as pd

from langdetect import detect
from textblob import TextBlob

In [3]:
# REPEAT: This CSV file is not included in this repo due to the huge file size. Download using link above
fake = pd.read_csv('../data/ira_tweets.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
# The large sample size used here is due to the large number of RTs and non-English tweets in the original dataset
# Reducing the sample size here would likely get you less than the required 50K tweets.
fake_tweets = fake.sample(n=850000, random_state=42)

In [5]:
# Dropping these columns for consistency with the real tweets dataset
fake_tweets = fake_tweets.drop(
    columns=[
        "userid",
        "user_display_name",
        "user_profile_url",
        "tweet_client_name",
        "in_reply_to_tweetid",
        "in_reply_to_userid",
        "quoted_tweet_tweetid",
        "is_retweet",
        "retweet_userid",
        "retweet_tweetid",
        "latitude",
        "longitude",
        "quote_count",
        "reply_count",
        "like_count",
        "retweet_count",
        "urls",
        "user_mentions",
        "poll_choices",
        "hashtags",
        "account_language",
        "tweet_language"
    ]
)

In [8]:
# I'm writing two functions to filter out non-English tweets. 
# In earlier tests, one filter alone was not enough to catch all the Russian-language tweets
def detect_language_langdetect(text):
    try:
        return detect(text)
    except:
        return 'unk'

In [9]:
def detect_language_textblob(text):
    try:
        return TextBlob(text).detect_language
    except:
        return 'unk'

In [10]:
fake_tweets['lang_textblob'] = fake_tweets['tweet_text'].apply(detect_language_textblob)
fake_tweets['lang_textblob_loc'] = fake_tweets['user_reported_location'].apply(detect_language_textblob)

In [11]:
fake_tweets['langdetect'] = fake_tweets['tweet_text'].apply(detect_language_langdetect)
fake_tweets['langdetect_loc'] = fake_tweets['user_reported_location'].apply(detect_language_langdetect)

In [13]:
fake_tweets = fake_tweets[(fake_tweets['langdetect'] == 'en') & (fake_tweets['langdetect_loc'] == 'en')].copy()

In [14]:
fake_tweets.head()

Unnamed: 0,tweetid,user_screen_name,user_reported_location,user_profile_description,follower_count,following_count,account_creation_date,tweet_text,tweet_time,lang_textblob,lang_textblob_loc,langdetect,langdetect_loc
1415203,631482168695353345,NewOrleansON,"New Orleans, LA","Breaking news, weather, traffic and more for N...",35988,11010,2014-05-05,Former LSU RB Alfred Blue atop Houston Texans'...,2015-08-12 15:07,<bound method BaseBlob.detect_language of Text...,<bound method BaseBlob.detect_language of Text...,en,en
911493,825950748055728129,a95a911dd6ae864c48ed062cdbe75e5c28dbe0cf57c6db...,United States,No more #HappyHolidays shit!!! It's #MerryChri...,2748,265,2016-06-15,RT @GrrrGraphics: #NEWGameinTown #PresidentTru...,2017-01-30 06:16,<bound method BaseBlob.detect_language of Text...,<bound method BaseBlob.detect_language of Text...,en,en
3288399,855607672103514114,a95a911dd6ae864c48ed062cdbe75e5c28dbe0cf57c6db...,United States,No more #HappyHolidays shit!!! It's #MerryChri...,2748,265,2016-06-15,RT @charliekirk11: The power of a movement! #...,2017-04-22 02:22,<bound method BaseBlob.detect_language of Text...,<bound method BaseBlob.detect_language of Text...,en,en
1265519,639019849369321472,005b6c0f7e3371b1cacced2890fead3d5543694ab21372...,"New York, NY",,112,153,2014-08-05,harapova pulls out of U.S. GOPsen,2015-09-02 10:19,<bound method BaseBlob.detect_language of Text...,<bound method BaseBlob.detect_language of Text...,en,en
345699,870918783614935040,a95a911dd6ae864c48ed062cdbe75e5c28dbe0cf57c6db...,United States,No more #HappyHolidays shit!!! It's #MerryChri...,2748,265,2016-06-15,RT @KamVTV: You see? Liberal limousine celebri...,2017-06-03 08:23,<bound method BaseBlob.detect_language of Text...,<bound method BaseBlob.detect_language of Text...,en,en


In [15]:
fake_tweets['langdetect'].value_counts()

en    93164
Name: langdetect, dtype: int64

In [16]:
fake_tweets = fake_tweets[~fake_tweets['tweet_text'].str.startswith("RT @")].copy()

In [17]:
# Almost one third of the English-language fake tweets by the IRA in our dataset are RTs
# Not removing them would skew our analysis. Likewise, the language detectors also filtered out
# some 89% of the tweets in the original sample which contained some Russian language 
fake_tweets.shape

(64038, 13)

In [18]:
fake_50K = fake_tweets[:50000]

In [19]:
fake_50K = fake_50K.drop(columns=['langdetect', 'langdetect_loc', 'lang_textblob', 'lang_textblob_loc'])

In [20]:
fake_50K = fake_50K[
    [
        "tweetid",
        "user_screen_name",
        "user_reported_location",
        "user_profile_description",
        "follower_count",
        "following_count",
        "account_creation_date",
        "tweet_time",
        "tweet_text",
    ]
]

In [21]:
# Outputting a training set of 50K fake tweets
# NOTE: This CSV file is included in the repo

#fake_sample = fake_50K.to_csv('../data/bot_50k.csv', index=False)

In [25]:
# Outputting an unseen test set of 1K fake tweets
# NOTE: This CSV file is also included in the repo

#fake_unseen_sample = fake_tweets[:-14046].drop(columns=['langdetect', 'langdetect_loc', 'lang_textblob', 'lang_textblob_loc'])
#test_fake = fake_unseen_sample.sample(n=1000, random_state=42)
#test_fake = test_fake.to_csv('../data/fake_test.csv', index=False)