# UNCOVERING STATE-BACKED TROLLS ON TWITTER

# NOTEOOK 1.4:
In this project, I'll be testing the model(s) against several unseen test sets to see whether the model(s) can be generalised, ie, detect state-backed tweets from different countries. I'll create the following test sets for this purpose:

1. Unseen test set 1: Russian/IRA state-backed tweets + American real users 
2. Unseen test set 2: Iranian state-backed tweets + American real users
3. Unseen test set 3: Venezuelian state-backed tweets + American real users

You can download the state-backed datasets [here](https://about.twitter.com/en_us/values/elections-integrity.html#data). I'm not including the raw datasets in this repo as they run into several Gbs.

In [29]:
from langdetect import detect
from textblob import TextBlob

import pandas as pd
import re

## 1. CREATING RUSSIAN STATE-BACKED/AMERICAN REAL USERS UNSEEN TEST SET
This is part of the Russian/Internet Research Agency corpus of tweets released by Twitter in Oct 2018. It comprises about 9 million tweets from 3613 accounts.

In [4]:
# Calling up the full unseen test sets created in the earlier notebooks
fake_test = pd.read_csv('../data/fake_test.csv')
real_test = pd.read_csv('../data/real_test.csv')

In [5]:
fake_test.shape, real_test.shape

((1000, 9), (1000, 9))

In [11]:
# This is same function I used in notebook 1.3 to clean up the tweet text
def clean_tweet(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)
    text = re.sub("\W", " ", text)
    text = re.sub("\s+", " ", text)
    text = text.strip(" ")
    text = text.strip("\n")
    text = re.sub("[^\w\s]", "", text)
    return text

In [12]:
real_test['clean_tweet_text'] = real_test['tweet_text'].map(lambda tweet: clean_tweet(tweet))

In [14]:
fake_test['clean_tweet_text'] = fake_test['tweet_text'].map(lambda tweet: clean_tweet(tweet))

In [17]:
#Creating a new col to classify the real Vs state-backed tweets
fake_test['bot_or_not'] = 1
real_test['bot_or_not'] = 0

In [22]:
russian_unseen = pd.concat((fake_test, real_test), axis=0, sort=True)

In [24]:
# NOTE: This CSV file is included in the repo
#russian_set = russian_unseen.to_csv('../data/russian_unseen.csv', index=False)

### NOTE:
I won't further feature engineer this particular set as I'll only need the "bot or not" and "clean tweet text" cols for testing the model. I'll also outputted the unseen Russian test set out as a separate CSV file that I'll use during the testing of the model in the 3.0 series of the notebooks.

## 2. CREATING IRANIAN STATE-BACKED/AMERICAN REAL USERS UNSEEN TEST SET
This is part of the Iranian corpus of tweets released by Twitter in October 2018, comprising 1.1 million tweets from 770 accounts. I won't do extensive work on this dataset, beyond getting them into the same shape as the other unseen tests and most importantly, filtering out the non-English tweets and retweets.   

## 2.1 PRE-PROCESSING THE IRANIAN STATE-BACKED TWEETS

In [26]:
# NOTE: This CSV file is not included in the repo due to its file size. Download it with the link above
iran = pd.read_csv('../data/iranian_tweets.csv')

In [28]:
iran = iran.drop(
    columns=[
        "userid",
        "user_display_name",
        "user_profile_url",
        "tweet_client_name",
        "in_reply_to_tweetid",
        "in_reply_to_userid",
        "quoted_tweet_tweetid",
        "is_retweet",
        "retweet_userid",
        "retweet_tweetid",
        "latitude",
        "longitude",
        "quote_count",
        "reply_count",
        "like_count",
        "retweet_count",
        "urls",
        "user_mentions",
        "poll_choices",
        "hashtags",
        "account_language",
        "tweet_language"
    ]
)

In [42]:
# The goal here is to get 1,000 English tweets from this set, but we'll have to sample a larger
#number due to the likelihood of large number of non_English tweets
iran_tweets = iran.sample(n=50000, random_state=42)

In [43]:
def detect_language_langdetect(text):
    try:
        return detect(text)
    except:
        return 'unk'

In [44]:
def detect_language_textblob(text):
    try:
        return TextBlob(text).detect_language
    except:
        return 'unk'

In [45]:
iran_tweets['lang_textblob'] = iran_tweets['tweet_text'].apply(detect_language_textblob)
iran_tweets['lang_textblob_loc'] = iran_tweets['user_reported_location'].apply(detect_language_textblob)

In [46]:
iran_tweets['langdetect'] = iran_tweets['tweet_text'].apply(detect_language_langdetect)
iran_tweets['langdetect_loc'] = iran_tweets['user_reported_location'].apply(detect_language_langdetect)

In [47]:
iran_tweets = iran_tweets[(iran_tweets['langdetect'] == 'en') & (iran_tweets['langdetect_loc'] == 'en')].copy()

In [48]:
iran_tweets['langdetect'].value_counts()

en    3134
Name: langdetect, dtype: int64

In [49]:
iran_tweets = iran_tweets[~iran_tweets['tweet_text'].str.startswith("RT @")].copy()

In [50]:
iran_tweets = iran_tweets.drop(columns=['langdetect', 'langdetect_loc', 'lang_textblob', 'lang_textblob_loc'])

In [51]:
iran_tweets = iran_tweets[
    [
        "tweetid",
        "user_screen_name",
        "user_reported_location",
        "user_profile_description",
        "follower_count",
        "following_count",
        "account_creation_date",
        "tweet_time",
        "tweet_text",
    ]
]

In [52]:
iran_tweets = iran_tweets.dropna()

In [60]:
iran_test = iran_tweets[:1000].copy()

## 2.2 COMBINING THE IRANIAN STATE-BACKED/AMERICAN REAL TWEETS

In [55]:
# this is the second set of real tweets set aside for part of the unseen test sets
real_test2 = pd.read_csv('../data/real_test2.csv')

In [56]:
iran_test.shape, real_test2.shape

((1000, 9), (1000, 9))

In [None]:
# This is same function I used in notebook 1.3 to clean up the tweet text
def clean_tweet(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)
    text = re.sub("\W", " ", text)
    text = re.sub("\s+", " ", text)
    text = text.strip(" ")
    text = text.strip("\n")
    text = re.sub("[^\w\s]", "", text)
    return text

In [57]:
real_test2['clean_tweet_text'] = real_test2['tweet_text'].map(lambda tweet: clean_tweet(tweet))

In [61]:
iran_test['clean_tweet_text'] = iran_test['tweet_text'].map(lambda tweet: clean_tweet(tweet))

In [62]:
#Creating a new col to classify the real Vs state-backed tweets
iran_test['bot_or_not'] = 1
real_test2['bot_or_not'] = 0

In [64]:
iranian_unseen = pd.concat((iran_test, real_test2), axis=0, sort=True)

In [65]:
# Outputting the csv file for use in notebook 3.1
# NOTE: This CSV is included in the repo

#iranian_set = iranian_unseen.to_csv('../data/iranian_unseen.csv', index=False)

## 3. CREATING VENEZUELIAN STATE-BACKED/AMERICAN REAL USERS UNSEEN TEST SET
This is part of the Venezuela (Set 2) corpus of tweets released by Twitter in January 2019, of about 1 million tweets from 764 accounts. I won't do extensive work on this dataset, beyond getting them into the same shape as the other unseen tests and most importantly, filtering out the non-English tweets and retweets.   

## 3.1 PRE-PROCESSING THE VENEZUELIAN STATE-BACKED TWEETS

In [84]:
# NOTE: This CSV file is not included in the repo due to its file size. Download it with the link above
vz = pd.read_csv('../data/venezuelian_tweets.csv')

In [87]:
vz = vz.drop(
    columns=[
        "userid",
        "user_display_name",
        "user_profile_url",
        "tweet_client_name",
        "in_reply_to_tweetid",
        "in_reply_to_userid",
        "quoted_tweet_tweetid",
        "is_retweet",
        "retweet_userid",
        "retweet_tweetid",
        "latitude",
        "longitude",
        "quote_count",
        "reply_count",
        "like_count",
        "retweet_count",
        "urls",
        "user_mentions",
        "poll_choices",
        "hashtags",
        "account_language",
        #"tweet_language"
    ]
)

In [97]:
# The goal here is to get 1,000 English tweets from this set, but we'll have to sample a larger
#number due to the likelihood of large number of non_English tweets
vz_tweets = vz.sample(n=250000, random_state=42)

In [98]:
def detect_language_langdetect(text):
    try:
        return detect(text)
    except:
        return 'unk'

In [99]:
def detect_language_textblob(text):
    try:
        return TextBlob(text).detect_language
    except:
        return 'unk'

In [100]:
vz_tweets['lang_textblob'] = vz_tweets['tweet_text'].apply(detect_language_textblob)
vz_tweets['lang_textblob_loc'] = vz_tweets['user_reported_location'].apply(detect_language_textblob)

In [101]:
vz_tweets['langdetect'] = vz_tweets['tweet_text'].apply(detect_language_langdetect)
vz_tweets['langdetect_loc'] = vz_tweets['user_reported_location'].apply(detect_language_langdetect) 

In [102]:
vz_tweets = vz_tweets[(vz_tweets['langdetect'] == 'en') & (vz_tweets['langdetect_loc'] == 'en')].copy()

In [103]:
vz_tweets['langdetect'].value_counts()

en    1521
Name: langdetect, dtype: int64

In [104]:
vz_tweets = vz_tweets[~vz_tweets['tweet_text'].str.startswith("RT @")].copy()

In [106]:
vz_tweets = vz_tweets.drop(columns=['langdetect', 'langdetect_loc', 'lang_textblob', 'lang_textblob_loc'])

In [107]:
vz_tweets = vz_tweets[
    [
        "tweetid",
        "user_screen_name",
        "user_reported_location",
        "user_profile_description",
        "follower_count",
        "following_count",
        "account_creation_date",
        "tweet_time",
        "tweet_text",
    ]
]

In [108]:
vz_tweets = vz_tweets.dropna()

In [109]:
vz_test = vz_tweets[:1000].copy()

## 3.2 COMBINING THE VENEZUELIAN STATE-BACKED/AMERICAN REAL TWEETS

In [110]:
# Recalling the third test-set of real tweets created earlier for this purpose 
real_test3 = pd.read_csv('../data/real_test3.csv')

In [111]:
vz_test.shape, real_test3.shape

((1000, 9), (1000, 9))

In [112]:
real_test3['clean_tweet_text'] = real_test3['tweet_text'].map(lambda tweet: clean_tweet(tweet))

In [113]:
vz_test['clean_tweet_text'] = vz_test['tweet_text'].map(lambda tweet: clean_tweet(tweet))

In [114]:
#Creating a new col to classify the real Vs state-backed tweets
vz_test['bot_or_not'] = 1
real_test3['bot_or_not'] = 0

In [120]:
vz_unseen = pd.concat((vz_test, real_test3), axis=0, sort=True)

In [121]:
# Outputting the csv file for use in notebook 3.1
# NOTE: This CSV is included in the repo
#vz_set = vz_unseen.to_csv('../data/vz_unseen.csv', index=False)