# STATE TWITTER TROLL DETECTION USING TRANSFORMERS

## REPO STRUCTURE

### 1. DATA FOLDER

* 5 CSV files for notebooks in this series. Note that raw troll tweet files from Twitter are not included here.

### 2. NOTEBOOKS FOLDER

* Notebooks 1.0 - 1.2: Data collection, cleaning and preparation. Optional if you just want to experiment with the final dataset.

* Notebooks 2.0 - 2.1: Fine tuning distilbert with custom dataset and detailed testing with unseen validation dataset, as well as a fresh dataset with state troll tweets from Iran.

* Notebook 3.0 - 3.1: Create and test optimised logistic regression and XGB models against datasets used to assess fine tuned Distilbert model.


### 3. APP FOLDER

* app.py + folders for "static" and "template: simple app for use on a local machine to demonstrate how a state troll tweet detector can be used in deployment. Unfortunately free hosting accounts can't accomodate the disk size required for pytorch and the fine tuned model, so I've not deployed this online. 


### 4. TROLL_DETECT FOLDER

* Fine tuned Distilbert model from Colab notebook2.0. Too big for Github, download [here](https://www.dropbox.com/sh/90h7ymog2oi5yn7/AACTuxmMTcso6aMxSmSiD8AVa) from Dropbox instead.

### 5. PKL FOLDER

* Pickled logistic regression model from notebook3.0

# PART 1B: REAL TWEETS COLLECTION, CLEANING AND PREPARATION

In this notebook, we'll scrape real tweets using Tweepy. You'll need your own auth keys to run the notebook on your own local machine. All 175 accounts scraped are listed below. I don't recommend running the full list as is, due to well known issues about rate limiting on Twitter's end. You are better off splitting your desired list of real users into smaller chunks. 

In [23]:
import csv
import os
import pandas as pd
import re

from dotenv import load_dotenv, find_dotenv
import tweepy

## 1.0: SCRAPE TWEETS WITH TWEEPY

In [3]:
load_dotenv(find_dotenv(), override=True)

True

In [4]:
CONSUMER_KEY = os.getenv('CONSUMER_KEY')
CONSUMER_SECRET = os.getenv('CONSUMER_SECRET')
ACCESS_KEY = os.getenv('ACCESS_KEY')
ACCESS_SECRET = os.getenv('ACCESS_SECRET')

In [5]:
# function to scrape tweets and write to csv file

def get_tweets(username):
    csv_file = open("../data/real_tweets.csv", "a")
    csv_writer = csv.writer(csv_file)

    # Authorization to consumer key and consumer secret
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)

    # Access to user's access key and access secret
    auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)

    # Calling api
    api = tweepy.API(auth, wait_on_rate_limit=True)

    # Get tweets
    for tweet in tweepy.Cursor(api.user_timeline, screen_name=username).items():
        csv_writer.writerow(
            [
                tweet.id,
                tweet.author.screen_name,
                tweet.created_at,
                tweet.lang,
                tweet.source,
                tweet.retweet_count,
                tweet.favorited,
                tweet.retweeted,
                tweet.text
            ]
        )

    csv_file.close()

In [6]:
get_tweets('nytimes')
get_tweets('washingtonpost')
get_tweets('Reuters')
get_tweets('ChannelNewsAsia')
get_tweets('STcom')

In [7]:
get_tweets('FoxFriendsFirst')
get_tweets('TheEconomist')
get_tweets('politico')
get_tweets('CNN')
get_tweets('WSJ')

In [8]:
get_tweets('realDonaldTrump')
get_tweets('newtgingrich')
get_tweets('RichardGrenell')
get_tweets('FrankLuntz')
get_tweets('AmbJohnBolton')

In [9]:
get_tweets('JoeBiden')
get_tweets('KamalaHarris')
get_tweets('SenSanders')
get_tweets('PeteButtigieg')
get_tweets('AOC')

In [11]:
get_tweets('maggieNYT')
get_tweets('JeffreyGoldberg')
get_tweets('maddow')
get_tweets('jaketapper')
get_tweets('ezraklein')

In [12]:
get_tweets('BillKristol')
get_tweets('Peggynoonannyc')
get_tweets('IngrahamAngle')
get_tweets('TuckerCarlson')
get_tweets('megynkelly')

In [13]:
get_tweets('CaseyNewton')
get_tweets('dandrezner')
get_tweets('kevinroose')
get_tweets('karaswisher')
get_tweets('gtconway3d')

In [None]:
get_tweets('axios')
get_tweets('voxdotcom')
get_tweets('TheAtlantic')
get_tweets('latimes')
get_tweets('DMRegister')

In [None]:
get_tweets('CNBC')
get_tweets('guardian')
get_tweets('NewYorker')
get_tweets('MSNBC')
get_tweets('business')

In [None]:
get_tweets('EricTrump')
get_tweets('IvankaTrump')
get_tweets('Liz_Cheney')
get_tweets('DonaldJTrumpJr')
get_tweets('seanhannity')

In [None]:
get_tweets('HillaryClinton')
get_tweets('ewarren')
get_tweets('NYGovCuomo')
get_tweets('AndrewYang')
get_tweets('davidaxelrod')

In [None]:
get_tweets('daveweigel')
get_tweets('ThePlumLineGS')
get_tweets('JamesFallows')
get_tweets('morningmoneyben')
get_tweets('weijia')

In [None]:
get_tweets('DineshDSouza')
get_tweets('ByronYork')
get_tweets('soledadobrien')
get_tweets('RonBrownstein')
get_tweets('alexwagner')

In [None]:
get_tweets('billmaher')
get_tweets('NormOrnstein')
get_tweets('jayrosen_nyu')
get_tweets('Toure')
get_tweets('brhodes')

In [None]:
get_tweets('SCMPNews')
get_tweets('HongKongFP')
get_tweets('ReutersChina')
get_tweets('CDT')
get_tweets('ChinaRealTime')

In [None]:
get_tweets('wongmjane')
get_tweets('mranti')
get_tweets('prchovanec')
get_tweets('BonnieGlaser')
get_tweets('niubi')

In [None]:
get_tweets('JKynge')
get_tweets('BeijingPalmer')
get_tweets('suilee')
get_tweets('meifongwriter')
get_tweets('PekingMike')

In [None]:
get_tweets('damienics')
get_tweets('GregPoling')
get_tweets('yangyang_cheng')
get_tweets('limlouisa')
get_tweets('vshih2')

In [None]:
get_tweets('BaldingsWorld')
get_tweets('klustout')
get_tweets('RealSexyCyborg')
get_tweets('laurelchor')
get_tweets('hebeipangzai')

In [None]:
get_tweets('thewirechina')
get_tweets('HongKongFP')
get_tweets('ReutersChina')
get_tweets('CDT')
get_tweets('ChinaRealTime')

In [None]:
get_tweets('thewirechina')
get_tweets('HongKongFP')
get_tweets('ReutersChina')
get_tweets('CDT')
get_tweets('ChinaRealTime')

In [None]:
get_tweets('joshchin')
get_tweets('gillianwong')
get_tweets('beijingscribe')
get_tweets('stegersaurus')
get_tweets('ulywang')

In [None]:
get_tweets('WeiDuCNA')
get_tweets('davidpaulk')
get_tweets('dakekang')
get_tweets('tmitchpk')
get_tweets('sharonchenhm')

In [None]:
get_tweets('SophieMak1')
get_tweets('melissakchan')
get_tweets('aliceysu')
get_tweets('lilkuo')
get_tweets('vshih2')

In [None]:
get_tweets('You_Shu_China')
get_tweets('jmulvenon')
get_tweets('fravel')
get_tweets('YuanfenYang')
get_tweets('humarisaac')

In [None]:
get_tweets('teamlipei')
get_tweets('EmilyZFeng')
get_tweets('ByChunHan')
get_tweets('JChengWSJ')
get_tweets('IlariaMariaSala'

In [None]:
get_tweets('supchinanews')
get_tweets('TechBuzzChina')
get_tweets('cnmediaproject')
get_tweets('The_ChinaStory')
get_tweets('CNStorytellers')

In [None]:
get_tweets('ccni')
get_tweets('JiayangFan')
get_tweets('CarlMinzner')
get_tweets('michaelxpettis')
get_tweets('onlyyoontv')

In [None]:
get_tweets('jeromeacohen')
get_tweets('lokmantsui')
get_tweets('rzhongnotes')
get_tweets('vwang3')
get_tweets('evadou')

In [None]:
get_tweets('CaiweiC')
get_tweets('DSORennie')
get_tweets('sophia_yan')
get_tweets('wang_maya')
get_tweets('kaifulee')

In [None]:
get_tweets('yananw')
get_tweets('DGTam86')
get_tweets('ruima')
get_tweets('yiqinfu')
get_tweets('chenchenzh')

In [None]:
get_tweets('Dali_Yang')
get_tweets('Yaqiu')
get_tweets('xinwenxiaojie')
get_tweets('ericfish85')
get_tweets('KaiserKuo')

In [None]:
get_tweets('AbacusNews')
get_tweets('MacroPoloChina')
get_tweets('ChinaFile')
get_tweets('chinaquarterly')
get_tweets('LaszloCHP')

In [None]:
get_tweets('XinqiSu')
get_tweets('gadyepstein')
get_tweets('QiZHAI')
get_tweets('Chao_Deng')
get_tweets('anthonytao')

In [None]:
get_tweets('DRechts')
get_tweets('akaDashan')
get_tweets('claydube')
get_tweets('S_Rabinovitch')
get_tweets('FuDaoge')

In [None]:
get_tweets('adam_ni')
get_tweets('ritacyliao')
get_tweets('Junmai1103')
get_tweets('JeromeTaylor')
get_tweets('austinramzy')

In [16]:
real_tweets = pd.read_csv(
    "../data/real_tweets.csv",
    names=[
        "tweetid",
        "user_screen_name",
        "tweet_time",
        "tweet_language",
        "source",
        "retweet_count",
        "favorited",
        "rewteeted",
        "tweet_text",
    ],
)


## 1.1: CLEAN + FILTER TWEET TEXT

Same cleaning and filtering rules as those for the troll tweets: only English tweets, dropping retweets and tweets with fewer than 3 words after cleaning.

In [24]:
# text cleaning function. adjust according to your use case

def clean_text(text):
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"\n", " ", text)
    text = re.sub(r"\'t", " not", text)  # Change 't to 'not'
    text = re.sub(r"(@.*?)[\s]", " ", text)  # Remove @name
    text = re.sub(r"$\d+\W+|\b\d+\b|\W+\d+$", " ", text)  # remove digits
    text = re.sub(r"[^\w\s\#]", "", text)  # remove special characters except hashtags
    text = text.strip(" ")
    text = re.sub(
        " +", " ", text
    ).strip()  # get rid of multiple spaces and replace with a single
    return text


real_tweets["clean_text"] = real_tweets["tweet_text"].map(lambda text: clean_text(text))


In [25]:
real_tweets['word_count'] = real_tweets['clean_text'].str.count(' ') + 1

In [27]:
crit1 = real_tweets["tweet_language"] == "en"
crit2 = ~real_tweets["tweet_text"].str.startswith("RT @")
crit6 = ~real_tweets["tweet_text"].str.startswith("RT@")
crit3 = ~real_tweets["clean_text"].isnull()
crit4 = real_tweets["clean_text"] != ""
crit5 = real_tweets["word_count"] > 3

real_tweets = real_tweets[crit1 & crit2 & crit3 & crit4 & crit5 & crit6].copy()


In [31]:
cols = ["tweetid", "user_screen_name", "tweet_text", "clean_text"]

real_tweets = real_tweets[cols].copy()


In [35]:
real_tweets = real_tweets.rename(
    columns={
        "tweetid": "tweetid",
        "user_screen_name": "user_display_name",
        "tweet_text": "tweet_text",
        "clean_text": "clean_text",
    }
)


In [None]:
# troll tweets are labelled 1

real_tweets["troll_or_not"] = 0

## 1.2 SLICE SMALLER SAMPLE OF REAL TWEETS

Again, this is to make the fine tuning process more manageable. If you have access to better compute, feel free to run on a bigger slice of the data.

In [None]:
real_sample = real_tweets.sample(n=50000, random_state=42, replace=False)

In [38]:
# this dataset is avail in the repo in case you want an even smaller slice
# Download here https://github.com/chuachinhon/transformers_state_trolls_cch/blob/master/data/real_50k.csv

# real_sample.to_csv('../data/real_50k.csv', index=False)