# STATE TWITTER TROLL DETECTION USING TRANSFORMERS

## REPO STRUCTURE

### 1. DATA FOLDER

* 5 CSV files for notebooks in this series. Note that raw troll tweet files from Twitter are not included here.

### 2. NOTEBOOKS FOLDER

* Notebooks 1.0 - 1.2: Data collection, cleaning and preparation. Optional if you just want to experiment with the final dataset.

* Notebooks 2.0 - 2.1: Fine tuning distilbert with custom dataset and detailed testing with unseen validation dataset, as well as a fresh dataset with state troll tweets from Iran.

* * Notebook 3.0 - 3.1: Create and test optimised logistic regression and XGB models against datasets used to assess fine tuned Distilbert model.


### 3. APP FOLDER

* app.py + folders for "static" and "template: simple app for use on a local machine to demonstrate how a state troll tweet detector can be used in deployment. Unfortunately free hosting accounts can't accomodate the disk size required for pytorch and the fine tuned model, so I've not deployed this online. 


### 4. TROLL_DETECT FOLDER

* Fine tuned Distilbert model from Colab notebook2.0. Too big for Github, download [here](https://www.dropbox.com/sh/90h7ymog2oi5yn7/AACTuxmMTcso6aMxSmSiD8AVa) from Dropbox instead.

### 5. PKL FOLDER

* Pickled logistic regression model from notebook3.0

# PART 1C: SPLIT DATASETS FOR TRAINING AND VALIDATION

Most tutorials out there tend to lump the data preparation section together with the fine tuning part. I prefer to keep them separate for clarity, so that I'd know what's in various sections of the dataset. 

In [2]:
import numpy as np
import pandas as pd
import csv

In [3]:
# prepared in notebook1.0, avail in repo 
troll = pd.read_csv("../data/troll_50k.csv")

# prepared in notebook1.1, avail in repo 
real = pd.read_csv("../data/real_50k.csv")
    
tweets = pd.concat([troll, real])

In [4]:
tweets.shape

(100000, 5)

In [5]:
tweets.isnull().sum()

tweetid              0
user_display_name    0
tweet_text           0
clean_text           0
troll_or_not         0
dtype: int64

In [6]:
tweets.head()

Unnamed: 0,tweetid,user_display_name,tweet_text,clean_text,troll_or_not
0,1245883557362282497,85c9M6CDZxgBwoEye0rF12ZBgGl3xvz6Bnbvhp7MUKI=,"having each tiny wish come true, or having som...",having each tiny wish come true or having some...,1
1,961577921461866496,曲剑明,＠null It is 12:25 UTC now,null It is UTC now,1
2,941616158075211776,IFL1E0m0SRX2cdOtuLFV7xKtnBgxagKzNgkuGFvNtvs=,British number two Bedene to switch back to Sl...,British number two Bedene to switch back to Sl...,1
3,850414479976345600,Klausv,kalamitykait Thanks for bearing with us - you ...,kalamitykait Thanks for bearing with us you sh...,1
4,960784360071925760,曲剑明,＠null It is 08:56 CET now,null It is CET now,1


## 1.1: CREATE VALIDATION SET

We'll keep to a split of 70:20:10 for training, testing and validation. Here we'll split off 10% of the dataset - 10k rows - and keep them aside for testing the fine tuned model.

In [8]:
validate = tweets.sample(n=10000, random_state=42, replace=False)

In [9]:
# not an exact 50-50 split, but good enough

validate["troll_or_not"].value_counts()

1    5061
0    4939
Name: troll_or_not, dtype: int64

In [None]:
# this is included in the repo
# avail here: https://github.com/chuachinhon/transformers_state_trolls_cch/blob/master/data/validate.csv

'''
validate.to_csv(
    "../data/validate.csv",
    index=False,
    encoding="utf-8",
    quoting=csv.QUOTE_NONNUMERIC,
)
'''

## 1.2: CREATE MAIN TRAINING SET

In [10]:
# The tweetids are unique, so removing these ids from the main dataset
# will give us the raw training dataset, which will then be further split into train-test sets prior to fine tuning

validate_id = validate['tweetid'].values

train_raw = tweets[~tweets['tweetid'].isin(validate_id)]

In [11]:
train_raw['troll_or_not'].value_counts()

0    45009
1    44939
Name: troll_or_not, dtype: int64

In [14]:
# this is included in the repo
# avail here: https://github.com/chuachinhon/transformers_state_trolls_cch/blob/master/data/train_raw.csv

'''
train_raw.to_csv(
    "../data/train_raw.csv",
    index=False,
    encoding="utf-8",
    quoting=csv.QUOTE_NONNUMERIC,
)
'''