In [1]:
# Cyberthon Data Science Training Materials
# Author: Ragul Balaji <ragulbalaji@ctf.sg>
# Dataset: Public Domain
# ALT-TAB LABS LLP (C) 2019-2022

If you're opening this locally, make sure your environment has an install of the packages from the following versions. Uncomment the following cell and run it.

In [2]:
#! pip install pandas==1.1.5 scikit-learn==1.0.2 matplotlib==3.5.1

# Part I: The Problem

Can we predict who sent a tweet? 🤔

This is a classification problem - the target variable belongs to either of the two categories, in our example, the categories are `Donald Trump` and `Justin Trudeau`.

Let's dive in and explore!

## Loading the Dataset

We will load the file `train.csv` using pandas `read.csv()` into a `Dataframe` object.

In [1]:
import pandas as pd

data = pd.read_csv("train.csv")
data.head()

Unnamed: 0,trainid,author,status
0,0,Justin Trudeau,RT @CQualtro: Félicitations #Amazon pour l'exp...
1,1,Justin Trudeau,Nous cherchons à résoudre les enjeux qui compt...
2,2,Donald J. Trump,Heading into the 12 days with great negotiatin...
3,3,Donald J. Trump,The long anticipated release of the #JFKFiles ...
4,4,Donald J. Trump,....for the Middle Class. The House and Senate...


## Target variable and Predictor(s)

We first identify that our target variable is `"author"`, and our predictor variable is `"status"`.

You can also apply stemming/lemmatization (covered earlier), or engineer new features that could be useful in model prediction (not covered).

In [2]:
# identify target and predictor
y = data['author']
X = data['status']

## Training set and validation set

### Training set

- data used to train our model
- data and labels are provided to the model, the model tune its parameters to fit the model

### Validation set

- Data used to test our model after it has been trained
- Predicted labels are compared against our true labels to compute the accuracy, to determine the model's performance.

### Why shouldn't the model be trained using the test set too?

- We want our model to generalize well on unseen data.
- Trained model gains information about the validation set and the predictions made by the model will be biased towards the validation set, resulting in overestimation of the model's performance.

Here, we specify some parameters
- `random_state=42` : for reproducible results
- `test_size=0.25` : dataset will be split into 75% training and 25% validation

Great! Now that we have split our data, are we able to train the model using training set yet? No!

## Text Vectorization

As covered earlier, machine learning models require numerical numbers as inputs. Text data unfortunately doesn't work here.😞

Let's use `CountVectorizer` to convert the tweets into numerical vectors!

We first create an instance of `CountVectorizer`.

We want the model to learn the vocabulary from the training set, so we fit the `CountVectorizer` using only the training set, not the validation set.

Then transform the training set and validation set to encode each tweet as a vector.

In [49]:
from sklearn.feature_extraction.text import *
import re

def clean(x):
    # x = re.sub(r'https://t.co/\S{0,10}', '', x)
    return x



# Create an instance of CountVectorizer
vectorizer = CountVectorizer(preprocessor=clean)

# Fit and transform/vectorize the training set
vec_x = vectorizer.fit_transform(X)

Generate the document term matrix. 

There are 2 documents and 10 unique terms, hence 2x10 matrix.

Cell specify the count of term in the document.

## Machine Learning Model

We will use the library `sklearn` due to its wide selection of models. Machine learning models in `sklearn` are objects which need to be initialized first.

Let's use what we learnt earlier - Decision 🌲!

Here, we specify some parameters
- `random_state=42` : for reproducible results
- `max_depth=8` : to limit tree depth to 8

For detailed documention, refer here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Some useful methods
- `.fit()` : to pass in our training set to train a machine learning model
- `.predict()` : to pass in our validation set to make predictions using our model

In [37]:
from sklearn import *

# Initialize our machine learning model
model = ensemble.ExtraTreesClassifier(n_estimators=3000, verbose=1, n_jobs=-1)
model2 = ensemble.RandomForestClassifier(n_estimators=3000, verbose=1, n_jobs=-1)

In [50]:
# Train the model
model.fit(vec_x.toarray(),y)
# model2.fit(vec_x.toarray(),y)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 768 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 1218 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done 1768 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done 2418 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 3000 out of 3000 | elapsed:    1.2s finished


ExtraTreesClassifier(n_estimators=3000, n_jobs=-1, verbose=1)

In [51]:
# Make predictions on test set
df_test = pd.read_csv('test.csv')
X_test = df_test['status']
vec_x_test = vectorizer.transform(X_test)
preds = model.predict(vec_x_test.toarray())

df_ans = pd.read_csv('submission.csv')
df_ans['author'] = preds
df_ans.to_csv('submission_ans.csv', index=False)

[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 168 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 418 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 768 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 1218 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 1768 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 2418 tasks      | elapsed:    0.1s
[Parallel(n_jobs=16)]: Done 3000 out of 3000 | elapsed:    0.1s finished


In [52]:
# Connect to graders
import sys
# insert at 1, 0 is the script path (or '' in REPL)
sys.path.insert(1, 'C:/Users/alien/Documents/PyCharm Projects/Cyberthon 2021/pyctfsglib.py')
import pyctfsglib as ctfsg
import random

USER_TOKEN = "WrlLCkymxwtgFwRHZsdmKfSwcdqIpnqoXEtRkciVRZJfBJUgcEJoxVZjNTQRdqkR" # You need to fill this up
GRADER_URL = random.choice([
  "http://chals.cyberthon22t.ctf.sg:50501/",
  "http://chals.cyberthon22t.ctf.sg:50502/"
])

grader = ctfsg.DSGraderClient(GRADER_URL, USER_TOKEN)
grader.submitFile('submission_ans.csv')

DSGraderClient: Successfully Connected!
[SERVER] MOTD: CHECK your USER_TOKEN and GRADER_URL HTTP address! I'm TWEET_CLASSIFY @dca2c8397283
ProofOfWork Challenge =>  ('CTFSGRB7008cbad939aca5f41b7fe09932484cd', 22)
ProofOfWork Answer Found! =>  3718578


'{"challenge":{"name":"WhoTweetThis (challenge)"},"id":"cl22q7i9zcskh08272nx5klbx","status":"PARTIALLY_CORRECT","multiplier":0.6815,"submittedBy":{"username":"hci-69"},"createdAt":"2022-04-17T03:26:03Z"}'

# Part II: Model Interpretability

How does the model make its predictions? Let's visualize the 🌲!

### Example: Model predicts Justin Trudeau sent the tweet
Starting at the root node (the node where the 🌲 begins), <br>
if `de` is absent in the tweet, go to the left branch, <br>
if `rt` is present in the tweet, go to the right branch, <br>
if `thank` and `hannity` and `seanhannity` are absent in the tweet, <br>
the model predicts the tweet sender as `Justin Trudeau`.

In [None]:
# RUN THIS PART AFTER YOU ARE 
# DONE WITH PART 1

from sklearn import tree
import matplotlib.pyplot as plt

plt.figure(figsize=(16,16))
tree.plot_tree(model, fontsize=10, filled=True, class_names=['Donald', 'Justin'], feature_names=vectorizer.get_feature_names())
plt.show()

The _absence_ of `de`, _presence_ of `rt`, _absence_ of `thank` and `hannity` and `seanhannity` result in the model predicting `Justin Trudeau` as the tweet sender.

# Part III: Model Attack

Now, you are able to identify what features the model uses to make its predictions. 

Are you able to make minimal modifications to a tweet to fool the model?

We have established in Part II that `de`, `rt`, `thank`, `hannity` and `seanhannity` are features used by the model to make its predictions.

Specifically, the absence of `de`, presence of `rt`, absence of `thank` and `hannity` and `seanhannity` result in the model predicting `Justin Trudeau` as the tweet sender.

Using the same features, what would fool the model to classify `Donald Trump` as the tweet sender? If `thank` or `hannity` is present in the tweet!

Let's explore modifying a sample tweet from the validation set!

In [None]:
sampleid = 1
truth = y_validation.iloc[sampleid]

print('tweet:', repr(X_validation.iloc[sampleid]))
print('\n   classifies as:', y_pred[sampleid])
print(' ground truth is:', truth)

In [None]:
# we want to modify an input so that the model misclassifies

original = X_validation.iloc[sampleid]
attack1 = X_validation.iloc[sampleid] + ' thank'
attack2 = X_validation.iloc[sampleid] + ' hannity'
attack3 = X_validation.iloc[sampleid] + ' seanhannity'

print('\n original:', repr(original))
print('\n attack1:', repr(attack1))
print('\n attack2:', repr(attack2))
print('\n attack3:', repr(attack3))

Again, same concept as before - vectorize text before input to machine learning models.

In [None]:
# vectorize the attacks
vec_attack = vectorizer.transform([original, attack1, attack2, attack3])

# feed attacks into model for prediction
attack_pred = model.predict(vec_attack)

print('\n   truth:', truth)
print('\n   original:', attack_pred[0])
print('\n   attack1:', attack_pred[1])
print('\n   attack2:', attack_pred[2])
print('\n   attack3:', attack_pred[3])

Modifying the sample tweet by adding `thank` or `hannity` fooled the model to predict `Donald Trump` instead of `Justin Trudeau` as the tweet sender.

Can you identify another modification to fool the model? 😁