## 5. Transformer Finetuning
To create our state operator classifier, we will be using the HuggingFace Transformers library to finetune the DistilBERT model. DistilBERT is an optimized, distilled version of BERT, which is a large Transformer language model created by Google in 2018. Obviously, larger and more capable language models currently exist, but we want to use the minimum-size model that still succeeds on our task.

### 5.1 Setup

In [None]:
import pandas as pd
import numpy as np
import torch
from transformers import AutoModel
from transformers import AutoTokenizer

### 5.2 Prepare Datasets
We will use the HuggingFace custom Datasets library here. It has a range of features that make training easier.

#### 5.2.1 Load Data

In [None]:
from datasets import load_dataset

data_files = {"train":"../data/train.csv","test":"../data/test.csv"}
tweet_seqs = load_dataset("csv", data_files=data_files, lineterminator='\n')
tweet_seqs

Let's take a look at an example of the data.

In [None]:
seq_sample = tweet_seqs["train"].shuffle(seed=42).select(range(1000))
seq_sample[:3]

From this, we can see that there are probably a number of shorter tweet sequences in the dataset--i.e., with fewer than 10 tweets. For consistency and since most real-world users have more than 10 tweets in their feed, we will exclude these sequences after marking them with a custom function.

#### 5.2.2 Remove Shorter Sequences

In [None]:
def compute_tweet_seq_len(example):
    """Compute number of tweets in tweet sequence
    
    Args:
        example (Dataset item): A single row item of the dataset

    Returns:
        dictionary: maps tweet sequence to count of tweets

    """
    return {"tweet_count": example["recent_tweets"].count("|")}

In [None]:
def compute_tweet_word_count(example):
    """Compute number of words in tweet sequence
    
    Args:
        example (Dataset item): A single row item of the dataset

    Returns:
        dictionary: maps tweet sequence to count of tweets

    """
    return {"word_count": example["recent_tweets"].count(" ")}

In [None]:
# compute number of tweets and words across entire dataset
tweet_seqs = tweet_seqs.map(compute_tweet_seq_len)
tweet_seqs = tweet_seqs.map(compute_tweet_word_count)
tweet_seqs["train"][0]

In [None]:
# use the Datasets filter function to include only sequences with >9 tweets
tweet_seqs = tweet_seqs.filter(lambda x: x["tweet_count"] > 9)
tweet_seqs = tweet_seqs.filter(lambda x: x["word_count"] < 400)

In [None]:
print(tweet_seqs.num_rows)

#### 5.2.3 Train-Validation-Test Split
In addition to using the test-train split we made earlier, we will also set aside 20% of the training set for validation during the model training.

In [None]:
# map train, validation and test splits appropriately.
tweet_seqs_clean = tweet_seqs["train"].train_test_split(train_size=0.8, seed = 42)

tweet_seqs_clean["validation"] = tweet_seqs_clean.pop("test")

tweet_seqs_clean["test"] = tweet_seqs["test"]
tweet_seqs_clean

In [None]:
# save the prepared dataset
tweet_seqs_clean.save_to_disk("../tweet-seqs")

### 5.3 Load Model & Train
Next, we load the saved dataset, download the DistilBERT model, set training arguments, and finally finetune the model.

#### 5.3.1 Load Dataset From Disk

In [None]:
from datasets import load_from_disk

tweet_seqs_reloaded = load_from_disk("../tweet-seqs")
tweet_seqs_reloaded

#### 5.3.2 Download Tokenizer & Tokenize sequences
HuggingFace language models generally come with their own tokenizers. We will need to use the tokenizer that comes with DistilBERT to tokenize our tweet sequences.

In [None]:
from transformers import DistilBertTokenizerFast, DataCollatorWithPadding

checkpoint = "distilbert-base-uncased"
tokenizer = DistilBertTokenizerFast.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["recent_tweets"], truncation=True, padding=True)


tokenized_datasets = tweet_seqs_reloaded.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["userid", "tweet_text", "tweet_time", "clean_tweets","seq_id", "tweet_count"])
tokenized_datasets = tokenized_datasets.rename_column("operator", "labels")
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

#### 5.3.3 Download Model & Selected Metrics

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("../trainer-checkpoints", evaluation_strategy="epoch",num_train_epochs=1)

In [None]:
from transformers import DistilBertForSequenceClassification

# note the label count below--currently only 1/0 for state operator or not.
model = DistilBertForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

In [None]:
from datasets import load_metric

def compute_metrics(eval_preds):
    metric = load_metric("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### 5.4 Set Arguments & Train

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Below is the final training command. Be cautious with the default commands below; it will save a checkpoint every 500 batches, and each checkpoint is almost 1 GB in size. It is therefore easy to accidentally fill up a smaller hard drive with the results.

In [None]:
trainer.train()

### 5.5 Results

#### 5.5.1 Evaluation
Let's evaluate the model on the test set.

In [None]:
trainer.evaluate(tokenized_datasets["test"])

We see that the final model provides us with the following metrics, as evaluated on the validation dataset:
- **Accuracy:** 0.999
- **F1:** 0.999

#### 5.5.2 Discussion

These values are unusually high, but have been found repeatedly with different sampling and data combinations. 

#### 5.5.3 Saving Model

In [None]:
ft_model = "finetuned/troll_detect_distilbert"
trainer.save_model(ft_model)
tokenizer.save_pretrained(ft_model)

### 5.6 Use Model For Inference

#### 5.6.1 Load Local Model
Loading the saved model is easy, and can be done with the following standalone code.

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import TextClassificationPipeline

ft_model = "../finetuned/troll_detect_distilbert"
model = AutoModelForSequenceClassification.from_pretrained(ft_model)
tokenizer = AutoTokenizer.from_pretrained(ft_model)

classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer)

Let's classify!

In [None]:
classifier("Washington Gives Ankara An Ultimatum | Trump Makes BigTime Overture To Iran | Authoritarian Spirits Congress The Espionage Act And Punishing WikiLeaks | Analysis Of The European Parliamentary Elections | One Mans Quest To Expose A Fake BBC Video About Syria | China Holds Three Trump Cards In War Against US | Maldives Affirms Fealty To Diego Garcia | The End Of Theresa May | Within The Church People Can Become Truly Free | China Hails Modi Victory This Is Why | ")

#### 5.6.2 Download Latest Model
The following assumes you are using the uploaded model for inference and test set evaluation. The model is available on HuggingFace and Github.

In [None]:
from transformers import pipeline

classifier = pipeline(model="lingwave-admin/state-op-detector")

In [None]:
classifier("Washington Gives Ankara An Ultimatum | Trump Makes BigTime Overture To Iran | Authoritarian Spirits Congress The Espionage Act And Punishing WikiLeaks | Analysis Of The European Parliamentary Elections | One Mans Quest To Expose A Fake BBC Video About Syria | China Holds Three Trump Cards In War Against US | Maldives Affirms Fealty To Diego Garcia | The End Of Theresa May | Within The Church People Can Become Truly Free | China Hails Modi Victory This Is Why | ")