# Transformer Architecture

## Skills

1. Understand the basic vocabulary of machine learning.
2. Explain the importance of training and testing data.
3. Train and evaluate a Support Vector Machine
4. Build a classification pipeline.
6. Use a multilabel classifier.
7. **Train and evaluate a transformer classifier.**

## Additional Resources

* [Watch an A.I. Learn to Write](https://www.nytimes.com/interactive/2023/04/26/upshot/gpt-from-scratch.html) by NYTimes
* [What is a transformer architecture?](https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transformer-model/) by NVIDIA
* [Transformers, Explained](https://www.youtube.com/watch?v=SZorAJ4I-sA) by Google Cloud Tech
* [Natural Language Processing](https://course.fast.ai/Lessons/lesson4.html) Chapter of the fast.ai deep learning course.
* [Getting started with NLP for absolute beginners](https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners) notebook from that course.

## Motivation

So far we've used the SVM and NBC models for text classification. As I've stated before, they're fairly simple: SVM is really just finding the best line to split the word counts of different documents. Despite that, they're often pretty effective. The NBC has been traditionally used for e-mail spam filters back to the late 90s and hardware then was a tiny fraction of the power we have today.

The transformer model is a substantial step up in model accuracy from simple bag of words models like the SVM or Naive Bayes Classifier, at the expense of complexity, time, computing power, and cost. Just running the code to train a language deep learning model would require hardware and monetary resources [beyond what is reasonable for us](https://www.cnbc.com/2023/03/13/chatgpt-and-generative-ai-are-booming-but-at-a-very-expensive-price.html). The website behind [ChatGPT](https://www.businessinsider.com/how-much-chatgpt-costs-openai-to-run-estimate-report-2023-4) takes almost a million dollars a day to just *run the already trained code*. We'll be doing something called fine-tuning, where we take a pre-trained model and tweak it just a bit to do a new task.

Rather than have me explain how a transformer model works, please take a look at the first three links above. They've done a much better job than I could hope to.

**BEFORE RUNNING ANY CODE** You need to change your runtime type to GPU, under Runtime > Change runtime type > Hardware Accelerator > GPU.

## Install and Load Packages

We need the `transformers` package, unsurprisingly. The brackets mean that we're installing the PyTorch variation of it. PyTorch is one of the most popular deep learning frameworks in Python (the other is Tensorflow).

The `datasets` package is used to load data into the transformer. It's a bit like Pandas in the way it stores data.

In [None]:
! pip install -q datasets
! pip install transformers[torch]

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m19.2 MB/s[0m eta [36m0

And then of course we load everything in that we'll be using. I'll explain each as we use it.

In [None]:
import pandas as pd

from datasets import Dataset,DatasetDict

from transformers import TrainingArguments,Trainer
from transformers import AutoModelForSequenceClassification,AutoTokenizer

from sklearn import metrics

*italicized text*## Load and Preprocess Data

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/Greg-Hallenbeck/class-datasets/main/datasets/SMSSpamCollection.tsv", sep="\t")

df.head(2)

Unnamed: 0,class,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...


In this example, we're going to predict whether an entry on Netflix is a show or a movie based on its description, as we've done before.

The transformer models provided by `transformers` want the y variable that we're trying to fit to be called `"labels"` and also for it to be a float value. So we'll do a bit of simple data manipulation, changing our prediction from the rating to a simple "is it at least a 3?" binary.

In [None]:
df["labels"] = (df["class"] == "spam").astype("float")
df = df[["labels", "text"]]

df.head(5)

Unnamed: 0,labels,text
0,0.0,"Go until jurong point, crazy.. Available only ..."
1,0.0,Ok lar... Joking wif u oni...
2,1.0,Free entry in 2 a wkly comp to win FA Cup fina...
3,0.0,U dun say so early hor... U c already then say...
4,0.0,"Nah I don't think he goes to usf, he lives aro..."


Additionally, all transformer models have a maximum token length. The one we're working with has a max length of 512, so let's chop the text to that length or shorter:

In [None]:
MAX_LENGTH = 512*2
def chop(text):
    return text[:MAX_LENGTH]

df["text"] = df["text"].apply(chop)

## Conversion to a Dataset

Not much to say here, just converting to a new format:

In [None]:
ds = Dataset.from_pandas(df)
ds

Dataset({
    features: ['labels', 'text'],
    num_rows: 5572
})

You can still access the columns like you could in Pandas, but the columns are no longer `Series`, but the base `list` type (which means you can't use `.head()`, for example)

In [None]:
ds["text"][0:3]

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
 'Ok lar... Joking wif u oni...',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"]

### Tokenization

Instead of using the punkt or scikit-learn CountVectorizer tokenizer, we're going to be using a tokenizer specifically built for the model we're working with.

Most modern tokenizers for sequence-based (that is, not bag-of-words) models work in the same way, so we'll take a look at it.

In [None]:
# Load in the model
# We're storing it in a variable name because 'bert-base-uncased' is also the name of the transformer model, not just the tokenizer
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Let's run it on a few sentences. There's nothing too surprising with these. We're using an uncased model, meaning that it ignores the distinction between upper and lowercase, but there are plenty which include case.

In [None]:
tokenizer.tokenize("The quick brown fox jumps over the lazy dogs.")

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs', '.']

In [None]:
tokenizer.tokenize("The five boxing wizards jump quickly.")

['the', 'five', 'boxing', 'wizards', 'jump', 'quickly', '.']

But we'll see something new here in this sentence with proper nouns: the tokens are no longer words, but pieces of words. Any token which isn't the first token in a word gets a special marker `##` at the start indicating that it's the middle (or end) of a word. In this way, "ob" in the middle of a word is allowed to gain a different meaning than "ob" at the start.



In [None]:
tokenizer.tokenize("""Obi-Wan Kenobi:   Hello, there!
                      General Grievous: General Kenobi!""") #The model probably doesn't know that a lot of these words are words, but it can still split
                      #these words into multipe pieces while showing that they're in the middle of the word by putting a ## before then

                      #In theory, if you remove the ## signs and combine with the previous letters, you make the full word

                      #So ken,##ob,##i put together is kenobi

                      #However this style that the model is in is what it understands, so no need to restructure it

['ob',
 '##i',
 '-',
 'wan',
 'ken',
 '##ob',
 '##i',
 ':',
 'hello',
 ',',
 'there',
 '!',
 'general',
 'gr',
 '##ie',
 '##vo',
 '##us',
 ':',
 'general',
 'ken',
 '##ob',
 '##i',
 '!']

How does the Tokenizer know how to split up these proper nouns? Well, it doesn't. Instead of simply tokenizing on spaces or punctuation as we might do by hand, the tokenizer is actually trained up from the ground along with the vocabulary size. "general" isn't split up because that is a common enough string of letters, as are shorter bits like "gr" and "ie". The vocabulary of allowed tokens has a maximum size, just like the `CountVectorizer` had. In this case, it is:

In [None]:
len(tokenizer.vocab)

30522

Here's an example which may be more intuitive: the word "deliciously" is made up of two of these learned tokens, that you might be able to guess.

In [None]:
tokenizer.tokenize("i shittily took a shit")

['i', 'shit', '##ti', '##ly', 'took', 'a', 'shit']

This `.tokenize()` is actually just provided for us humans to read. What is actually used by Python is a list of IDs as `int`, which can be used to look up the original token. `int` is a much simpler (and thus faster) data type to do operations on than `str`ings.

In [None]:
tokenizer("The quick brown fox jumps over the lazy dogs.") #converts each token to an id number

{'input_ids': [101, 1996, 4248, 2829, 4419, 14523, 2058, 1996, 13971, 6077, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

We then apply this tokenizer to the description column. Because this is a `Dataset`, rather than a `Dataframe`, have to use `.map()` on the entire dataset, using a defined function:

In [None]:
def tok_func(x): return tokenizer(x["text"])
ds = ds.map(tok_func, batched=True)

Map:   0%|          | 0/5572 [00:00<?, ? examples/s]

And you can see that there are some new columns, produced by the tokenizer.

In [None]:
ds

Dataset({
    features: ['labels', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 5572
})

### Train/Test Split

We've done this before! We're just using the syntax for the transformers library, instead of scikit-learn.

In [None]:
dataset = ds.train_test_split(0.2, seed=42)

### Set up the Model

Here we're doing three things:

1. Set up hyperparameters
2. Load a pretrained model
3. Set up the fine-tuning

#### Set up Hyperparameters

These are hyperparameters that affect how we go about doing the fit, but aren't actual parameters to be fit by the model (like the vocabulary size of our SVM models).

When we tweak the model, it goes through every row in our dataset and makes little changes to the model itself (i.e. moving around the SVM line) to get a better fit.

* **batch size** (`bs`): How many rows the model looks at simultaneously. Shouldn't have much of an effect on model training, but if your GPU doesn't have a lot of memory, this needs to be smaller.
* **epochs** (`epochs`): How many times the model goes through the entire dataset, making predictions for each input. Think of it as how many times you go through your stack of flashcards. More epochs means longer training time, and potentially a better fit (or more overfitting).
* **learning rate** (`lr`): How much the model adjusts itself each time it gets a wrong answer while training.


In [None]:
bs = 32 #If you have a huge dataset, you might not be able to process it all at once, so you need to go at a slower pace
#generally, you want the bs to be as big as it can be without your computer giving you an error

epochs = 4 #The biggest thing determining your overfitting

lr = 8e-5 #Prof said its very hard to understand so we should just leave it like this

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

#Hyperparameters are defined by the person defining it and CANT be learned by the model, these hyperparameters are the Batch size, epochs, and learning rate

#### Load Pre-Trained Model

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Set up the Model for Fine-Tuning

In [None]:
def accuracy(x, y):
    return ((x > 0.5).reshape(-1) == y).mean()

def acc_metric(eval_pred):
    return {"accuracy": accuracy(*eval_pred)}

trainer = Trainer(model, args, train_dataset=dataset['train'], eval_dataset=dataset['test'],
                  tokenizer=tokenizer, compute_metrics=acc_metric)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


### Train the Model

This is just like the `.fit()` method

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.010942,0.986547
2,No log,0.006442,0.992825
3,No log,0.006448,0.992825
4,0.026000,0.005729,0.992825


TrainOutput(global_step=560, training_loss=0.02352738425667797, metrics={'train_runtime': 112.2099, 'train_samples_per_second': 158.881, 'train_steps_per_second': 4.991, 'total_flos': 705648778573722.0, 'train_loss': 0.02352738425667797, 'epoch': 4.0})

### Evaluate the Model

There are many ways we can do this, but it turns out that the `metrics` from scikit-learn will work fine! So there's nothing much to learn once we coerce the data into the same type.

In [None]:
y_pred_train = trainer.predict(dataset["train"]).predictions
y_pred_test  = trainer.predict(dataset["test"]).predictions

In [None]:
y_pred_train

array([1., 0., 0., ..., 1., 0., 0.])

In [None]:
y_pred_train = (y_pred_train.reshape(-1) > 0.5).astype("float")
y_pred_test  = (y_pred_test.reshape(-1) > 0.5).astype("float")

In [None]:
y_pred_train

array([1., 0., 0., ..., 1., 0., 0.])

In [None]:
print(metrics.classification_report(dataset["train"]["labels"], y_pred_train, zero_division=0))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      3860
         1.0       1.00      0.99      1.00       597

    accuracy                           1.00      4457
   macro avg       1.00      1.00      1.00      4457
weighted avg       1.00      1.00      1.00      4457



In [None]:
print(metrics.classification_report(dataset["test"]["labels"], y_pred_test, zero_division=0))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       965
         1.0       0.97      0.97      0.97       150

    accuracy                           0.99      1115
   macro avg       0.98      0.98      0.98      1115
weighted avg       0.99      0.99      0.99      1115

