# Fine-tune a Small Transformer Model
A purpose built small model that is fine-tuned for the job can be cheaper and faster than using a LLM.

Due to the small size, we can train the entire model (all the weights) in a relatively small machine.

In this notebook we will train a tiny model called DistilBERT to classify the sentiment of text to either positive or negative.

In [None]:
from datasets import load_dataset
import torch
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification, TrainingArguments, Trainer

## Load the Dataset
We will use the [IMDB](https://huggingface.co/datasets/stanfordnlp/imdb) dataset. It provides movie review text labelled as either positive (1) or negative (0).

In [None]:
imdb = load_dataset("imdb")

## Inspect the Dataset
Each sample looks something like this.

```json
{
  "text": "I loved the movie.",
  "label": 1
}
```

In [None]:
imdb["test"][0]

## Prepare the Data
For training, the text data needs to be tokenized. We will do that now.

In [None]:
base_model_name = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

In [None]:
def preprocess_function(examples):
    #Truncate to the maximum number of tokens accepted by the model
    return tokenizer(examples["text"], truncation=True)

tokenized_imdb = imdb.map(preprocess_function, batched=True)

This mapping will add the token IDs as ``input_ids`` and ``attention_mask`` features for each sample. These names are significant. The model knows to look for these features in the forward pass. The ``text`` feature is not used by the model and will be ignored.

In [None]:
tokenized_imdb

## Load the Base Model
Since we plan to tune the base model for a classfication task, we should load it using ``AutoModelForSequenceClassification``. We need to carefully supply the number of possible classes using ``num_labels``.

In [None]:
base_model = AutoModelForSequenceClassification.from_pretrained(
    base_model_name, 
    num_labels=2,
    device_map="auto")

## Train the Model

In [None]:
training_args = TrainingArguments(
    output_dir="./sentiment-classifier",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    processing_class=tokenizer,
)

trainer.train()

## Save the Model

In [None]:
trainer.save_model()

In [None]:
#Unload models to save memory
del base_model
del tokenizer
torch.cuda.empty_cache()

## Run Inference

In [None]:
def classify_sentiment(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors="pt")

    #Run inference
    with torch.no_grad():
        logits = model(**inputs).logits
    
    #Get the predicted class with the highest probability
    predicted_class_id = logits.argmax().item()

    print(predicted_class_id)

In [None]:
#Load the trained model
model_name = "./sentiment-classifier"

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
classify_sentiment(model, tokenizer, "The movie was awesome!")

In [None]:
classify_sentiment(model, tokenizer, "The food was terrible!")

## Summary
In this notebook we did a full training of a small transformer model called DistillBERT. The model was trained to learn how to classify the sentiment of a text as either positive or negative.