# HuggingFace Transformers Example

The following example is adapted from [this tutorial](https://huggingface.co/docs/transformers/training) to run with a smaller model. In the example, you will fine-tune a small BERT model on the Yelp Review dataset.

For any NLP applications, the Hugging Face transformers library is the way to go. They have many pre-trained models that you can fine-tune on specific tasks. You can search pre-trained models in the [Hugging Face Hub](https://huggingface.co/docs/hub/index). They also have support for tokenizers, data processing, NLP metrics, and more.

The typical NLP workflow looks something like this:
1. Obtain text dataset
2. Convert text dataset into tokens (integer ids) using a tokenizer
3. Obtain a pre-trained model for your task (sequence classfication, token classification, question answering, etc.)
4. Run fine-tuning using your model and tokenized dataset
5. Evaluate your fine-tuned model

In [None]:
# If this cell says "False", you need to switch to a GPU (T4) runtime
# Do this under Runtime > Change Runtime Type > T4
# This will restart the runtime for a short while and you can try again
import torch
torch.cuda.is_available()

In [None]:
# Install HuggingFace packages into Google Colab
!pip install transformers evaluate datasets accelerate

## Pre-process dataset

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# TODO: Load the yelp_review_full dataset
dataset = None
#####

In [None]:
# View an example
dataset["train"][101]

In [None]:
# Create smaller train/test datasets for demo
train_dataset = dataset["train"].shuffle(seed=42).select(range(10_000))
test_dataset = dataset["test"].shuffle(seed=42).select(range(1_000))

In [None]:
# TODO: Get the model path for bert-small from huggingface.co/models
hf_model_path = ""  
#####

In [None]:
# Load tokenizer and process the text data
tokenizer = AutoTokenizer.from_pretrained(hf_model_path)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

## Setup models

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import evaluate
import numpy as np
import torch

In [None]:
# TODO: Load the model from a pre-trained checkpoint and setup for text classification
model = None
#####


Note that you can specify other hyperparmeters like optimizer, learning rate, batch size, number of epochs, etc. in the `TrainingArguments`. See the [documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) for more information.

In [None]:
BATCH_SIZE = 128

# Setup training arguments
training_args = TrainingArguments(
    output_dir="test_trainer",
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    eval_strategy="steps",
    eval_steps=25,
    logging_strategy="steps",
    logging_steps=25,
    num_train_epochs=3
)

In [None]:
# Define how metrics will be computed
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
# TODO: Setup trainer for the PyTorch model
trainer = Trainer(
)
#####

In [None]:
# TODO: Run the training

#####

Note that the model overfits with further training and the validation loss and accuracy are best around epoch 2-3. This is because our training data is small.

## Test on an example

In [None]:
# Same example as before
test_text = dataset["train"][101]["text"]
test_label = dataset["train"][101]["label"]

input = tokenizer(test_text, padding="max_length", truncation=True, max_length=128)
input = {k: torch.as_tensor(v).reshape(1, -1).cuda() for k, v in input.items()}

In [None]:
model = trainer.model
output = model(**input)
prediction = torch.argmax(output.logits)
print(f"Model rating is {prediction.item() + 1} star(s), actual rating is {test_label + 1} star(s)")
test_text

Go to [Yelp Waterloo Restaurants](https://www.yelp.com/search?find_desc=Food&find_loc=Waterloo%2C+ON), pick one review, and see how the model does on that!

In [None]:
# TODO: Find a review online and paste it here, with the number of actual stars
test_text = """
<Paste text here>
"""
actual_stars = None  # Just take the stars from an online review of your choice
#####

input = tokenizer(test_text, padding="max_length", truncation=True, max_length=128)
input = {k: torch.as_tensor(v).reshape(1, -1).cuda() for k, v in input.items()}
output = model(**input)
prediction = torch.argmax(output.logits)
print(f"Model rating is {prediction.item() + 1} star(s), actual rating is {actual_stars} star(s)")