# 🧠 Mini Workshop 3: Fine-Tuning DistilBERT for Sentiment Classification

Welcome to your first language model fine-tuning task! In this notebook, you'll take a pretrained model that already knows English and teach it to recognize **sentiment** from short user reviews.

We'll use **DistilBERT**, a smaller and faster version of BERT, and fine-tune it on the **Yelp Polarity** dataset using Hugging Face's `transformers` and `datasets` libraries.

## 📚 Part 1: What Is Fine-Tuning?
Pretrained language models like BERT have already been trained on a huge amount of text. They're great at understanding English.

> In this workshop, we’ll use a pretrained model and **fine-tune** it — which means updating it slightly to learn a new skill: recognizing sentiment.

This is the same idea behind many state-of-the-art models you hear about today — adapting a general-purpose model to a specific task!

## 🧪 Part 2: Load and Tokenize Dataset
Before we can train our model, we need to prepare the text data so it can be understood by a neural network.

- We're using a small subset of the **Yelp Polarity** dataset, which contains short reviews labeled as positive or negative.
- The **tokenizer** takes each review and breaks it down into numerical input IDs (tokens), adds padding if needed, and creates an **attention mask** that tells the model which parts of the input are real vs. padding.
- We'll use the tokenizer that matches our pretrained model (DistilBERT).

This step turns raw text into model-ready inputs.

In [None]:
# Load a small portion of the Yelp Polarity dataset for quick experimentation
from datasets import load_dataset
from transformers import DistilBertTokenizerFast

# Load 1,000 training examples from the Yelp Polarity dataset
    # and split it into training and testing sets
dataset = load_dataset("yelp_polarity", split="train[:1000]").train_test_split(test_size=0.2)

# Load the tokenizer that matches our model (DistilBERT)
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

def tokenize(example):
    return tokenizer(example['text'], truncation=True, padding='max_length')

# Tokenize the dataset
    # and remove the original text column
tokenized_dataset = dataset.map(tokenize, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(['text'])
tokenized_dataset.set_format('torch')

## 🧠 Part 3: Load Pretrained Model
Now we load a **DistilBERT model** that’s already been trained on a huge collection of English text.

- This model already understands grammar, vocabulary, and context to a decent degree.
- We'll fine-tune it for our specific task: **classifying reviews as positive or negative**.
- We use a special version of DistilBERT that includes a **classification head** — a final layer that outputs class predictions.

Think of this as giving the model a final coaching session focused only on Yelp reviews.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)

## 🔁 Part 4: Fine-Tune the Model

We're now ready to train our model using Hugging Face's `Trainer` class — a powerful tool that takes care of most of the training loop for us.

Here’s what each new component does:

- **`TrainingArguments`**: This is where we configure how training should work. We set:
  - `output_dir`: where to save the model
  - `eval_strategy="epoch"`: run evaluation at the end of every epoch
  - `learning_rate`: how fast the model updates its weights
  - `per_device_train_batch_size`: batch size during training
  - `num_train_epochs`: how many passes through the dataset
  - `logging_steps`: how often to print logs during training

- **`DataCollatorWithPadding`**: This handles automatic padding of inputs so that batches can be formed correctly. It makes sure each sentence is padded to the length of the longest one in the batch.

- **`Trainer`**: This is the high-level training loop. It handles:
  - Feeding batches to the model
  - Computing loss
  - Running backpropagation
  - Evaluating on the test set

Once it’s set up, we simply call `.train()` and the whole training process runs for us!


In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
)

# Data collator for padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer) 

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator
)

trainer.train()

## 🔍 Part 5: Make Predictions!

Now that we have our fine-tuned model, we can see its performance on some sample statements.

Try out your own statements below!

In [None]:
def predict_sentiment(text):
    # Tokenize and move to model's device
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    # Run model
    outputs = model(**inputs)
    probs = outputs.logits.softmax(dim=-1).squeeze().tolist()

    # Format results
    labels = ["Negative", "Positive"]
    predicted_class = labels[probs.index(max(probs))]
    confidence = max(probs)

    print(f"\n📝 Input: {text}")
    print(f"🤖 Prediction: {predicted_class} ({confidence*100:.2f}% confidence)")
    print(f"📊 Raw probabilities: {dict(zip(labels, [f'{p:.4f}' for p in probs]))}")


predict_sentiment("This place was amazing! The food was delicious and the service was great.")
predict_sentiment("I had a terrible experience. The food was cold and the staff was rude.")
predict_sentiment("The ambiance was nice, but the food was just ok.")

In [None]:
# Try your own text here!
predict_sentiment("")

### 🎓 Recap
Congratulations! You just fine-tuned a real language model on a real task.

Now you can classify text using your customized DistilBERT model. In the next notebook, we’ll try something more expressive — emotion classification!