# Fine-Tuning Hugging Face DistilBERT for Emotion Classification

<img src="https://www.developer-tech.com/wp-content/uploads/sites/3/2023/08/intel-pytorch-foundation-ai-development-artificial-intelligence-coding-programming-machine-learning.jpg" alt="Alt Text" style="width: 400px;"/>

This notebook demonstrates how to fine-tune the `distilbert-base-uncased` model for multi-class emotion classification using Hugging Face's Transformers library. It covers data preprocessing, training, evaluation, and inference.

---

## Why This is Important

Fine-tuning pre-trained models for specific tasks such as emotion classification enables developers to build NLP applications effectively. This hands-on guide will help you:

- Load and preprocess datasets from the Hugging Face library.
- Fine-tune a model for a specific classification task.
- Evaluate and test the model's performance on new data.

---


#### Environment Setup and Dependencies Installation

This cell ensures the required libraries are installed. We use `transformers`, `torch`, and other essential libraries to fine-tune the model.


In [2]:
# !pip install transformers==4.35.2
# !pip install torch==2.1.0
# !pip install datasets==2.16.1
# !pip install accelerate==0.26.0


#### Loading Libraries and Packages

Import the core libraries required for the task. These include Hugging Face's Transformers for the model and tokenizer, the Datasets library for data handling, and utilities like Numpy and Scikit-learn for performance evaluation.


In [3]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score
import torch


#### Dataset Loading

Load the `emotion` dataset from Hugging Face's Datasets library. This dataset is used for training and evaluation.


In [4]:
dataset = load_dataset("emotion")


#### Model and Tokenizer Initialization

Initialize the pre-trained `distilbert-base-uncased` model and tokenizer. The model is configured for sequence classification with six labels.


In [5]:
model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)
tokenizer = AutoTokenizer.from_pretrained(model_name)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Data Preprocessing

Preprocess the dataset by tokenizing text inputs to make them compatible with the model.


In [6]:
def preprocess_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

# Apply preprocessing
encoded_dataset = dataset.map(preprocess_function, batched=True)


#### Training the Model

Define training arguments and set up the Hugging Face `Trainer` for fine-tuning the model.


In [7]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
)

# Define the compute metrics function
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=-1)
    return {'accuracy': accuracy_score(labels, preds)}

# Define the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2766,0.212747,0.921


TrainOutput(global_step=1000, training_loss=0.5033166809082031, metrics={'train_runtime': 1354.9642, 'train_samples_per_second': 11.808, 'train_steps_per_second': 0.738, 'total_flos': 2119629570048000.0, 'train_loss': 0.5033166809082031, 'epoch': 1.0})

#### Model Evaluation

Evaluate the model's performance on the validation dataset to check its accuracy in emotion classification.


In [8]:
trainer.evaluate()


{'eval_loss': 0.21274743974208832,
 'eval_accuracy': 0.921,
 'eval_runtime': 49.8103,
 'eval_samples_per_second': 40.152,
 'eval_steps_per_second': 2.51,
 'epoch': 1.0}

#### Inference and Testing

Test the fine-tuned model on new sentences to predict their associated emotions.


In [10]:
# Define test sentences
test_sentences = [
    "I am feeling incredibly happy and joyful today!",
    "I am so sad and down.",
    "I have mixed feelings about this.",
    "This is absolutely terrifying!",
]

# Preprocess the test sentences
encoded_input = tokenizer(test_sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')

# Predict using the fine-tuned model
with torch.no_grad():
    predictions = model(**encoded_input)

# Convert predictions to human-readable labels
predicted_labels = np.argmax(predictions.logits.numpy(), axis=1)

# Mapping for the 'emotion' dataset labels
label_map = {0: "sadness", 1: "joy", 2: "love", 3: "anger", 4: "fear", 5: "surprise"}

# Print predictions
for sentence, label_id in zip(test_sentences, predicted_labels):
    print(f"Sentence: '{sentence}' - Emotion Prediction: {label_map[label_id]}")


RuntimeError: Placeholder storage has not been allocated on MPS device!

# Conclusion and Discussion

### Conclusion

This notebook demonstrated how to fine-tune a pre-trained Hugging Face model for emotion classification. By walking through data preprocessing, model training, evaluation, and testing, we successfully utilized `distilbert-base-uncased` to classify emotions.

### Discussion

The ease of integration provided by Hugging Face's tools allows developers to focus on the task rather than complex configurations. Fine-tuning models like DistilBERT provides accurate results for various NLP tasks, paving the way for further advancements in AI and natural language understanding.
