# Multi-Class Text Classification - AI model for sentiment analysis

## Overview

**Sentiment Analysis** is a common Natural Language Processing (NLP) technique used to determine the emotional tone behind a body of text. It involves in classifying text into pre-defined categories, allowing for gaining insights and monitoring various aspects of business intelligence.

This notebook will provide a showcase for a practical application, namely building a model to classify short corpora of text into a label describing the overall emotion/tone. <br />

The implementation uses libraries and utilities from [Hugging Face](https://huggingface.co/). For more in-depth information, please refer to the official curriculum.

![Hugging Face](./media/hf-logo-with-title.png)

## Problem

The evaluation system that your company is using to gather performance reviews from fellow colleagues is getting old and outdated. Most likely it is written in a language you haven't even heard of. The employees, including yourself, are getting annoying for having to deal with it, and your managers are getting frustrated sorting out the good, and the bad ones. </br >

But... Luckily, you have been assigned to lead an initiative to revamp the architecture and, the selling point is, you are **REQUIRED** to use AI for classifying reviews. Some will even go to say that this is the only reason the project was approved by stakeholders. Anyway! This is your task and you **MUST** complete it by the end of sprint.

See you in Stand Up!

## Approach

The scope of this workshop is to gain insights and hands on experience with the latest, state-of-the-art tooling for building and fine-tuning AI models. <br />
For this purpose, the platform of choice is **Hugging Face**, an open source company that provides a series of performant libraries very popular in the ML space. <br />

The model's architecture will start with a pre-trained model (*distilbert-base-uncased*), which will be fine-tuned on a specific dataset. <br />
In the following sections, the necessary steps and other basic workflows will be addressed and discussed:
- Dependency management and GPU availability 
- Data discovery and pre-processing 
- Model architecture and parametrization
- Training, analysis and evaluation
- Inference

## Dependency management

In this step, the required dependencies are installed (if not already satisfied) in the Python environment. Then, the availability of GPU cores is checked and torch is set to use the corresponding backend. <br/>
Training and using models on GPU cores is much more faster, as they are optimized for parallel compute.

In this notebook, the AI model will run on an architecture build on top of PyTorch. Libraries and tooling from Hugging Face support also Tensorflow. <br />
It is important to understand that Hugging Face provides only an unified API for multiple AI [Tasks](https://huggingface.co/tasks), and so, further customization and optimization requires knowledge of at least one of the two popular ML libraries.

References:
- Check out: [PyTorch](https://pytorch.org/)
- Check out: [Tensorflow](https://www.tensorflow.org/)
- Check out the Tasks supported by Hugging Face: [Tasks](https://huggingface.co/tasks)

In [None]:
%%capture
%pip insatll matplotlib seaborn
%pip install datasets transformers torch evaluate scikit-learn sentencepiece protobuf nltk
%pip install accelerate --upgrade

In [None]:
import torch
if torch.backends.mps.is_available():
  mps_device = torch.device('mps')
  print('training available on GPU cores')
else:
  print('training available on CPU cores only')

## Data discovery and pre-processing

One of the most crucial steps in building AI systems is the selection and the processing of the right amount of data. Datasets are constructed from records
consisting of pairs of input (in this case, the *text* or the *query*) and the corresponding label/s. In some cases, datasets are already splitted into subfolders
specific for **training**, **testing**, and **evaluation**.

`Hugging Face` provides, beside architecture, a multitude of datasets provided and maintained by the community. For the given problem, the `mteb/emotion` dataset.

References:
- Datasets available on Hugging Face, based on tasks: [Datasets](https://huggingface.co/datasets)
- Dataset used in this notebook: [mteb/emotion](https://huggingface.co/datasets/mteb/emotion)

### Loading and quick inspection of the data

First, let's check the dataset in order to get accustomed with the data available. Datasets from Hugging Face come in different formats and may or may not be ready for further pre-processing steps. <br />

Loading, inquirying and altering the dataset can be easily done using the `datasets` library.

In [None]:
from datasets import load_dataset

# Declare the official path of the dataset from Hugging Face Datesets
dataset_path = 'mteb/emotion' 

# Load the dataset and check its inner structure
# TODO

Feel free to inspect the records from the **training**, **testing** or **evaluation** dataset. You can do so by modifying the next line of code.

In [None]:
# Inspect different records in the dataset by changing the split type or the index
dataset['train'][5]

### Data discovery through visualisation techniques

Let's plot some charts in order to visualize the distribution of the labels in the dataset.

But first, let's define some mappers between the ids and the corresponding labels

In [None]:
# Declare the number of labels that exist in the dataset
# TODO

# Declare a mapping between the ID of the label and the corresponding emotion label
# TODO

# Declare a mapping between the emotion label and the corresponding ID, programmatically
# TODO

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Convert the Hugging Face dataset to a Pandas DataFrame
# TODO

# Create a Bar Plot to visualize the distribution of the emotion labels
# TODO

In [None]:
# Declare a variable to store the occurences of every emotion label
# TODO

# Create a Pie Chart to visualize the distribution of the emotion labels in percentage
# TODO

In [None]:
# Add a new column in the data frame to store the length of the text
# TODO

plt.figure(figsize=(10, 6))
for emotion in df['label'].unique():
    sns.histplot(df[df['label'] == emotion]['text_length'], kde=True, label=f"Emotion {emotion}", bins=10)

plt.title("Histogram of Text Length by Emotion")
plt.xlabel("Text Length")
plt.ylabel("Frequency")

# Add the emotion ids as a legend in the visualization
# TODO

plt.show()

In [None]:
import string
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def pos_count(text, pos_type):
    tokens = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    return len([word for word, pos in pos_tags if pos.startswith(pos_type)])

df['noun_count'] = df['text'].apply(lambda x: pos_count(x, 'NN'))  # Nouns
df['verb_count'] = df['text'].apply(lambda x: pos_count(x, 'VB'))  # Verbs
df['adj_count'] = df['text'].apply(lambda x: pos_count(x, 'JJ'))  # Adjectives
df['adv_count'] = df['text'].apply(lambda x: pos_count(x, 'RB'))  # Adverbs
df['punctuation_count'] = df['text'].apply(lambda x: len([char for char in x if char in string.punctuation]))
correlation_matrix = df[['text_length', 'label', 'adj_count', 'adv_count']].corr()

# Create a Heat Map to visualize the correlation between the features defined above
plt.figure(figsize=(6, 5))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap of Text Features")
plt.show()

### Data pre-processing

Next, a series of data pre-processing techniques are required in order to prepare the dataset for training.

We start by removing the `label_text` modal because it will be irrelevant to the model (which is interested only in the label ids) and because we aim to load into memory as little data as possible. This is crucial for, obviously, performance/time concerns.

In [None]:
# Remove the label_text column from the Hugging Face dataset
# TODO

Then, we will define a function that will process a corpus of text in order to simplify the input and to get rid of tokens irrelevant to learning.

In [None]:
# Implement a `preprocess_text` function that pre-processes the raw text,
# by mapping the text to lowercase, removing punctuation and numbers,
# removing stopwords and applying the lemmatization process.

def preprocess_text(text):
  raise NotImplementedError("Function not implemented")

preprocess_text("John did a great job in this quarter, he definetely needs a raise.")

At this point, the fun part *technically* (pun intended) begins. We will start to use different abstractions provided by `Hugging Face` to deal with tokenization and batching techniques.

For our given task, the model of choice and the starting point for the fine-tuning process is the `distilbert/distilbert-base-uncased` model - a smaller, but fast version of Google's Bert model.

In [None]:
# Define the model path from the Hugging Face Models, and initialize the tokenizer
# TODO

# Implement a `preprocess_function method needed for the tokenization process`
# TODO

With the tokenizer in place and configured, we can start processing the dataset and prepare it for training.

In [None]:
# Define the train and test subsets and apply the tokenization process
# Data should be shuffled before applying the tokenization process
# TODO

### Data Collator

A Data Collator is a function or class that processes individual samples from a dataset and transforms them into a batch. This involves several tasks, such as:

1. **Padding**: Ensuring that all sequences in a batch have the same length by adding padding tokens to shorter sequences. This is essential because models typically require inputs of the same length for batch processing.

2. **Tensor Conversion**: Converting lists of inputs (e.g., token IDs, attention masks) into tensors that can be fed into a deep learning model.

3. **Mask Creation**: Creating attention masks to distinguish between real tokens and padding tokens.

Why? There are multiple benefits of using a Data Collator, but for a high level overview, **efficiency** and **compatibility** are key aspects to follow.

- **Efficiency**: Batching samples together is computationally efficient and takes advantage of parallel processing on GPUs.
- **Compatibility**: Many models require fixed-size inputs, making padding and other preprocessing steps essential.


In [None]:
# Define a Data Collator 
# TODO

## Evaluation metrics

### Accuracy

**Accuracy** is the proportion of correct predictions among the total number of cases processed. It can be computed with: 

`Accuracy = (TP + TN) / (TP + TN + FP + FN)`

Where: 
- `TP: True positive`
- `TN: True negative` 
- `FP: False positive` 
- `FN: False negative`

**Limitations and Bias**
This metric can be easily misleading, especially in the case of unbalanced classes. 

For example, a high accuracy might be because a model is doing well, but if the data is unbalanced, it might also be because the model is only accurately labeling the high-frequency class. In such cases, a more detailed analysis of the model’s behavior, or the use of a different metric entirely, is necessary to determine how well the model is actually performing.

Reference:
- sklearn documentation: (accuracy_score)[https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html]

In [None]:
# This metric, amongst many others, it is available and can be loaded using the `evaluate` library.
# This library is a layer of abstration on top of sklearn
import evaluate

# Also, for custom approaches, the metric can be used directly from the sklearn libary
from sklearn.metrics import accuracy_score

# Usage with Hugging Face API
accuracy_metric = evaluate.load('accuracy')

### Precision

**Precision** is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation: 

`Precision = TP / (TP + FP)` 

Where:
- `TP is the True positive` 
- `FP is the False positive `

Reference:
- sklearn documentation: (precision_score)[https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html]

In [None]:
# This metric, amongst many others, it is available and can be loaded using the `evaluate` library.
# This library is a layer of abstration on top of sklearn
import evaluate

# Also, for custom approaches, the metric can be used directly from the sklearn libary
from sklearn.metrics import precision_score

# Usage with Hugging Face API
precision_metric = evaluate.load('precision')


### Recall

**Recall** is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation: 

`Recall = TP / (TP + FN)`

 Where:
- `TP is the true positives`
- `FN is the false negatives`

Reference:
-  sklearn documentation: [recall_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)


In [None]:
# This metric, amongst many others, it is available and can be loaded using the `evaluate` library.
# This library is a layer of abstration on top of sklearn
import evaluate

# Also, for custom approaches, the metric can be used directly from the sklearn libary
from sklearn.metrics import recall_score

# Usage with Hugging Face API
recall_metric = evaluate.load('recall')

### rocauc

This metric computes the area under the curve (AUC) for the **Receiver Operating Characteristic Curve (ROC)**. 

The return values represent how well the model used is predicting the correct classes, based on the input data. 

A **score of 0.5** means that the model is predicting exactly at chance, i.e. the model’s predictions are correct at the same rate as if the predictions were being decided by the flip of a fair coin or the roll of a fair die. 

A **score above 0.5** indicates that the model is doing better than chance, while a **score below 0.5** indicates that the model is doing worse than chance.

This metric has three separate use cases:

- **binary**: The case in which there are only two different label classes, and each example gets only one label. This is the default implementation.
- **multiclass**: The case in which there can be more than two different label classes, but each example still gets only one label.
- **multilabel**: The case in which there can be more than two different label classes, and each example can have more than one label.

Reference:
- sklearn documentation: [roc_auc_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)

In [None]:
# This metric, amongst many others, it is available and can be loaded using the `evaluate` library.
# This library is a layer of abstration on top of sklearn
import evaluate

# Also, for custom approaches, the metric can be used directly from the sklearn libary
from sklearn.metrics import roc_auc_score

# Usage with Hugging Face API
roc_auc_metric = evaluate.load('roc_auc')

In [None]:
# Implement a custom `compute_metrics` function for computing the accuracy of the model
# TODO

def compute_metrics(example):
  raise NotImplementedError("Function not implemented")

## Model architecture and hyperparameters

Below are some basic hyperparameters, together with a brief description and explanation. Of course they are generated with Chat GPT :)

`learning_rate`

The learning_rate parameter specifies the initial learning rate for the optimizer. 

It controls how much to change the model weights at each step based on the gradient descent update.

A higher learning rate might lead to faster convergence but risks overshooting the optimal point, 

while a lower learning rate might result in slower convergence and a higher chance of getting stuck in local minima.

`num_train_epochs`
 
This parameter indicates the number of complete passes through the training dataset.

A higher number might improve learning but could also lead to overfitting.

`per_device_train_batch_size`

This parameter sets the batch size for training on each device (such as a GPU or CPU).

Controls how many samples are processed at once per device during training. A larger batch size might speed up training but requires more memory.

`per_device_eval_batch_size`

This parameter sets the batch size for evaluation on each device (such as a GPU or CPU).

Similar to the training batch size, but used during evaluation (validation) to balance speed and memory usage.

In [None]:
# Declare the pre-trained model used for fine-tuning
# TODO

# Declare the hyperparameters for the fine-tuning process
# TODO

# Declare the Trainer and finish the model architecture
# TODO

Start training the model. Be aware that it may take a loooong time, depending on the architecture, the amount of data and the resources available on your machine.

In [None]:
# Start training the model
# TODO

It is time to evaluate the model. By running the `evaluate` method, the metrics of the best checkpoint will be computed and displayed (the metric used will be the one specified in the training arguments).

In [None]:
# Evaluate the model
# TODO

## Inference

The final step is defined by running inference on the trained model. This is definetely the easiest and the most fullfiling aspect so far. 

We are going to use the `pipeline` function from the `transformers` library and use one of the models (checkpoint) saved in the specified directory.

It is also possible to save models on the Hugging Face platform (which can be a strategic option for a real life application), but this concern is out of the scope of this workshop.

You can feel free to play and test your classifier and be creative in your prompt engineering.

Make that model sweat! ;)

In [None]:
# Run inference on the model and check the output
# TODO