### Text Classification: Sentiment Analysis

Hello there! This is an example of sentiment analysis based on the second chapter of Natural Language Processing with Transformers [book](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/) (and [GitHub](https://github.com/nlp-with-transformers)). This notebook aims to extract the most relevant elements of the first end-to-end hands-on chapter with the 🤗 HuggingFace ecosystem. I might change a thing or two for my own experiments, given that I have been using this material for personal learning.

#### Load Data

We start by loading the data from the 🤗 Datasets library. We are using the `emotion` dataset which contains tweets written in English. This dataset was proposed on a [paper](https://aclanthology.org/D18-1404/) by E. Saravia <i>et al.<i> where the authors not only collected the data, but labeled the sentiment inferred from hashtags.

In [None]:
from datasets import load_dataset

dataset = load_dataset("emotion")

#### EDA

We now perform a small EDA on the dataset for examining the corpus.

In [None]:
print("This is the dataset object:\n", dataset)
print("__________________________________________________________")
print("This is the dataset object type:\n", type(dataset))
print("__________________________________________________________")
print("The dataset is split into train, validation and test. Inside each partition - taking train for instance -, we have:\n", dataset["train"])
print("__________________________________________________________")
print("And inside the first element of the train partition:\n", dataset["train"][0])
print("__________________________________________________________")
print(f"Splitting from text and label, we get: {dataset['train'][0]['text']} and {dataset['train'][0]['label']}")

We can explore the object `features` attribute to see all information about the labeled data and the `set_format` method to improve manipulation by using `pandas` to compute quantities in the corpus.

In [None]:
dataset["train"].features

In [None]:
dataset.set_format("pandas")
df = dataset["train"][:]   
df.head()

We can map the label inter to its respective class by using the `.int2str` method in the features attribute 

In [None]:
dataset["train"].features["label"].int2str(0)

Applying to the whole dataset, we get:

In [None]:
def label_int2str(label):
    return dataset["train"].features["label"].int2str(label)

df["label_name"] = df["label"].apply(label_int2str)
df.head()

Now getting some visualizations on the dataset, we can plot how the dataset is balanced (or not) and the distributions of words of each sentiment:

In [None]:
import matplotlib.pyplot as plt

df["label_name"].value_counts(ascending=True).plot(kind="barh")

In [None]:
df["Words per tweet"] = df["text"].str.split().apply(len)
df.boxplot("Words per tweet", by="label_name", grid=False)

#### Tokenization

Now that the data has been visualized and we got a grasp of its contents, we can proceed with training. In order to do that, we need to use a tokenizer that will convert a word into its numerical representation for ML processing. From 🤗 Transformers, we can get the `AutoTokenizer` that gets the tokenizer from the model that we are using for training. In this tutorial, we are following the instructions of the book and using `DistilBERT`.

In [None]:
from transformers import AutoTokenizer

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

*Important*: On page 35, the authors mention the importance of using the same tokenizer that the model was trained with when using pretrained models. Different tokenization "runs" can lead to different representations of the vocabulary.

Now, we test the tokenization for one sentence to see its behavior 

In [None]:
text = "Tokenizing text is a core task of NLP."
encoded_text = tokenizer(text)
print(encoded_text)
print('__________________________________________________________')
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)
print('__________________________________________________________')
print(tokenizer.convert_tokens_to_string(tokens))

Now we proceed with applying tokenization to the whole dataset:

In [None]:
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)
    # Padding is for preserving dimensions
    # Truncation is for maximum context size

dataset.reset_format() # Resetting the format from pandas to the default one
print(tokenize(dataset["train"][:2]))

dataset_encoded = dataset.map(tokenize, batched=True, batch_size=None)

print(dataset_encoded["train"].column_names) # We got two more "columns" in our dataset: input_ids and attention_mask

#### Training

Two possibilities:
- Feature Extraction
- Fine Tuning

##### Feature Extraction

First, we load our pretrained model

In [None]:
from transformers import AutoModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)

Now, the next step is to extract the last hidden states.

In [None]:
# if TensorFlow, TFAutoModel
# Some Models were trained only on one or other framework. If that, add from_pt/from_tf = True

In [None]:
text = "this is a test"
inputs = tokenizer(text, return_tensors="pt")
# return_tensors='pt' affects the data structure
inputs['input_ids'].size()

In [None]:
# Dimensions: 
inputs

In [None]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])


In [None]:
# Add all the inputs to the device
inputs = {key: tensor.to(device) for key, tensor in inputs.items()}

In [None]:
with torch.no_grad():
    outputs = model(**inputs)
outputs

In [None]:
outputs.last_hidden_state.size()

# [batch_size, n_tokens, hidden_size_dim]

In [None]:
def extract_hidden_states(batch):
    inputs = {key: tensor.to(device) for key, tensor in batch.items() if key in tokenizer.model_input_names}
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
    return {"hidden_states": last_hidden_state[:, 0].cpu().numpy()} 
    
""" 
In transformer-based models like BERT, the [CLS] (classification) token is a special token added to the beginning of every input sequence. Its purpose is to provide a summary of the entire sequence. 
The hidden state associated with this token at the output of the final transformer layer is often used as a condensed representation of the whole sequence, which is why it’s used as an input feature for classification tasks.

Here's why it's common practice:

Global Representation: The [CLS] token's hidden state is designed to capture information from all tokens in the input sequence. 
During the self-attention mechanism, the [CLS] token interacts with every other token, allowing it to aggregate context from the entire sequence.

Efficient for Classification: For many tasks, like text classification or sentiment analysis, the task only requires a single output that summarizes the entire input. 
The [CLS] token is trained to carry this summary information, making it a convenient choice for tasks that involve sequence-level prediction.

Pretrained Model Design: In models like BERT, the [CLS] token is specifically optimized during pretraining for tasks like next-sentence prediction (NSP) 
and masked language modeling (MLM), further enhancing its ability to carry global sequence-level information.

Reduced Dimensionality: Instead of working with the hidden states of all tokens in the sequence, using just the [CLS] token reduces the dimensionality of the input 
to the final layer, making the model more efficient for tasks where a single vector is sufficient for decision-making.

This approach simplifies the use of transformer models for many downstream tasks by leveraging the [CLS] token’s summary of the sequence, 
which is robust enough for tasks requiring a high-level understanding of the input.
"""

In [None]:
dataset_encoded.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
dataset_hidden = dataset_encoded.map(extract_hidden_states, batched=True, batch_size=1000)

In [None]:
dataset_hidden

In [None]:
dataset_hidden["train"].column_names

In [None]:
dataset_hidden["train"]["hidden_states"][0]

Now we proceed to create a feature matrix. We will use the hidden states as input features and the labels as targets.

In [None]:
import numpy as np

X_train = np.array(dataset_hidden["train"]["hidden_states"])
y_train = np.array(dataset_hidden["train"]["label"])
X_valid = np.array(dataset_hidden["validation"]["hidden_states"])
y_valid = np.array(dataset_hidden["validation"]["label"])

print(X_train.shape, y_train.shape, X_valid.shape, y_valid.shape)

Before we start training the model, we should make sure that it provides a good representation of the emotions we want to classify. Let's do some dataviz.

In [None]:

from umap import UMAP
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Scale features to [0,1] range
X_scaled = MinMaxScaler().fit_transform(X_train)
# Initialize and fit UMAP
mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)
# Create a DataFrame of 2D embeddings
df_emb = pd.DataFrame(mapper.embedding_, columns=["X", "Y"])
df_emb["label"] = y_train
df_emb.head()

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(7,5))
axes = axes.flatten()
cmaps = ["Greys", "Blues", "Oranges", "Reds", "Purples", "Greens"]
labels = dataset["train"].features["label"].names

for i, (label, cmap) in enumerate(zip(labels, cmaps)):
    df_emb_sub = df_emb.query(f"label == {i}")
    axes[i].hexbin(df_emb_sub["X"], df_emb_sub["Y"], cmap=cmap,
                   gridsize=20, linewidths=(0,))
    axes[i].set_title(label)
    axes[i].set_xticks([]), axes[i].set_yticks([])

plt.tight_layout()
plt.show()

In [None]:
# Although we may have hoped for some separation, this is in no way guaranteed since the model was not trained to know the difference between these emotions. 
# It only learned them implicitly by guessing the masked words in texts.
#
#

Training a simple classifier 

In [None]:
# We increase `max_iter` to guarantee convergence 
from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(max_iter=3000)
lr_clf.fit(X_train, y_train)

In [None]:
lr_clf.score(X_valid, y_valid)


In [None]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
dummy_clf.score(X_valid, y_valid)

In [None]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="stratified")
dummy_clf.fit(X_train, y_train)
dummy_clf.score(X_valid, y_valid)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

def plot_confusion_matrix(y_preds, y_true, labels):
    cm = confusion_matrix(y_true, y_preds, normalize="true")
    fig, ax = plt.subplots(figsize=(6, 6))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
    plt.title("Normalized confusion matrix")
    plt.show()
    
y_preds = lr_clf.predict(X_valid)
plot_confusion_matrix(y_preds, y_valid, labels)

In [None]:
""" 
We can see that anger and fear are most often confused with sadness, which agrees with the observation we made when visualizing the embeddings. Also, love and surprise are frequently mistaken for joy.
"""

#### Fine-Tuning Transformers

In [None]:
from transformers import AutoModelForSequenceClassification

num_labels = 6
model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt, num_labels=num_labels)
         .to(device))

Let's reload the dataset so that we can store the labels in our model

In [None]:
from datasets import load_dataset

dataset = load_dataset("emotion")
labels = dataset["train"].features["label"].names


# Create id2label and label2id dictionaries
id2label = {i: label for i, label in enumerate(labels)}
label2id = {label: i for i, label in enumerate(labels)}

# Update the model configuration
model.config.id2label = id2label
model.config.label2id = label2id

In [None]:
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1) # ???
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

To complete the tutorial, we need to properly train the model by defining the hyperparameters and upload our model to the 🤗 Hugging Face Hub. We'll use the terminal for that with `huggingface-cli login`.

In [None]:
from transformers import Trainer, TrainingArguments

batch_size = 64
logging_steps = len(dataset_encoded["train"]) // batch_size
output_dir = f"../../data/{model_ckpt}-finetuned-emotion"
training_args = TrainingArguments(output_dir=output_dir,
                                  num_train_epochs=2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  push_to_hub=True, 
                                  log_level="error")

In [None]:

from transformers import Trainer

trainer = Trainer(model=model, args=training_args, 
                  compute_metrics=compute_metrics,
                  train_dataset=dataset_encoded["train"],
                  eval_dataset=dataset_encoded["validation"],
                  tokenizer=tokenizer)
trainer.train()

In [None]:
preds_output = trainer.predict(dataset_encoded["validation"])

In [None]:
preds_output.metrics

In [None]:
y_preds = preds_output.predictions.argmax(-1) # or np.argmax(preds_output.predictions, axis=1)
y_preds

In [None]:
plot_confusion_matrix(y_preds, y_valid, labels)

In [None]:
## Add error analysis here

#### Saving and sharing the model

In [None]:
trainer.push_to_hub(commit_message="Added labels to the model")

In [None]:
from transformers import pipeline

model_id = "gfbarros/distilbert-base-uncased-finetuned-emotion"
classifier = pipeline("text-classification", model=model_id, tokenizer=model_id)

In [None]:
def visualize_predictions(custom_tweet):
    preds = classifier(custom_tweet, return_all_scores=True)
    preds_df = pd.DataFrame(preds[0])
    plt.bar(preds_df.label, 100*preds_df.score, color="skyblue")
    plt.title(custom_tweet)
    plt.ylabel("Score (%)")
    plt.show()

In [None]:
visualize_predictions("I'm so happy today!")

In [None]:
visualize_predictions("What?! Four dollars for a cup of coffee?")

In [None]:
visualize_predictions("We are so scared of the future")