# Logistic regression for a multi-class problem

In this notebook, we are going to work on a classification problem using text as the initial input. To do so, we are going to train a language model which will learn to recognize the topic of a text based on its semantic representation. The classifier will be based on the DistillBERT model and will be trained on a small subsegment of the dataset to make training faster.

## Load libraries

In [None]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ['HF_HOME'] = os.getcwd() + "/cache/"

from datasets import load_dataset, Dataset, DatasetDict
from evaluate import load
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from transformers.utils import logging
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
import torch

## Identifying the best device to run the model

Since we are going to perform a computing-intensive task, we must identify the most efficient device available to perform it. We do so using PyTorch, which is the back-end that we will use in this lab. We prioritize NVIDIA GPUs with CUDA installed, then Apple Silicon GPUs, and finally CPUs if none of the above is found.

If you need help installing the relevant version of PyTorch: https://pytorch.org/get-started/locally/

If you have a NVIDIA GPU but you don't know whether you have CUDA installed or not, type the following command:

```bash
nvcc --version
```

If you have it installed, you should see the CUDA version installed on your computer. Otherwise, you should install a PyTorch-compatible version (as listed [here](https://pytorch.org/get-started/locally/), row "Stable CUDA").

In [None]:
if torch.cuda.is_available():
    device = torch.device('cuda')
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device('cpu')

print(device)

## Load data

In [None]:
newsgroups_dst = load_dataset("SetFit/20_newsgroups")
df: pd.DataFrame = pd.concat([newsgroups_dst["train"].to_pandas(), newsgroups_dst["test"].to_pandas()])
df.head()

## Data preparation

Let's check if we have some empty texts and if so, we get rid of them.

In [None]:
num_empty_texts = (df.text == "").sum()

print(f"There are {num_empty_texts} empty rows in the dataset.")

df = df.loc[df.text != ""]
print("Now, we only keep non-empty rows.")

In [None]:
df.label_text.value_counts()

In [None]:
num_rows = len(df.index)
print(f"The dataset is {num_rows} rows long.")

The data is too large to train it during this class, so we are going to produce a subsample. First, we will narrow down our scope by taking only the documents whose topic label starts with `sci`, i.e. `sci.crypt`, `sci.med`, `sci.space` and `sci.electronics`.

In [None]:
df_filtered = df.loc[df.label_text.str.startswith("sci")]
normalized_labels = {label: i for i, label in enumerate(df_filtered.label_text.unique())}
df_filtered.loc[:, "label"] = df_filtered.label_text.map(normalized_labels)
print(f"The dataset is {len(df_filtered.index)} rows long.")

This is already a good improvement, but we will further reduce the number of rows by randomly sampling the dataset.

In [None]:
df_sample = df_filtered.groupby('label_text', group_keys=True).sample(n=400, random_state=1234)
print(f"The dataset is {len(df_sample.index)} rows long.")
df_sample

We now have a 1600 rows dataset with only 4 different topics perfectly balanced between the different labels. Now let's split it between train, valid and test dataframes.

In [None]:
df_train, df_temp = train_test_split(
    df_sample, 
    test_size=0.3,       # 30% goes to valid+test
    stratify=df_sample['label_text'],
    random_state=1234
)

df_valid, df_test = train_test_split(
    df_temp,
    test_size=0.5,       # 50% of the 30% -> 15% test, 15% valid
    stratify=df_temp['label_text'],
    random_state=1234
)

Now we re-convert our dataframes as Datasets as they are optimized to be used with our training framework.

In [None]:
dst = DatasetDict({"train": Dataset.from_pandas(df_train, preserve_index=False), "validation": Dataset.from_pandas(df_valid, preserve_index=False), "test": Dataset.from_pandas(df_test, preserve_index=False)})
dst

## Model training

Now, we will setup a model trainer for a DistillBERT model, which willfine-tune it on our new dataset. 

### Model hyperparameters

In [None]:
training_batch_size = 8
num_epochs = 5
lr = 2e-5
weight_decay = 0.01
num_labels = len(normalized_labels)

### Training setup

In [None]:
model_name = "distilbert/distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels).to(device)

In [None]:
encoded_dataset = dst.map(lambda example: tokenizer(example["text"], max_length=512, truncation=True), batched=True)

In [None]:
metric_name = "f1"
metric = load(metric_name)


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {
        "f1_micro": metric.compute(predictions=predictions, references=labels, average="micro")["f1"],
        "f1_macro": metric.compute(predictions=predictions, references=labels, average="macro")["f1"]
    }

In [None]:
args = TrainingArguments(
    f"./cache/newsgroups_classifier",
    eval_strategy="steps",
    save_strategy="steps",
    eval_steps=100,
    logging_steps=100,
    learning_rate=lr,
    per_device_train_batch_size=training_batch_size,
    per_device_eval_batch_size=training_batch_size,
    num_train_epochs=num_epochs,
    weight_decay=weight_decay,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    save_safetensors=True,
    save_total_limit=3,
    seed=1234
)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    processing_class=tokenizer,
    compute_metrics=compute_metrics
)

Now we are good to fine-tune our model on the dataset we created!

In [None]:
logging.set_verbosity_info()

trainer.train()

## Assessing the model

Now we can assess the performances of our finetuned model using the methods we have already seen.

### Classification report

In [None]:
predictions = trainer.predict(encoded_dataset["test"])
y_pred = predictions.predictions.argmax(-1)
y_true = predictions.label_ids

label_list = list(sorted(normalized_labels.keys(), key=normalized_labels.get))

print(classification_report(y_true, y_pred, target_names=label_list))

### Confusion matrix

In [None]:
disp = ConfusionMatrixDisplay.from_predictions(
    y_true, 
    y_pred, 
    display_labels=label_list,
    cmap="Blues",
    xticks_rotation="vertical"
)

plt.show()