# Tutorial 3

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Transformers, Huggingface, Prompting, LangChain

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

## PART 0 ($\sim$5 mins)
*   Downloading a **dataset**.
*   Encoding a a **dataset**.

## PART I ($\sim$30 mins)

*   Text encoding with transformers.
*   Model definition.
*   Model training and evaluation with huggingface APIs.

## PART II ($\sim$30 mins)

*   TODO

## Preliminaries

First of all, we need to import some useful packages that we will use during this hands-on session.

In [1]:
# system packages
from pathlib import Path
import shutil
import urllib
import tarfile
import sys

# data and numerical management packages
import pandas as pd
import numpy as np

# useful during debugging (progress bars)
from tqdm import tqdm

# typing
from typing import List, Callable, Dict, Iterable

In [2]:
import torch
torch.cuda.is_available()

  return torch._C._cuda_getDeviceCount() > 0


False

In [4]:
!pip list | grep torch

torch                             1.13.0+cu116
torchdata                         0.5.0
torchtext                         0.14.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [5]:
!pip install torch==1.13.0+cu116
!pip install transformers
!pip install datasets
!pip install accelerate -U


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'width': 2560,
        'height': 1440,
        'scroll': True,
})

In [6]:
!nvidia-smi

Fri Nov  3 14:19:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   36C    P0    N/A /  80W |      6MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

# Data

We will use the IMDB dataset first introduced in tutorial 1.

* [**Stats**] A dataset of 50k sentences used for sentiment analysis: 25k with positive sentiment, 25k with negative one.
* [**Sentiment**] We consider sentiment labels for classification.

We start by **downloading** the dataset and **extract** it to a folder.

In [None]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(download_path: Path, url: str):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=download_path, reporthook=t.update_to)

        
def download_dataset(download_path: Path, url: str):
    print("Downloading dataset...")
    download_url(url=url, download_path=download_path)
    print("Download complete!")

def extract_dataset(download_path: Path, extract_path: Path):
    print("Extracting dataset... (it may take a while...)")
    with tarfile.open(download_path) as loaded_tar:
        loaded_tar.extractall(extract_path)
    print("Extraction completed!")

In [None]:
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset_name = "aclImdb"

print(f"Current work directory: {Path.cwd()}")
dataset_folder = Path.cwd().joinpath("Datasets")

if not dataset_folder.exists():
    dataset_folder.mkdir(parents=True)

dataset_tar_path = dataset_folder.joinpath("Movies.tar.gz")
dataset_path = dataset_folder.joinpath(dataset_name)

if not dataset_tar_path.exists():
    download_dataset(dataset_tar_path, url)

if not dataset_path.exists():
    extract_dataset(dataset_tar_path, dataset_folder)

#### Data Format

Just like in the first assignment, we need a **high level view** of the dataset that is helpful to our needs. 

We encode the dataset into a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

In [None]:
dataframe_rows = []

for split in ['train', 'test']:
    for sentiment in ['pos', 'neg']:
        folder = dataset_folder.joinpath(dataset_name, split, sentiment)
        for file_path in folder.glob('*.txt'):            
            with file_path.open(mode='r', encoding='utf-8') as text_file:
                text = text_file.read()
                score = file_path.stem.split("_")[1]
                score = int(score)
                file_id = file_path.stem.split("_")[0]

                num_sentiment = 1 if sentiment == 'pos' else 0

                dataframe_row = {
                    "file_id": file_id,
                    "score": score,
                    "sentiment": num_sentiment,
                    "split": split,
                    "text": text
                }

                dataframe_rows.append(dataframe_row)

In [None]:
folder = Path.cwd().joinpath("Datasets", "Dataframes", dataset_name)
if not folder.exists():
    folder.mkdir(parents=True)

# transform the list of rows in a proper dataframe
df = pd.DataFrame(dataframe_rows)
df = df[["file_id", 
         "score",
         "sentiment",
         "split",
         "text"]
       ]
df_path = folder.with_name(dataset_name + ".pkl")
df.to_pickle(df_path)

# PART I

*   Text encoding with Transformers.
*   Model definition.
*   Model training and evaluation with huggingface APIs.

## 1. Text encoding with Transformers.

In tutorial 1, we have seen how to define standard machine learning models to address sentiment classification.

However, we know that Transformer-based models are one of the strongest baselines when assessing a task or benchmarking on a novel corpus.

Before defining our transformer-based classifier, we need to encode text inputs into numerical format.

As in Tutorial 1, we are going to **tokenize** input texts to perform token indexing.

### 1.1 Encoding the dataset

First, we are going to use ``datasets`` library to encode our dataset into a handy wrapper for computational speedup.

In [None]:
from datasets import Dataset

train_df = df.loc[df['split'] == "train"].sample(frac=1.0)[:5000]
test_df = df.loc[df['split'] == "test"].sample(frac=1.0)[:1000]

train_data = Dataset.from_pandas(train_df)
test_data = Dataset.from_pandas(test_df)

Let's inspect the newly defined `Dataset` instances

In [None]:
print(train_data)
print(test_data)

### 1.1 Transformer Tokenizer

Transformers typically use [SentencePiece tokenizer](https://github.com/google/sentencepiece) to perform sub-word level tokenization.

In particular, the `transformers` library offers the `AutoTokenizer` class to quickly retrieve our chosen transformer's ad-hoc tokenizer.

In [None]:
from transformers import AutoTokenizer

model_card = 'distilbert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(model_card)

The `model_card` variable defines the *path* where to look for our pre-trained model.

You can check [huggingface's hub](https://huggingface.co/models) model hub to pick the model card according to your preference.

We proceed on tokenizing movie reviews text with our tokenizer.

In [None]:
def preprocess_text(texts: Iterable[str]):
    return tokenizer(texts['text'], truncation=True)

train_data = train_data.map(preprocess_text, batched=True)
test_data = test_data.map(preprocess_text, batched=True)

Let's inspect the preprocess `Dataset` instances

In [None]:
print(train_data)
print(test_data)

In [None]:
print(train_data['input_ids'][50])

In [None]:
print(train_data['attention_mask'][50])

We can perform some quick 'sanity check' to evaluate the tokenization process

In [None]:
original_text = train_data['text'][50]
decoded_text = tokenizer.decode(train_data['input_ids'][50])

print(original_text)
print()
print()
print(decoded_text)

### Vocabulary

We **do not** necessarily need to build a vocabulary since transformers already come with their own! 

**However**, it is still possible to add new tokens to the vocabulary to adapt the model to the given use case.

```
tokenizer.add_tokens(new_tokens=new_tokens)
```

The transformer vocabulary will update its **unusued** vocabulary indexes with newly provided tokens.

### Special tokens

**Pay attention** to used special tokens and their corresponding token ids.

Each transformer models has its own special tokens ([CLS], [SEP], [PAD], [EOS], etc...).

Thus, the same special token may be mapped to different token ids in distinct transformer models.

### Text cleaning

We didn't perform any kind of text cleaning before performing text encoding.

This is usually because transformer tokenizers have their own text cleaning process to perform tokenization and models **may be sensitive** to custom operations!

### Example

`bert-base-uncased` is trained with text in lower format.

**Check model cards** on huggingface to know more about the models you use and inspect their text encoding pipeline to understand how they behave.

#### Homework 📖

Experiment with different model cards.

Experiment with text cleaning and evaluate its impact on classification.

## 2. Model definition

We are now ready to define our transformer-based classifier.

## 2.1 Data Formatting

We first need to format input data to be fed as mini-batches in a training/evaluation procedure.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The ``DataCollatorWithPadding`` receives a batch of

```
(input_ids, attention_mask, token_type_ids, label)
```

tuples and **dynamically pads** ``input_ids``, ``attention_mask`` and ``token_type_ids`` to maximum sequence in the batch. 

Intuitively, this operation saves a lot of memory compared to padding to global maximum sequence, while it introduces a reasonable computational overhead.

### Note

The above example is just one way out of many to perform dynamic batch padding: it really depends on which data structures you are using.

## 2.2 Model definition

Defining a transformer-based model with huggingface is pretty straightforward!

Since we are dealing with text classification, we can use off-the-shelf `AutoModelForSequenceClassification`.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_card,
                                                           num_labels=2,
                                                           id2label={0: 'NEG', 1: 'POS'},
                                                           label2id={'NEG': 0, 'POS': 1})

Let's first check the loaded model architecture.

In [None]:
print(model)

**That's it!**

That's the simplicity of huggingface's APIs.

The model is ready to use for classification.

### Custom architectures

There are plenty of pre-defined model architectures $\rightarrow$ [auto classes](https://huggingface.co/docs/transformers/model_doc/auto)

In more complex scenarios, we may want to define a custom architecture where the pre-trained model is part of it.

In these cases, the way you do it strongly depends on the underlying neural library.

However, there exist several high-level APIs depending on your needs.

## 3. Model training and evaluation

We are now ready to define the training and evaluation procedures to test our model on the IMDB dataset.

In particular, we are going to use ``Trainer`` APIs to efficiently perform training.

### 3.1 Metrics

First, we define classification metrics for evaluation.

In [None]:
from sklearn.metrics import f1_score, accuracy_score

def compute_metrics(output_info):
    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)
    
    f1 = f1_score(y_pred=predictions, y_true=labels, average='macro')
    acc = accuracy_score(y_pred=predictions, y_true=labels)
    return {'f1': f1, 'acc': acc}

### 3.2 Training Arguments

The ``Trainer`` object can be extensively customized.

Feel free to check the [documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) on training arguments.

We first rename the `sentiment` column to `label` as the default input to `AutoModelForSequenceClassification`.

In [None]:
train_data = train_data.rename_column('sentiment', 'label')
test_data = test_data.rename_column('sentiment', 'label')

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="test_dir",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to='none'
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
!pip list | grep torch

In [None]:
import torch
torch.cuda.is_available()

In [None]:
trainer.train()

# The End!