# Tutorial 3

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Transformers, Huggingface, Prompting

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

## PART 0 ($\sim$5 mins)
*   Downloading a **dataset**.
*   Encoding a a **dataset**.

## PART I ($\sim$30 mins)

*   Text encoding with transformers.
*   Model definition.
*   Model training and evaluation with huggingface APIs.

## PART II ($\sim$30 mins)

*   Prompting 101
*   Sentiment analysis with prompting
*   LangChain intro

## Preliminaries

First of all, we need to import some useful packages that we will use during this hands-on session.

In [2]:
# system packages
from pathlib import Path
import shutil
import urllib
import tarfile
import sys

# data and numerical management packages
import pandas as pd
import numpy as np

# useful during debugging (progress bars)
from tqdm import tqdm

# typing
from typing import List, Callable, Dict, Iterable

In [35]:
import torch
torch.cuda.is_available()

True

In [10]:
!pip list | grep datasets

datasets                          2.13.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [34]:
!pip install torch==1.13.0+cu116
!pip install transformers==4.18.0
!pip install datasets==2.13.2
!pip install accelerate -U
!pip install evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly

Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [12]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'width': 2560,
        'height': 1440,
        'scroll': True,
})

{'width': 2560, 'height': 1440, 'scroll': True}

In [4]:
!nvidia-smi

Mon Nov  6 14:33:24 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:0B:00.0  On |                  N/A |
| 30%   33C    P8    25W / 320W |    659MiB / 10240MiB |     32%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

# Data

We will use the IMDB dataset first introduced in tutorial 1.

* [**Stats**] A dataset of 50k sentences used for sentiment analysis: 25k with positive sentiment, 25k with negative one.
* [**Sentiment**] We consider sentiment labels for classification.

We start by **downloading** the dataset and **extract** it to a folder.

In [5]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(download_path: Path, url: str):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=download_path, reporthook=t.update_to)

        
def download_dataset(download_path: Path, url: str):
    print("Downloading dataset...")
    download_url(url=url, download_path=download_path)
    print("Download complete!")

def extract_dataset(download_path: Path, extract_path: Path):
    print("Extracting dataset... (it may take a while...)")
    with tarfile.open(download_path) as loaded_tar:
        loaded_tar.extractall(extract_path)
    print("Extraction completed!")

In [6]:
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset_name = "aclImdb"

print(f"Current work directory: {Path.cwd()}")
dataset_folder = Path.cwd().joinpath("Datasets")

if not dataset_folder.exists():
    dataset_folder.mkdir(parents=True)

dataset_tar_path = dataset_folder.joinpath("Movies.tar.gz")
dataset_path = dataset_folder.joinpath(dataset_name)

if not dataset_tar_path.exists():
    download_dataset(dataset_tar_path, url)

if not dataset_path.exists():
    extract_dataset(dataset_tar_path, dataset_folder)

Current work directory: /home/frgg/Repositories/nlp-course-material/2023-2024/Tutorial 3


#### Data Format

Just like in the first assignment, we need a **high level view** of the dataset that is helpful to our needs. 

We encode the dataset into a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

In [7]:
dataframe_rows = []

for split in ['train', 'test']:
    for sentiment in ['pos', 'neg']:
        folder = dataset_folder.joinpath(dataset_name, split, sentiment)
        for file_path in folder.glob('*.txt'):            
            with file_path.open(mode='r', encoding='utf-8') as text_file:
                text = text_file.read()
                score = file_path.stem.split("_")[1]
                score = int(score)
                file_id = file_path.stem.split("_")[0]

                num_sentiment = 1 if sentiment == 'pos' else 0

                dataframe_row = {
                    "file_id": file_id,
                    "score": score,
                    "sentiment": num_sentiment,
                    "split": split,
                    "text": text
                }

                dataframe_rows.append(dataframe_row)

In [8]:
folder = Path.cwd().joinpath("Datasets", "Dataframes", dataset_name)
if not folder.exists():
    folder.mkdir(parents=True)

# transform the list of rows in a proper dataframe
df = pd.DataFrame(dataframe_rows)
df = df[["file_id", 
         "score",
         "sentiment",
         "split",
         "text"]
       ]
df_path = folder.with_name(dataset_name + ".pkl")
df.to_pickle(df_path)

# PART I

*   Text encoding with Transformers.
*   Model definition.
*   Model training and evaluation with huggingface APIs.

## 1. Text encoding with Transformers.

In tutorial 1, we have seen how to define standard machine learning models to address sentiment classification.

However, we know that Transformer-based models are one of the strongest baselines when assessing a task or benchmarking on a novel corpus.

Before defining our transformer-based classifier, we need to encode text inputs into numerical format.

As in Tutorial 1, we are going to **tokenize** input texts to perform token indexing.

### 1.1 Encoding the dataset

First, we are going to use ``datasets`` library to encode our dataset into a handy wrapper for computational speedup.

In [9]:
from datasets import Dataset

# Slicing for showcasing purposes only!
train_df = df.loc[df['split'] == "train"].sample(frac=1.0)[:5000]
test_df = df.loc[df['split'] == "test"].sample(frac=1.0)[:1000]

train_data = Dataset.from_pandas(train_df)
test_data = Dataset.from_pandas(test_df)

Let's inspect the newly defined `Dataset` instances

In [10]:
print(train_data)
print(test_data)

Dataset({
    features: ['file_id', 'score', 'sentiment', 'split', 'text', '__index_level_0__'],
    num_rows: 5000
})
Dataset({
    features: ['file_id', 'score', 'sentiment', 'split', 'text', '__index_level_0__'],
    num_rows: 1000
})


### 1.2 Tokenization

Transformers typically use [SentencePiece tokenizer](https://github.com/google/sentencepiece) to perform sub-word level tokenization.

In particular, the `transformers` library offers the `AutoTokenizer` class to quickly retrieve our chosen transformer's ad-hoc tokenizer.

In [11]:
from transformers import AutoTokenizer

model_card = 'distilbert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(model_card)

The `model_card` variable defines the *path* where to look for our pre-trained model.

You can check [huggingface's hub](https://huggingface.co/models) model hub to pick the model card according to your preference.

We proceed on tokenizing movie reviews text with our tokenizer.

In [12]:
def preprocess_text(texts):
    return tokenizer(texts['text'], truncation=True)

train_data = train_data.map(preprocess_text, batched=True)
test_data = test_data.map(preprocess_text, batched=True)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Let's inspect the preprocess `Dataset` instances

In [13]:
print(train_data)
print(test_data)

Dataset({
    features: ['file_id', 'score', 'sentiment', 'split', 'text', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 5000
})
Dataset({
    features: ['file_id', 'score', 'sentiment', 'split', 'text', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 1000
})


In [14]:
print(train_data['input_ids'][50])

[101, 2630, 10608, 2003, 2941, 3855, 1999, 1996, 2143, 2021, 2025, 1999, 2151, 2126, 2008, 3084, 2151, 2825, 3168, 1012, 2012, 2028, 2391, 1010, 2070, 4268, 2024, 13071, 27046, 1996, 6748, 3798, 1010, 11131, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2027, 4088, 2000, 6848, 2054, 2027, 1005, 2222, 2424, 2091, 2045, 1998, 2028, 1997, 2068, 1006, 1037, 2611, 1007, 2758, 2016, 29475, 2027, 1005, 2222, 2424, 1037, 2630, 10608, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2748, 1010, 2008, 2015, 2009, 1012, 6135, 4297, 5644, 2063, 15417, 4818, 2000, 1996, 2466, 1010, 1996, 2069, 6517, 4434, 2000, 1996, 2516, 1010, 1998, 2053, 2801, 2339, 2016, 2052, 6814, 2016, 1005, 1040, 2424, 1037, 2630, 10608, 1999, 1037, 2902, 1005, 1055, 8102, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1045, 1005, 1049, 10339, 2005, 2383, 4622, 2009, 2021, 8307, 2018, 2000, 3342, 1045, 6814, 999, 102]


In [15]:
print(train_data['attention_mask'][50])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


We can perform some quick 'sanity check' to evaluate the tokenization process

In [16]:
original_text = train_data['text'][50]
decoded_text = tokenizer.decode(train_data['input_ids'][50])

print(original_text)
print()
print()
print(decoded_text)

Blue monkey is actually mentioned in the film but not in any way that makes any possible sense. At one point,some kids are wandering thru the deeper levels, exploring. <br /><br />They begin to discuss what they'll find down there and one of them (a girl) says she bets they'll find a blue monkey.<br /><br />Yes, thats it. Totally inconsequential to the story, the only sad connection to the title, and no idea why she would suppose she'd find a blue monkey in a hospital's basement.<br /><br />I'm embarrassed for having remembered it but somebody had to remember I suppose!


[CLS] blue monkey is actually mentioned in the film but not in any way that makes any possible sense. at one point, some kids are wandering thru the deeper levels, exploring. < br / > < br / > they begin to discuss what they'll find down there and one of them ( a girl ) says she bets they'll find a blue monkey. < br / > < br / > yes, thats it. totally inconsequential to the story, the only sad connection to the title,

### 1.3 Vocabulary

We **do not** necessarily need to build a vocabulary since transformers already come with their own! 

**However**, it is still possible to add new tokens to the vocabulary to adapt the model to the given use case.

```
tokenizer.add_tokens(new_tokens=new_tokens)
```

The transformer vocabulary will update its **unusued** vocabulary indexes with newly provided tokens.

### 1.4 Special tokens

**Pay attention** to used special tokens and their corresponding token ids.

Each transformer models has its own special tokens ([CLS], [SEP], [PAD], [EOS], etc...).

Thus, the same special token may be mapped to different token ids in distinct transformer models.

### 1.5 Text cleaning

We didn't perform any kind of text cleaning before performing text encoding.

This is usually because transformer tokenizers **have their own text cleaning process** to perform tokenization and models **may be sensitive** to custom operations!

In [22]:
example_text = "couldn't"
encoded_example = tokenizer.encode_plus(example_text, add_special_tokens=False)
print(encoded_example.tokens())

['couldn', "'", 't']


In [23]:
example_text = "At one point,some kids are wandering through the deeper levels, exploring."
encoded_example = tokenizer.encode_plus(example_text, add_special_tokens=False)
print(encoded_example.tokens())

['at', 'one', 'point', ',', 'some', 'kids', 'are', 'wandering', 'through', 'the', 'deeper', 'levels', ',', 'exploring', '.']


#### Example

`bert-base-uncased` is trained with text in lower format.

**Check model cards** on huggingface to know more about the models you use and inspect their text encoding pipeline to understand how they behave.

#### Homework 📖

Experiment with different model cards.

Experiment with text cleaning and evaluate its impact on classification.

## 2. Model definition

We are now ready to define our transformer-based classifier.

### 2.1 Data Formatting

We first need to format input data to be fed as mini-batches in a training/evaluation procedure.

In [24]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The ``DataCollatorWithPadding`` receives a batch of

```
(input_ids, attention_mask, token_type_ids, label)
```

tuples and **dynamically pads** ``input_ids``, ``attention_mask`` and ``token_type_ids`` to maximum sequence in the batch. 

Intuitively, this operation saves a lot of memory compared to padding to global maximum sequence, while it introduces a reasonable computational overhead.

### Note

The above example is just one way out of many to perform dynamic batch padding: it really depends on which data structures you are using.

### 2.2 Model definition

Defining a transformer-based model with huggingface is pretty straightforward!

Since we are dealing with text classification, we can use off-the-shelf `AutoModelForSequenceClassification`.

In [25]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_card,
                                                           num_labels=2,
                                                           id2label={0: 'NEG', 1: 'POS'},
                                                           label2id={'NEG': 0, 'POS': 1})

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifi

Let's first check the loaded model architecture.

In [26]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

**That's it!**

That's the simplicity of huggingface's APIs.

The model is ready to use for classification.

### 2.3 Custom architectures

There are plenty of pre-defined model architectures $\rightarrow$ [auto classes](https://huggingface.co/docs/transformers/model_doc/auto)

In more complex scenarios, we may want to define a custom architecture where the pre-trained model is part of it.

In these cases, the way you do it strongly depends on the underlying neural library.

However, there exist several high-level APIs depending on your needs.

## 3. Model training and evaluation

We are now ready to define the training and evaluation procedures to test our model on the IMDB dataset.

In particular, we are going to use ``Trainer`` APIs to efficiently perform training.

### 3.1 Metrics

First, we define classification metrics for evaluation.

In [28]:
from sklearn.metrics import f1_score, accuracy_score

def compute_metrics(output_info):
    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)
    
    f1 = f1_score(y_pred=predictions, y_true=labels, average='macro')
    acc = accuracy_score(y_pred=predictions, y_true=labels)
    return {'f1': f1, 'acc': acc}

### Hugginface's metrics

Huggingface's offers the **Evaluate** package that contains several evaluation metrics (e.g., accuracy, f1, squad-f1, etc...)

### 3.2 Training Arguments

The ``Trainer`` object can be extensively customized.

Feel free to check the [documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) on training arguments.

In [36]:
import evaluate

acc_metric = evaluate.load('accuracy')
f1_metric = evaluate.load('f1')

def compute_metrics(output_info):
    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)
    
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='macro')
    acc = acc_metric.compute(predictions=predictions, references=labels)
    return {**f1, **acc}
    

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

We first rename the `sentiment` column to `label` as the default input to `AutoModelForSequenceClassification`.

In [29]:
train_data = train_data.rename_column('sentiment', 'label')
test_data = test_data.rename_column('sentiment', 'label')

In [30]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="test_dir",                 # where to save model
    learning_rate=2e-5,                   
    per_device_train_batch_size=8,         # accelerate defines distributed training
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",           # when to report evaluation metrics/losses
    save_strategy="epoch",                 # when to save checkpoint
    load_best_model_at_end=True,
    report_to='none'                       # disabling wandb (default)
)

In [31]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

### Training schema with collator

<center>
    <img src="images/collator.png" alt="collator" />
</center>

In [32]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: score, split, file_id, __index_level_0__, text. If score, split, file_id, __index_level_0__, text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1875


Epoch,Training Loss,Validation Loss,F1,Acc
1,0.3592,0.253561,0.899994,0.9
2,0.2254,0.312255,0.916963,0.917
3,0.1468,0.353221,0.92287,0.923


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: score, split, file_id, __index_level_0__, text. If score, split, file_id, __index_level_0__, text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
Saving model checkpoint to test_dir/checkpoint-625
Configuration saved in test_dir/checkpoint-625/config.json
Model weights saved in test_dir/checkpoint-625/pytorch_model.bin
tokenizer config file saved in test_dir/checkpoint-625/tokenizer_config.json
Special tokens file saved in test_dir/checkpoint-625/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: score, split, file_id, __index_level_0__, text. If score, split, file_id, __index_level_0__, 

TrainOutput(global_step=1875, training_loss=0.21550222981770834, metrics={'train_runtime': 230.3159, 'train_samples_per_second': 65.128, 'train_steps_per_second': 8.141, 'total_flos': 1878284222786880.0, 'train_loss': 0.21550222981770834, 'epoch': 3.0})

### 3.3 Evaluation

We now evaluate the trained model on the test set.

In [37]:
test_prediction_info = trainer.predict(test_data)
test_predictions, test_labels = test_prediction_info.predictions, test_prediction_info.label_ids

print(test_predictions.shape)
print(test_labels.shape)

The following columns in the test set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: score, split, file_id, __index_level_0__, text. If score, split, file_id, __index_level_0__, text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1000
  Batch size = 8


(1000, 2)
(1000,)


In [39]:
test_metrics = compute_metrics([test_predictions, test_labels])
print(test_metrics)

{'f1': 0.8999935995903738, 'accuracy': 0.9}


### Some cleaning before PART II

Let's clean the memory and GPU before switching to instruction-tuned models.

In [42]:
import gc

del model
del trainer

with torch.no_grad():
    torch.cuda.empty_cache()

gc.collect()

53

# PART II

*   Prompting 101
*   Sentiment analysis with prompting
*   LangChain intro

## 1. Prompting 101

Prompting is a technique used to adapt a model to a variety of tasks without requiring fine-tuning.

```
Classify the text into neutral, negative or positive.
Text: {text}
Sentiment:
```

The model receives the above input prompt and performs text classification via completion.

```
Classify the text into neutral, negative or positive.
Text: {text}
Sentiment: {label}
```

In natural language, prompting is a very delicate process since natural language is **expressive**, **flexible**, and, **ambiguous**.

A certain concept can be expressed in several ways:

* These ways are semantically **equivalent**
* May lead to **significant** model performance **drifts**

### 1.1 Sensitivity Factors

There are two main factors to consider when performing prompt-based learning.

#### [Prompt Engineering](https://huggingface.co/docs/transformers/main/tasks/prompting#basics-of-prompting)

Eventually we have to iteratively find the best performing prompt.

This can either done

* Manually
* Automatically (via an ad-hoc model).


#### [Generation hyper-parameters](https://huggingface.co/docs/transformers/main/generation_strategies#text-generation-strategies)

Finding the optimal text generation strategy is a **critical point** for achieving satisfying performance.

These strategies affects how the model iteratively selects tokens during generation to avoid phenomena like repetitions, rare words, coherence with input text, and style.

* [Deterministic] Greedy $\rightarrow$ the most preferred (i.e., highest likelihood) token wins
* [Deterministic] Beam search
* [Stochastic] Top-k sampling
* [Stochastic] Nucleus sampling
* [Contrastive search](https://huggingface.co/blog/introducing-csearch)  $\leftarrow$ **recommended**

### 1.2 Model types

There are a lot of different large language models and it is quite easy to be confused.

Essentially, we have:

* **Base models** (either encoders or encode-decoders): very good at text completion.
* **Chat-based models**: base models specifically fine-tuned to address instructions or to chat.

#### Example

In Huggingface, the distinct is easily formatted as:

* `llama2-7b`            $\rightarrow$ base model
* `llama2-7b-*-instruct`   $\rightarrow$ chat-based model

## 2. Sentiment analysis with prompting

Let's consider our task once again to evaluate prompt-based models.

### 2.1 Model pipeline

First, we have to define the model pipeline to digest input prompts.

We choose the `model_card` to build our model.

In [None]:
model_card = ""

Next, we build the corresponding tokenizer and text generation pipeline

In [None]:
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained(model_card)

model_pipe = pipeline('text-generation',
                       model=model_card,
                       tokenizer=tokenizer)

### 2.2 Inference

We are now ready to feed prompts to our model and evaluate its performance.

Let's start with an example.

In [None]:
prompt = """Classify the text into negative or positive. 
Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen.
Sentiment:
"""

sequences = pipe(
    prompt,
    max_new_tokens=10,
)[0]

print(sequences['generated_text'])

Now we try with the whole test set.

In [None]:
def prepend_prompt(example):
    example['text'] = formatting_prompt.format(example['text'])
    return example
    

formatting_prompt = """Classify the text into negative or positive. 
Text: {0}
Sentiment:
"""

test_data = test_data.map(preprend_prompt, batched=True)

In [None]:
def parse_generation(text):
    label = text.split('Sentiment:')[1].strip()
    return 1 if label.casefold() == 'positive' else 0


sequences = pipe(
    test_data['text'],
    max_new_tokens=10
)

predictions = [parse_generation(seq['generated_text']) for seq in sequences]

metrics = compute_metrics([predictions, test_data['label']])
print(metrics)

Add homework https://huggingface.co/docs/transformers/main/generation_strategies#assisted-decoding
https://huggingface.co/docs/transformers/main/llm_tutorial

## 3. Advanced Prompting

There is no rule of thumb to perform well on prompting.

Some may argue it is *art*, some others might say it is just *engineering*.

However, here are some **general recommendations**:

* Check **how** the pre-trained model you are using was trained!

* Start **simple** and then refine.

* Instructions at the **start/end** of the prompt $\rightarrow$ based on how most attention layers work.

* **Separate** input text from instructions

* Provide **clear description** of the task: no ambiguity, text format, style, language, etc...

* **Evaluate** the prompt on several models

* Use advanced techniques: **few-shot prompting**, **Chain-of-thought (CoT)**, Least-to-Most (LtM)

### 3.1 Few-shot Prompting

https://huggingface.co/docs/transformers/main/tasks/prompting#best-practices-of-llm-prompting
https://huggingface.co/docs/transformers/custom_tools

### 3.2 Chain-of-thought (CoT) Prompting

### 3.3 Prompting vs Fine-tuning

## 4. LangChain

# The End!