# Tutorial 3

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Transformers, Huggingface, Prompting

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

## PART 0 ($\sim$5 mins)
*   Downloading a **dataset**.
*   Encoding a a **dataset**.

## PART I ($\sim$30 mins)

*   Text encoding with transformers.
*   Model definition.
*   Model training and evaluation with huggingface APIs.

## PART II ($\sim$30 mins)

*   Prompting 101
*   Sentiment analysis with prompting
*   Advanced prompting

## Preliminaries

First of all, we need to import some useful packages that we will use during this hands-on session.

In [1]:
# system packages
from pathlib import Path
import shutil
import urllib
import tarfile
import sys

# data and numerical management packages
import pandas as pd
import numpy as np

# useful during debugging (progress bars)
from tqdm import tqdm

In [None]:
!pip install torch==1.13.0+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
!pip install transformers==4.30.0
!pip install datasets==2.13.2
!pip install accelerate -U
!pip install evaluate

In [2]:
import torch
torch.cuda.is_available()

  from .autonotebook import tqdm as notebook_tqdm


True

In [3]:
!nvidia-smi

Fri Nov 10 09:24:44 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   32C    P0    22W /  80W |      8MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [4]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'width': 2560,
        'height': 1440,
        'scroll': True,
})

{'width': 2560, 'height': 1440, 'scroll': True}

# Data

We will use the IMDB dataset first introduced in tutorial 1.

* [**Stats**] A dataset of 50k sentences used for sentiment analysis: 25k with positive sentiment, 25k with negative one.
* [**Sentiment**] We consider sentiment labels for classification.

We start by **downloading** the dataset and **extract** it to a folder.

In [6]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(download_path: Path, url: str):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=download_path, reporthook=t.update_to)

        
def download_dataset(download_path: Path, url: str):
    print("Downloading dataset...")
    download_url(url=url, download_path=download_path)
    print("Download complete!")

def extract_dataset(download_path: Path, extract_path: Path):
    print("Extracting dataset... (it may take a while...)")
    with tarfile.open(download_path) as loaded_tar:
        loaded_tar.extractall(extract_path)
    print("Extraction completed!")

In [7]:
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset_name = "aclImdb"

print(f"Current work directory: {Path.cwd()}")
dataset_folder = Path.cwd().joinpath("Datasets")

if not dataset_folder.exists():
    dataset_folder.mkdir(parents=True)

dataset_tar_path = dataset_folder.joinpath("Movies.tar.gz")
dataset_path = dataset_folder.joinpath(dataset_name)

if not dataset_tar_path.exists():
    download_dataset(dataset_tar_path, url)

if not dataset_path.exists():
    extract_dataset(dataset_tar_path, dataset_folder)

Current work directory: /home/frgg/Repositories/nlp-course-material/2023-2024/Tutorial 3


#### Data Format

Just like in the first assignment, we need a **high level view** of the dataset that is helpful to our needs. 

We encode the dataset into a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

In [8]:
dataframe_rows = []

for split in ['train', 'test']:
    for sentiment in ['pos', 'neg']:
        folder = dataset_folder.joinpath(dataset_name, split, sentiment)
        for file_path in folder.glob('*.txt'):            
            with file_path.open(mode='r', encoding='utf-8') as text_file:
                text = text_file.read()
                score = file_path.stem.split("_")[1]
                score = int(score)
                file_id = file_path.stem.split("_")[0]

                num_sentiment = 1 if sentiment == 'pos' else 0

                dataframe_row = {
                    "file_id": file_id,
                    "score": score,
                    "sentiment": num_sentiment,
                    "split": split,
                    "text": text
                }

                dataframe_rows.append(dataframe_row)

In [9]:
folder = Path.cwd().joinpath("Datasets", "Dataframes", dataset_name)
if not folder.exists():
    folder.mkdir(parents=True)

# transform the list of rows in a proper dataframe
df = pd.DataFrame(dataframe_rows)
df = df[["file_id", 
         "score",
         "sentiment",
         "split",
         "text"]
       ]
df_path = folder.with_name(dataset_name + ".pkl")
df.to_pickle(df_path)

# PART I

*   Text encoding with Transformers.
*   Model definition.
*   Model training and evaluation with huggingface APIs.

## 1. Text encoding with Transformers.

In tutorial 1, we have seen how to define standard machine learning models to address sentiment classification.

However, we know that Transformer-based models are one of the strongest baselines when assessing a task or benchmarking on a novel corpus.

Before defining our transformer-based classifier, we need to encode text inputs into numerical format.

As in Tutorial 1, we are going to **tokenize** input texts to perform token indexing.

### 1.1 Encoding the dataset

First, we are going to use ``datasets`` library to encode our dataset into a handy wrapper for computational speedup.

In [10]:
from datasets import Dataset

# Slicing for showcasing purposes only!
train_df = df.loc[df['split'] == "train"].sample(frac=1.0)[:5000]
test_df = df.loc[df['split'] == "test"].sample(frac=1.0)[:1000]

train_data = Dataset.from_pandas(train_df)
test_data = Dataset.from_pandas(test_df)

Let's inspect the newly defined `Dataset` instances

In [11]:
print(train_data)
print(test_data)

Dataset({
    features: ['file_id', 'score', 'sentiment', 'split', 'text', '__index_level_0__'],
    num_rows: 5000
})
Dataset({
    features: ['file_id', 'score', 'sentiment', 'split', 'text', '__index_level_0__'],
    num_rows: 1000
})


### 1.2 Tokenization

Transformers typically use [SentencePiece tokenizer](https://github.com/google/sentencepiece) to perform sub-word level tokenization.

In particular, the `transformers` library offers the `AutoTokenizer` class to quickly retrieve our chosen transformer's ad-hoc tokenizer.

In [12]:
from transformers import AutoTokenizer

model_card = 'distilbert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(model_card)

The `model_card` variable defines the *path* where to look for our pre-trained model.

You can check [huggingface's hub](https://huggingface.co/models) model hub to pick the model card according to your preference.

We proceed on tokenizing movie reviews text with our tokenizer.

In [13]:
def preprocess_text(texts):
    return tokenizer(texts['text'], truncation=True)

train_data = train_data.map(preprocess_text, batched=True)
test_data = test_data.map(preprocess_text, batched=True)

                                                                                

Let's inspect the preprocess `Dataset` instances

In [14]:
print(train_data) 
print(test_data)

Dataset({
    features: ['file_id', 'score', 'sentiment', 'split', 'text', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 5000
})
Dataset({
    features: ['file_id', 'score', 'sentiment', 'split', 'text', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 1000
})


In [15]:
print(train_data['input_ids'][50])

[101, 1045, 2228, 2008, 2026, 5440, 2112, 1997, 2023, 3185, 1010, 1996, 2028, 2008, 4654, 6633, 24759, 14144, 1996, 11591, 23100, 1010, 28072, 1998, 27118, 22758, 1997, 1996, 8931, 1010, 3310, 2012, 1996, 14463, 1997, 1996, 2143, 1012, 3460, 6945, 5912, 1998, 2010, 17204, 2767, 1996, 6458, 2031, 2633, 25878, 1996, 13721, 2158, 2006, 1037, 4899, 2006, 2070, 5108, 1999, 2019, 5992, 11717, 3269, 1012, 2562, 1999, 2568, 2008, 5912, 2038, 2042, 2559, 2005, 1996, 3461, 2005, 3053, 1996, 2972, 2143, 1010, 1998, 2008, 1996, 3461, 2038, 2730, 1998, 8828, 2195, 2111, 2012, 2023, 2391, 1006, 2164, 2010, 5795, 1007, 1010, 1998, 5912, 2003, 2200, 5204, 2008, 3461, 2003, 14196, 9577, 1998, 7501, 2005, 2529, 5771, 1998, 2668, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2061, 1996, 6458, 2038, 2010, 3282, 4197, 2012, 3461, 1010, 2040, 2003, 1010, 1998, 1045, 2507, 1996, 3185, 1998, 6174, 6243, 24387, 2005, 2023, 1010, 1996, 2087, 19424, 1998, 17082, 4874, 1999, 2529, 2433, 2008, 2057, 2031, 

In [16]:
print(train_data['attention_mask'][50])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

We can perform some quick *sanity check* to evaluate the tokenization process

In [17]:
original_text = train_data['text'][50]
decoded_text = tokenizer.decode(train_data['input_ids'][50])

print(original_text)
print()
print()
print(decoded_text)

I think that my favorite part of this movie, the one that exemplifies the sheer pointless, stupidity and inanity of the proceedings, comes at the climax of the film. DOCTOR TED NELSON and his unmarried friend the Sheriff have finally cornered the Melting Man on a landing on some stairs in an electrical generating plant. Keep in mind that Nelson has been looking for the MM for nearly the entire film, and that the MM has killed and eaten several people at this point (including his boss), and Nelson is very aware that MM is violently insane and hungry for human flesh and blood.<br /><br />So the Sheriff has his gun pointed at MM, who is, and I give the movie and Rick Baker props for this, the most disgusting and terrifying object in human form that we have ever seen. And he yells a very important question to DOCTOR TED NELSON: "WHAT DO WE DO NOW?!?!?" <br /><br />The camera cuts over to DOCTOR TED NELSON, and it's obvious that Ted has no idea what to do next. Apparently Ted was so intent 

### 1.3 Vocabulary

We **do not** necessarily need to build a vocabulary since transformers already come with their own! 

**However**, it is still possible to add new tokens to the vocabulary to adapt the model to the given use case.

```
tokenizer.add_tokens(new_tokens=new_tokens)
```

The transformer vocabulary will update its **unusued** vocabulary indexes with newly provided tokens.

### 1.4 Special tokens

**Pay attention** to used special tokens and their corresponding token ids.

Each transformer models has its own special tokens ([CLS], [SEP], [PAD], [EOS], etc...).

Thus, the same special token may be mapped to different token ids in distinct transformer models.

### 1.5 Text cleaning

We didn't perform any kind of text cleaning before performing text encoding.

This is usually because transformer tokenizers **have their own text cleaning process** to perform tokenization.

Thus, models **may be sensitive** to custom operations!

In [18]:
example_text = "couldn't"
encoded_example = tokenizer.encode_plus(example_text, add_special_tokens=False)
print(encoded_example.tokens())

['couldn', "'", 't']


In [19]:
example_text = "At one point,some kids are wandering through the deeper levels, exploring."
encoded_example = tokenizer.encode_plus(example_text, add_special_tokens=False)
print(encoded_example.tokens())

['at', 'one', 'point', ',', 'some', 'kids', 'are', 'wandering', 'through', 'the', 'deeper', 'levels', ',', 'exploring', '.']


#### Example

`bert-base-uncased` is trained with text in lower format.

**Check model cards** on huggingface to know more about the models you use and inspect their text encoding pipeline to understand how they behave.

#### Homework 📖

Experiment with different model cards.

Experiment with text cleaning and evaluate its impact on classification.

## 2. Model definition

We are now ready to define our transformer-based classifier.

### 2.1 Data Formatting

We first need to format input data to be fed as mini-batches in a training/evaluation procedure.

In [20]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The ``DataCollatorWithPadding`` receives a batch of

```
(input_ids, attention_mask, token_type_ids, label)
```

tuples and **dynamically pads** ``input_ids``, ``attention_mask`` and ``token_type_ids`` to maximum sequence in the batch. 

Intuitively, this operation saves a lot of memory compared to padding to global maximum sequence, while it introduces a reasonable computational overhead.

### Note

The above example is just one way out of many to perform dynamic batch padding: it really depends on which data structures you are using.

### 2.2 Model definition

Defining a transformer-based model with huggingface is pretty straightforward!

Since we are dealing with text classification, we can use off-the-shelf `AutoModelForSequenceClassification`.

In [22]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_card,
                                                           num_labels=2,
                                                           id2label={0: 'NEG', 1: 'POS'},
                                                           label2id={'NEG': 0, 'POS': 1})

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.we

Let's first check the loaded model architecture.

In [20]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

**That's it!**

That's the simplicity of huggingface's APIs.

The model is ready to use for classification.

### 2.3 Custom architectures

There are plenty of pre-defined model architectures $\rightarrow$ [auto classes](https://huggingface.co/docs/transformers/model_doc/auto)

In more complex scenarios, we may want to define a custom architecture where the pre-trained model is part of it.

In these cases, the way you do it strongly depends on the underlying neural library.

However, there exist several high-level APIs depending on your needs.

## 3. Model training and evaluation

We are now ready to define the training and evaluation procedures to test our model on the IMDB dataset.

In particular, we are going to use ``Trainer`` APIs to efficiently perform training.

### 3.1 Metrics

First, we define classification metrics for evaluation.

In [23]:
from sklearn.metrics import f1_score, accuracy_score

def compute_metrics(output_info):
    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)
    
    f1 = f1_score(y_pred=predictions, y_true=labels, average='macro')
    acc = accuracy_score(y_pred=predictions, y_true=labels)
    return {'f1': f1, 'acc': acc}

### Hugginface's metrics

Huggingface's offers the **Evaluate** package that contains several evaluation metrics (e.g., accuracy, f1, squad-f1, etc...)

In [24]:
import evaluate

acc_metric = evaluate.load('accuracy')
f1_metric = evaluate.load('f1')

def compute_metrics(output_info):
    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)
    
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='macro')
    acc = acc_metric.compute(predictions=predictions, references=labels)
    return {**f1, **acc}
    

### 3.2 Training Arguments

The ``Trainer`` object can be extensively customized.

Feel free to check the [documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) on training arguments.

We first rename the `sentiment` column to `label` as the default input to `AutoModelForSequenceClassification`.

In [25]:
train_data = train_data.rename_column('sentiment', 'label')
test_data = test_data.rename_column('sentiment', 'label')

In [26]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="test_dir",                 # where to save model
    learning_rate=2e-5,                   
    per_device_train_batch_size=8,         # accelerate defines distributed training
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",           # when to report evaluation metrics/losses
    save_strategy="epoch",                 # when to save checkpoint
    load_best_model_at_end=True,
    report_to='none'                       # disabling wandb (default)
)

In [27]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

### Training schema with collator

<center>
    <img src="images/collator.png" alt="collator" />
</center>

In [28]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.3428,0.266022,0.9139,0.914


TrainOutput(global_step=625, training_loss=0.333054443359375, metrics={'train_runtime': 169.1831, 'train_samples_per_second': 29.554, 'train_steps_per_second': 3.694, 'total_flos': 622869987692928.0, 'train_loss': 0.333054443359375, 'epoch': 1.0})

### 3.3 Evaluation

We now evaluate the trained model on the test set.

In [29]:
test_prediction_info = trainer.predict(test_data)
test_predictions, test_labels = test_prediction_info.predictions, test_prediction_info.label_ids

print(test_predictions.shape)
print(test_labels.shape)

(1000, 2)
(1000,)


In [30]:
test_metrics = compute_metrics([test_predictions, test_labels])
print(test_metrics)

{'f1': 0.9139004689420969, 'accuracy': 0.914}


### Some cleaning before PART II

Let's clean the memory and GPU before switching to instruction-tuned models.

In [31]:
import gc

model = None
del model
trainer = None
del trainer

with torch.no_grad():
    torch.cuda.empty_cache()

gc.collect()

112

# PART II

*   Prompting 101
*   Sentiment analysis with prompting
*   Advanced prompting

## 1. Prompting 101

Prompting is a technique used to adapt a model to a variety of tasks without requiring fine-tuning.

```
Classify the text into neutral, negative or positive.
Text: {text}
Sentiment:
```

The model receives the above input prompt and performs text classification via completion.

```
Classify the text into neutral, negative or positive.
Text: {text}
Sentiment: {label}
```

In natural language, prompting is a very delicate process since natural language is **expressive**, **flexible**, and, **ambiguous**.

A certain concept can be expressed in several ways:

* These ways are semantically **equivalent**
* May lead to **significant** model performance **drifts**

### 1.1 Sensitivity Factors

There are two main factors to consider when performing prompt-based learning.

#### [Prompt Engineering](https://www.promptingguide.ai/)

Eventually we have to iteratively find the best performing prompt.

This can either done

* Manually
* Automatically (via an ad-hoc model).


#### [Generation hyper-parameters](https://huggingface.co/docs/transformers/main/generation_strategies#text-generation-strategies)

Finding the optimal text generation strategy is a **critical point** for achieving satisfying performance.

These strategies affects how the model iteratively selects tokens during generation to avoid phenomena like repetitions, rare words, coherence with input text, and style.

* [Deterministic] Greedy $\rightarrow$ the most preferred (i.e., highest likelihood) token wins
* [Deterministic] Beam search
* [Stochastic] Top-k sampling
* [Stochastic] Nucleus sampling
* [Contrastive search](https://huggingface.co/blog/introducing-csearch)  $\leftarrow$ **recommended**

### 1.2 Model types

There are a lot of different large language models and it is quite easy to be confused.

Essentially, we have:

* **Base models** (either encoders or encode-decoders): very good at text completion.
* **Chat-based models**: base models specifically fine-tuned to address instructions or to chat.

#### Example

In Huggingface, the distinct is easily formatted as:

* `llama2-7b`            $\rightarrow$ base model
* `llama2-7b-*-instruct`   $\rightarrow$ chat-based model

## 2. Sentiment analysis with prompting

Let's consider our task once again to evaluate prompt-based models.

### Preliminaries

We first install some package(s) for efficient computation given our hardware limitations.

In [None]:
!git clone https://github.com/timdettmers/bitsandbytes.git
!cd bitsandbytes
!python setup.py install

# Alternatively (Colab) --> restart runtime afterwards! (re-run part I and the first code cell of Part I 3.2)
!pip install bitsandbytes

### 2.1 Model pipeline

First, we have to define the model pipeline to digest input prompts.

In [32]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

model_card = "danielhanchen/open_llama_3b_600bt_preview"
tokenizer = AutoTokenizer.from_pretrained(model_card)
model = AutoModelForCausalLM.from_pretrained(model_card, load_in_8bit=True, device_map='auto')

#### Homework 📖

Experiment with different model cards (either base or chat-base models)

### 2.2 Inference

We are now ready to feed prompts to our model and evaluate its performance.

Let's start with an example.

In [33]:
def complete_prompt(prompt, **kwargs):
    prompt = tokenizer(prompt, return_tensors='pt').to('cuda')
    generated = model.generate(input_ids=prompt['input_ids'],
                           attention_mask=prompt['attention_mask'],
                           **kwargs)
    generated = tokenizer.batch_decode(generated, skip_special_tokens=True)
    return generated[0]

prompt = """Classify the text into negative or positive. 
Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen.
Sentiment:
"""

generated = complete_prompt(prompt, max_new_tokens=10)
print(generated)

Classify the text into negative or positive. 
Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen.
Sentiment:


*

*Negative:


Now we try with the whole test set.

In [34]:
def prepend_prompt(example):
    example['prompt'] = formatting_prompt.format(example['text'])
    return example
    

formatting_prompt = """Classify the text into negative or positive. 
Text: {0}
Sentiment:
"""

test_data = test_data.map(prepend_prompt)

                                                                                

In [35]:
# first 100 examples for showcasing purposes only (especially given this lazy implementation)
generated = [complete_prompt(prompt, max_new_tokens=10) for prompt in tqdm(test_data['prompt'][:100])]

100%|█████████████████████████████████████████| 100/100 [00:57<00:00,  1.74it/s]


In [36]:
def parse_generation(text):
    label = text.split('Sentiment:')[1].strip()
    return [0, 1] if 'positive' in label.casefold() else [1, 0]

predictions = [parse_generation(seq) for seq in generated]

In [40]:
metrics = compute_metrics([np.array(predictions), np.array(test_data['label'][:100])])
print(metrics)
print(generated[4])

{'f1': 0.4542619542619543, 'accuracy': 0.58}
Classify the text into negative or positive. 
Text: Good cinematography, good acting good direction...cannot justify a story that is not and cannot be acceptable to any society. Amitabh has often used the media to make this junk sell able by saying that -- if such an incident happens...then what? I would like to ask him if such a thing happens for your own child or your grandchild (say girl child) then what will you do? I think every parents will have to take special care before interacting with any 60 year old neighbor if you have one -jia- with you. Such films should be banned and discouraged otherwise you inspire more more Nithari cases. Such acts are villainous and villains in films are punished..that should be the moral of the story and not glorify their act or them.
Sentiment:
I am not sure if this is a sentiment or


## 3. [Advanced Prompting](https://huggingface.co/docs/transformers/main/tasks/prompting#chain-of-thought)

There is no rule of thumb to perform well on prompting.

Some may argue it is *art*, some others might say it is just *engineering*.

However, here are some **general recommendations**:

* Check **how** the pre-trained model you are using was trained!

* Start **simple** and then refine.

* Instructions at the **start/end** of the prompt $\rightarrow$ based on how most attention layers work.

* **Separate** input text from instructions

* Provide **clear description** of the task: no ambiguity, text format, style, language, etc...

* **Evaluate** the prompt on several models

* Use advanced techniques: **few-shot prompting**, **Chain-of-thought (CoT)**, Least-to-Most (LtM)

### 3.1 From Zero- to Few-shot Prompting

In many situations, a prompt containing instructions is not sufficient for a model to behave properly.

We can improve the prompt by providing **a few** ground-truth examples showing how the model should behave.

```
Classify the text into negative or positive. 
Text: {example1}
Sentiment: {label1}
Text: {example2}
Sentiment: {label2}
Text: {example3}
Sentiment: {label3}
Text: {text}
Sentiment:
```

#### Examples may be insufficient

Depending on the task at hand, providing examples may be not sufficient for the model to *understand* the instructions.

Also, the model might ignore provided examples or it might still perform correctly despite using **intentionally wrong** examples!!

#### Lengthy prompts

Adding examples increases the level of detail of prompt, while it may considerably increases its length.

Pay attention to what ``model_card`` you choose since your model may **truncate** input prompts!

Additionally, a lengthy prompt **increases computation**!

#### Examples quality

Choosing the right set of examples has an impact on model performance.

Intuitively, we select examples to maximize (textual) diversity and cover the whole label distribution.

In practice, this may be harder than expected: models are sensitive to prompt formatting.

Let's try sentiment analysis again with Few-shot prompting!

In [41]:
def prepend_prompt(example):
    example['prompt'] = formatting_prompt.format(example['text'])
    return example
    

formatting_prompt = """Classify the text into negative or positive.
Text: Everything is so well done: acting, directing, visuals, settings, photography, casting. If you can enjoy a story of real people and real love - this is a winner.
Label: positive
Text: This is one of the dumbest films, I've ever seen. It rips off nearly ever type of thriller and manages to make a mess of them all.
Sentiment: negative
Text: {0}
Sentiment:
"""

test_data = test_data.map(prepend_prompt)

                                                                                

In [42]:
# first 100 examples for showcasing purposes only (especially given this lazy implementation)
generated = [complete_prompt(prompt, max_new_tokens=10) for prompt in tqdm(test_data['prompt'][:100])]

100%|█████████████████████████████████████████| 100/100 [01:03<00:00,  1.58it/s]


In [43]:
predictions = [parse_generation(seq) for seq in generated]

metrics = compute_metrics([np.array(predictions), np.array(test_data['label'][:100])])
print(metrics)

{'f1': 0.36305732484076436, 'accuracy': 0.57}


#### Homework 📖

Experiment with different few-shot examples and evaluate corresponding model performance.

### 3.2 Chain-of-thought (CoT) Prompting

Providing examples to improve task performance may fail in complex scenarios like reasoning tasks.

CoT prompting forces the model to generate intermediate reasoning steps before providing the final output.

CoT can either be achieved via

* Few-shot examples on how to perform *reasoning*
* Defining the prompt to force *reasoning* (e.g., *let's think step by step*)

Let's try our sentiment analysis task with CoT prompting

In [44]:
def prepend_prompt(example):
    example['prompt'] = formatting_prompt.format(example['text'])
    return example
    

formatting_prompt = """Classify the text into negative or positive.
Text: Everything is so well done: acting, directing, visuals, settings, photography, casting. If you can enjoy a story of real people and real love - this is a winner.
Label: positive
Text: This is one of the dumbest films, I've ever seen. It rips off nearly ever type of thriller and manages to make a mess of them all.
Sentiment: negative
Text: {0}
Let's think step by step.
Sentiment:
"""

test_data = test_data.map(prepend_prompt)

                                                                                

In [45]:
# first 100 examples for showcasing purposes only (especially given this lazy implementation)
generated = [complete_prompt(prompt, max_new_tokens=20) for prompt in tqdm(test_data['prompt'][:100])]

100%|█████████████████████████████████████████| 100/100 [01:53<00:00,  1.13s/it]


In [46]:
predictions = [parse_generation(seq) for seq in generated]

metrics = compute_metrics([np.array(predictions), np.array(test_data['label'][:100])])
print(metrics)

{'f1': 0.36305732484076436, 'accuracy': 0.57}


#### Homework 📖

Experiment with different CoT prompts to enforce intermediate reasoning steps.

For more details check this [page](https://www.promptingguide.ai/techniques/cot).

### 3.3 Prompting vs Fine-tuning

At last, we may wondering on which technique to use.

In short, prompting comes at hand when transferring a pre-trained model on a domain that has some affinities with those seen during training.

In other cases like:

* Different domain
* Sensitive data
* Low-resource language
* Domain-specific model constraints

Fine-tuning is the preferred choice (to maximize improvements)

# The End!