# Homework 2 Part 3

## Course Name: Large Language Models
#### Lecturers: Dr. Soleimani, Dr. Rohban, Dr. Asgari

---

#### Notebooks Supervised By: Omid Ghahroodi, MohammadAli SadraeiJavaheri
#### Notebook Prepared By: Omid Ghahroodi, MohammadAli SadraeiJavaheri

**Contact**: Ask your questions in Quera

---

### Instructions:
- Complete all exercises presented in this notebook.
- Ensure you run each cell after you've entered your solution.
- After completing the exercises, save the notebook and <font color='red'>follow the submission guidelines provided in the PDF.</font>


---

**Note**: Replace the placeholders (between <font color="green">`## Your code begins ##`</font> and <font color="green">`## Your code ends ##`</font>) with the appropriate details.


# 1. Introduction

This notebook serves as a practical exercise in understanding prompt engineering and calibration within large language models. We will apply these concepts using `phi1.5`, a variant of advanced language models. Our task involves utilizing the `IMDB sentiment dataset`, a popular choice for training and testing language processing capabilities. This dataset, known for its collection of movie reviews, offers a diverse range of emotions and sentiments, making it an ideal tool for this exercise. The goal is to explore how different prompts influence the model's performance in accurately identifying and analyzing sentiments in text, thereby enhancing our comprehension of the nuances in language model calibration and prompt design.

In this exercise, you will explore different prompt choices and examine their effects on the model's performance. Your task is to calculate the calibration of the model for each of the given prompts and then compare these results. To achieve this, you should first implement the Expected Calibration Error (ECE) metric. This metric is crucial for understanding how closely the confidence of the model's predictions aligns with its accuracy. After implementing the ECE metric, calculate and report it for the results obtained from each of the prompts. This will provide valuable insights into the effectiveness of prompt engineering and its impact on model calibration, helping you understand the intricacies of large language model behavior in sentiment analysis tasks

In [None]:
%%capture

!pip install datasets
!pip install transformers
!pip install einops

In [None]:
# Note: Do NOT make changes to this block.

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics import classification_report
from tqdm import tqdm
import itertools
import torch
import random
import numpy as np
import pandas as pd


SEED=21

np.random.seed(SEED)
random.seed(SEED)

## 1.1 Load Dataset

Because `IMDB sentiment dataset` is large we only evalute using only 1000 samples of it. Important varibles from the cell below are:
- `test_set` the 1000 samples from `IMDB sentiment dataset`
- `pos_samples`, `neg_samples` 3 samples from each class that we will use in section `2.2 Few-shot`
- `calibration_context` samples used for calibration in section `3. Calibration`

In [None]:
# Note: Do NOT make changes to this block.

dataset = load_dataset("imdb")

num_of_test_data = 1000

test_set = list(dataset['test'])

data = np.array(test_set[:num_of_test_data]+test_set[-num_of_test_data:])
data = [i for i in data if len(i['text'])<2000]
data = np.array(test_set[:num_of_test_data//2]+test_set[-num_of_test_data//2:])

np.random.shuffle(data)


pos_samples = []
neg_samples = []

for i in range(12400, 12600, 1):
    if len(test_set[i]['text'])<1000:
        if test_set[i]['label'] == 0:
            neg_samples.append(test_set[i]['text'])
        elif test_set[i]['label'] == 1:
            pos_samples.append(test_set[i]['text'])
pos_samples = pos_samples[:3]
neg_samples = neg_samples[:3]

calibration_context = []

for i in range(13000, 16000, 1):
    if len(test_set[i]['text'])<=4000:
        calibration_context.append(test_set[i]['text'])

data[0]

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'text': "This is one of those movies that you watch because it's bad. Such a movie that you watch just to see it's shitty craftsmanship. Supposedly a horror, I cannot imagine how anyone can be afraid of a claymation bug, especially one that is translucent in nature where you can see the actor's legs behind it.<br /><br />Even with no budget, a little bit of attention to detail and even an attempt at making this movie believable would have sucked the fun right out of it, as they would have had to replace all of the actors and the entire story with it. If I had nothing to make fun of while it was playing, I would have stopped it after 10 minutes, and put on some quality show like Spunge Bob Square Pants (HAR HAR HAR).<br /><br />I Strongly recommend that Brett Piper get with Quintin Terrantino and Really pump out some feces.<br /><br />:)",
 'label': 0}

## 1.2 Load Model and Tokenizer

In [None]:
# Note: Do NOT make changes to this block.

torch.set_default_device("cuda")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

configuration_phi.py:   0%|          | 0.00/2.03k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-1_5:
- configuration_phi.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi.py:   0%|          | 0.00/33.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-1_5:
- modeling_phi.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin:   0%|          | 0.00/2.84G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/69.0 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

## 2. Classification (30 Points)



In the next cell you must complete `classify` implementation. This method can be used to classify a text using language model generation!

In [None]:
from typing import List

def classify(texts: List[str], pos_token: str, neg_token: str) -> List[int]:
    predicted_labels = []
    pos_token_id = tokenizer.encode(pos_token)[0]
    neg_token_id = tokenizer.encode(neg_token)[0]
    decoding_tokens = [pos_token_id, neg_token_id]
    for text in tqdm(texts):
        ## Your code begins ##
        input_ids = torch.tensor([tokenizer(text)['input_ids']])# use tokenizer!
        outputs = model.generate(
            input_ids=input_ids,
            max_new_tokens=1,
            prefix_allowed_tokens_fn=lambda batch_id, context: decoding_tokens  # we force the model to generate between these two tokens
        )
        last_output_id = outputs[-1][-1]
        ## Your code ends ##
        if last_output_id == pos_token_id:
            predicted_labels.append(1)
        elif last_output_id == neg_token_id:
            predicted_labels.append(0)
        else:
            if not isinstance(last_output_id, int):
                raise ValueError("Convert last_output_id to normal python type (use item method in torch)!")
            raise ValueError(f"A not supported label ({last_output_id}) occured!!!")

    return predicted_labels

## 2.1 Zero-shot settings (effect of label names)

In this section you will classify `data` by just using prompts without any examples. In the next two cel the performance is tested using two different prompts!

In [None]:
pos_token = 'positive'
neg_token = 'negative'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{text}
The sentiment of the above text is: '''

texts = [
    prompt_template.format(
        text=row['text'],
        pos_token=pos_token,
        neg_token=neg_token
    )
    for row in data
]
true_labels = [
    row['label']
    for row in data
]
## Your code begins ##
predicted_labels = classify(texts ,pos_token, neg_token)
## Your code ends ##
print(classification_report(y_true=true_labels, y_pred=predicted_labels))

100%|██████████| 1000/1000 [05:30<00:00,  3.03it/s]


              precision    recall  f1-score   support

           0       1.00      0.12      0.21       500
           1       0.53      1.00      0.69       500

    accuracy                           0.56      1000
   macro avg       0.77      0.56      0.45      1000
weighted avg       0.77      0.56      0.45      1000



In [None]:
pos_token = '1'
neg_token = '0'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} for positive or {neg_token} for negative.
{text}
The sentiment of the above text is: '''

## Your code begins ##
texts = None
true_labels = None
predicted_labels = None

texts = [
    prompt_template.format(
        text=row['text'],
        pos_token=pos_token,
        neg_token=neg_token
    )
    for row in data
]
true_labels = [
    row['label']
    for row in data
]
predicted_labels = classify(texts ,pos_token, neg_token)
## Your code ends ##
print(classification_report(y_true=true_labels, y_pred=predicted_labels))

100%|██████████| 1000/1000 [05:44<00:00,  2.90it/s]

              precision    recall  f1-score   support

           0       0.72      0.08      0.14       500
           1       0.51      0.97      0.67       500

    accuracy                           0.52      1000
   macro avg       0.61      0.52      0.40      1000
weighted avg       0.61      0.52      0.40      1000






## 2.2 Few-shot settings
### 2.2.1 Effect of different few-shot examples

In this section you will add an example for positive and negative label into your prompt. You must compare all 9 results in your report!

In [None]:
pos_token = 'positive'
neg_token = 'negative'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{pos_sample}
The sentiment of the above text is: {pos_token}
{neg_sample}
The sentiment of the above text is: {neg_token}
{text}
The sentiment of the above text is: '''

for pos_sample in pos_samples:
    for neg_sample in neg_samples:
        print(f'Results with:\n{pos_sample=}\n{neg_sample=}')
        ## Your code begins ##
        texts = [
            prompt_template.format(
                text=row['text'],
                pos_sample=pos_sample,
                neg_sample=neg_sample,
                pos_token=pos_token,
                neg_token=neg_token
            )
            for row in data
            ]
        true_labels = [
            row['label']
            for row in data
        ]
        predicted_labels = classify(texts ,pos_token, neg_token)
        ## Your code ends ##
        ## Your code ends ##
        print(classification_report(y_true=true_labels, y_pred=predicted_labels))
        print("=====================================")

Results with:
pos_sample="Previous reviewer Claudio Carvalho gave a much better recap of the film's plot details than I could. What I recall mostly is that it was just so beautiful, in every sense - emotionally, visually, editorially - just gorgeous.<br /><br />If you like movies that are wonderful to look at, and also have emotional content to which that beauty is relevant, I think you will be glad to have seen this extraordinary and unusual work of art.<br /><br />On a scale of 1 to 10, I'd give it about an 8.75. The only reason I shy away from 9 is that it is a mood piece. If you are in the mood for a really artistic, very romantic film, then it's a 10. I definitely think it's a must-see, but none of us can be in that mood all the time, so, overall, 8.75."
neg_sample='Shame Shame Shame on UA/DW for what you do! <br /><br />I was appalled. <br /><br />Do NOT take kids to see this movie. The humor is totally inappropriate for children - plus they\'ll be bored and disappointed. Certain

100%|██████████| 1000/1000 [12:36<00:00,  1.32it/s]


              precision    recall  f1-score   support

           0       0.93      0.49      0.64       500
           1       0.65      0.96      0.78       500

    accuracy                           0.73      1000
   macro avg       0.79      0.73      0.71      1000
weighted avg       0.79      0.73      0.71      1000

Results with:
pos_sample="Previous reviewer Claudio Carvalho gave a much better recap of the film's plot details than I could. What I recall mostly is that it was just so beautiful, in every sense - emotionally, visually, editorially - just gorgeous.<br /><br />If you like movies that are wonderful to look at, and also have emotional content to which that beauty is relevant, I think you will be glad to have seen this extraordinary and unusual work of art.<br /><br />On a scale of 1 to 10, I'd give it about an 8.75. The only reason I shy away from 9 is that it is a mood piece. If you are in the mood for a really artistic, very romantic film, then it's a 10. I defini

100%|██████████| 1000/1000 [11:35<00:00,  1.44it/s]


              precision    recall  f1-score   support

           0       0.92      0.42      0.57       500
           1       0.62      0.96      0.76       500

    accuracy                           0.69      1000
   macro avg       0.77      0.69      0.67      1000
weighted avg       0.77      0.69      0.67      1000

Results with:
pos_sample="Previous reviewer Claudio Carvalho gave a much better recap of the film's plot details than I could. What I recall mostly is that it was just so beautiful, in every sense - emotionally, visually, editorially - just gorgeous.<br /><br />If you like movies that are wonderful to look at, and also have emotional content to which that beauty is relevant, I think you will be glad to have seen this extraordinary and unusual work of art.<br /><br />On a scale of 1 to 10, I'd give it about an 8.75. The only reason I shy away from 9 is that it is a mood piece. If you are in the mood for a really artistic, very romantic film, then it's a 10. I defini

100%|██████████| 1000/1000 [12:14<00:00,  1.36it/s]


              precision    recall  f1-score   support

           0       0.88      0.67      0.76       500
           1       0.73      0.91      0.81       500

    accuracy                           0.79      1000
   macro avg       0.80      0.79      0.79      1000
weighted avg       0.80      0.79      0.79      1000

Results with:
pos_sample='A stunningly well-made film, with exceptional acting, directing, writing, and photography.<br /><br />A newlywed finds married life not what she expected, and starts to question her duty to herself versus her duty to society. Together with her sister -in-law, she makes some radical departures from conventional roles and mores.'
neg_sample='Shame Shame Shame on UA/DW for what you do! <br /><br />I was appalled. <br /><br />Do NOT take kids to see this movie. The humor is totally inappropriate for children - plus they\'ll be bored and disappointed. Certainly *we all* have read Theo\'s wonderful children book and certainly we have expectation

100%|██████████| 1000/1000 [10:57<00:00,  1.52it/s]


              precision    recall  f1-score   support

           0       0.92      0.36      0.52       500
           1       0.60      0.97      0.74       500

    accuracy                           0.67      1000
   macro avg       0.76      0.67      0.63      1000
weighted avg       0.76      0.67      0.63      1000

Results with:
pos_sample='A stunningly well-made film, with exceptional acting, directing, writing, and photography.<br /><br />A newlywed finds married life not what she expected, and starts to question her duty to herself versus her duty to society. Together with her sister -in-law, she makes some radical departures from conventional roles and mores.'
neg_sample="There's only one thing I'm going to say about cat in the hat...as a KIDS movie and a good comedy movie it sucks...I lost track of how many terrible jokes in the movie that not only sucked but weren't exactly kid appropriate. Oh and by the way the way the cat in the hat talked was annoying...as for the pl

100%|██████████| 1000/1000 [09:32<00:00,  1.75it/s]


              precision    recall  f1-score   support

           0       0.91      0.39      0.54       500
           1       0.61      0.96      0.75       500

    accuracy                           0.67      1000
   macro avg       0.76      0.67      0.64      1000
weighted avg       0.76      0.67      0.64      1000

Results with:
pos_sample='A stunningly well-made film, with exceptional acting, directing, writing, and photography.<br /><br />A newlywed finds married life not what she expected, and starts to question her duty to herself versus her duty to society. Together with her sister -in-law, she makes some radical departures from conventional roles and mores.'
neg_sample='A fine line up of actors and a seemingly nice plot -- though not original -- promised me a nice evening in front of the TV. I was disappointed. The actors delivered up to standard (Juliette Lewis cuddly as ever; William Hurt solid but in the background; Shelley Duvall convincing as ever) but the story wa

100%|██████████| 1000/1000 [10:14<00:00,  1.63it/s]


              precision    recall  f1-score   support

           0       0.86      0.61      0.71       500
           1       0.70      0.90      0.79       500

    accuracy                           0.75      1000
   macro avg       0.78      0.75      0.75      1000
weighted avg       0.78      0.75      0.75      1000

Results with:
pos_sample='one of best movies ever...Fire...it is not much about sociological description of India today...it is the mind blowing use of light that never stops, never becomes...normal...even when...in this sense the movie is almost unique...both leads are of very good quality...the origin of Das as a street performer are pretty obvious...her performance is a superb "cammeo"...but the use of the light...I have look at it and looked at it, again and again...still mind blowing after ages...nothing torrid in the story...rather "pure" way of facing the subject...in a way it is sad that in the bizarre world we live today, a major art work is usually known 

100%|██████████| 1000/1000 [13:24<00:00,  1.24it/s]


              precision    recall  f1-score   support

           0       0.91      0.43      0.58       500
           1       0.63      0.96      0.76       500

    accuracy                           0.69      1000
   macro avg       0.77      0.69      0.67      1000
weighted avg       0.77      0.69      0.67      1000

Results with:
pos_sample='one of best movies ever...Fire...it is not much about sociological description of India today...it is the mind blowing use of light that never stops, never becomes...normal...even when...in this sense the movie is almost unique...both leads are of very good quality...the origin of Das as a street performer are pretty obvious...her performance is a superb "cammeo"...but the use of the light...I have look at it and looked at it, again and again...still mind blowing after ages...nothing torrid in the story...rather "pure" way of facing the subject...in a way it is sad that in the bizarre world we live today, a major art work is usually known 

100%|██████████| 1000/1000 [11:46<00:00,  1.41it/s]


              precision    recall  f1-score   support

           0       0.92      0.26      0.40       500
           1       0.57      0.98      0.72       500

    accuracy                           0.62      1000
   macro avg       0.74      0.62      0.56      1000
weighted avg       0.74      0.62      0.56      1000

Results with:
pos_sample='one of best movies ever...Fire...it is not much about sociological description of India today...it is the mind blowing use of light that never stops, never becomes...normal...even when...in this sense the movie is almost unique...both leads are of very good quality...the origin of Das as a street performer are pretty obvious...her performance is a superb "cammeo"...but the use of the light...I have look at it and looked at it, again and again...still mind blowing after ages...nothing torrid in the story...rather "pure" way of facing the subject...in a way it is sad that in the bizarre world we live today, a major art work is usually known 

100%|██████████| 1000/1000 [12:34<00:00,  1.33it/s]

              precision    recall  f1-score   support

           0       0.89      0.42      0.57       500
           1       0.62      0.95      0.75       500

    accuracy                           0.69      1000
   macro avg       0.76      0.68      0.66      1000
weighted avg       0.76      0.69      0.66      1000






### 2.2.2 Effect of the order of few-shot examples

The sequence order is critical in in-context few-shot learning for Large Language Models (LLMs). In the upcoming section, we will delve into this by conducting tests with three distinct samples. Using these samples, we have the potential to examine six different permutations to understand this learning approach better.

In [None]:
pos_token = 'positive'
neg_token = 'negative'

sample_template = '''
{text}
The sentiment of the above text is: {label}'''

prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{samples}
{text}
The sentiment of the above text is: '''

samples_list = [
    sample_template.format(text=pos_samples[0], label=pos_token),
    sample_template.format(text=pos_samples[1], label=pos_token),
    sample_template.format(text=neg_samples[0], label=neg_token)
]

for permutation_indexes in itertools.permutations(range(len(samples_list))):
    print(f'Results with Permutation {permutation_indexes}')
    samples_permuted = [samples_list[idx] for idx in permutation_indexes]
    samples = ''.join(samples_permuted)
    ## Your code begins ##
    texts = [prompt_template.format(
                text=row['text'],
                samples = samples,
                pos_token=pos_token,
                neg_token=neg_token
            )
            for row in data
            ]
    true_labels = [
            row['label']
            for row in data
        ]
    predicted_labels = classify(texts ,pos_token, neg_token)
    ## Your code ends ##
    print(classification_report(y_true=true_labels, y_pred=predicted_labels))
    print("=====================================")

Results with Permutation (0, 1, 2)


100%|██████████| 1000/1000 [14:17<00:00,  1.17it/s]


              precision    recall  f1-score   support

           0       0.74      0.86      0.80       500
           1       0.84      0.70      0.76       500

    accuracy                           0.78      1000
   macro avg       0.79      0.78      0.78      1000
weighted avg       0.79      0.78      0.78      1000

Results with Permutation (0, 2, 1)


100%|██████████| 1000/1000 [14:19<00:00,  1.16it/s]


              precision    recall  f1-score   support

           0       0.86      0.62      0.72       500
           1       0.70      0.90      0.79       500

    accuracy                           0.76      1000
   macro avg       0.78      0.76      0.76      1000
weighted avg       0.78      0.76      0.76      1000

Results with Permutation (1, 0, 2)


100%|██████████| 1000/1000 [14:19<00:00,  1.16it/s]


              precision    recall  f1-score   support

           0       0.85      0.79      0.81       500
           1       0.80      0.86      0.83       500

    accuracy                           0.82      1000
   macro avg       0.82      0.82      0.82      1000
weighted avg       0.82      0.82      0.82      1000

Results with Permutation (1, 2, 0)


100%|██████████| 1000/1000 [14:18<00:00,  1.16it/s]


              precision    recall  f1-score   support

           0       0.80      0.86      0.83       500
           1       0.85      0.79      0.81       500

    accuracy                           0.82      1000
   macro avg       0.82      0.82      0.82      1000
weighted avg       0.82      0.82      0.82      1000

Results with Permutation (2, 0, 1)


100%|██████████| 1000/1000 [14:18<00:00,  1.16it/s]


              precision    recall  f1-score   support

           0       0.69      0.92      0.79       500
           1       0.88      0.58      0.70       500

    accuracy                           0.75      1000
   macro avg       0.79      0.75      0.75      1000
weighted avg       0.79      0.75      0.75      1000

Results with Permutation (2, 1, 0)


100%|██████████| 1000/1000 [14:19<00:00,  1.16it/s]

              precision    recall  f1-score   support

           0       0.58      0.98      0.73       500
           1       0.95      0.28      0.43       500

    accuracy                           0.63      1000
   macro avg       0.76      0.63      0.58      1000
weighted avg       0.76      0.63      0.58      1000






# 3. Calibration (50 Points)

In this section, you will calibrate the large language model using the methods that reviewed in class.

For prompt use the zero-shot setting with positive and negative labels.

### Calibrate before Use

In this part, you should use the method of "the Calibrate before Use" paper which was discussed in class, and get the calibration coefficients of the positive and negative labels, then combine it with your model and report metrics. You can read this paper in [this link](https://arxiv.org/abs/2102.09690).

In [None]:
pos_prob_calibration = 0
neg_prob_calibration = 0

## Your code begins ##
from torch.nn.functional import softmax
pos_token = "positive"
neg_token = "negative"

pos_token_id = tokenizer.encode(pos_token)[0]
neg_token_id = tokenizer.encode(neg_token)[0]

decoding_tokens = [pos_token_id, neg_token_id]

input_ids = tokenizer.encode("N/A", return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(
                    input_ids=input_ids,
                    max_new_tokens=1,
                    output_scores=True,
                    return_dict_in_generate=True,
                    prefix_allowed_tokens_fn=lambda batch_id, context: decoding_tokens  # we force the model to generate between these two tokens
                        )

    #print(torch.unique(outputs.scores[0]))
    probs = softmax(outputs.scores[0] , dim = -1)
    pos_prob_calibration = probs[0][pos_token_id]
    neg_prob_calibration = probs[0][neg_token_id]
## Your code ends ##

print(f'Positive prob: {pos_prob_calibration}')
print(f'Negative prob: {neg_prob_calibration}')

Positive prob: 0.8053299784660339
Negative prob: 0.19467003643512726


In [None]:
pos_token = 'positive'
neg_token = 'negative'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{text}
The sentiment of the above text is: '''

texts = [
    prompt_template.format(
        text=row,
        pos_token=pos_token,
        neg_token=neg_token
    )
    for row in data
]
true_labels = [
    row['label']
    for row in data
]
## Your code begins ##
from typing import List

def classify(texts: List[str], pos_token: str, neg_token: str) -> List[int]:
    predicted_labels = []
    predicted_confidences = []
    pos_token_id = tokenizer.encode(pos_token)[0]
    neg_token_id = tokenizer.encode(neg_token)[0]
    decoding_tokens = [pos_token_id, neg_token_id]
    for text in tqdm(texts):

        input_ids = torch.tensor([tokenizer(text)['input_ids']])# use tokenizer!
        outputs = model.generate(
                    input_ids=input_ids,
                    max_new_tokens=1,
                    output_scores=True,
                    return_dict_in_generate=True,
                    prefix_allowed_tokens_fn=lambda batch_id, context: decoding_tokens  # we force the model to generate between these two tokens
        )
        #print(outputs.scores)
        probs = softmax(outputs.scores[0] , dim = -1)
        pos_prob_before = probs[0][pos_token_id]
        neg_prob_before = probs[0][neg_token_id]

        probs_after = softmax(torch.tensor([[pos_prob_before/pos_prob_calibration , neg_prob_before/neg_prob_calibration]]))
        predicted_confidences.append(probs_after)
        pred = torch.argmax(probs_after)
        if pred == 0:
            predicted_labels.append(1)
        elif pred == 1:
            predicted_labels.append(0)
    return predicted_labels , predicted_confidences

predicted_labels_cc , predicted_confidences_cc = classify(texts, pos_token, neg_token)
## Your code ends ##
print(classification_report(y_true=true_labels, y_pred=predicted_labels_cc))

  return func(*args, **kwargs)
100%|██████████| 1000/1000 [05:04<00:00,  3.28it/s]

              precision    recall  f1-score   support

           0       0.88      0.43      0.58       500
           1       0.62      0.94      0.75       500

    accuracy                           0.69      1000
   macro avg       0.75      0.69      0.66      1000
weighted avg       0.75      0.69      0.66      1000






### Mitigating label biases for in-context learning

In this part, you should use the method of "Mitigating label biases for in-context learning" paper which was discussed in class, and get the calibration coefficients of the positive and negative labels, then combine it with your model and report metrics.

Use `calibration_context` list for context and consider `T = 1000`

In [None]:
T = 1000
pos_prob_calibration = 0
neg_prob_calibration = 0

## Your code begins ##
from torch.nn.functional import softmax
pos_token = "positive"
neg_token = "negative"

pos_token_id = tokenizer.encode(pos_token)[0]
neg_token_id = tokenizer.encode(neg_token)[0]
decoding_tokens = [pos_token_id, neg_token_id]

for t in tqdm(range(T)):
    selected_calibration_context = random.choice(calibration_context)
    #print(selected_calibration_context)
    random_words = selected_calibration_context.split(" ")
    eight_words = random.choices(random_words , k = 8)
    #print(eight_words)
    eight_context = " ".join(eight_words)

    input_ids = tokenizer.encode(eight_context, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(
                        input_ids=input_ids,
                        max_new_tokens=1,
                        output_scores=True,
                        return_dict_in_generate=True,
                        prefix_allowed_tokens_fn=lambda batch_id, context: decoding_tokens  # we force the model to generate between these two tokens
                            )

    #print(torch.unique(outputs.scores[0]))
    probs = softmax(outputs.scores[0] , dim = -1)
    pos_prob_calibration += probs[0][pos_token_id]
    neg_prob_calibration += probs[0][neg_token_id]
    #break

## Your code ends ##
pos_prob_calibration/=T
neg_prob_calibration/=T
print()
print(f'Positive prob: {pos_prob_calibration}')
print(f'Negative prob: {neg_prob_calibration}')

100%|██████████| 1000/1000 [01:04<00:00, 15.49it/s]


Positive prob: 0.4270208775997162
Negative prob: 0.5729792714118958





In [None]:
pos_token = 'positive'
neg_token = 'negative'
prompt_template = '''
What is the sentiment of the following text? Choose between {pos_token} or {neg_token}.
{text}
The sentiment of the above text is: '''

texts = [
    prompt_template.format(
        text=row['text'],
        pos_token=pos_token,
        neg_token=neg_token
    )
    for row in data
]
true_labels = [
    row['label']
    for row in data
]

## Your code begins ##
from typing import List

def classify(texts: List[str], pos_token: str, neg_token: str) -> List[int]:
    predicted_labels = []
    predicted_confidences = []
    pos_token_id = tokenizer.encode(pos_token)[0]
    neg_token_id = tokenizer.encode(neg_token)[0]
    decoding_tokens = [pos_token_id, neg_token_id]
    for text in tqdm(texts):

        input_ids = torch.tensor([tokenizer(text)['input_ids']])# use tokenizer!
        outputs = model.generate(
                    input_ids=input_ids,
                    max_new_tokens=1,
                    output_scores=True,
                    return_dict_in_generate=True,
                    prefix_allowed_tokens_fn=lambda batch_id, context: decoding_tokens  # we force the model to generate between these two tokens
        )
        probs = softmax(outputs.scores[0] , dim = -1)
        pos_prob_before = probs[0][pos_token_id]
        neg_prob_before = probs[0][neg_token_id]

        probs_after =  softmax(torch.tensor([[pos_prob_before + pos_prob_calibration , neg_prob_before + neg_prob_calibration]]))
        predicted_confidences.append(probs_after)
        pred = torch.argmax(probs_after)
        if pred == 0:
            predicted_labels.append(1)
        elif pred == 1:
            predicted_labels.append(0)
    return predicted_labels , predicted_confidences

predicted_labels_mit , predicted_confidences_mit = classify(texts , pos_token, neg_token)
## Your code ends ##
print(classification_report(y_true=true_labels, y_pred=predicted_labels_mit))

  return func(*args, **kwargs)
100%|██████████| 1000/1000 [05:19<00:00,  3.13it/s]


              precision    recall  f1-score   support

           0       0.95      0.22      0.36       500
           1       0.56      0.99      0.71       500

    accuracy                           0.61      1000
   macro avg       0.75      0.61      0.54      1000
weighted avg       0.75      0.61      0.54      1000



## ECE (20 Points)

ECE stands for Expected Calibration Error. It is a metric used to evaluate the calibration of probabilistic predictions made by a machine learning model.

The Expected Calibration Error measures the average difference between the predicted confidence (probability) and the true accuracy across different confidence levels.
ECE is calculated by dividing the confidence interval into smaller bins and computing the average difference between the predicted accuracy and the true accuracy within each bin. It provides a quantitative measure of how well a model's predicted probabilities align with the actual outcomes. Lower values of ECE indicate better calibration, while higher values indicate greater miscalibration.

To calculate the ECE follow these steps:

1- Divide the predictions into different confidence bins.

2- Calculate the average confidence and accuracy for each bin. Confidence can be defined as the mean predicted probability within each bin, and accuracy can be calculated as the proportion of correct predictions within each bin.

3- Compute the difference between the average confidence and accuracy for each bin.

4- Weight the differences by the fraction of examples in each bin to obtain the weighted difference for each bin and sum up the weighted differences across all bins to get the ECE.

Here is a general formula to calculate ECE:
$$
\text{ECE} = \sum \left( \left| \text{Accuracy}_i - \text{Confidence}_i \right| \times \frac{N_i}{N} \right)
$$
You should implement this metric in the following cell.

In [None]:
import numpy as np

def ECE(output, true_labels, bins=4):
    bin_boundaries = np.linspace(0, 1, bins + 1)
    bin_lowers = bin_boundaries[:-1]
    bin_uppers = bin_boundaries[1:]

    # Extract confidences and predicted labels
    confidences = np.max(output, axis=1)
    predicted_labels = np.argmax(output, axis=1)

    # Compute accuracies
    accuracies = predicted_labels == true_labels

    bin_accuracies, bin_confidences, bin_probabilities = [], [], []

    for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
        in_bin = np.logical_and(confidences > bin_lower, confidences <= bin_upper)
        prob_in_bin = np.mean(in_bin)

        if prob_in_bin > 0:
            bin_accuracies.append(np.mean(accuracies[in_bin]))
            bin_confidences.append(np.mean(confidences[in_bin]))
            bin_probabilities.append(prob_in_bin)

    bin_accuracies, bin_confidences, bin_probabilities = map(np.array, [bin_accuracies, bin_confidences, bin_probabilities])

    ece = np.sum(bin_probabilities * np.abs(bin_confidences - bin_accuracies))

    return ece



In the following cell, calculate the ECE for the two calibration methods you implemented.

In [None]:
## Your code begins ##
certainty1 =np.array([x[0].cpu().numpy() for x in predicted_confidences_cc])
gt1 = true_labels
ece = ECE(certainty1, gt1)
print(f'ECE for Calibrate before Use: {ece}')

certainty2 = np.array([x[0].cpu().numpy() for x in predicted_confidences_mit])
gt2 = true_labels
ece = ECE(certainty2, gt2)
print(f'ECE for Mitigating label biases for in-context learning: {ece}')

## Your code ends ##

ECE for Calibrate before Use: [0.3048874]
ECE for Mitigating label biases for in-context learning: [0.19842499]
