In [3]:
import torch
import numpy as np
import random
from dotenv import load_dotenv
import os
from huggingface_hub import login

load_dotenv()

True


In [4]:
HF_TOKEN = os.getenv("HF_TOKEN")
login(HF_TOKEN)

In [5]:
SEED = 42
def set_seed(seed):
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
set_seed(SEED)
print(f"Seed set to {SEED}")

Seed set to 42


In [6]:
def get_device():
    if torch.cuda.is_available():
        device = torch.device("cuda")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
    else:
        device = torch.device('cpu')
    return device
device = get_device()
print("Using device:", device)

Using device: cuda


# Working with Datasets

### 1. Load the tokenizer and the 4 Bit quantized model.

In [7]:
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

In [8]:
model_id = "microsoft/Phi-3.5-mini-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False, padding_side='left')
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="auto",
    trust_remote_code=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### 2. Load the test split of the ```abisee/cnn``` dailymail dataset (Version 3.0.0) using the <i>datasets</i> library. Print the dataset structure and the first sample.

In [9]:
from datasets import load_dataset

In [10]:
dataset = load_dataset("abisee/cnn_dailymail", "3.0.0", split="test")

In [11]:
print("Dataset Structure:")
print(dataset)

Dataset Structure:
Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 11490
})


In [12]:
print("First Sample:")
print("Article: ", dataset[0]['article'])
print("Highlights: ", dataset[0]['highlights'])
print("ID: ", dataset[0]['id'])

First Sample:
Article:  (CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC's founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians' efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at 

### 3. Shuffle the dataset with a fixed seed, select only the first 50 samples, and remove the ```id``` column. Print the dataset structure and the first sample.

In [13]:
dataset = dataset.shuffle(seed=SEED)
dataset = dataset.select(range(50))
dataset = dataset.remove_columns('id')

In [14]:
print("Dataset Structure:")
print(dataset)

Dataset Structure:
Dataset({
    features: ['article', 'highlights'],
    num_rows: 50
})


In [15]:
print("First Sample:")
print("Article: ", dataset[0]['article'])
print("Highlights: ", dataset[0]['highlights'])

First Sample:
Article:  (CNN) I see signs of a revolution everywhere. I see it in the op-ed pages of the newspapers, and on the state ballots in nearly half the country. I see it in politicians who once preferred to play it safe with this explosive issue but are now willing to stake their political futures on it. I see the revolution in the eyes of sterling scientists, previously reluctant to dip a toe into this heavily stigmatized world, who are diving in head first. I see it in the new surgeon general who cites data showing just how helpful it can be. I see a revolution in the attitudes of everyday Americans. For the first time a majority, 53%, favor its legalization, with 77% supporting it for medical purposes. Support for legalization has risen 11 points in the past few years alone. In 1969, the first time Pew asked the question about legalization, only 12% of the nation was in favor. I see a revolution that is burning white hot among young people, but also shows up among the paren

### 4. Incorporate the system prompt <i>You are an AI summarizer. Write a short summary for the following CNN article:</i> into each sample using the ```apply_chat_template()``` method of the tokenizer. Print the first sample.

In [16]:
system_prompt = "You are an AI summarizer. Write a short summary for the following CNN article:"

def add_prompt(example):
    chat = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Below is an article from CNN. Summarize the main ideas:\n{example['article']}"}
    ]
    example["prompt"] = tokenizer.apply_chat_template(chat, tokenize=False,add_generation_prompt=True)
    return example

dataset = dataset.map(add_prompt)

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [17]:
print("First sample:")
print(dataset[0]["prompt"])

First sample:
<|system|>
You are an AI summarizer. Write a short summary for the following CNN article:<|end|>
<|user|>
Below is an article from CNN. Summarize the main ideas:
(CNN) I see signs of a revolution everywhere. I see it in the op-ed pages of the newspapers, and on the state ballots in nearly half the country. I see it in politicians who once preferred to play it safe with this explosive issue but are now willing to stake their political futures on it. I see the revolution in the eyes of sterling scientists, previously reluctant to dip a toe into this heavily stigmatized world, who are diving in head first. I see it in the new surgeon general who cites data showing just how helpful it can be. I see a revolution in the attitudes of everyday Americans. For the first time a majority, 53%, favor its legalization, with 77% supporting it for medical purposes. Support for legalization has risen 11 points in the past few years alone. In 1969, the first time Pew asked the question abo

### 5. Generate a summary for each sample in the dataset. Use batched generation with a batch size of 4. Use nucleus sampling. Only save the generated summaries. Print the summary for the first sample.

In [18]:
batch_size = 4
all_summaries = []

model.eval()
device = model.device

for i in range(0, len(dataset), batch_size):
    batch = dataset[i : i + batch_size]["prompt"]
    inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, add_special_tokens=False).to(device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=True,
        top_p=0.9,
        temperature=0.1
    )
    decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    summaries = [d for d in decoded]
    all_summaries.extend(summaries)

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48


In [22]:
cleaned_summaries = []
for i in range(len(all_summaries)):
    cleaned_summaries.append(all_summaries[i][len(dataset["prompt"][i]):])

In [23]:
print("Summary of first sample:")
print(cleaned_summaries[0])

Summary of first sample:
> The article from CNN highlights a nationwide revolution surrounding the legalization of medical marijuana. Op-ed pieces, state ballots, and politicians are increasingly willing to engage with this issue, which was once stigmatized. Scientists, the new surgeon general, and parents are all part of this shift, with a majority of Americans now in favor of legalization for medical purposes.

The revolution is not limited to adults; it also involves young people


### 6. Calculate the ROUGE score of the generated summaries against the reference summaries of the dataset. Use the ```evaluate``` library. Print the average ROUGE scores.

In [25]:
import evaluate

rouge = evaluate.load("rouge")

references = dataset["highlights"]
results = rouge.compute(predictions=cleaned_summaries, references=references)

print("\nAverage ROUGE scores:")
print(results)


Average ROUGE scores:
{'rouge1': 0.3663098826204447, 'rouge2': 0.12697411004003278, 'rougeL': 0.23415596671510458, 'rougeLsum': 0.3046264028409189}
