**Task 2:**

**Problem Statement:**
Develop a prefix language model using Hugging Face and PyTorch. You can pick any dataset for a creative text generation task and you should report the perplexity metric. Hint: A subtle data preprocessing trick is required when setting the inputs and labels for implementing prefix LM.

**Model Selected:** t5-large

**Dataset Selected:** CNN daily mail

Installation of Required Packages

In [None]:
!pip install transformers==4.36.2
!pip install accelerate==0.25.0
!pip install datasets==2.15.0
!pip install peft==0.7.1
!pip install bitsandbytes==0.41.3
!pip install trl==0.7.7
!pip install tqdm==4.66.1
!pip install flash-attn==2.4.2

Collecting datasets==2.15.0
  Obtaining dependency information for datasets==2.15.0 from https://files.pythonhosted.org/packages/e2/cf/db41e572d7ed958e8679018f8190438ef700aeb501b62da9e1eed9e4d69a/datasets-2.15.0-py3-none-any.whl.metadata
  Downloading datasets-2.15.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets==2.15.0)
  Obtaining dependency information for pyarrow-hotfix from https://files.pythonhosted.org/packages/e4/f4/9ec2222f5f5f8ea04f66f184caafd991a39c8782e31f5b0266f101cb68ca/pyarrow_hotfix-0.6-py3-none-any.whl.metadata
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec[http]<=2023.10.0,>=2023.1.0 (from datasets==2.15.0)
  Obtaining dependency information for fsspec[http]<=2023.10.0,>=2023.1.0 from https://files.pythonhosted.org/packages/e8/f6/3eccfb530aac90ad1301c582da228e4763f19e719ac8200752a4841b0b2d/fsspec-2023.10.0-py3-none-any.whl.metadata
  Downloading fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 

Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments,BartTokenizer, BartForConditionalGeneration

Authentication and Configuration
> Huggingface and wandb integration

In [None]:
user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN_2"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY_2")
os.environ["WANDB_PROJECT"] = "Prefix language modelling"
os.environ["WANDB_NOTES"] = "Prefix language modelling using LORA"
os.environ["WANDB_NAME"] = "Prefix tuning"
os.environ["MODEL_NAME"] = "t5-large"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Accelerate Memory Estimation
> This command estimates the memory requirements for the specified model using the Accelerate library.

In [None]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `t5-large` from `transformers`...
config.json: 100%|█████████████████████████| 1.21k/1.21k [00:00<00:00, 6.46MB/s]
┌────────────────────────────────────────────────────┐
│        Memory Usage for loading `t5-large`         │
├───────┬─────────────┬──────────┬───────────────────┤
│ dtype │Largest Layer│Total Size│Training using Adam│
├───────┼─────────────┼──────────┼───────────────────┤
│float32│   125.5 MB  │ 2.75 GB  │      10.99 GB     │
│float16│   62.75 MB  │ 1.37 GB  │       5.5 GB      │
│  int8 │   31.38 MB  │ 703.5 MB │      2.75 GB      │
│  int4 │   15.69 MB  │351.75 MB │      1.37 GB      │
└───────┴─────────────┴──────────┴───────────────────┘


Model Quantization Configuration
> configures model quantization settings, including whether to load in 4-bit, the quantization type, and data types.

In [None]:
from transformers import BitsAndBytesConfig
from accelerate import Accelerator
import torch

load_in_4bit = True

if load_in_4bit:
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=load_in_4bit,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.float16  # Change to torch.float16 for fp16
    )
    # copy the model to each device
    device_map = "auto"
    torch_dtype = torch.float16  # Change to torch.float16 for fp16
else:
    device_map = None
    quantization_config = None
    torch_dtype = None

 Loading Dataset

In [None]:
from datasets import load_dataset
dataset = load_dataset('cnn_dailymail','3.0.0',split='train')

Downloading readme:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/259M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [None]:
dataset

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 287113
})

In [None]:
dataset= dataset.shuffle(seed=42).select([i for i in range(85000)])

In [None]:
dataset = dataset.train_test_split(test_size=0.1,seed=42)

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 76500
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 8500
    })
})

In [None]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["CUDA_VISIBLE_DEVICES"] = "2"

device = "cuda"
model_name_or_path = "t5-large"
tokenizer_name_or_path = "t5-large"

text_column = "article"
label_column = "highlights"
max_length = 256
lr = 1e-5
num_epochs = 1
batch_size = 8

Tokenization and Preprocessing

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)


def preprocess_function(examples):
    inputs = examples[text_column]
    targets = examples[label_column]
    model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")
    labels = tokenizer(targets, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")
    labels = labels["input_ids"]
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels
    return model_inputs

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:
processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=1,
    remove_columns=dataset["train"].column_names,
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

Running tokenizer on dataset:   0%|          | 0/76500 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/8500 [00:00<?, ? examples/s]

In [None]:
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import default_data_collator

In [None]:
train_dataset = processed_datasets["train"]
eval_dataset = processed_datasets["test"]

train_dataloader = DataLoader(
    train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True
)
eval_dataloader = DataLoader(eval_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True)

Prefix Language Model Configuration
> Since we are doing Text summarization task, we will use **AutoModelForSeq2SeqLM**

In [None]:
from peft import get_peft_config, get_peft_model, get_peft_model_state_dict, PrefixTuningConfig, TaskType
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, default_data_collator, get_linear_schedule_with_warmup

In [None]:
peft_config = PrefixTuningConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, num_virtual_tokens=20)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

trainable params: 983,040 || all params: 738,651,136 || trainable%: 0.13308583065659835


In [None]:
from tqdm import tqdm
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=(len(train_dataloader) * num_epochs),
)

Trainer Initialization and Training

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    save_steps=len(train_dataloader),
    save_total_limit=5,
    num_train_epochs=1,
    learning_rate=lr,
    logging_dir="./logs",
    logging_steps=len(train_dataloader),
    evaluation_strategy="steps",
    eval_steps=len(train_dataloader),
    load_best_model_at_end=True,
    remove_unused_columns=False,
    push_to_hub=False,
)

# Create the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=default_data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Train the model
trainer.train()

# Evaluate the model
results = trainer.evaluate()

# Print the results
print(results)

[34m[1mwandb[0m: Currently logged in as: [33maravindsriraj[0m ([33maravindan[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.16.2
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240120_175633-cqc8xj40[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mPrefix tuning[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/aravindan/Prefix%20language%20modelling[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/aravindan/Prefix%20language%20modelling/runs/cqc8xj40[0m


Step,Training Loss,Validation Loss


{'eval_loss': 2.928058624267578, 'eval_runtime': 835.7248, 'eval_samples_per_second': 10.171, 'eval_steps_per_second': 0.637, 'epoch': 1.0}


Perplexity Calculation

In [None]:
import numpy as np
def perplexity(eval_output):
    return np.exp(eval_output)

In [None]:
perplexity(results['eval_loss'])

18.691308398129227

Model Upload to Hugging Face Model Hub

In [None]:
peft_model_id = "t5-large_PREFIX_TUNING_SEQ2SEQ"
trainer.push_to_hub("t5-large_PREFIX_TUNING_SEQ2SEQ")

training_args.bin:   0%|          | 0.00/4.35k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/3.93M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/tr-aravindan/output/commit/518f484be84a25b15c0cfa672f29093906661655', commit_message='t5-large_PREFIX_TUNING_SEQ2SEQ', commit_description='', oid='518f484be84a25b15c0cfa672f29093906661655', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
tokenizer.push_to_hub('t5-large_PREFIX_TUNING_SEQ')

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/tr-aravindan/t5-large_PREFIX_TUNING_SEQ/commit/69f4ac8b1c070e022d5e1737188db2376c1000f5', commit_message='Upload tokenizer', commit_description='', oid='69f4ac8b1c070e022d5e1737188db2376c1000f5', pr_url=None, pr_revision=None, pr_num=None)

Loading PEFT Model for Text Generation

In [None]:
from peft import get_peft_config, get_peft_model, get_peft_model_state_dict, PrefixTuningConfig, TaskType,PeftConfig,PeftModel
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, default_data_collator, get_linear_schedule_with_warmup

peft_model_name="tr-aravindan/output"

peft_config=PeftConfig.from_pretrained(peft_model_name)
base_model=AutoModelForSeq2SeqLM.from_pretrained(peft_config.base_model_name_or_path)

peft_model=PeftModel.from_pretrained(base_model, peft_model_name)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(peft_config.base_model_name_or_path)

In [None]:
text = """
SAN FRANCISCO, California (CNN) -- A magnitude 4.2 earthquake shook the San Francisco area Friday at 4:42 a.m. PT (7:42 a.m. ET), the U.S. Geological Survey reported. The quake left about 2,000 customers without power, said David Eisenhower, a spokesman for Pacific Gas and Light. Under the USGS classification, a magnitude 4.2 earthquake is considered "light," which it says usually causes minimal damage. "We had quite a spike in calls, mostly calls of inquiry, none of any injury, none of any damage that was reported," said Capt. Al Casciato of the San Francisco police. "It was fairly mild." Watch police describe concerned calls immediately after the quake » . The quake was centered about two miles east-northeast of Oakland, at a depth of 3.6 miles, the USGS said. Oakland is just east of San Francisco, across San Francisco Bay. An Oakland police dispatcher told CNN the quake set off alarms at people's homes. The shaking lasted about 50 seconds, said CNN meteorologist Chad Myers. According to the USGS, magnitude 4.2 quakes are felt indoors and may break dishes and windows and overturn unstable objects. Pendulum clocks may stop. E-mail to a friend .
"""

In [None]:
inputs = tokenizer(text,return_tensors='pt')

In [None]:
device = "cuda"

In [None]:
peft_model.to(device)

with torch.no_grad():
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = peft_model.generate(input_ids=inputs["input_ids"], max_new_tokens=30)
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

['. The quake was centered in the San Francisco Bay Area, the USGS says. about 2,000 customers without power, Pacific']
