**Task 1:**

**Problem Statement:**
Develop a Google Colab notebook to fine-tune LORA adapters for text generation task with either a 3B model or a smaller model that accommodates the available GPU RAM. Utilise Hugging Face and PyTorch for implementation, and incorporate WandB for logging purposes. Provide the notebook link, wandb project link and include a screenshot of the convergence graph. You can pick any dataset for a creative text generation task and you should report the perplexity metric

**Model Selected:** bigscience/bloomz-560m

**Dataset Selected:** Amazon Polarity

Installing Required Libraries

In [None]:
!pip install transformers==4.36.2
!pip install accelerate==0.25.0
!pip install datasets==2.15.0
!pip install peft==0.7.1
!pip install bitsandbytes==0.41.3
!pip install trl==0.7.7
!pip install tqdm==4.66.1
!pip install flash-attn==2.4.2

Collecting datasets==2.15.0
  Obtaining dependency information for datasets==2.15.0 from https://files.pythonhosted.org/packages/e2/cf/db41e572d7ed958e8679018f8190438ef700aeb501b62da9e1eed9e4d69a/datasets-2.15.0-py3-none-any.whl.metadata
  Downloading datasets-2.15.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets==2.15.0)
  Obtaining dependency information for pyarrow-hotfix from https://files.pythonhosted.org/packages/e4/f4/9ec2222f5f5f8ea04f66f184caafd991a39c8782e31f5b0266f101cb68ca/pyarrow_hotfix-0.6-py3-none-any.whl.metadata
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec[http]<=2023.10.0,>=2023.1.0 (from datasets==2.15.0)
  Obtaining dependency information for fsspec[http]<=2023.10.0,>=2023.1.0 from https://files.pythonhosted.org/packages/e8/f6/3eccfb530aac90ad1301c582da228e4763f19e719ac8200752a4841b0b2d/fsspec-2023.10.0-py3-none-any.whl.metadata
  Downloading fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 

Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
from datasets import load_dataset

Setting Up Secrets and Environment Variables

> Hugging Face and Wandb Integration

In [None]:
user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Text generation using LORA"
os.environ["WANDB_NOTES"] = "Fine tuning text generation using LLM"
os.environ["WANDB_NAME"] = "Model-text-generation"
os.environ["MODEL_NAME"] = "bigscience/bloomz-560m"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


 Model Memory Estimation

 - Estimates the memory requirements for the specified model using the accelerate library.

 - int4 has the lowest memory consumption for the model

In [None]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `bigscience/bloomz-560m` from `transformers`...
config.json: 100%|█████████████████████████████| 715/715 [00:00<00:00, 4.92MB/s]
┌────────────────────────────────────────────────────┐
│ Memory Usage for loading `bigscience/bloomz-560m`  │
├───────┬─────────────┬──────────┬───────────────────┤
│ dtype │Largest Layer│Total Size│Training using Adam│
├───────┼─────────────┼──────────┼───────────────────┤
│float32│   980.0 MB  │ 2.08 GB  │      8.33 GB      │
│float16│   490.0 MB  │ 1.04 GB  │      4.17 GB      │
│  int8 │   245.0 MB  │533.31 MB │      2.08 GB      │
│  int4 │   122.5 MB  │266.65 MB │      1.04 GB      │
└───────┴─────────────┴──────────┴───────────────────┘


Loading and Preprocessing Dataset

> Loads a dataset, removes unnecessary columns, shuffles, and splits it into training and evaluation sets.

In [None]:
dataset = load_dataset("amazon_polarity",split='train')

Downloading readme:   0%|          | 0.00/6.81k [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/260M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/258M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/254M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/117M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/3600000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/400000 [00:00<?, ? examples/s]

In [None]:
dataset

Dataset({
    features: ['label', 'title', 'content'],
    num_rows: 3600000
})

In [None]:
dataset = dataset.remove_columns(['label', 'title'])

In [None]:
dataset

Dataset({
    features: ['content'],
    num_rows: 3600000
})

Selecting Few Rows to save time on training the model

In [None]:
dataset = dataset.shuffle(seed=42).select([i for i in range(70000)])

In [None]:
dataset = dataset.train_test_split(test_size=0.1,seed=42)

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['content'],
        num_rows: 63000
    })
    test: Dataset({
        features: ['content'],
        num_rows: 7000
    })
})

In [None]:
train_dataset = dataset['train']
eval_dataset = dataset['test']

In [None]:
train_dataset[1]

{'content': "I could never seem to get into this book; why? It is a tale of 'Bilbo Baggins' and his journey with Gandalf and the dwarves to find treasure, and defeat the 'evil dragon', Smaug. It seems pointless; adventure after adventure, many characters, and no main theme. It was a childish and boring book (as well a series)."}

In [None]:
eval_dataset[1]

{'content': 'I REALL LIKE THIS SONG....VERY CATCHY IN WORDS AS WELL AS MUSIC.MY FAVORITE VERSIONS ARE:THE FULL INTENTION MIXES!GRAB A COPY...WELL WORTH IT!'}

Tokenizer Initialization

In [None]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained(os.getenv("MODEL_NAME"), use_fast=True,padding_size='right')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

1

Quantization Configuration

- Training the model takes more time and costs huge memory
- We can save the model weights and parameters at less bitwidth instead of floating point
- This method will save memory and makes the training fast

In [None]:
from transformers import BitsAndBytesConfig
from accelerate import Accelerator
import torch

load_in_4bit = True

if load_in_4bit:
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=load_in_4bit,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.float16
    )
    # copy the model to each device
    device_map = "auto"
    torch_dtype = torch.float16
else:
    device_map = None
    quantization_config = None
    torch_dtype = None


Model Initialization

- Since our task is Text generation, we will select **AutoModelForCausalLM** (Casual Language modelling)

- LoRA decreases memory needs by lowering the number of parameters to update, aiding in the management of large-scale models.

In [None]:
from transformers import AutoModelForCausalLM

def print_trainable_parameters(model):
    trainable_params=0
    all_params=0
    for _, param in model.named_parameters():
        all_params+=param.numel()
        if param.requires_grad:
            trainable_params+=param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_params} || trainable%: {100 * trainable_params/all_params:.2f}")

model=AutoModelForCausalLM.from_pretrained(
    os.getenv("MODEL_NAME"),
    quantization_config=quantization_config,
    device_map=device_map,
    trust_remote_code=False,
    torch_dtype=torch_dtype,
    # RuntimeError: FlashAttention only supports Ampere GPUs or newer.
#     attn_implementation="flash_attention_2"
)

print_trainable_parameters(model)

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

trainable params: 257003520 || all params: 408219648 || trainable%: 62.96


In [None]:
model.get_memory_footprint()

665444352

In [None]:
from peft import LoraConfig, get_peft_model

use_peft=True

peft_config=LoraConfig(
    r=64,
    lora_alpha=16,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["query_key_value"]
)

peft_model=get_peft_model(model,peft_config)
peft_model.print_trainable_parameters()

trainable params: 6,291,456 || all params: 565,506,048 || trainable%: 1.112535581582321


In [None]:
peft_model.get_memory_footprint()

690610176

 Model Training Configuration and SFTTrainer Initialization
 >  Configures training parameters, output directory, and other settings.

In [None]:
from transformers import TrainingArguments, Trainer
from trl import SFTTrainer

training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    per_device_train_batch_size=8,
    gradient_accumulation_steps=8,
    learning_rate=1.41e-5,
    num_train_epochs=5,
    max_steps=-1,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
    save_steps=100,
    logging_steps=50,
    save_total_limit=1,
    push_to_hub=False,
    gradient_checkpointing=False,
    evaluation_strategy="epoch",
    lr_scheduler_type = "cosine",
    fp16=True
)

sft_trainer=SFTTrainer(
    model=peft_model,
    args=training_args,
    max_seq_length=256,
    train_dataset=train_dataset,
    eval_dataset = eval_dataset,
    dataset_text_field="content",
    tokenizer=tokenizer
)

sft_trainer.train()

Map:   0%|          | 0/63000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Currently logged in as: [33maravindsriraj[0m ([33maravindan[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.16.2
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240120_175543-shw325rv[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mModel-text-generation[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/aravindan/Text%20generation%20using%20LORA[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/aravindan/Text%20generation%20using%20LORA/runs/shw325rv[0m
You're using a BloomTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
0,3.6532,3.665671
1,3.6527,3.651834
2,3.6301,3.6462
3,3.6279,3.644216
4,3.6385,3.644014


TrainOutput(global_step=4920, training_loss=3.6541302332064, metrics={'train_runtime': 22266.8144, 'train_samples_per_second': 14.147, 'train_steps_per_second': 0.221, 'total_flos': 1.0327383925122662e+17, 'train_loss': 3.6541302332064, 'epoch': 5.0})

Model Evaluation

In [None]:
results = sft_trainer.evaluate()
print(results)

{'eval_loss': 3.6440136432647705, 'eval_runtime': 198.5037, 'eval_samples_per_second': 35.264, 'eval_steps_per_second': 4.408, 'epoch': 5.0}


Perplexity Calculation
> Calculates and prints the perplexity value from the evaluation loss

In [None]:
import numpy as np
def perplexity(eval_output):
    return np.exp(eval_output)

In [None]:
perplexity(results['eval_loss'])

38.24503099723861

In [None]:
sft_trainer.push_to_hub(os.getenv("WANDB_NAME"))
tokenizer.push_to_hub(os.getenv("WANDB_NAME"))

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/25.2M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.28k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/tr-aravindan/Model-text-generation/commit/63e641635eec8717b217e57ccd838b0722f7537a', commit_message='Upload tokenizer', commit_description='', oid='63e641635eec8717b217e57ccd838b0722f7537a', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
del sft_trainer, tokenizer
torch.cuda.empty_cache()

Model Inference

In [None]:
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM

peft_model_name="/kaggle/working/Model-text-generation"

peft_config=PeftConfig.from_pretrained(peft_model_name)
base_model=AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path)

peft_model=PeftModel.from_pretrained(base_model, peft_model_name)

In [None]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained(peft_config.base_model_name_or_path)

In [None]:
prompt="I good in football but"
inputs=tokenizer(prompt, return_tensors="pt")

In [None]:
outputs=peft_model.generate(**inputs)



In [None]:
tokenizer.batch_decode(outputs, skip_special_token=True)

['I good in football but I am not a fan of the game. I like the game but I']