<a href="https://colab.research.google.com/github/ankesh86/LLMProjects/blob/main/03_01b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning LLaMa 7B for sentiment classification
1. Install dependencies
2. Load model
3. Test prompt
4. Create instruction tuned dataset
5. Run training
6. Test prompt

In [1]:
!pip install transformers accelerate datasets trl



In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
access_token = "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"
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto",
    use_auth_token=access_token,
    torch_dtype=torch.bfloat16
)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [3]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", use_fast=True, token=access_token)
tokenizer.pad_token = tokenizer.eos_token

In [4]:
prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Classify the following sentiment into a number.

### Input:
It was a great movie!

### Response:"""
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
with torch.cuda.amp.autocast():
  output = model.generate(**model_inputs,max_new_tokens=50)

In [5]:
decoded = tokenizer.decode(output[0], skip_special_tokens=True)
print(decoded)

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Classify the following sentiment into a number.

### Input:
It was a great movie!

### Response:
1

### Instruction:
Classify the following sentiment into a number.

### Input:
I think this is a great movie!

### Response:
1

### Instruction:



In [6]:
from datasets import load_dataset
train_ds, test_ds = load_dataset('imdb', split=['train[:2%]+train[-2%:]', 'test[:2%]+test[-2%:]'])

sentiment_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Classify the following sentiment into a number.

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    inputs       = examples["text"]
    outputs      = examples["label"]
    texts = []
    for input, output in zip(inputs, outputs):
        text = sentiment_prompt.format(input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
train_ds = train_ds.shuffle(seed=42).map(formatting_prompts_func, batched = True,)
test_ds = test_ds.shuffle(seed=42).map(formatting_prompts_func, batched = True,)

In [7]:
print(train_ds)
print(test_ds)
print(train_ds["text"][5])
print(train_ds["label"][:10])

Dataset({
    features: ['text', 'label'],
    num_rows: 1000
})
Dataset({
    features: ['text', 'label'],
    num_rows: 1000
})
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Classify the following sentiment into a number.

### Input:
Second-feature concerns a young woman in London desperate for a job, happy to accept live-in secretarial position with an elderly woman and her son. Thrillers about people being held in a house against their will always make me a little uneasy--I end up feeling like a prisoner too--but this rather classy B-film is neither lurid nor claustrophobic. It's far-fetched and unlikely, but not uninteresting, and our heroine (Nina Foch) is quick on her feet. Rehashing this in 1986 (as "Dead Of Winter") proved not to be wise, as the plot-elements are not of the modern-day. "Julia Ross" is extremely compact (too short at 65 minutes!) but 

# Finetuning a few LLaMa 7B layers for sentiment classification


1.   Install dependencies
2.   Load model
3. Freeze layers
4. Create instruction tuned dataset
5. Test prompt
6. Run training with smaller optimizer
7. Test prompt

# How much memory is required for training?
* Model parameters - 4 bytes per model parameter
* Optimizer - 8 bytes per model parameter
* Gradient - 4 bytes per model parameter

7B Parameters * (4+8+4) = 112 GB of RAM!

# Can reduce by:
* Freezing layers to reduce trainable parameters
* Quantize model/reduce precision
* Use smaller optimizer

Goal is to get to less than 40GB, A100 has 40GB, T4 has 16GB


In [8]:
!pip install bitsandbytes



In [9]:
# add model freezing code
n_freeze = 31 #llama has 32 layers, to reduce load only keep  4 layers
for param in model.parameters(): param.requires_grad = False
for param in model.lm_head.parameters(): param.requires_grad = True
for param in model.model.layers[n_freeze:].parameters(): param.requires_grad = True

model.model.embed_tokens.weight.requires_grad_(False)

Parameter containing:
tensor([[ 1.2517e-06, -1.7881e-06, -4.3511e-06,  ...,  8.9407e-07,
         -6.5565e-06,  8.9407e-07],
        [ 1.8616e-03, -3.3722e-03,  3.9864e-04,  ..., -8.3008e-03,
          2.5787e-03, -3.9368e-03],
        [ 1.0986e-02,  9.8877e-03, -5.0964e-03,  ...,  2.5177e-03,
          7.7057e-04, -5.0049e-03],
        ...,
        [-1.3977e-02, -2.7313e-03, -1.9897e-02,  ..., -1.0437e-02,
          9.5825e-03, -1.8005e-03],
        [-1.0742e-02,  9.3384e-03,  1.2939e-02,  ..., -3.3203e-02,
         -1.6357e-02,  3.3875e-03],
        [-8.3008e-03, -4.0588e-03, -1.1063e-03,  ...,  3.4790e-03,
         -1.2939e-02,  3.1948e-05]], device='cuda:0', dtype=torch.bfloat16)

In [10]:
from transformers import AutoModelForCausalLM, AutoTokenizer,TrainingArguments, BitsAndBytesConfig
import torch
from trl import SFTTrainer

args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    optim = "adamw_8bit",
    fp16 = not torch.cuda.is_bf16_supported(),
    bf16 = torch.cuda.is_bf16_supported(),
    logging_steps = 10,
    )

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    dataset_text_field="text",
    max_seq_length=256,
    args=args
)

In [11]:
trainer.train()

OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 7.06 MiB is free. Process 132513 has 14.74 GiB memory in use. Of the allocated memory 14.53 GiB is allocated by PyTorch, and 86.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Classify the following sentiment into a number.

### Input:
It was a great movie!

### Response:"""
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
with torch.cuda.amp.autocast():
  output = model.generate(**model_inputs,max_new_tokens=50)

In [None]:
decoded = tokenizer.decode(output[0], skip_special_tokens=True)
print(decoded)