## NLP - Lesson 5 - LLM based Sentiment Analysis

Keeping up with the current trend, LLMs is the hottest thing in the AI industry right now. This notebook will show how to read a pretrained model `GPT2` from the transformers library, use a compatible tokenizer (trained on the same data as the model), generate text after giving an instruction (zero-shot learning), then train the same model using LoRA (Low Rank Adaptation) and PEFT (Parameter Efficient Fine Tuning). Now let's start with the notebook.

In [96]:
## Import Libraries

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

import torch
# from trl import SFTTrainer
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from peft import PeftModel, PeftConfig
# from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments

In [2]:
import transformers
import trl
import peft

print("Transformers version : ",transformers.__version__)
print("Transformers reinforcement learning version : ",trl.__version__)
print("Peft version : ",peft.__version__)

Transformers version :  4.50.3
Transformers reinforcement learning version :  0.16.0
Peft version :  0.15.1


## GPU initiation

I will be using my laptop's inbuilt GPU. For windows, I believe it will be `cuda` instead of `mps`.

In [3]:
# Check if MPS (Metal Performance Shaders) is available
if torch.backends.mps.is_available():
    device = torch.device("mps")  # Use the MPS backend
else:
    device = torch.device("cpu")  # Fallback to CPU

print(f"Using device: {device}")

Using device: mps


### Dataset

In [4]:
%%time
train_dataset = load_dataset("tatsu-lab/alpaca", split="train")
print(train_dataset)

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 52002
})
CPU times: user 157 ms, sys: 99.6 ms, total: 256 ms
Wall time: 9.39 s


In [5]:
pandas_format = train_dataset.to_pandas()
pandas_format.head()

Unnamed: 0,instruction,input,output,text
0,Give three tips for staying healthy.,,1.Eat a balanced diet and make sure to include...,Below is an instruction that describes a task....
1,What are the three primary colors?,,"The three primary colors are red, blue, and ye...",Below is an instruction that describes a task....
2,Describe the structure of an atom.,,"An atom is made up of a nucleus, which contain...",Below is an instruction that describes a task....
3,How can we reduce air pollution?,,There are a number of ways to reduce air pollu...,Below is an instruction that describes a task....
4,Describe a time when you had to make a difficu...,,I had to make a difficult decision when I was ...,Below is an instruction that describes a task....


In [6]:
pandas_format.isna().sum()

instruction    0
input          0
output         0
text           0
dtype: int64

In [7]:
pandas_format['input'].unique()

array(['', 'Twitter, Instagram, Telegram', '4/16', ...,
       'cake, me, eating', 'Michelle Obama',
       'The following is an excerpt from a contract between two parties, labeled "Company A" and "Company B": \n\n"Company A agrees to provide reasonable assistance to Company B in ensuring the accuracy of the financial statements it provides. This includes allowing Company A reasonable access to personnel and other documents which may be necessary for Company B’s review. Company B agrees to maintain the document provided by Company A in confidence, and will not disclose the information to any third parties without Company A’s explicit permission."'],
      dtype=object)

In [61]:
row = pandas_format.loc[0]

example = "Instruction : " + row['instruction'] + "\n" + row['input'] + "\n\n" + row['output'] + "\n\n" + row['text']
print(example)

Instruction : Give three tips for staying healthy.


1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.


In [104]:
def generate_model_output(prompt, model1, tokenizer1, device='mps'):
    # Tokenize the input
    inputs = tokenizer1(prompt, return_tensors="pt").to(device)
    
    # Generate text
    output = model1.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,         # Enables randomness for creativity
        top_k=50,               # Top-k sampling
        top_p=0.95,             # Nucleus sampling
        temperature=0.1         # Controls randomness (lower = less random)
    )
    
    # Decode and print the result
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    print("Instruction:", prompt)
    print("\nGenerated Text:\n", generated_text)

In [105]:
prompt = row['instruction']

generate_model_output(prompt, model, tokenizer)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Instruction: Give three tips for staying healthy.

Generated Text:
 Give three tips for staying healthy.

1. Don't drink too much.

2. Don't drink too much.

3. Don't drink too much.


4. Don't drink too much.

5. Don't drink too much.

6. Don't drink too much.


7. Don't drink too much.


8. Don't drink too much.

9.

10. Don't drink too much.

11. Don't drink too much.

12. Don't drink too much.

13. Don't drink too much.

14. Don't drink too much.

15. Don't drink too much.


16. Don't drink too much.

17. Don't drink too much.

18. Don't drink too much.

19. Don't drink too much.

20. Don't drink too much.




We see that even with a lower temperature, the model is generating something that doesn't make sense while repeating majority of the text.

### Model Loading

In [None]:
# torch.mps.empty_cache() For emptying cache if model loading doesn't work properly in one-attempt.

In [45]:
%%time

# pretrained_model_name = "Salesforce/xgen-7b-8k-base"
# pretrained_model_name = "microsoft/phi-2"
# pretrained_model_name = "bigscience/bloom-560m"
# pretrained_model_name = "sbintuitions/modernbert-ja-30m"
pretrained_model_name = 'gpt2'

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, trust_remote_code=True, device_map="auto")
print("Tokenizer initiated!")

# model = AutoModelForCausalLM.from_pretrained(pretrained_model_name, torch_dtype=torch.float16, trust_remote_code=True)
# model = AutoModelForCausalLM.from_pretrained(
#     pretrained_model_name,
#     torch_dtype=torch.float16,  # Use float16 for efficiency
#     trust_remote_code=True
# ).to(device)

model = AutoModelForCausalLM.from_pretrained(pretrained_model_name, torch_dtype=torch.bfloat16).to(device)

print("Model downloaded!")


Tokenizer initiated!
Model downloaded!
CPU times: user 2.21 s, sys: 374 ms, total: 2.58 s
Wall time: 1.63 s


In [48]:
def print_number_of_trainable_model_parameters(model1):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model1.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    val1 = f"Trainable model parameters : {trainable_model_params}, "
    val2 = f"All Model parameters : {all_model_params}, "
    val3 = f"Percentage trainable parameters : {np.round(100*trainable_model_params/all_model_params,3)}%"
    return val1 + val2 + val3

In [49]:
print_number_of_trainable_model_parameters(model)

'Trainable model parameters : 124439808, All Model parameters : 124439808, Percentage trainable parameters : 100.0%'

In [85]:
tokenizer.pad_token = tokenizer.eos_token

In [86]:
def combine_instruction_output(example):
    return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"

pandas_format['input_text'] = pandas_format.apply(combine_instruction_output, axis=1)

In [89]:
hf_dataset = Dataset.from_pandas(pandas_format)

In [91]:
# Tokenization function
def tokenize_function(example, tokenizer1=tokenizer):
    # Tokenize the instruction + output as input
    model_input = tokenizer1(
        example['input_text'],
        truncation=True,
        padding="max_length",
        max_length=512  # adjust based on your model's context window
    )

    # Tokenize the label text
    with tokenizer1.as_target_tokenizer():
        label_output = tokenizer1(
            example['text'],
            truncation=True,
            padding="max_length",
            max_length=512
        )

    model_input["labels"] = label_output["input_ids"]
    return model_input

# Apply tokenization
tokenized_dataset = hf_dataset.map(tokenize_function, batched=False)


Map: 100%|███████████████████████| 52002/52002 [00:16<00:00, 3217.60 examples/s]


### PEFT Model training

In [51]:
model_training_args = TrainingArguments(
       output_dir="gpt2-fine-tuned",
       per_device_train_batch_size=4,
       optim="adamw_torch",
       logging_steps=80,
       learning_rate=2e-4,
       warmup_ratio=0.1,
       max_steps = 3,
       lr_scheduler_type="linear",
       num_train_epochs=1,
       save_strategy="epoch",
       dataloader_pin_memory=False,
       report_to="none"
   )

In [30]:
# lora_peft_config = LoraConfig(
#     task_type=TaskType.CAUSAL_LM,  # Since we're working with a causal language model
#     inference_mode=False,           # Set to False for training
#     r=16,                           # Rank of LoRA (reducing parameter count)
#     lora_alpha=32,                  # Scaling factor for LoRA weights
#     lora_dropout=0.1,               # Dropout rate for LoRA layers
#     target_modules=["q_proj", "v_proj"]  # Target attention projection layers
#     # target_modules=['query_key_value']
# )

lora_peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["c_attn", "c_proj"],  # Targeting 'query' and 'value' projection layers
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

In [50]:
peft_model = get_peft_model(model, lora_peft_config)
print(print_number_of_trainable_model_parameters(peft_model))

Trainable model parameters : 3244032, All Model parameters : 127683840, Percentage trainable parameters : 2.541%


In [93]:
peft_trainer = Trainer(
       model=peft_model,
       train_dataset=tokenized_dataset,
       # dataset_text_field="text",
       # max_seq_length=1024,
       # tokenizer=tokenizer,
       args=model_training_args
   )

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [94]:
%%time
peft_trainer.train()

print("PEFT complete!")

peft_model_path = "./gpt2-fine-tuned"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


PEFT complete!
CPU times: user 2.29 s, sys: 910 ms, total: 3.2 s
Wall time: 4.26 s


('./gpt2-fine-tuned/tokenizer_config.json',
 './gpt2-fine-tuned/special_tokens_map.json',
 './gpt2-fine-tuned/vocab.json',
 './gpt2-fine-tuned/merges.txt',
 './gpt2-fine-tuned/added_tokens.json',
 './gpt2-fine-tuned/tokenizer.json')

In [99]:
%%time

peft_model_base = AutoModelForCausalLM.from_pretrained(pretrained_model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
print("Model and  tokenizer downloaded!")

peft_model_loaded = PeftModel.from_pretrained(peft_model_base,
                                      peft_model_path,
                                      torch_dtype=torch.bfloat16,
                                      is_trainable=False
                                      )

Model and  tokenizer downloaded!
CPU times: user 2.72 s, sys: 418 ms, total: 3.14 s
Wall time: 1.14 s


In [106]:
prompt = row['instruction']

generate_model_output(prompt, peft_model_loaded, tokenizer,device='cpu')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Instruction: Give three tips for staying healthy.

Generated Text:
 Give three tips for staying healthy.

1. Get a good diet.

If you're not eating enough, you're not going to be able to eat enough.

If you're not eating enough, you're not going to be able to eat enough.

2. Get a good sleep.

If you're not sleeping well, you're not going to be able to sleep well.

If you're not sleeping well, you're not going to be able to sleep well.

3. Get a good diet.

If you're not eating enough, you're not going to be able to eat enough.

If you're not eating enough, you're not going to be able to eat enough.

4. Get a good sleep.

If you're not sleeping well, you're not going to be able to sleep well.

If you're not sleeping well, you're not going to be able to sleep well.

5.


### Conclusion

We see that peft model's generated text is much better than what the original gpt2 could generate on our dataset. If we train the model longer then we will have even better results!

In [None]:
# rm -rf ~/.cache/huggingface/transformers