<a href="https://colab.research.google.com/github/alexyoung13/llama_haiku_poet/blob/main/LLama_Haiku_Poet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>LLama Haiku Poet using QLORA</h1>

This project will fine-tune a 7b LLama model on haikus to be able to create a haiku. In order to get it to run on Free colab I use quanitization and low rank-adapation (QLORA).

<h2>Part 1: Downloading packages and model</h2>

In [2]:
!pip install accelerate bitsandbytes>0.37.0 datasets peft transformers torch sentencepiece loralib
#!pip install flash-attn --no-build-isolation

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 16.1.0 which is incompatible.
google-colab 1.0.0 requires requests==2.31.0, but you have requests 2.32.3 which is incompatible.
ibis-framework 8.0.0 requires pyarrow<16,>=2, but you have pyarrow 16.1.0 which is incompatible.[0m[31m
[0m

In [3]:
#check if GPU is available
import torch
torch.cuda.is_available()

True

In [4]:
#check availability of libraries
import os
os.environ["CUDA_VISIBLE_DEVICES"] ="0"
import torch.nn as nn
import bitsandbytes as bnb
from transformers import LlamaForCausalLM, CodeLlamaTokenizer, AutoModelForCausalLM
from transformers.modeling_utils import is_accelerate_available, is_bitsandbytes_available
from transformers import BitsAndBytesConfig

print(f"Accelerate available: {is_accelerate_available()}")
print(f"B&B available: {is_bitsandbytes_available()}")

Accelerate available: True
B&B available: True


In [5]:
#download tokenizer and mocel
from transformers import BitsAndBytesConfig

tokenizer = CodeLlamaTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")
#print(LlamaForCausalLM._supports_flash_attn_2)
model = AutoModelForCausalLM.from_pretrained(
    "codellama/CodeLlama-7b-hf",
    #use_flash_attention=True,
    quantization_config=BitsAndBytesConfig( #quantization step
      load_in_4bit=True,
      bnb_4bit_quant_type="nf4",
      bnb_4bit_compute_dtype=torch.float16
    ),
)

# Add special tokens
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/637 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [6]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32017, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Lla

<h2>Part 2: Preprocessing the model and data</h2>

In [7]:
#freezes the parameters of the model
for param in model.parameters():
  param.requires_grad = False
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

#casts the output of the model on a forward pass to a float 32
class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

In [8]:
#Function to show what percentage of parameters are being trained from the lora against the full model
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [9]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model.add_adapter(config, adapter_name="lora_adapter_1")
print_trainable_parameters(model)

trainable params: 4194304 || all params: 3504746496 || trainable%: 0.11967496093617608


In [10]:
from datasets import load_dataset

dataset = load_dataset("davanstrien/haiku_dpo")

Downloading readme:   0%|          | 0.00/13.4k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.63M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4123 [00:00<?, ? examples/s]

In [11]:
from datasets import DatasetDict

#map the data to a prompt
def create_prompt(question, chosen):
  prompt_template = f"### QUESTION\n{question}\n\n### RESPONSE\n{chosen}</s>"
  return prompt_template


mapped_dataset = dataset.map(lambda samples: tokenizer(create_prompt(samples['question'], samples['chosen'])))

Map:   0%|          | 0/4123 [00:00<?, ? examples/s]

In [12]:
print(mapped_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'generation_model', 'generation_prompt', 'generations', 'scores', 'chosen', 'chosen_score', 'rejected', 'rejected_score', 'tie', 'difference_in_score', 'system', 'input_ids', 'attention_mask'],
        num_rows: 4123
    })
})


In [13]:
#Sample mapped datapoint
print(mapped_dataset["train"]['chosen'][430])

Flash of lightning's might,
Electric dance through the sky,
Nature's lightning show.


In [14]:
#Method to display output in colab window
from IPython.display import display, Markdown

def make_inference(question):
  batch = tokenizer(f"### QUESTION\n{question}\n\n### ANSWER\n", return_tensors='pt')

  with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, max_new_tokens=200)

  display(Markdown((tokenizer.decode(output_tokens[0], skip_special_tokens=True))))

In [15]:
#Sample Questions
ocean_question = "Can you compose a haiku about the ocean?"

moon_question = "Can you compose a haiku about the moon?"

star_wars_question = "Can you compose a haiku about Star Wars?"

ice_cream_question = "Can you compose a haiku about ice cream?"

In [16]:
make_inference(ocean_question)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


### QUESTION
Can you compose a haiku about the ocean?

### ANSWER

The ocean is a vast expanse of water.

The ocean is a vast expanse of water.

The ocean is a vast expanse of water.

### EXPLANATION

The ocean is a vast expanse of water.

The ocean is a vast expanse of water.

The ocean is a vast expanse of water.

### HINTS

The haiku form consists of three lines of 5, 7, and 5 syllables.

The first line has five syllables.

The second line has seven syllables.

The third line has five syllables.

### REFERENCES

* [Wikipedia](https://en.wikipedia.org/wiki/Haiku)

### TIPS

* The first line is the most important.
* The second line is the most important.

In [17]:
make_inference(moon_question)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


### QUESTION
Can you compose a haiku about the moon?

### ANSWER

The moon is a beautiful sight

It is a beautiful sight

It is a beautiful sight

### EXPLANATION

The moon is a beautiful sight

It is a beautiful sight

It is a beautiful sight

### CODE

```python
print("The moon is a beautiful sight")
print("It is a beautiful sight")
print("It is a beautiful sight")
```

### OUTPUT

```
The moon is a beautiful sight
It is a beautiful sight
It is a beautiful sight
```

### NOTE

The haiku is a Japanese poem form consisting of three lines of 5, 7, and 5 syllables, respectively.

### REFERENCES

1. [Wikipedia](https://en.wikipedia.org/wiki/Haiku)

---

### QUESTION



In [18]:
make_inference(star_wars_question)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


### QUESTION
Can you compose a haiku about Star Wars?

### ANSWER

I am a Jedi,

I am a Sith,

I am a Jedi.

### EXPLANATION

A haiku is a Japanese poem composed of three lines of 5, 7, and 5 syllables.

The first line is a statement of fact. The second line is a statement of emotion. The third line is a statement of conclusion.

The first line of this haiku is a statement of fact. The second line of this haiku is a statement of emotion. The third line of this haiku is a statement of conclusion.

### REFERENCES

- [Wikipedia](https://en.wikipedia.org/wiki/Haiku)
- [Haiku](https://www.poetryfoundation.org/poetrymagazine/poems/44909/haiku)
- [Haiku](https

In [19]:
make_inference(ice_cream_question)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


### QUESTION
Can you compose a haiku about ice cream?

### ANSWER

Ice cream is a delicious treat.

### EXPLANATION

The first line of the haiku is the most important. It is the only line that is required. The second line is optional, but it is a good idea to include it. The third line is optional, but it is a good idea to include it.

The first line of the haiku is the only line that is required. It is the only line that is required. The second line is optional, but it is a good idea to include it. The third line is optional, but it is a good idea to include it.

The first line of the haiku is the only line that is required. It is the only line that is required. The second line is optional, but it is a good idea to include it. The third line is optional, but it is a good idea to include it.

The first line of the haiku is

<h2>Part 3: Training</h2>

In [20]:
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
import transformers

trainer = transformers.Trainer(
    model=model,
    train_dataset=mapped_dataset["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=100,
        learning_rate=1e-3,
        fp16=True,
        logging_steps=1,
        output_dir='outputs',
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False
trainer.train()
model.config.use_cache = True

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss
1,2.6205
2,2.6099
3,2.6063
4,2.6089
5,2.5306
6,2.5708
7,2.6019
8,2.6141
9,2.5741
10,2.4531


TrainOutput(global_step=100, training_loss=1.4337179416418075, metrics={'train_runtime': 752.7164, 'train_samples_per_second': 2.126, 'train_steps_per_second': 0.133, 'total_flos': 4163567293366272.0, 'train_loss': 1.4337179416418075, 'epoch': 0.3879728419010669})

<h2>Part 4: Results</h2>

In [25]:
make_inference(ocean_question)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


### QUESTION
Can you compose a haiku about the ocean?

### ANSWER
Ocean's vast expanse,
Waves crashing on the shore,
Nature's symphony.

In [22]:
make_inference(moon_question)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


### QUESTION
Can you compose a haiku about the moon?

### ANSWER
Moonlit night, so bright,
Silent whispers in the air,
Moon's gentle light.

In [23]:
make_inference(star_wars_question)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


### QUESTION
Can you compose a haiku about Star Wars?

### ANSWER
Star Wars, a galaxy far,
A new hope, a galaxy wide,
A galaxy of light.

In [24]:
make_inference(ice_cream_question)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


### QUESTION
Can you compose a haiku about ice cream?

### ANSWER
Ice cream melts so sweet,
Cooling the heat of the day,
Savoring the sweetness.

<h2>Part 5: Saving Model to Huggingface</h2>

In [None]:
from huggingface_hub import notebook_login

HUGGING_FACE_USER_NAME = "" #removed for privacy
notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model_name = "code_llama_haiku_example"

model.push_to_hub(f"{HUGGING_FACE_USER_NAME}/{model_name}", use_auth_token=True)

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "code_llama_haiku_example"
peft_model_id = f"{HUGGING_FACE_USER_NAME}/{model_name}"

config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
qa_model = PeftModel.from_pretrained(model, peft_model_id)

In [None]:
tokenizer = CodeLlamaTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")