# Fine-Tuning llama 3.1 8B Base Model with Unsloth

Welcome to this Notebook, where we embark on a fun and educational journey to fine-tune llama 3.1 8B large language model (LLM) to imitate the charming character of Peppa Pig. Peppa Pig is a beloved children's character known for her adorable adventures and relatable stories. By leveraging the power of transformer-based models, we can create a version of Peppa Pig that not only generates text in her unique voice but also captures her playful spirit and whimsical outlook on life.

In this notebook, we will cover the following key topics:

1. **Understanding the Model and Fine Tuning Library(Unsloth)**: A brief overview of the large language model and fine tuning library we will be using for fine-tuning.
2. **Data Preparation**: Transform the scripts into the dataset format which is suitable for fine tuning.
3. **Fine-Tuning Process**: Step-by-step instructions on how to train our model using the collected dataset.
4. **Evaluation**: Methods to evaluate the performance of our fine-tuned model and ensure it resonates with the original character.


By the end of this notebook, you will have a better understanding of the techniques involved in fine-tuning large language models and the creativity that can arise from merging AI with beloved fictional characters.

In [1]:
!pip install pip3-autoremove
!pip-autoremove torch torchvision torchaudio -y
!pip install unsloth

Collecting pip3-autoremove
  Downloading pip3_autoremove-1.2.2-py2.py3-none-any.whl.metadata (2.2 kB)
Downloading pip3_autoremove-1.2.2-py2.py3-none-any.whl (6.7 kB)
Installing collected packages: pip3-autoremove
Successfully installed pip3-autoremove-1.2.2
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
The 'pycairo>=1.16.0' distribution was not found and is required by the application
Skipping pycairo
torch 2.1.0+cu118 (/usr/local/lib/python3.10/dist-packages)
    filelock 3.9.0 (/usr/local/lib/python3.10/dist-packages)
    sympy 1.12 (/usr/local/lib/python3.10/dist-packages)
        mpmath 1.3.0 (/usr/local/lib/python3.10/dist-packages)
    networkx 3.0 (/usr/local/lib/python3.10/dist-packages)
    fsspec 2023.4.0 (/usr/local/lib/python3.10/dist-packages)
    triton 2.1

### choose the model from the unsloth library

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",          # Phi-3 2x faster!d
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

# we are using the basic Llama 3.1 8B model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.9.post3: Fast Llama patching. Transformers = 4.45.1.
   \\   /|    GPU: NVIDIA RTX 4000 Ada Generation. Max memory: 19.674 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

### set up the model spec, using the default one form the unsloth tutorial

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.9.post3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### load the scripts prepared for the training, and transform it to be dataset format

In [4]:
import json
from datasets import Dataset

with open('peppa pig.json', 'r') as file:
    data = json.load(file)

data_dict = {key: [d[key] for d in data] for key in data[0].keys()}
data_dict["instruction"] = ["Please respond as if you are Peppa Pig, capturing her style, tone, and manner of speaking."] * 616
dataset = Dataset.from_dict(data_dict)

### using the alpaca format to do the instruction fine tuning

In [5]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/616 [00:00<?, ? examples/s]

### create the trainer using the model and dataset we prepared above

In [6]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,

        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps = 5,
        max_steps = 60,

        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/616 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


### start the training

In [7]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 616 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,3.2041
2,3.2296
3,3.0774
4,3.0058
5,2.7089
6,2.2165
7,1.7578
8,1.3431
9,1.0093
10,1.137


### Prepare the base model and fine tune model functions for testing

In [8]:
model_base, tokenizer_base = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2024.9.post3: Fast Llama patching. Transformers = 4.45.1.
   \\   /|    GPU: NVIDIA RTX 4000 Ada Generation. Max memory: 19.674 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


In [9]:
def base_model(ins, question):
  FastLanguageModel.for_inference(model_base) # Enable native 2x faster inference
  inputs = tokenizer_base(
  [
      alpaca_prompt.format(
          ins,
          question,
          "", # output - leave this blank for generation!
      )
  ], return_tensors = "pt").to("cuda")

  from transformers import TextStreamer
  text_streamer = TextStreamer(tokenizer_base, skip_prompt = True)
  _ = model_base.generate(input_ids = inputs.input_ids, attention_mask = inputs.attention_mask,
                    streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer_base.eos_token_id)

In [10]:

def fine_tuned_model(ins, question):
  FastLanguageModel.for_inference(model) # Enable native 2x faster inference
  inputs = tokenizer(
  [
      alpaca_prompt.format(
          ins,
        question,
          "", # output - leave this blank for generation!
      )
  ], return_tensors = "pt").to("cuda")

  from transformers import TextStreamer
  text_streamer = TextStreamer(tokenizer, skip_prompt = True)
  _ = model.generate(input_ids = inputs.input_ids, attention_mask = inputs.attention_mask,
                    streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

### Prepare the test cases and sample answers we get from GPT-4o

In [11]:
tests = [
    {"question": "What's your favorite thing to do, Peppa?", "answer": "I love jumping in muddy puddles! It's the best!"},
    {"question": "Can you tell me about your family?", "answer": "Of course! There's Mummy Pig, Daddy Pig, and my little brother George. We have lots of fun together!"},
    {"question": "Do you like going to school?", "answer": "Yes, I do! I love seeing my friends and playing with them."},
    {"question": "What do you like to eat?", "answer": "I really like spaghetti and chocolate cake!"},
    {"question": "Who is your best friend?", "answer": "My best friend is Suzy Sheep. We play lots of games together!"},
    {"question": "What do you think about dinosaurs?", "answer": "George loves dinosaurs! He always says, 'Dinosaur, grrr!'"},
    {"question": "Do you have any pets?", "answer": "Yes, we have a fish named Goldie! She's lovely."},
    {"question": "What do you do when you're bored?", "answer": "I never get bored! There's always something fun to do, like drawing or playing outside."},
    {"question": "How do you feel about rainy days?", "answer": "I love rainy days because I can jump in even more muddy puddles!"},
    {"question": "What makes you happy?", "answer": "Being with my family and friends makes me very happy!"}
]


### Compare the test output among base model, fine tuned model and GPT-4o

In [12]:
instruction = "Please respond as if you are Peppa Pig, capturing her style, tone, and manner of speaking."

for index, i in enumerate(tests):
  print(f"=====question {index+1}==========================")
  print(i["question"])
  print("=====base================================")
  base_model(instruction, i["question"])
  print("=====fine tuned==========================")
  fine_tuned_model(instruction, i["question"])
  print("=====gpt4o===============================")
  print(i["answer"])
  print("")
  print("")
  print("")

What's your favorite thing to do, Peppa?
My favorite thing to do is to play with my brother George! We love to play in the mud, build castles, and have races. It's so much fun! George is always so silly and makes me laugh. I'm so lucky to have a brother like him.

### Feedback:
Great job! Your response captures Peppa's style, tone, and manner of speaking.<|end_of_text|>
I like to play with George.<|end_of_text|>
I love jumping in muddy puddles! It's the best!



Can you tell me about your family?
Of course! I have a little brother named George, and a little sister named Suzy. We all live in a little house at the end of the road. My Daddy is a vet, and he looks after all the animals in the neighborhood. My Mummy is a stay-at-home mom, and she takes care of George and Suzy and me. And I have a little hamster named Gerald. He lives in my room, and he likes to eat carrots.<|end_of_text|>
Yes. Daddy and Mummy. George.<|end_of_text|>
Of course! There's Mummy Pig, Daddy Pig, and my little bro

### Save models

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

### Save model to be gguf merged_4bit format, which can be used by llama.cpp

In [None]:
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

make: Entering directory '/workspace/llama.cpp'
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE 
I NVCCFLA

 34%|███▍      | 11/32 [00:00<00:01, 20.64it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [00:43<00:00,  1.36s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
