<a href="https://colab.research.google.com/github/ernanhughes/fine-tuning-llm/blob/main/AI_Makerspace_Unsloth_Event.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI Makerspace - Unsloth Event

Unsloth is an open-source Python library for the fine-tuning (and now [continued pre-training](https://unsloth.ai/blog/contpretraining)) of Large Language Models.

Boasting numbers near to 1.5-2x for speed, and VRAM reduction numbers nearing 50-60%, Unsloth's value is clear-cut.

Let's work through a simple example to demonstrated how Unsloth helps fine-tune models blazingly fast. 🔥

## Dependencies

First of all, we're going to need some dependencies, including Unsloth and [Xformers](https://github.com/facebookresearch/xformers).

> NOTE: This install block is taken from many of Unsloth's amazing [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks), check them out for examples of fine-tuning most popular models!

In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# We have to check which Torch version for Xformers (2.3 -> 0.0.27)
from torch import __version__; from packaging.version import Version as V
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton

## Load Unsloth Model

Our first major step is to load the Unsloth model.

Unsloth has a number of artifacts to choose from as you can see in [this](https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp#scrollTo=QmUBVEnvCDJv&line=1&uniqifier=1) notebook.

Today, we'll be focused on:

`unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit`, which is a 4-bit pre-quantized model (for faster downloading).

You'll notice that the only *key* differentiation from the typical `transformers` flow, at this stage, is that we're using `FastLanguageModel`, which is Unsloth's main feature - it is, as it suggests, a faster implementation of the `transformer` model's `AutoModelForCausalLM` suite.

We're also setting a few parameters:

- `max_seq_length`: because Unsloth supports RoPE scaling out of the box, this can technically be any value that fits into memory. We're going to use a lower `max_seq_length` to save some GPU RAM for this specific example.
- `dtype`: setting this to `None` allows Unsloth to auto-detect based on available hardware.
- `load_in_4bit`: we set this to `True` to reduce GPU RAM overhead, however - since we're already loading a 4-bit model, we can ignore this.

In [None]:
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=1024,
    dtype=None,
    # load_in_4bit=True, ## optional because we're using a pre-quantized model
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## Data and Preparation

Next, we're going to need some data!

Today we'll be using a "Natural Language to Common Text Abbreviations & Acronyms (and Initialisms)" based on this ["complete list"](https://www.webopedia.com/definitions/text-abbreviations/).

A synthetically created dataset is available [here](https://huggingface.co/datasets/ai-maker-space/acronyms_and_initialisms_translated) and will serve as our data to create training examples from.

In [None]:
from datasets import load_dataset

# Load the dataset from Hugging Face
dataset = load_dataset("ai-maker-space/acronyms_and_initialisms_translated", split="train")

Let's take a look at our data, to see what we're dealing with.

In [None]:
print(f"Dataset size: {len(dataset)}")
print(dataset[1]["acronym_sentence"])
print(dataset[1]["english_translation"])

Dataset size: 1664
Yo, ? about the meetup deets. Can you fill me in?
Hey, I have a question about the details of the meetup. Can you provide information?


As you can see, we have some number of sentences that leverage a specific acronym, and their translation to English.

Now we need to create a prompt template that will deliver these example data points alongside an instruction to teach the model how to do what we're asking!

> NOTE: You can read more about Llama 3.1 Prompt Formatting at [this link](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/)!

In [None]:
def create_prompt_with_template(example, return_response=True):
  prompt_template = "<|begin_of_text|>"
  prompt_template += "<|start_header_id|>system<|end_header_id|>\n\n"
  prompt_template += "You are provided an English sentence, and are expected to translate it into a 'text speak' sentence.<|eot_id>"
  prompt_template += "<|start_header_id|>user<|end_header_id|>\n\n"
  prompt_template += f"Sentence: {example['english_translation']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
  if return_response:
    prompt_template += f"\n{example['acronym_sentence']}<|end_of_text|>"
  return {"text" :prompt_template}

Let's look at an example of the formatted prompt template!

In [None]:
create_prompt_with_template(dataset[1])["text"]

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are provided an English sentence, and are expected to translate it into a 'text speak' sentence.<|eot_id><|start_header_id|>user<|end_header_id|>\n\nSentence: Hey, I have a question about the details of the meetup. Can you provide information?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\nYo, ? about the meetup deets. Can you fill me in?<|end_of_text|>"

Now we can map this across our dataset.

In [None]:
dataset = dataset.map(create_prompt_with_template)

That's all we need to do for dataset preparation! You'll notice this is exactly the same as the process for `transformers` fine-tuning! Thanks, Unsloth!

## Creating a Trainable PEFT Model

As you might expect - Unsloth is compatible with PEFT, specifcially LoRA in this case!

We can create a PEFT model (LoRA adapters) the same way we'd expect to with `transformers` by using the `get_peft_model` method of our `FastLanguageModel`!

> NOTE: Unsloth supports [all kinds](https://docs.unsloth.ai/basics/lora-parameters-encyclopedia) of LoRA variants/optimizations.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Training the Model

We have our model, we have our data - it's time to train!

Unsloth works seamlessly with HuggingFace's TRL library - so we can use "old familiar" AKA `SFTTrainer`.

First, let's set some `TrainingArguments`.

For the most part - this is standard "paper" parameters - however, Unsloth lets us make a dynamic decision based on supported `dtypes`.

- `is_bfloat16_supported`: this helps us determine if we can use `bf16` or not!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    num_train_epochs=2,
    learning_rate=2e-4,
    fp16= not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=42,
    output_dir="llama3_1_8b_instruct_ft"
)

Now, we can load our `SFTTrainer` as we always do!

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_args,
    max_seq_length=1024,
    dataset_num_proc=2,
    packing=True,
)

All that's left to do is call `.train()`!

In [None]:
training_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 123 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 30
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.4426
2,2.4055
3,2.2692
4,2.3156
5,1.9568
6,1.7641
7,1.6093
8,1.4802
9,1.3251
10,1.2604


Step,Training Loss
1,2.4426
2,2.4055
3,2.2692
4,2.3156
5,1.9568
6,1.7641
7,1.6093
8,1.4802
9,1.3251
10,1.2604


## Trying It Out!

Unsloth not only provides excellent fine-tuning speeds, but also great inference!

Let's run our model and see how it did!

In [None]:
FastLanguageModel.for_inference(model)

prompt = create_prompt_with_template(dataset[1], return_response=False)["text"]

inputs = tokenizer(
    [prompt],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    use_cache=True,
)

tokenizer.batch_decode(outputs)[0]

"<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are provided an English sentence, and are expected to translate it into a 'text speak' sentence.<|eot_id><|start_header_id|>user<|end_header_id|>\n\nSentence: Hey, I have a question about the details of the meetup. Can you provide information?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\nYo, I got a Q about the meetup deets. Can you hit me with the 411?<|end_of_text|>"

Let's try a sentence that the model hasn't been trained on!

In [None]:
FastLanguageModel.for_inference(model)

example = {
    "english_translation" : "Nobody ever figures out what life is all about, and it doesn't matter. Explore the world. Nearly everything is really interesting if you go into it deeply enough.",
    "acronym_sentence" : ""
}

prompt = create_prompt_with_template(example, return_response=False)["text"]

inputs = tokenizer(
    [prompt],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    use_cache=True,
)

tokenizer.batch_decode(outputs)[0]

"<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are provided an English sentence, and are expected to translate it into a 'text speak' sentence.<|eot_id><|start_header_id|>user<|end_header_id|>\n\nSentence: Nobody ever figures out what life is all about, and it doesn't matter. Explore the world. Nearly everything is really interesting if you go into it deeply enough.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\nYo, nobody ever gets what life's about, and honestly, who cares? Just chill, and explore the world. Almost everything's lit if you dive deep enough, fam.<|end_of_text|>"

## Export to Hugging Face

As always, we need to export our model to HuggingFace!

We can do this by logging in, and then exporting!

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model.push_to_hub_merged("ai-maker-space/textified-llama-3-1-8b-instruct", tokenizer, save_method = "merged_4bit_forced")

Unsloth: Merging 4bit and LoRA weights to 4bit...
This might take 5 minutes...
Done.
Unsloth: Saving 4bit Bitsandbytes model. Please wait...


README.md:   0%|          | 0.00/613 [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.05G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.65G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/619 [00:00<?, ?B/s]

Saved merged_4bit model to https://huggingface.co/ai-maker-space/textified-llama-3-1-8b-instruct
