In this notebook, I will be tuning a Llama-3 8 billion parameter model with my own data by only modifying the LoRA adapters. With Unsloth ((link unavailable)) and the SFTTrainer from Hugging Face, the model will be quantized to 4-bit precision for faster inference and reduced memory usage. First, we will make the necessary installations.

In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes #keep an eye out on the xformers version. Usually you want one version before the latest; causes errors often

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", #Sometimes models are gated and a token is required for usage
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth: Fast Llama patching release 2024.8
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

Here are the parameters for FastLanguageModel, as suggested by Unsloth, for applying Parameter-Efficient Fine-Tuning (PEFT) to a language model.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Next, we need to format the prompt to fit our data. My data is organized into 'label' and 'text', but sometimes there are 'instructions' or other data categories as well. After formatting, we need to load the data; in my case, it comes from Hugging Face.

In [None]:
alpaca_prompt = """

### label:
{}

### text:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    labels       = examples["label"]
    texts      = examples["text"]
    outputs = []
    for label, text in zip(labels, texts):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        output = alpaca_prompt.format(label, text) + EOS_TOKEN
        outputs.append(output)
    return { "output" : outputs, }
pass

from datasets import load_dataset
dataset = load_dataset("chrismontes/DogData", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Downloading data:   0%|          | 0.00/217k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.74M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/451k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/543k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.38M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/28492 [00:00<?, ? examples/s]

Map:   0%|          | 0/28492 [00:00<?, ? examples/s]

Now time to tune the model. Using Huggingface TRL's `SFTTrainer` (https://huggingface.co/docs/trl/sft_trainer), we set the parameters necessary for the tuning the model. If you want to tune based off epochs instead of steps, set `num_train_epochs=1` and `max_steps=None` for a full run.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

#establish parameters for SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "output",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = -1, #If you set a value in here, it will override num_train_epochs and instead use number of steps. 100-200 steps works well
        num_train_epochs = 1,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "texts",
    ),
)

#Now call the function
trainer_stats = trainer.train()

Map (num_proc=2):   0%|          | 0/28492 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 28,492 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 3,561
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.6799
2,2.9919
3,2.7749
4,2.2831
5,2.5732
6,2.4108
7,2.6501
8,2.1133
9,2.0843
10,1.8553


Optional: Display statistics for VRAM usage. Sometimes VRAM becomes limited after training.

In [None]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

After the model has been fine-tuned, it's time to use it for inference. Below is the code to run the model from the fine-tuning above. Note that the model has still not been saved at this point.

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
labels = tokenizer(
[
    alpaca_prompt.format(
        "Please tell me something interesting about the Labrador-Retriever Dog", # instruction
        ""
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)

# Modify the model.generate line as follows
_ = model.generate(**labels, streamer=text_streamer, max_new_tokens=128)

The block below is potentially the best method for saving the model. It only saves the LoRA adapters, not the entire model, which is significantly larger (and thus takes much longer to save and load). This block also allows for uploading the model to a Hugging Face account for online storage. At the end of the notebook, there will be other methods for saving the full model, either in 16-bit or GGUF format, for use with llama.cpp.

In [None]:
model.save_pretrained("lora_model2") # Local saving
tokenizer.save_pretrained("lora_model2")

In [None]:
#Push your trained adapters or model to Hugging Face
model.push_to_hub("HF_username/LoRA_adapter_name", token = "HF_Token") # Online saving; you can create a token under your HF account settings. LoRA_adapter_name is the name you want to use to upload to HF.
tokenizer.push_to_hub("HF_username/LoRA_adapter_name", token = "HF_Token") # Same as model.push_to_hub

README.md:   0%|          | 0.00/578 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/chrismontes/Dog-LoRA


Now that the model has been saved, you can load it using the block below, as long as you saved it as LoRA adapters or in the format shown in the code blocks at the very bottom. Simply insert the folder name (or directory) where it is saved in the model_name parameter.
VERY IMPORTANT FOR COLAB USERS:
If you are using the free GPU from Google Colab (T4), you will need to restart the runtime session (do not disconnect and delete) because this will affect the available memory. Although QLoRA creates a model that is very memory-efficient, the Llama 3 8B base model is still quite large! You will need to adjust the prompt again to fit your data format, as was done a few cells back.

In [None]:

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model2", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) #Set the model for inference

alpaca_prompt = """

### label:
{}

### text:
{}"""

# alpaca_prompt = You MUST copy from above!

labels = tokenizer(
[
    alpaca_prompt.format(
        "Please tell me something interesting about the Rottweiler Dog", # label
        "", # text - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

texts = model.generate(**labels, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(texts)

#The code below are some modification you can do to play with the randomness of the responses
"""
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
temperature = 10.0 # Must be a positive float, lower reduces randomness. This one gets wild if too high :)
top_k = 2  # Positive integer only, no upper limit. Lower reduces randomness.
top_p = 5  # bottom is 0, no upper limit. Lower reduces randomness.

# Modify the model.generate line as follows (Add a 'streamer' so that it loads the text as it processes, instead of all at once)
_ = model.generate(**labels, streamer=text_streamer, max_new_tokens=128, temperature=temperature, top_k=top_k, top_p=top_p, do_sample=True)
"""

#You can also use the Causal LM model from HF, though it is significantly slower so it is NOT recommended
"""
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
    "lora_model", # YOUR MODEL YOU USED FOR TRAINING
    load_in_4bit = load_in_4bit,
)
tokenizer = AutoTokenizer.from_pretrained("lora_model")
"""

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth: Fast Llama patching release 2024.7
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2024.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


['<|begin_of_text|>\n\n### label:\nPlease tell me something interesting about the Bull-Terrier Dog\n\n### text:\nBull-Terrier: The Bull Terrier is a medium to large sized dog that stands 14 to 18 inches tall at the shoulder and weighs 30 to 50 pounds.<|end_of_text|>']

In [None]:
alpaca_prompt = """

### label:
{}

### text:
{}"""

# alpaca_prompt = You MUST copy from above!

labels = tokenizer(
[
    alpaca_prompt.format(
        "Please tell me something interesting about the Rottweiler Dog", # label
        "", # text - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

texts = model.generate(**labels, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(texts)

['<|begin_of_text|>\n\n### label:\nPlease tell me something interesting about the Rottweiler Dog\n\n### text:\nRottweiler: In 1931, the Rottweiler was officially recognised as a breed by the German Kennel Club (VDH). In 1936, the Rottweiler was recognised by the American Kennel Club (AKC).<|end_of_text|>']

To run a model saved using the code below for inference, follow the same steps as with the inference code above (after the block displaying VRAM statistics). Ensure that the name and directories are set correctly. If you only need to work with the LoRA adapters, you can stop here.

*Below are alternative methods for saving the model. I will leave the explanations as provided by Unsloth. In my experience, saving and loading the LoRA adapters as described above was the most memory-efficient approach relative to performance. However, if you require the full model for a specific reason, you can use one of the lines below to save the full model locally or to a Hugging Face account.*



### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit (Note that doing a merged 4bit save now makes you force the save as explained by the error message)
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

#GGUF format saving for running in llama.cpp
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).