# Fine-Tuning Meta-Llama-3.1-8B parameter Language Model with  	2.1x faster performance and taking 60% less memory


This tutorial will guide you through the process of fine-tuning the latest Meta-Llama-3.1-8B language model released by Meta using LoRA (Low-Rank Adaptation). We'll cover both basic and providing explanations at each step. We'll also suggest challenging datasets for further fine-tuning based on your interest. We will be using  [Unsloth](https://unsloth.ai/) which is an finetuning library for AI and LLMs based on highly optimized GPU kernels.
For more details on the API usage please visit the [Unsloth Documentation
](https://docs.unsloth.ai/)

## Prerequisites

- Basic knowledge of Python programming
- Familiarity with machine learning concepts
- An environment with CUDA-capable GPU and install Unsloth  library


## Steps include:
1. Setting Up the Environment
2. loading packages
3. Set up the Configuration
4. Preparing the Data
5. Using LoRA (Low-Rank Adaptation)
6. Training the Model with LoRA
7. Inferencing with the fine-tunied model
8. Saving the Model







### 1. Setting Up the Environment

Ensure you have the necessary libraries installed. Key libraries include `unsloth`.


In [1]:
# !pip install torch transformers datasets trl unsloth
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

#'Note: ignore ERROR while running on colab:''pips dependency resolver does not currently take into account all the packages that are installed'

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-e7tib0e1/unsloth_46ea9c1f2283420ab96cd1a45e47226c
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-e7tib0e1/unsloth_46ea9c1f2283420ab96cd1a45e47226c
  Resolved https://github.com/unslothai/unsloth.git to commit dfca5516e74e60d52915d4287121d9ff8b80b314
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading tyro-0.8.5-py3-none-any.whl.metadata (8.2 kB)
Collecting transformers>=4.43.2 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[c

### 2. loading packages

Define several configurations, such as the maximum sequence length, data type, and whether to load the model in 4-bit precision.


In [2]:
from unsloth import FastLanguageModel
import torch
import os
from transformers import TextStreamer
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported



🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


### 3. Set up the Configuration
We are using the prompt based on the dataset [python_code_instructions_18k_alpaca](https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca)



In [3]:
# 1. Configuration
max_seq_length = 2048
dtype = None
load_in_4bit = True
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

## we are instructed to generate plot graph using ploty package
instruction = "Generate the scatter plot using ploty package for the given input range"
input = "[0,100]"
huggingface_model_name = "AYNaich/Llama-3.1-8B-bnb-4bit-python_update"



### 4. Testing pre-trained model and Tokenizer

Load the model and tokenizer using a specific model checkpoint. The `FastLanguageModel` module from unsloth  provides an interface to manage this process.


In [4]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    token=os.getenv("HF_TOKEN")
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        instruction, # instruction
        input, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)



==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.43.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Generate the scatter plot using ploty package for the given input range

### Input:
[0,100]

### Response:
```python
import plotly.express as px
fig = px.scatter(x=[0,100], y=[0,100])
fig.show()
```<|end_of_text|>


### 5. Preparing the Data

Prepare the data using a pre-defined prompt structure (`alpaca_prompt`). Use the `load_dataset` function from the `datasets` library to load a specific dataset.


In [5]:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

dataset = load_dataset("iamtarun/python_code_instructions_18k_alpaca", split="train")
dataset = dataset.map(formatting_prompts_func, batched=True)


Downloading readme:   0%|          | 0.00/905 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18612 [00:00<?, ? examples/s]

Map:   0%|          | 0/18612 [00:00<?, ? examples/s]

 ### 6. Using LoRA (Low-Rank Adaptation)

LoRA allows for fine-tuning large models efficiently by adding low-rank updates. The configuration for LoRA, including the rank, target modules, and other parameters.

In [6]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # Also support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)



Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### 7. Training the Model with LoRA

Configure and initiate the training process using `SFTTrainer`. The key parameters such as learning rate, batch size, and number of training steps are specified.


In [7]:

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=100,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)


Map (num_proc=2):   0%|          | 0/18612 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [8]:
##Before training: Record initial GPU memory usage
start_memory = torch.cuda.memory_reserved(0)
start_memory_gb = start_memory / (1024 ** 3)  # Convert to GB

# Print initial GPU memory stats
gpu_properties = torch.cuda.get_device_properties(0)
total_memory_gb = gpu_properties.total_memory / (1024 ** 3)
print(f"GPU: {gpu_properties.name}, Total Memory: {total_memory_gb:.2f} GB")
print(f"Initial Memory Reserved: {start_memory_gb:.2f} GB")

# Train the model and record training stats
trainer_stats = trainer.train()

# After training: Record final GPU memory usage
end_memory = torch.cuda.memory_reserved(0)
end_memory_gb = end_memory / (1024 ** 3)  # Convert to GB

# Calculate memory usage during training
memory_used_gb = end_memory_gb - start_memory_gb
memory_used_percentage = (end_memory / gpu_properties.total_memory) * 100

# Print training stats and memory usage
train_time_seconds = trainer_stats.metrics['train_runtime']
train_time_minutes = train_time_seconds / 60
print(f"Training Time: {train_time_seconds:.2f} seconds ({train_time_minutes:.2f} minutes)")
print(f"Peak Memory Reserved: {end_memory_gb:.2f} GB")
print(f"Memory Used for Training: {memory_used_gb:.2f} GB")
print(f"Memory Used Percentage: {memory_used_percentage:.2f}%")

GPU: Tesla T4, Total Memory: 14.75 GB
Initial Memory Reserved: 6.04 GB


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 18,612 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 100
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,1.5787
2,1.7224
3,1.2536
4,1.3967
5,1.322
6,1.1539
7,0.9535
8,0.7623
9,0.9171
10,0.5804


Step,Training Loss
1,1.5787
2,1.7224
3,1.2536
4,1.3967
5,1.322
6,1.1539
7,0.9535
8,0.7623
9,0.9171
10,0.5804


Training Time: 806.57 seconds (13.44 minutes)
Peak Memory Reserved: 10.02 GB
Memory Used for Training: 3.97 GB
Memory Used Percentage: 67.94%


### 8. Inferencing with the fine-tunied model

In [12]:
# 5. After Training
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        instruction, # instruction
        input, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)


### copy the generated output in to the next cell to see the output

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Generate the scatter plot using ploty package for the given input range

### Input:
[0,100]

### Response:
import plotly.graph_objects as go

x = [i for i in range(0, 101)]
y = [i**2 for i in x]

fig = go.Figure(data=[go.Scatter(x=x, y=y, mode='markers')])

fig.update_layout(title='Scatter Plot', xaxis_title='X', yaxis_title='Y')
fig.show()<|end_of_text|>


### 9. Saving the Model

After training, save the model and tokenizer both locally and optionally upload to Hugging Face Hub.


In [13]:
## saving the finetuned model locally
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

## saving the finetuned model to huggingface
model.push_to_hub(huggingface_model_name, token=os.getenv("HF_TOKEN"))
tokenizer.push_to_hub(huggingface_model_name, token=os.getenv("HF_TOKEN"))


README.md:   0%|          | 0.00/588 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/AYNaich/Llama-3.1-8B-bnb-4bit-python_update


## Part 2: Advanced Steps




## Part 3: Challenging Datasets for Further Fine-Tuning

For advanced users looking to challenge their fine-tuning setup, consider the following datasets:

1. **Common Crawl**: A diverse and extensive dataset from web pages, useful for training models on a wide variety of topics.
2. **Wikitext-103**: A large dataset from Wikipedia articles, suitable for training models with a deep understanding of structured and factual information.
3. **BooksCorpus**: A dataset containing a wide range of books, perfect for training on long-form content and narratives.
4. **CodeSearchNet**: A dataset specifically for code and programming-related tasks, ideal for fine-tuning models in understanding and generating code.


## Conclusion

This tutorial covers the basic aspects of fine-tuning the  Meta-Llama-3.1-8B parameter language model leveraging unsloth. Whether you're starting with model training or looking to explore advanced configurations, these steps provide a comprehensive guide. For further exploration, experimenting with different datasets and training parameters can lead to even more optimized and capable models. You can pre-train the model locally and perform inference on your system if GPU available.
