## <b><span style='color:#9146ff'>|</span> Introduction </b>

Welcome to this notebook on fine-tuning the Meta LLaMA-3 model on an Arabic instruct dataset using the free Kaggle recourses! 🎉

In this notebook, you will learn how to:

* Set up the environment and install necessary dependencies.
* Prepare and preprocess the Arabic dataset for model training.
* Configure and fine-tune the Meta LLaMA-3 model.
* Quantize the model for efficiency.
* Use Parameter-Efficient Fine-Tuning (PEFT) with LoRA.
* Utilize the SFT Trainer for fine-tuning.
* Choose appropriate hyperparameters for training.
* Test the performance of the fine-tuned model.

Note : You can generalize this notebook on any other different QA instruct dataset for chatbot

![Llama-3](https://pc-tablet.co.in/wp-content/uploads/2024/04/Llama-3.webp)


## <b>1 <span style='color:#9146ff'>|</span> Instalation and Logging </b>

In [5]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

secret_label = "HF Hub"
secret_value = UserSecretsClient().get_secret(secret_label)
login(token=secret_value)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [6]:
%pip install \
    datasets \
    evaluate \
    rouge_score\
    loralib \
    evaluate \
    accelerate \
    bitsandbytes \
    trl \
    peft \
    -U --quiet

Note: you may need to restart the kernel to use updated packages.


In [7]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import pandas as pd
import re
import numpy as np
import string
from nltk.corpus import stopwords
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer
from sklearn.pipeline import Pipeline
import evaluate

2024-05-21 06:05:00.104058: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-21 06:05:00.104200: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-21 06:05:00.232632: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [8]:
# !pip install -q -U git+https://github.com/huggingface/peft.git

In [9]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

In [10]:
import transformers

torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

## <b>2 <span style='color:#9146ff'>|</span> Model Configuration and Quantization </b>

* Loading the model and its tokenizer from huggingface `AutoModelForCausalLM` library
* Apply model quantization to reduce the size and memory usage of the model This compression technique is pivotal for deploying advanced models on devices with limited computational capabilities

**Detailed Code Explanation :**
- `AutoTokenizer`: This function loads a pre-trained tokenizer from Hugging Face's model hub.
- `from_pretrained`: This method loads the tokenizer for the "meta-llama/Meta-Llama-3-8B-Instruct" model. The tokenizer is responsible for converting text into tokens that the model can process
- `getattr`: This function dynamically gets an attribute from the `torch` module. Here, it retrieves `torch.float16`, which indicates that computations will use 16-bit floating point precision. This is typically used to reduce memory usage and increase computation speed.
- `BitsAndBytesConfig`: This class is used to configure the quantization parameters.
> - `load_in_4bit=True`: Indicates that the model should be loaded with 4-bit quantization. This reduces the model size and speeds up inference by using 4-bit integers instead of the usual 32-bit floating point numbers.
> - `bnb_4bit_quant_type="nf4"`: Specifies the quantization type. "nf4" is a specific quantization format optimized for neural network weights.
> - `bnb_4bit_compute_dtype=compute_dtype`: Sets the computation data type to torch.float16. This means that while the model weights are stored as 4-bit integers, the computations are performed in 16-bit floating point precision.
> - `bnb_4bit_use_double_quant=True`: Enables double quantization, which applies a second level of quantization to further reduce model size and potentially increase accuracy.

In [11]:
from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=quant_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

In [12]:
# Set pad_token as end-of-sentence token
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [13]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(model))

trainable model parameters: 1050939392
all model parameters: 4540600320
percentage of trainable model parameters: 23.15%


## <b>3 <span style='color:#9146ff'>|</span> Data Preparation </b>


In [14]:
import pandas as pd

# Load the Parquet file
df = pd.read_parquet('/kaggle/input/arabic-instruct-chatbot-dataset/train-00000-of-00001-10520e8228c2c104.parquet')

# Display the first few rows
df.head()


Unnamed: 0,instruction,input,output
0,أعط ثلاث نصائح للبقاء بصحة جيدة.,,1- تناول نظامًا غذائيًا متوازنًا وتأكد من تنا...
1,ما هي الألوان الثلاثة الأساسية؟,,الألوان الثلاثة الأساسية هي الأحمر والأزرق وا...
2,صف بنية الذرة.,,تتكون الذرة من نواة تحتوي على البروتونات والن...
3,كيف يمكننا تقليل تلوث الهواء؟,,هناك عدد من الطرق للحد من تلوث الهواء ، مثل ال...
4,صف وقتًا كان عليك فيه اتخاذ قرار صعب.,,كان علي أن أتخذ قرارًا صعبًا عندما كنت أعمل كم...


## <b>4 <span style='color:#9146ff'>|</span> Data Preprocessing </b>


In [15]:
# Calculate the maximum length of text in the 'instruction' column
max_length_instruction = df['instruction'].apply(len).mean()

# Calculate the maximum length of text in the 'output' column
max_length_output = df['output'].apply(len).mean()

# Print the results
print(f"Maximum length of 'instruction': {max_length_instruction}")
print(f"Maximum length of 'output': {max_length_output}")

Maximum length of 'instruction': 47.95117495480943
Maximum length of 'output': 233.013211030345


**Detailed Code Explanation :**

- `tokenizer(question, ...)`: This uses the tokenizer to convert the question string into token IDs.
- `padding="max_length"`: Pads the sequences to the maximum length specified by `max_length`.
- `truncation=True`: Truncates the sequences if they exceed the `max_length`.
- `max_length`: Specifies the maximum length of the tokenized sequence.
- `return_tensors="pt"`: Returns the tokenized sequences as PyTorch tensors.
- `input_ids[0]`: Retrieves the token IDs from the tensor and assigns them to `row['input_ids']`.

In [16]:
def tokenize_function(row):
    # Tokenize the conversations
    question = ' '.join(row["instruction"]) if isinstance(row["instruction"], list) else row["instruction"]

    row['input_ids'] = tokenizer(question, padding="max_length", truncation=True, max_length = 128, return_tensors="pt").input_ids[0]
    
    # Assuming "answer" column is already a string, no need for conversion
    row['labels'] = tokenizer(row["output"], padding="max_length", truncation=True, max_length = 256, return_tensors="pt").input_ids[0]
    
    return row


# Tokenize the DataFrame
tokenized_df = df.apply(tokenize_function, axis=1)

In [17]:
# Convert columns to list
tokenized_df['input_ids'] = tokenized_df['input_ids'].apply(lambda x: x.tolist())
tokenized_df['labels'] = tokenized_df['labels'].apply(lambda x: x.tolist())

In [18]:
tokenized_df

Unnamed: 0,instruction,input,output,input_ids,labels
0,أعط ثلاث نصائح للبقاء بصحة جيدة.,,1- تناول نظامًا غذائيًا متوازنًا وتأكد من تنا...,"[128000, 106173, 44735, 117075, 118201, 100462...","[128000, 220, 16, 12, 40534, 101537, 73904, 10..."
1,ما هي الألوان الثلاثة الأساسية؟,,الألوان الثلاثة الأساسية هي الأحمر والأزرق وا...,"[128000, 101237, 104380, 100461, 8700, 100539,...","[128000, 100461, 8700, 100539, 102432, 109413,..."
2,صف بنية الذرة.,,تتكون الذرة من نواة تحتوي على البروتونات والن...,"[128000, 104477, 100829, 74541, 102554, 101341...","[128000, 112077, 103967, 102554, 101341, 64337..."
3,كيف يمكننا تقليل تلوث الهواء؟,,هناك عدد من الطرق للحد من تلوث الهواء ، مثل ال...,"[128000, 114804, 106666, 101537, 40534, 101471...","[128000, 108241, 101052, 105300, 64337, 101979..."
4,صف وقتًا كان عليك فيه اتخاذ قرار صعب.,,كان علي أن أتخذ قرارًا صعبًا عندما كنت أعمل كم...,"[128000, 104477, 110521, 101333, 102037, 12741...","[128000, 102087, 104537, 100822, 64515, 14628,..."
...,...,...,...,...,...
51997,قم بإنشاء مثال لما يجب أن ترغب فيه السيرة الذ...,,جين تريمين \ n1234 Main Street، Anytown، CA 98...,"[128000, 117659, 28946, 107078, 118712, 119979...","[128000, 34190, 100327, 40534, 113690, 100327,..."
51998,رتب العناصر الواردة أدناه بالترتيب لإكمال الجم...,كعكة لي الأكل,أنا آكل الكعكة.,"[128000, 11318, 100936, 119424, 110732, 105155...","[128000, 127389, 100281, 102812, 101100, 24102..."
51999,اكتب فقرة تمهيدية عن شخص مشهور.,ميشيل أوباما,ميشيل أوباما امرأة ملهمة ارتقت إلى مستوى التح...,"[128000, 110973, 100936, 119932, 101341, 10170...","[128000, 102606, 33890, 96298, 64515, 100708, ..."
52000,قم بإنشاء قائمة من خمسة أشياء يجب على المرء أ...,,1. ابحث عن الفرص المحتملة وفكر مليًا في الخيار...,"[128000, 117659, 28946, 107078, 118712, 123797...","[128000, 16, 13, 101558, 116246, 100926, 10865..."


In [19]:
# import gc
# torch.cuda.empty_cache()
# gc.collect()
# torch.cuda.empty_cache()

In [20]:
from datasets import Dataset

# Assuming `tokenized_df` is your pandas DataFrame
dataset = Dataset.from_pandas(tokenized_df[:10000])

In [21]:
dataset

Dataset({
    features: ['instruction', 'input', 'output', 'input_ids', 'labels'],
    num_rows: 10000
})

In [22]:
tokenized_datasets = dataset.map(tokenize_function)# batched=True, # batch_size=...
tokenized_datasets = tokenized_datasets.remove_columns(['instruction', 'input','output'])

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

## <b>5 <span style='color:#9146ff'>|</span> Model Training and Fine-tuning </b>

### LoRA (Low-Rank Adaptation) :
is a technique for Parameter-Efficient Fine-Tuning (PEFT) that adds trainable low-rank matrices to the model weights.

![LoRa](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/blog/133_trl_peft/step2.png)


In [38]:
# Load LoRA configuration
peft_args = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM",
)

### Training Arguments :

**Parameter Explanations**

1. `output_dir="./results"`:
Directory where the model checkpoints and other outputs will be saved.
num_train_epochs=1:

2. Number of epochs to train the model. An epoch is one full pass through the training dataset.

3. `per_device_train_batch_size=2`:
Batch size per GPU/TPU core/CPU for training. This means that each device will process 2 samples per forward/backward pass.

4. `gradient_accumulation_steps=1`:
Number of update steps to accumulate before performing a backward/update pass. This effectively increases the batch size by accumulating gradients over multiple steps.

5. `optim="paged_adamw_32bit"`:
Specifies the optimizer to use. paged_adamw_32bit is an AdamW optimizer variant that uses 32-bit precision and is optimized for memory efficiency.

6. `save_steps=100`:
Number of steps between model checkpoint saves. The model will be saved every 100 steps.

7. `logging_steps=100`:
Number of steps between logging outputs. Training progress will be logged every 100 steps.

8. `learning_rate=2e-5`:
Initial learning rate for the optimizer. This controls how much to adjust the model weights with respect to the loss gradient.

9. `weight_decay=0.001`:
Weight decay (L2 regularization) to apply to model parameters. Helps prevent overfitting by penalizing large weights.

10. `fp16=True`:
Enable 16-bit (half-precision) training to reduce memory usage and speed up training.

11. `bf16=False`:
Disable bfloat16 training. Bfloat16 is another 16-bit precision format, often used on TPUs.

12. `max_grad_norm=0.3`:
Maximum norm for gradient clipping. This helps prevent exploding gradients by scaling gradients that exceed this norm.

13. `warmup_ratio=0.03`:
Ratio of total training steps used for linear learning rate warmup. This gradually increases the learning rate from 0 to the initial learning rate over the first 3% of the training steps.

14. `group_by_length=True`:
Whether to group sequences of roughly the same length together for training. This can improve training efficiency and stability.

15. `lr_scheduler_type="cosine"`:
Type of learning rate scheduler to use. "cosine" refers to a cosine annealing schedule, which gradually decreases the learning rate following a cosine curve.

16. `report_to="tensorboard"`:
Specifies where to report training metrics. "tensorboard" will log metrics to TensorBoard, a visualization tool for monitoring training.

In [47]:
# Set training parameters
training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    # per_device_eval_batch_size=1,
    gradient_accumulation_steps=1,
#     evaluation_strategy="epoch",
    optim="paged_adamw_32bit",
    save_steps=100,
    logging_steps=100,
    learning_rate=2e-5,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
#     max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="tensorboard"
)

In [48]:
# # Set up training arguments
# training_args = TrainingArguments(
#     output_dir="./results",
#     num_train_epochs=3,
#     per_device_train_batch_size=16,  # Adjust according to your device and global batch size
#     gradient_accumulation_steps=2,  # Adjust according to your device and global batch size
#     logging_dir='./logs',
#     logging_steps=10,
#     evaluation_strategy="steps",
#     save_steps=10,
#     # save_total_limit=2,
#     learning_rate=2e-5,
#     lr_scheduler_type="cosine",
#     warmup_ratio=0.1,
#     fp16=True,  # Use bf16 if your hardware supports it
#     optim="adamw_torch_fused",  # Use "adamw_torch_fused" for speedup
#     report_to="tensorboard"
# )

In [49]:
from peft import get_peft_model, TaskType

peft_model = get_peft_model(model, 
                            peft_args)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3407872
all model parameters: 4544008192
percentage of trainable model parameters: 0.07%


### SFTTrainer:

- Supervised Fine-tuning (SFT): Optimized for fine-tuning pre-trained models with smaller datasets on supervised learning tasks.
- Simpler interface: Provides a streamlined workflow with fewer configuration options, making it easier to get started.
- Efficient memory usage: Uses techniques like parameter-efficient (PEFT) and packing optimizations to reduce memory consumption during training.
- Faster training: Achieves comparable or better accuracy with smaller datasets and shorter training times than Trainer.

In [50]:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=dataset,
#     eval_dataset=test_dataset,
    peft_config=peft_args,
    dataset_text_field="text",
#     max_seq_length=256,
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [51]:
# # import torch_optimizer as optim
# from transformers import AdamW
# from transformers.optimization import get_cosine_schedule_with_warmup

# # trainer.args.fsdp = "full_shard auto_wrap"  # Configure FSDP if required

# # Initialize optimizer and scheduler
# optimizer = AdamW(model.parameters(), lr=training_args.learning_rate)
# num_training_steps = len(tokenized_datasets) // training_args.per_device_train_batch_size // training_args.gradient_accumulation_steps * training_args.num_train_epochs
# lr_scheduler = get_cosine_schedule_with_warmup(
#     optimizer,
#     num_warmup_steps=int(0.1 * num_training_steps),
#     num_training_steps=num_training_steps,
# )

# # Enable Flash Attention v2
# # flash_attention_v2_enabled = True  # Assume this is integrated in your model/library


### Training

In [52]:
trainer.train()

Step,Training Loss
100,3.8019
200,3.3599
300,3.0473
400,3.0143
500,2.8759
600,2.9524
700,2.9111
800,2.8108
900,2.8013
1000,2.7188


TrainOutput(global_step=5000, training_loss=2.700860397338867, metrics={'train_runtime': 7395.6575, 'train_samples_per_second': 1.352, 'train_steps_per_second': 0.676, 'total_flos': 5.766399393792e+16, 'train_loss': 2.700860397338867, 'epoch': 1.0})

In [46]:
# import gc
# torch.cuda.empty_cache()
# gc.collect()
# torch.cuda.empty_cache()

### Save model & Publish

In [53]:
trainer.model.save_pretrained("./llama-3-8B-Arabic")
tokenizer.save_pretrained("./llama-3-8B-Arabic")

('./llama-3-8B-Arabic/tokenizer_config.json',
 './llama-3-8B-Arabic/special_tokens_map.json',
 './llama-3-8B-Arabic/tokenizer.json')

In [None]:
# model.push_to_hub("")
# tokenizer.push_to_hub("")

## <b>6 <span style='color:#9146ff'>|</span> Testing the model performance on a single inference </b>


In [56]:
def single_inference(question):
    messages = [
        {"role": "system", "content": "اجب علي الاتي بالعربي فقط."},
    ]

    messages.append({"role": "user", "content": question})


    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(model.device)

    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]

    outputs = model.generate(
        input_ids,
        max_new_tokens=256,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.4,
    #     top_p=0.9,
    )
    response = outputs[0][input_ids.shape[-1]:]
    output = tokenizer.decode(response, skip_special_tokens=True)
    return output

In [57]:
question = """ما هي طريقة عمل البيتزا , اجب في خطوات"""

answer = single_inference(question)

print(f'INPUT QUESTION:\n{question}')
print(f'\n\nModel Answer:\n{answer}')

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


INPUT QUESTION:
ما هي طريقة عمل البيتزا , اجب في خطوات


Model Answer:
لإعداد البيتزا ، اتبع الخطوات التالية:

1. قم بإعداد العجينة: قم بجمع 2 كوب من الدقيق ، 1 كوب من الماء ، 1 ملعقة صغيرة من الخميرة ، 1 ملعقة صغيرة من الملح ، و 2 ملعقة صغيرة من الزيت. قم بترسيب العجينة في وعاء ، ثم قم بتغطيتها بالكفاف وتعطيتها وقتًا لترسيبها.
2. قم بترسيب العجينة: قم بترسيب العجينة في وعاء ، ثم قم بتغطيتها بالكفاف وتعطيتها وقتًا لترسيبها.
3. قم بترتيب المواد: قم بترتيب المواد التالية: 1 رطل من الجبن ، 1 رطل من لحم البقر ، 1 رطل من لحم الدجاج ، 1 ملعقة صغيرة من البصل ، 1 ملعقة صغيرة من الخضار ، 1 ملعقة صغيرة من الكمون ، 1 ملعقة صغيرة من الفلفل ، 1


In [None]:
question = """   """

answer = single_inference(question)

print(f'INPUT QUESTION:\n{question}')
print(f'\n\nModel Answer:\n{answer}')