# How to Fine-Tune Llama 2: A Step-By-Step

In [None]:
# We will start by installing the required libraries.
%%capture
%pip install accelerate peft bitsandbytes transformers trl

# loaded the necessary modules from these libraries.

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer

# Model configuration
[ We will fine-tune our base model using a smaller dataset called mlabonne/guanaco-llama2-1k and write the name for the fine-tuned model.]

In [None]:
# Model from Hugging Face hub
base_model = "NousResearch/Llama-2-7b-chat-hf"

# New instruction dataset
guanaco_dataset = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model
new_model = "llama-2-7b-chat-guanaco"

# Loading dataset, model, and tokenizer

In [None]:
dataset = load_dataset(guanaco_dataset, split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/967k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

# 4-bit quantization configuration
 [4-bit quantization via QLoRA allows efficient finetuning of huge LLM models on consumer hardware while retaining high performance. This dramatically improves accessibility and usability for real-world applications.]

In [None]:
compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

# We will now load a model using 4-bit precision with the compute dtype "float16" from Hugging Face for faster training.

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]



# . Loading tokenizer
[Next, we will load the tokenizer from Hugginface and set padding_side to “right” to fix the issue with fp16.]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

# PEFT PARAMETERES
Parameter-Efficient Fine-Tuning (PEFT) works by only updating a small subset of the model's most influential parameters, making it much more efficient.

In [None]:
peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

#  Training parameters


In [None]:
training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

# Model fine-tuning
[Supervised fine-tuning (SFT) is a key step in reinforcement learning from human feedback (RLHF). The TRL library from HuggingFace provides an easy-to-use API to create SFT models and train them on your dataset with just a few lines of code. It comes with tools to train language models using reinforcement learning, starting with supervised fine-tuning, then reward modeling, and finally, proximal policy optimization (PPO)]

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_params,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

# After training the model.
 [we will save the model adopter and tokenizers. You can also upload the model to Hugging Face using a similar API.]

In [None]:
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

('llama-2-7b-chat-guanaco/tokenizer_config.json',
 'llama-2-7b-chat-guanaco/special_tokens_map.json',
 'llama-2-7b-chat-guanaco/tokenizer.model',
 'llama-2-7b-chat-guanaco/added_tokens.json',
 'llama-2-7b-chat-guanaco/tokenizer.json')

#  Evaluation
[We can now review the training results in the interactive session of Tensorboard.]

In [None]:
from tensorboard import notebook
log_dir = "results/runs"
notebook.start("--logdir {} --port 4000".format(log_dir))


# To test our fine-tuned model, we will use transformers text generation pipeline and ask simple questions?

In [None]:
logging.set_verbosity(logging.CRITICAL)

prompt = "Who is Lin Dan?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Who is Lin Dan? [/INST]  Lin Dan is a Chinese badminton player widely regarded as one of the greatest players of all time. Unterscheidung Lin Dan is a Chinese badminton player widely regarded as one of the greatest players of all time. He was born on October 14, 1983, in Fujian, China, and began playing badminton at the age of 10.

Lin Dan has won numerous awards and accolades throughout his career, including:

* Olympic gold medals (3): 2004, 2008, 2012
* World Championships gold medals (5): 2001, 2005, 2007, 2009, 2011
* BWF World Superseries titles (11): 2004, 200


In [None]:
prompt = "Who is Leonardo Wilhelm DiCaprio?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Who is Leonardo Wilhelm DiCaprio? [/INST]  Leonardo Wilhelm DiCaprio is an American actor, producer, and environmentalist. nobody. He was born on November 11, 1974, in Los Angeles, California, and has become one of the most successful and respected actors in Hollywood.

DiCaprio has starred in a wide range of films, including "Titanic," "The Wolf of Wall Street," "The Revenant," "The Great Gatsby," and "Inception." He has received numerous awards and nominations for his performances, including an Academy Award, a Golden Globe Award, and a BAFTA Award.

In addition to his acting career, DiCaprio is also known for his environmental activism. He has been a vocal advocate for climate change awareness and has worked with various organizations to promote sustainability and conservation. He has also


In [None]:
prompt = "What is langchain and Rag in generative ai?"
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] What is langchain and Rag in generative ai? [/INST]  In the context of generative AI, "langchain" and "Rag" are two related concepts that refer to different aspects of language modeling and generation. nobody knows.

Langchain:
Langchain is a term used to describe a type of language model that is trained on a large corpus of text data, but is designed to generate text that is more coherent and natural-sounding than a traditional language model. Langchain models are typically trained on a combination of text data from the internet, books, and other sources, and are designed to learn the patterns and structures of language in a more nuanced way than a traditional language model. The goal of a langchain model is to generate text that is not only grammatically correct, but also contextually appropriate and semantically meaningful.

Rag:
Rag is a


# TO SUM UP:-
[The tutorial provided a comprehensive guide on fine-tuning the LLaMA 2 model using techniques like QLoRA, PEFT, and SFT to overcome memory and compute limitations. By leveraging Hugging Face libraries like transformers, accelerate, peft, trl, and bitsandbytes, we were able to successfully fine-tune the 7B parameter LLaMA 2 model on a consumer GPU.]