## Instruct-tuning Llama2

This notebook follows [maximelabonne's](https://mlabonne.github.io/blog/posts/Fine_Tune_Your_Own_Llama_2_Model_in_a_Colab_Notebook.html) notebook fine-tuning Llama2 model using instruction dataset

In [1]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m50.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Imports
import os
import torch
from datasets import load_dataset

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
from transformers import HfArgumentParser, TrainingArguments
from transformers import pipeline, logging

In [4]:
# Performance
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

In [5]:
# Base Llama2 model to be loaded
model_name = "NousResearch/Llama-2-7b-chat-hf"

In [6]:
# Instruction dataset to be used for fine-tuning the base model
dataset_name = "mlabonne/guanaco-llama2-1k"

In [7]:
# Name of model to be pushed to hub post fine-tuning
tuned_model = "llama-2-7b-guanaco-1k"

In [8]:
# Parameters for quantized training via bitsandbytes

# 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

In [9]:
# QLoRA - quantized LoRA training parameters

# LoRA rank
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

In [10]:
# Training arguments

# Output directory to store results
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

In [11]:
# Supervised fine-tuning (SFT) parameters

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

In [12]:
# Load dataset
dataset = load_dataset(dataset_name, split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/967k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [14]:
# Load tokenizer and model with QLoRA configuration

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

In [15]:
# Check GPU compatibility with bfloat16

if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

In [16]:
print(major)

7


In [17]:
# Load base model

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

In [18]:
# Load LLaMA tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [19]:
# Load LoRA configuration

peft_config = LoraConfig(
                  lora_alpha=lora_alpha,
                  lora_dropout=lora_dropout,
                  r=lora_r,
                  bias="none",
                  task_type="CAUSAL_LM",
              )

In [20]:
# Set training arguments

training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

In [21]:
# Set supervised fine-tuning parameters

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [22]:
# Train model
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,1.4085
50,1.6598
75,1.214
100,1.439
125,1.177
150,1.3615
175,1.173
200,1.4597
225,1.1573
250,1.5332


TrainOutput(global_step=250, training_loss=1.3583098602294923, metrics={'train_runtime': 1505.6741, 'train_samples_per_second': 0.664, 'train_steps_per_second': 0.166, 'total_flos': 8755214190673920.0, 'train_loss': 1.3583098602294923, 'epoch': 1.0})

In [23]:
# Saving trained model
trainer.model.save_pretrained(tuned_model)

In [24]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

### Text generation using the fine-tuned model

In [25]:
# Set text generation pipeline
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)

In [None]:
# result = pipe(f"<s>[INST] {prompt} [/INST]")
# print(result[0]['generated_text'])

In [26]:
def generate(prompt="", pipe=pipe):
    """ Generate with given prompt """

    result = pipe(f"<s>[INST] {prompt} [/INST]")
    return result[0]['generated_text']

In [27]:
# Run text generation pipeline with instruct-tuned model
prompt = "What is a large language model?"
generated_text = generate(prompt=prompt)
print(generated_text)



<s>[INST] What is a large language model? [/INST] A large language model is a type of artificial intelligence (AI) model that is trained on a large dataset of text to generate human-like language outputs. These models are typically trained on vast amounts of text data, such as books, articles, and websites, and are designed to learn the patterns and structures of language.

Large language models are often used for natural language processing tasks such as text classification, sentiment analysis, and machine translation. They are also used for more creative tasks such as writing poetry, stories, and even entire books.

Some examples of large language models include:

* BERT (Bidirectional Encoder Representations from Transformers): A popular language model developed by Google that is trained on a large dataset of text and is known for its ability to generate human-like language outputs.
* LLaMA (LLaMA): A large language model developed


In [28]:
prompt = "Explain to me pythagoras theorem"
generated_text = generate(prompt=prompt)
print(generated_text)

<s>[INST] Explain to me pythagoras theorem [/INST] Pythagoras theorem is a mathematical formula that describes the relationship between the lengths of the sides of a right triangle. It states that the square of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the other two sides.

In mathematical notation, this formula can be written as:

a^2 + b^2 = c^2

Where a and b are the lengths of the two shorter sides of the triangle, and c is the length of the hypotenuse.

For example, if you have a right triangle with sides measuring 3, 4, and 5, you can use the Pythagoras theorem to find the length of the hypotenuse:

3^2 + 4^2 = 5^2

Solving for the length of the hyp


In [29]:
prompt = "What is grand unified theory?"
generated_text = generate(prompt=prompt)
print(generated_text)

<s>[INST] What is grand unified theory? [/INST] Grand Unified Theory (GUT) is a theoretical framework in physics that attempts to unify three of the four fundamental forces of nature: electromagnetism, the strong nuclear force, and the weak nuclear force. The fourth force, gravity, is not yet included in the GUT.

The GUT is based on the idea that all of these forces are manifestations of a single underlying field, known as the unified field. This field is thought to be a scalar field that permeates all of space and time, and it is the source of all of the fundamental forces.

The GUT is a very ambitious and complex theory, and it has been the subject of much research and debate in the physics community. While it has not yet been experimentally confirmed, it is one of the most promising approaches to understanding the fundamental nature of the universe.

One of the key


In [30]:
prompt = "List Maxwell's equations of electromagnetism"
generated_text = generate(prompt=prompt)
print(generated_text)

<s>[INST] List Maxwell's equations of electromagnetism [/INST] Maxwell's equations of electromagnetism are a set of four partial differential equations that describe the behavior of electric and magnetic fields in space and time. These equations are:

1. Gauss's law for electric fields: ∇⋅E = ρ/ε0
2. Gauss's law for magnetic fields: ∇⋅B = 0
3. Faraday's law of induction: ∇×E = -∂B/∂t
4. Ampere's law with Maxwell's correction: ∇×B = μ0J + μ0ε0∂E/∂t

These equations are known as Maxwell's equations, and they provide a complete and consistent description of the behavior of electric and magnetic fields in a wide range of physical situations, from the simplest electrical circuits to


In [31]:
prompt = "What are the system pathways playing a crucial role in lung cancer?"
generated_text = generate(prompt=prompt)
print(generated_text)

<s>[INST] What are the system pathways playing a crucial role in lung cancer? [/INST] The system pathways playing a crucial role in lung cancer are:

1. Epidermal Growth Factor Receptor (EGFR) signaling pathway: This pathway is involved in cellular proliferation, migration, and survival. Mutations in the EGFR gene are common in non-small cell lung cancer (NSCLC) and are associated with increased proliferation and poor prognosis.
2. PI3K/Akt signaling pathway: This pathway is involved in cellular survival and proliferation. Mutations in the PI3K gene are common in NSCLC and are associated with increased proliferation and poor prognosis.
3. MAPK signaling pathway: This pathway is involved in cellular proliferation,


In [32]:
prompt = "What is the difference between NP hard and NP complete?"
generated_text = generate(prompt=prompt)
print(generated_text)

<s>[INST] What is the difference between NP hard and NP complete? [/INST] NP-hard and NP-complete are related concepts in computational complexity theory.

NP-hard refers to a problem that is hard to solve in NP, meaning that it is computationally infeasible to solve it in polynomial time. In other words, if a problem is NP-hard, it means that there is no known algorithm that can solve it in polynomial time.

NP-complete, on the other hand, refers to a problem that is both NP-hard and has a known polynomial-time algorithm. In other words, if a problem is NP-complete, it means that there is a known algorithm that can solve it in polynomial time, but it is also computationally infeasible to solve it in polynomial time.

To illustrate the difference between NP-hard and NP-complete, consider the following example
