# Fine-tune LLama2

### Install important libraries

1. **Accelerate :** This library is designed to simplify and optimize model training across multiple devices, including CPUs, GPUs, and distributed systems. It abstracts away many of the complexities involved in parallel training, making it easier to scale your machine learning models. It supports both PyTorch and TensorFlow.
  - Automatic device placement (CPU/GPU) and Distributed training for scaling to multiple GPUs or nodes.
  - Mixed-precision training for memory efficiency.

2. **PEFT :** This is library aimed at efficiently fine-tuning pre-trained models using fewer parameters. Instead of fine-tuning all parameters, peft allows selective updates of only a small subset of model parameters. This makes the fine-tuning process faster, more efficient, and suitable for use with resource-constrained environments.
  - Adapter tuning, LoRA (Low-Rank Adaptation), and prefix tuning techniques.
  - Designed for large language models, it allows users to fine-tune a fraction of the model, reducing both training time and resource usage.


3. **bitsandbytes :** This library provides efficient quantization and memory optimizations for neural networks, particularly focusing on making large models (e.g., GPT, BERT) fit into limited GPU memory. It enables running models in mixed precision or quantized formats, reducing memory usage significantly.
  - 8-bit and 4-bit quantization.
  - Efficient memory allocation and optimization for large models which enabels users to train and deploy larger models on smaller GPUs

4. **transformers :** Hugging Face's transformers library is an open-source libraries for natural language processing (NLP) and other transformer-based models. It provides implementations for a variety of pre-trained models (e.g., BERT, GPT, T5, etc.) and an easy-to-use API for both inference and fine-tuning tasks.

5. **trl((Transformer Reinforcement Learning) :** This is a specialized Hugging Face library that integrates transformer models with reinforcement learning (RL). It's designed to fine-tune language models using RL algorithms, making it particularly useful for tasks where traditional supervised fine-tuning isn’t sufficient, such as in dialogue generation or optimizing models based on human feedback.
  - Supports popular RL algorithms like Proximal Policy Optimization (PPO).
  - Allows users to integrate rewards and RL into transformer-based models.

In [1]:
! pip install -q accelerate peft bitsandbytes transformers trl

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

In [3]:
# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/Llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "Llama-2-7b-chat-finetune"

################################################################################
# QLoRA parameters
################################################################################

# It controls the flexibility and number of trainable parameters in the LoRA layer
lora_r = 64

# It adjusts the scale of the adapted weights.
lora_alpha = 16

# It introduces regularization, helping to prevent overfitting.
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Enables loading the model in 4-bit precision to save memory.
use_4bit = True

# Sets the precision for computations (despite 4-bit storage).
bnb_4bit_compute_dtype = "float16"

# Specifies the type of 4-bit quantization.
bnb_4bit_quant_type = "nf4"

# Enables or Disable a second layer of quantization for further memory savings. (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate scheduler type (e.g., linear, cosine) that controls how the learning rate changes over time.
lr_scheduler_type = "cosine"

# Maximum number of training steps. If set, it overrides num_train_epochs
max_steps = -1

# Ratio of total steps used for learning rate warmup. During warmup, the learning rate gradually increases to the initial learning rate
warmup_ratio = 0.03

# Whether to group input sequences of similar length together for more efficient training.
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# The maximum sequence length for the input data during training.
max_seq_length = None

#  Packs multiple short examples into a single input sequence to improve training efficiency. This is useful when the dataset contains many short sequences
packing = False

# Specifies which device (e.g., which GPU) to load the model onto. "0" refers to GPU 0.
device_map = {"": 0}

In [4]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-2e821c83-e78f-44c3-abb4-e59ae99af48f)


In [5]:
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 159.06 MiB is free. Process 46541 has 14.59 GiB memory in use. Of the allocated memory 13.84 GiB is allocated by PyTorch, and 644.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
# Save trained model
trainer.model.save_pretrained(new_model)

In [None]:
%load_ext tensorboard
%tensorboard --logdir results/runs

In [None]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])