<a href="https://colab.research.google.com/github/arjun7579/miiny-gpt/blob/main/fine_tune_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Title: Fine-Tuning a Pretrained LLM using QLoRA

🧠 Introduction
In this notebook, we will fine-tune a pretrained LLM using QLoRA (Quantized Low-Rank Adaptation) — an efficient method to adapt large language models even on limited hardware like Google Colab.

We’ll use:

- 📚 CodeAlpaca-20k dataset (programming instructions)

- 🤖 TinyLlama 1.1B model (fits in 4-bit mode)

- ⚙️ PEFT + bitsandbytes for efficient training

## 🔧 Step 1: Install Required Packages

We install:

- bitsandbytes for 4-bit quantization

- transformers and datasets for model and data handling

- peft for parameter-efficient fine-tuning (QLoRA)

- trl for advanced training options




In [None]:
!pip install -q bitsandbytes accelerate peft trl transformers datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m376.2/376.2 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m494.8/494.8 kB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m46.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

# 📘 Cell 2: Load the Base Model (TinyLlama 1.1B)

We use TinyLlama-1.1B-Chat — a compact instruction-tuned GPT-style model.

- It's quantized using bitsandbytes to fit easily in Colab.

- We configure it for 4-bit quantization using nf4 and double quant.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

# 📘 Cell 3: Load Dataset (CodeAlpaca)
We use the CodeAlpaca-20k dataset, which contains instruction-following prompts and code completions.

The format() function structures each example into an instruction format GPT can learn from.

In [None]:
from datasets import load_dataset

dataset = load_dataset("sahil2801/CodeAlpaca-20k", split="train")

def format(example):
    return {
        "text": f"### Instruction:\n{example['instruction']}\n### Input:\n{example['input']}\n### Response:\n{example['output']}"
    }

dataset = dataset.map(format)


# 📘 Cell 4: Tokenize the Data

We truncate and pad each example to a max length of 512 tokens, suitable for TinyLlama. This will help stabilize training and avoid OOM errors.

In [None]:
def tokenize(example):
    return tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=512
    )

tokenized = dataset.map(tokenize, batched=True)
tokenized.set_format(type="torch", columns=["input_ids", "attention_mask"])

# 📘 Cell 6: Set Up the Trainer

We apply LoRA to only the attention projection layers. This reduces trainable parameters to a few million, while retaining most model knowledge via frozen weights.

print_trainable_parameters() helps confirm that only a few modules will be updated.

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./qlora-tinyllama",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=2,
    logging_steps=25,
    save_strategy="epoch",
    evaluation_strategy="no",
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    tokenizer=tokenizer
)


# Cell 6: Set Up the Trainer

We use HuggingFace’s Trainer API to simplify training. With batch size = 4 and fp16, this setup runs comfortably on Colab. Logs are printed every 25 steps and model is saved each epoch.

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./qlora-tinyllama",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=2,
    logging_steps=25,
    save_strategy="epoch",
    evaluation_strategy="no",
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    tokenizer=tokenizer
)


# Cell 7: Train the Model

The model will now fine-tune only the LoRA adapter layers. We’ll see the loss drop over time.

- Training is efficient: low memory + fast

- We can scale this by adding more epochs or bigger datasets

In [None]:
trainer.train()

# Cell 8: Inference (Code Generation)

After training, we can now test the model on custom prompts. We use nucleus sampling (top_p=0.9) and mild randomness (temperature=0.8) for natural responses.

In [None]:
model.eval()

prompt = "### Instruction:\nWrite a Python function to check if a number is a palindrome.\n### Input:\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    top_p=0.9,
    temperature=0.8
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))


# Save Adapter Weights

In [None]:
model.save_pretrained("tinyllama-qlora-codealpaca")
tokenizer.save_pretrained("tinyllama-qlora-codealpaca")


---
---