# **Falcon 1B LLM Fine-Tuning**

In this demo, you will fine-tune the Falcon3-1B-Base using Parameter-Efficient Fine-Tuning (PEFT) with LoRA.
You will tokenize a subset of WikiText-2 and configure key LoRA parameters (rank, scaling, dropout) for efficient training.
Finally, compare the model outputs before and after fine-tuning to showcase the method's effectiveness.

### Steps to be followed:

1. Install required packages and import libraries
2. Set device and load pre-trained model
3. Configure PEFT with LoRA
4. Move the model to the device
5. Load and preprocess the dataset
6. Define a custom data collator
7. Generate and store model outputs before fine-tuning
8. Configure training arguments and fine-tune the model
9. Compare model outputs after fine-tuning

### **Step 1: Install required packages and import libraries**

In this first step, we install the essential packages and import the necessary libraries.

Installing and importing packages like transformers, peft, and datasets prepares the environment for loading pre-trained models, applying parameter-efficient fine-tuning, and managing datasets. This sets the foundation for the entire demo.

In [1]:
# Install required packages
!pip install transformers peft datasets

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from

In [3]:
import os
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    default_data_collator
)
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

### **Step 2: Set device and load pre-trained model**

Here, we determine whether a GPU is available (which speeds up training) and load a pre-trained Falcon 1B model with its tokenizer.

Using a GPU (if available) significantly accelerates training. Loading a pre-trained model like Falcon 1B gives us a base model that we can fine-tune, saving both time and resources compared to training from scratch.

In [4]:
# Set device to GPU if available (e.g., T4), otherwise CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [5]:
# Load Falcon 1B model and tokenizer

model_name = "tiiuae/Falcon3-1B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/362k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.78M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.34G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/91.0 [00:00<?, ?B/s]

In [6]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(131072, 2048)
    (layers): ModuleList(
      (0-17): 18 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-06)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-06)
    (rotary_emb

### **Step 3: Configure PEFT with LoRA**

We now set up the Low-Rank Adaptation (LoRA) configuration for PEFT. This involves defining several parameters:

`r=8`: This is the LoRA rank. It determines the size of the low-rank matrices added to the model. A higher rank can capture more nuances but increases parameters slightly.

`lora_alpha=32`: This scaling factor adjusts the magnitude of the LoRA weights. It helps in stabilizing training by scaling the low-rank updates.

`target_modules=["q_proj", "v_proj"]`: Specifies which layers to apply LoRA to. For Falcon 1B, targeting the *q_proj and v_proj* module (the attention projection layer) is common since these layers significantly impact the model's performance.

`lora_dropout=0.1`: This dropout rate is used on the LoRA layers to regularize the training and prevent overfitting.

`bias="none"`: Indicates that the bias parameters are not being fine-tuned.

`task_type="CAUSAL_LM"`: Specifies that our task is causal language modeling.


This step is crucial because it configures the PEFT method. By adding trainable low-rank matrices only to key components, we significantly reduce the number of parameters to update, making fine-tuning both memory- and compute-efficient.

In [7]:
# Configure LoRA for PEFT

lora_config = LoraConfig(
    r=8,                    # LoRA rank: defines the size of the low-rank matrices
    lora_alpha=32,          # Scaling factor: scales the low-rank updates
    target_modules=["q_proj", "v_proj"],  # Target module; adjust based on Falcon's architecture
    lora_dropout=0.1,       # Dropout rate for regularization
    bias="none",            # Do not fine-tune bias terms
    task_type="CAUSAL_LM"   # Task type: causal language modeling
)
model = get_peft_model(model, lora_config)

The Falcon 1B model is now wrapped with LoRA-based PEFT. Only the additional low-rank parameters in the targeted modules will be updated during fine-tuning, making the process more efficient while retaining performance.

### **Step 4: Move the model to the device**

Next, we move the model to the GPU (if available) to ensure faster training and inference.


Transferring the model to the correct device (GPU/CPU) is essential for performance. GPUs, in particular, accelerate the matrix operations involved in training deep neural networks.

In [8]:
# Move the model to the chosen device (GPU)
model = model.to(device)

### **Step 5: Load and preprocess the dataset**

We load a subset (70%) of the WikiText-2 dataset, which contains raw text data. Then, we tokenize the text using the Falcon 1B tokenizer with a maximum sequence length of 128 tokens. Finally, we filter out any examples that yield empty token sequences.


Preprocessing the data is a critical step before training. Tokenizing converts raw text into numerical tokens that the model can understand. Filtering ensures that only valid data is passed to the model, which improves training quality.

In [9]:
# Load a subset of WikiText-2 for demonstration
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:70%]")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=128)
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
tokenized_datasets = tokenized_datasets.filter(lambda x: len(x["input_ids"]) > 0)

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Map:   0%|          | 0/25703 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25703 [00:00<?, ? examples/s]

### **Step 6: Define a custom data collator**

We define a custom data collator function to prepare batches of data for training. This function ensures:

- The `input_ids` are cast to long tensors.

- The `labels` are set correctly. If labels aren’t provided, they are set to be the same as the input_ids.


A data collator is used to combine individual examples into a batch. This ensures that all sequences in the batch have consistent formatting, which is critical for training stability and performance.

In [10]:
# Custom data collator without moving the batch to the device (Trainer will handle that)
def collate_fn(features):
    batch = default_data_collator(features)
    batch["input_ids"] = batch["input_ids"].long()
    if "labels" in batch:
        batch["labels"] = batch["labels"].long()
    else:
        batch["labels"] = batch["input_ids"].clone()
    return batch

### **Step 7: Generate and store model outputs before fine-tuning**

Before fine-tuning the model, we generate outputs for several predefined prompts. These outputs represent the baseline performance of the model in its pre-fine-tuned state and are stored for later comparison.

Capturing the model’s behavior before fine-tuning is crucial for demonstrating how PEFT changes the model's responses. This baseline is essential for a clear before-and-after comparison.

In [11]:
from transformers import logging as hf_logging
hf_logging.set_verbosity_error()

# Generate baseline outputs BEFORE fine-tuning using some sample prompts

# Define your prompts for comparison
prompts = [
    "Discuss the historical development of the internet and its impact on modern society",
    "Explain the principles of quantum mechanics in simple terms",
    "Analyze the role of renewable energy in combating climate change",
    "Describe the contributions of Renaissance artists to modern culture"
]

print("=== Generating outputs BEFORE fine-tuning ===")
before_outputs = {}
for prompt in prompts:
    inp = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    before_ids = model.generate(
        inp,
        max_length=300,
        temperature=0.8,
        no_repeat_ngram_size=3,
        repetition_penalty=2.0,
        top_k=50,
        top_p=0.95
    )
    before_text = tokenizer.decode(before_ids[0], skip_special_tokens=True)
    before_outputs[prompt] = before_text
    print(f"Prompt: {prompt}\nBefore: {before_text}\n{'-'*40}")

=== Generating outputs BEFORE fine-tuning ===




Prompt: Discuss the historical development of the internet and its impact on modern society
Before: Discuss the historical development of the internet and its impact on modern society.
2019-365 days ago · The Internet is a global network that connects computers, servers or other devices to each others using standard protocols for communication over computer networks such as TCP/IP.. It was developed in response by Tim Berners Lee at CERN during World War II with his idea called hypertext transfer protocol which later became known simply http://www....The history behind how we got here: From ARPANET through today's web browsers; from dialup modems via DSL lines all up until now when you can just click an icon! This article will take your knowledge about this amazing invention into another dimension where it has become part...In fact there are many different types including social media sites like Facebook Twitter Instagram etc., but they share one common goal—to connect people around wo

### **Step 8: Configure training arguments and fine-tune the model**

We now set up the training configuration with specific parameters:

- `output_dir`: Directory to save the fine-tuned model
- `run_name`: Name for the training run
- `max_steps`: Limits the number of training steps (set to 50 for this demo)
- `per_device_train_batch_size`: Batch size for each device
- `learning_rate`: The learning rate for fine-tuning
- `logging_steps and save_steps`: Frequency of logging and saving checkpoints
- `num_train_epochs`: Number of training epochs
- `fp16`: Enables mixed precision training for speed on GPUs
- `no_cuda`: Indicates that CUDA (GPU) should be used if available
- `report_to`: Disables external logging tools like Weights & Biases

Then we initialize the Hugging Face Trainer with our model, training arguments, dataset, and data collator, and run the training process.

This step fine-tunes the model using our dataset and LoRA configuration. The training arguments control the fine-tuning process, and running the trainer updates only the low-rank parameters introduced by LoRA.

In [13]:
# Set up training arguments for fine-tuning
training_args = TrainingArguments(
    output_dir="./falcon_1b_lora_finetuned",
    run_name="falcon-1b-lora",
    overwrite_output_dir=True,
    max_steps=50,                        # Quick demo: 50 training steps
    per_device_train_batch_size=1,
    learning_rate=2e-4,
    logging_steps=1,
    save_steps=5,
    num_train_epochs=5,
    fp16=True,                          # Use mixed precision training for GPUs
    no_cuda=False,
    report_to=[]                        # Disable reporting to external services
)

In [14]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    data_collator=collate_fn,
)

In [15]:
trainer.train()

{'loss': 2.7677, 'grad_norm': 0.501239001750946, 'learning_rate': 0.000196, 'epoch': 6.041200990756963e-05}
{'loss': 2.8031, 'grad_norm': 0.8852635025978088, 'learning_rate': 0.000192, 'epoch': 0.00012082401981513926}
{'loss': 3.3461, 'grad_norm': 1.2856632471084595, 'learning_rate': 0.000188, 'epoch': 0.00018123602972270887}
{'loss': 2.8077, 'grad_norm': 0.703910231590271, 'learning_rate': 0.00018400000000000003, 'epoch': 0.0002416480396302785}
{'loss': 2.6074, 'grad_norm': 0.7563310265541077, 'learning_rate': 0.00018, 'epoch': 0.0003020600495378481}
{'loss': 3.9421, 'grad_norm': nan, 'learning_rate': 0.00018, 'epoch': 0.00036247205944541774}
{'loss': 3.348, 'grad_norm': 1.841776728630066, 'learning_rate': 0.00017600000000000002, 'epoch': 0.0004228840693529874}
{'loss': 3.4759, 'grad_norm': 0.7791195511817932, 'learning_rate': 0.000172, 'epoch': 0.000483296079260557}
{'loss': 3.1782, 'grad_norm': 0.6232472062110901, 'learning_rate': 0.000168, 'epoch': 0.0005437080891681266}
{'loss': 3



{'loss': 3.0061, 'grad_norm': nan, 'learning_rate': 6.400000000000001e-05, 'epoch': 0.0021748323566725064}
{'loss': 2.9518, 'grad_norm': 1.0133031606674194, 'learning_rate': 6e-05, 'epoch': 0.002235244366580076}
{'loss': 2.9494, 'grad_norm': 0.8881255984306335, 'learning_rate': 5.6000000000000006e-05, 'epoch': 0.002295656376487646}
{'loss': 3.1532, 'grad_norm': 1.2746647596359253, 'learning_rate': 5.2000000000000004e-05, 'epoch': 0.0023560683863952155}
{'loss': 2.9856, 'grad_norm': 0.9390109777450562, 'learning_rate': 4.8e-05, 'epoch': 0.002416480396302785}
{'loss': 2.9265, 'grad_norm': 0.9401560425758362, 'learning_rate': 4.4000000000000006e-05, 'epoch': 0.0024768924062103545}
{'loss': 2.8525, 'grad_norm': 0.8157004117965698, 'learning_rate': 4e-05, 'epoch': 0.002537304416117924}
{'loss': 4.4723, 'grad_norm': 4.27672815322876, 'learning_rate': 3.6e-05, 'epoch': 0.002597716426025494}
{'loss': 2.4248, 'grad_norm': 3.3562183380126953, 'learning_rate': 3.2000000000000005e-05, 'epoch': 0.0

TrainOutput(global_step=50, training_loss=2.9765391993522643, metrics={'train_runtime': 21.3706, 'train_samples_per_second': 2.34, 'train_steps_per_second': 2.34, 'train_loss': 2.9765391993522643, 'epoch': 0.0030206004953784813})

### **Step 9: Compare model outputs after fine-tuning**

Finally, we generate outputs for the same set of prompts using the fine-tuned model. These outputs are compared side-by-side with the baseline outputs generated earlier.


Comparing outputs before and after fine-tuning clearly demonstrates the impact of the training process. It shows how the model's responses change after being fine-tuned with PEFT, providing practical insights into the effectiveness of this method.

In [16]:
from transformers import logging as hf_logging
hf_logging.set_verbosity_error()

# Compare outputs AFTER fine-tuning
print("\n=== Comparing Outputs Before and After Fine-Tuning ===")
for prompt in prompts:
    inp = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    after_ids = model.generate(
        inp,
        max_length=300,
        temperature=0.8,
        no_repeat_ngram_size=3,
        repetition_penalty=2.0,
        top_k=50,
        top_p=0.95
    )
    after_text = tokenizer.decode(after_ids[0], skip_special_tokens=True)

    print(f"\nPrompt: {prompt}\n")
    print("|Before fine-tuning:|")
    print(before_outputs[prompt])
    print("|After fine-tuning:|")
    print(after_text)
    print("=" * 40)


=== Comparing Outputs Before and After Fine-Tuning ===





Prompt: Discuss the historical development of the internet and its impact on modern society

|Before fine-tuning:|
Discuss the historical development of the internet and its impact on modern society.
2019-365 days ago · The Internet is a global network that connects computers, servers or other devices to each others using standard protocols for communication over computer networks such as TCP/IP.. It was developed in response by Tim Berners Lee at CERN during World War II with his idea called hypertext transfer protocol which later became known simply http://www....The history behind how we got here: From ARPANET through today's web browsers; from dialup modems via DSL lines all up until now when you can just click an icon! This article will take your knowledge about this amazing invention into another dimension where it has become part...In fact there are many different types including social media sites like Facebook Twitter Instagram etc., but they share one common goal—to connect 

### Conclusion

By following these detailed steps and explanations, you have successfully fine-tuned the Falcon 1B model using Parameter-Efficient Fine-Tuning with LoRA. Using a subset of WikiText-2, you configured key LoRA parameters such as rank, scaling factor, dropout, and target modules to efficiently update only the most critical low-rank parameters. This demo clearly illustrates how the model's responses improve after fine-tuning, providing practical insights into the effectiveness and scalability of the PEFT approach for adapting large language models.