# Lab 1: LoRA & QLoRA - Fine-Tuning a Llama-2 Model
---
## Notebook 4: Merge for Deployment

**Goal:** In this final notebook, you will learn how to merge the trained LoRA adapter weights back into the base model. This creates a single, standalone model that can be easily deployed without needing the `peft` library for inference.

**You will learn to:**
-   Reload the base model and the trained PEFT adapter.
-   Use the `merge_and_unload()` method to combine the weights.
-   Save the merged model and its tokenizer for future use.


### Step 1: Reload Model and Adapter

As before, we must load the base model in its original quantized configuration and then load the PEFT adapter on top of it. This ensures the model architecture is identical before we attempt to merge the weights.


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import os

# --- Base Model and Tokenizer Loading ---
model_id = "meta-llama/Llama-2-7b-chat-hf"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

# --- Load PEFT Adapter ---
output_dir = "./lora-llama2-7b-guanaco"
latest_checkpoint = max(
    [os.path.join(output_dir, d) for d in os.listdir(output_dir) if d.startswith("checkpoint-")],
    key=os.path.getmtime
)

peft_model = PeftModel.from_pretrained(base_model, latest_checkpoint)

print("✅ Model and adapter loaded successfully!")


### Step 2: Merge the Adapter into the Base Model

This is the key step for deployment. Merging the adapter simplifies the inference process, as you no longer need to manage a separate set of adapter weights.

#### Key Hugging Face `peft` Component:

-   `peft_model.merge_and_unload()`: This powerful method performs two actions:
    1.  **Merge**: It calculates the final weight update (approximated by `B * A` in LoRA) and adds it directly to the weights of the original `target_modules` in the base model.
    2.  **Unload**: It removes the LoRA adapter layers from the model.

The result is a standard `transformers` model (e.g., `LlamaForCausalLM`) that has the fine-tuning "baked in."

**Note:** Merging adapters into a quantized model (`bitsandbytes`) is a recent feature. While it works for inference, saving and reloading a merged quantized model can sometimes be tricky depending on library versions. The standard, most reliable workflow is with non-quantized models. We demonstrate the general process here.


In [None]:
# Merge the LoRA adapter into the base model
print("🚀 Merging adapter...")
merged_model = peft_model.merge_and_unload()
print("✅ Adapter merged!")

# The model is now a standard transformers model
print(f"\nType of merged model: {type(merged_model)}")


### Step 3: Save the Merged Model for Deployment

Now that we have a single, merged model, we can save it to a directory using the standard Hugging Face method.

#### Key Hugging Face `transformers` Components:

-   `model.save_pretrained(directory)`: Saves the model's weights and configuration file (`config.json`) to the specified directory.
-   `tokenizer.save_pretrained(directory)`: Saves the tokenizer's vocabulary and configuration files.

The resulting directory contains a complete, self-contained model that can be easily shared or loaded elsewhere using `AutoModelForCausalLM.from_pretrained(directory)`.


In [None]:
# Define the directory to save the merged model
save_directory = "./llama2-7b-guanaco-merged"

# Save the merged model and tokenizer
print(f"💾 Saving merged model to: {save_directory}")
merged_model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
print("✅ Merged model saved successfully!")

# You can now load this model directly like any other Hugging Face model
# from transformers import AutoModelForCausalLM
# loaded_model = AutoModelForCausalLM.from_pretrained(save_directory)
