# Lab 1: LoRA & QLoRA - Fine-Tuning a Llama-2 Model
---
## Notebook 3: Inference

**Goal:** In this notebook, you will learn how to load your fine-tuned PEFT adapter and use it for inference.

**You will learn to:**
-   Reload the quantized base model.
-   Use `peft.PeftModel` to load the LoRA adapter weights from your training checkpoint.
-   Prepare a prompt using the tokenizer.
-   Use the `generate()` method of the fine-tuned model to get a response.


### Step 1: Reload Model and Adapter

To perform inference, we first need to load the base model again in the exact same configuration as we did for training (i.e., with 4-bit quantization). Then, we'll load the LoRA adapter on top of it.

#### Key Hugging Face `peft` Components:

-   `peft.PeftModel`: This class is used to work with a model that has PEFT adapters.
    -   `from_pretrained()`: This is the key method. It takes the **base model** as the first argument and the **path to the saved adapter** as the second argument. It then correctly loads the adapter weights and attaches them to the target modules of the base model.

We need to find the latest checkpoint saved by the `Trainer` in our output directory.


In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import os

# --- Base Model and Tokenizer Loading ---
# Same configuration as in the training notebook
model_id = "NousResearch/Llama-2-7b-hf"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

# --- Load PEFT Adapter ---
# Find the latest checkpoint from the training output directory
output_dir = "./lora-llama2-7b-guanaco"
# Find the latest checkpoint directory
latest_checkpoint = max(
    [os.path.join(output_dir, d) for d in os.listdir(output_dir) if d.startswith("checkpoint-")],
    key=os.path.getmtime
)
print(f"Loading adapter from: {latest_checkpoint}")

# Load the PEFT model
inference_model = PeftModel.from_pretrained(base_model, latest_checkpoint)

print("✅ Inference model loaded successfully!")


OutOfMemoryError: CUDA out of memory. Tried to allocate 6.15 GiB. GPU 0 has a total capacity of 24.00 GiB of which 22.76 GiB is free. Process 69757 has 17179869184.00 GiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

### Step 2: Perform Inference

Now we can use our fine-tuned `inference_model` to generate text. The process is standard for Hugging Face `transformers` models.

#### Key Hugging Face `transformers` Components:

-   `tokenizer()`: The tokenizer converts our prompt string into a format the model can understand (i.e., a sequence of token IDs). `return_tensors="pt"` ensures the output is a PyTorch tensor. We also move it to the GPU with `.cuda()`.
-   `model.generate()`: This is the core method for text generation.
    -   It takes the `input_ids` from the tokenizer.
    -   `max_new_tokens`: Sets the maximum length of the generated response.
    -   `do_sample=True`: Enables sampling-based generation (like top-k or top-p), which usually produces more creative and less repetitive text than greedy decoding.
    -   `top_k`: In top-k sampling, the model considers only the `k` most likely next tokens at each step.

We will use a prompt that is similar in style to the `guanaco` dataset to see how well our fine-tuning worked.


In [None]:
# Prepare the prompt
prompt = "Could you please tell me about the Yushan National Park in Taiwan?"
input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()

# Generate text
# We use a context manager to disable gradient calculations for efficiency
with torch.no_grad():
    outputs = inference_model.generate(
        input_ids=input_ids,
        max_new_tokens=256,
        do_sample=True,
        top_k=50,
        num_return_sequences=1
    )

# Decode and print the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("--- Prompt ---")
print(prompt)
print("\n--- Generated Response ---")
print(generated_text)


### (Optional) Compare with the Base Model

To truly appreciate the effect of fine-tuning, you can run the same prompt through the `base_model` (without the LoRA adapter) and compare the responses. You will likely see that the fine-tuned model's response is more detailed, better formatted, or more aligned with the instruction-following style of the `guanaco` dataset.

---
Next, in the final notebook `04-Merge_and_Deploy.ipynb`, we will see how to merge the adapter weights back into the model for easy deployment.
