Run 18 July

Fine tune llaama on: "for this question, given this passage, this is the answer'

Save fine tuned llama.

Run inference on fine tuned llama: "Hey llama {better prompt}, given this passage, generate the answer to the question"

We will evaluate above generated answers from a (unit test, likert score).



Inputs: a triplet file phase_z3_generated_qa.jsonl created using mistral model. The file has passage_id, passage_text, question, and answer.
2.   Phase 2: fine tune llama on above training data. Aspirational Use of QLORA, and Unsloth

3. Phase 3: some toy inference on trained model in phase 2







# Part 2: LLM Fine tuning using QLoRA, No Unsloth.

Used components:

1.   LLM (Large Language Model): meta-llama/Meta-Llama-3-8B-Instruct
2.   Fine-Tuning Technique: QLoRA (Quantized Low-Rank Adaptation)
3.   Core Machine Learning Framework: PyTorch (imported as torch)
4.   Hugging Face Libraries:
    - transformers: For loading the model and tokenizer, and defining TrainingArguments.
    
    - trl: Specifically the SFTTrainer for efficient supervised fine-tuning.
    - peft: For configuring LoRA adapters (LoraConfig).
    - datasets: For loading question, passage, ans, triplet.jsonl file.
    - huggingface_hub: For authenticating with the Hugging Face Hub to access the Llama 3 model.

5. Quantization: 4-bit quantization (a core part of QLoRA, reducing memory footprint).
6. Hardware: A GPU (like a Colab T4 or A100) is essential for running this process.
7. Input Data Format: phase_4_question_passage_ans_triplet.jsonl file, containing doc_id, question, passage, and answer triplets.

    







In [1]:
###AGGRESSIVE INSTALLTION IF YOU ENCOUNTER PERSISTENT ISSUES

# --- Installation Block ---
# Install standard Hugging Face libraries for QLoRA fine-tuning.
# It's crucial to run these installations first.

print("Installing standard Hugging Face libraries...")

# Aggressively uninstall to ensure a clean slate, especially for torch and torchvision
!pip uninstall -y torch torchvision torchaudio transformers accelerate bitsandbytes trl peft datasets xformers

# Clear relevant caches
print("Clearing bitsandbytes cache...")
!rm -rf ~/.cache/bitsandbytes
print("Clearing Hugging Face cache...")
!rm -rf ~/.cache/huggingface/hub/*

# Install PyTorch and Torchvision specifically for CUDA 12.1 (common in Colab)
# This often resolves 'operator torchvision::nms does not exist' errors
print("Installing PyTorch and Torchvision for CUDA 12.1...")
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install other core Hugging Face libraries
print("Installing transformers, accelerate, bitsandbytes, trl, peft, datasets...")
!pip install transformers accelerate bitsandbytes "trl==0.8.6" peft datasets # Pinned trl to a specific version
# xformers is often used for attention optimization, but can be optional if it causes issues.
# If you encounter CUDA errors related to Triton or memory, try removing xformers.
# !pip install xformers # Uncomment if you want to try xformers, but it's not strictly necessary for basic QLoRA

print("\nInstallation complete.")
print("IMPORTANT: Due to how Colab handles library installations, please follow these steps precisely:")
print("1. Go to 'Runtime' -> 'Disconnect and delete runtime'.")
print("2. After the runtime restarts, RUN THIS INSTALLATION BLOCK AGAIN.")
print("3. Once this block finishes its second run, you can safely proceed to the next sections (Utility Functions, Step 1, Step 2, Step 3).")



Installing standard Hugging Face libraries...
Found existing installation: torch 2.5.1+cu121
Uninstalling torch-2.5.1+cu121:
  Successfully uninstalled torch-2.5.1+cu121
Found existing installation: torchvision 0.20.1+cu121
Uninstalling torchvision-0.20.1+cu121:
  Successfully uninstalled torchvision-0.20.1+cu121
Found existing installation: torchaudio 2.5.1+cu121
Uninstalling torchaudio-2.5.1+cu121:
  Successfully uninstalled torchaudio-2.5.1+cu121
Found existing installation: transformers 4.53.3
Uninstalling transformers-4.53.3:
  Successfully uninstalled transformers-4.53.3
Found existing installation: accelerate 1.9.0
Uninstalling accelerate-1.9.0:
  Successfully uninstalled accelerate-1.9.0
Found existing installation: bitsandbytes 0.46.1
Uninstalling bitsandbytes-0.46.1:
  Successfully uninstalled bitsandbytes-0.46.1
Found existing installation: trl 0.8.6
Uninstalling trl-0.8.6:
  Successfully uninstalled trl-0.8.6
Found existing installation: peft 0.16.0
Uninstalling peft-0.16.0

In [2]:
#utility functions


import os
from google.colab import drive
from huggingface_hub import login
import os
import json
import re
from tqdm import tqdm
from google.colab import drive


# --- Helper function to replace Unicode punctuation with ASCII equivalents ---
def replace_unicode_punctuation_with_ascii(sentence: str) -> str:
    """
    Replaces common Unicode "smart" punctuation characters in a sentence
    with their ASCII equivalents.
    """
    unicode_to_ascii_map = {
        '\u201c': '"',   # Left double quotation mark
        '\u201d': '"',   # Right double quotation mark
        '\u2018': "'",   # Left single quotation mark (curly apostrophe)
        '\u2019': "'",   # Right single quotation mark (curly apostrophe)
        '\u2013': '-',   # En dash
        '\u2014': '--',  # Em dash (commonly replaced by two hyphens)
        '\u2026': '...', # Ellipsis
        '\u00A0': ' ',   # Non-breaking space
    }

    cleaned_sentence = sentence
    for unicode_char, ascii_char in unicode_to_ascii_map.items():
        cleaned_sentence = cleaned_sentence.replace(unicode_char, ascii_char)

    return cleaned_sentence

def mount_google_drive_conditionally():
    """
    Checks if Google Drive is already mounted and mounts it if not.
    Exits the program if mounting fails.
    """
    if not os.path.exists('/content/drive/MyDrive'):
        print("Google Drive not detected as mounted. Attempting to mount...")
        try:
            drive.mount('/content/drive')
            print("Google Drive mounted successfully!")
        except Exception as e:
            print(f"Error mounting Google Drive: {e}")
            print("Please ensure you are running this in a Google Colab environment and authorize Drive access.")
            exit() # Exit if Drive cannot be mounted
    else:
        print("Google Drive already mounted. Skipping mounting step.")

def check_file_exists_and_exit_if_not(file_path: str, file_description: str = "file"):
    """
    Checks for the existence of a specified file. If the file does not exist,
    it prints an error message and exits the program.

    Args:
        file_path (str): The full path to the file to check.
        file_description (str): A descriptive name for the file (e.g., "fine-tuning dataset").
    """
    print(f"\nChecking for existence of {file_description}: {file_path}")
    if not os.path.exists(file_path):
        print(f"Error: {file_description} not found at {file_path}")
        print("Please ensure previous steps were completed successfully and the file exists.")
        exit() # Exit if the file is not found
    else:
        print(f"{file_description.capitalize()} found. Proceeding.")

# Example usage (you would call these from your main fine-tuning script)
# mount_google_drive_conditionally()
# check_file_exists_and_exit_if_not(FINE_TUNING_DATASET_PATH, "fine-tuning dataset")

def login_to_huggingface_hub():
    """
    Logs into the Hugging Face Hub.
    Exits the program if login fails.
    """
    print("\nLogging into Hugging Face Hub...")
    try:
        # You will be prompted to enter your HF token in a pop-up or console
        login()
        print("Hugging Face login successful!")
    except Exception as e:
        print(f"Hugging Face login failed: {e}")
        print("Please ensure you have accepted the Llama 3 license and pasted a valid token.")
        exit() # Exit if login fails, as model loading will fail without it

def formatting_prompts_func(examples, tokenizer):
    """
    Formats the dataset examples into the Llama 3 chat template required by SFTTrainer.

    Args:
        examples (dict): A dictionary of lists, where each list corresponds to a column
                         (e.g., 'question', 'passage', 'answer') from the dataset.
        tokenizer: The Hugging Face tokenizer object for the Llama 3 model.

    Returns:
        dict: A dictionary containing a single key 'text', whose value is a list of
              formatted prompt strings suitable for SFTTrainer.
    """
    formatted_texts = []
    for i in range(len(examples["question"])):
        question = examples["question"][i]
        passage = examples["passage_text"][i]
        answer = examples["answer"][i]

        messages = [
            {"role": "system", "content": "You are a helpful assistant that answers questions based on the provided passage. Ensure your answer is concise and directly addresses the question using information only from the passage."},
            {"role": "user", "content": f"Passage: {passage}\nQuestion: {question}"},
            {"role": "assistant", "content": answer}
        ]
        # Apply the tokenizer's chat template to convert messages to a single string
        # add_generation_prompt=False means we don't add the final assistant turn start token,
        # as the model is learning to generate the assistant's response.
        formatted_texts.append(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False))
    return {"text": formatted_texts}




In [3]:
mount_google_drive_conditionally()

Google Drive not detected as mounted. Attempting to mount...
Mounted at /content/drive
Google Drive mounted successfully!


In [10]:
# from google.colab import files
# uploaded = files.upload()

# # Move to desired directory
# !mv phase_z3_generated_qa.jsonl /content/drive/MyDrive/266_fp_ph_z/


Saving phase_z3_generated_qa.jsonl to phase_z3_generated_qa (1).jsonl


In [4]:
import os
DATA_PATH = "/content/drive/MyDrive/266_fp_ph_z"
FINE_TUNING_DATASET_PATH = os.path.join(DATA_PATH, "phase_z3_generated_qa.jsonl")
check_file_exists_and_exit_if_not(FINE_TUNING_DATASET_PATH, "fine-tuning dataset")


Checking for existence of fine-tuning dataset: /content/drive/MyDrive/266_fp_ph_z/phase_z3_generated_qa.jsonl
Fine-tuning dataset found. Proceeding.


In [5]:
# --- Imports for Fine-tuning ---
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
# The huggingface_hub and google.colab.drive imports are now handled by their respective utility functions.

# --- Configuration ---
DATA_PATH = "/content/drive/MyDrive/266_fp_ph_z"
FINE_TUNING_DATASET_PATH = os.path.join(DATA_PATH, "phase_z3_generated_qa.jsonl")
MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
MAX_SEQ_LENGTH = 2048
OUTPUT_DIR = os.path.join(DATA_PATH, "llama3_8b_qa_finetuned_adapters_standard_hf")
os.makedirs(OUTPUT_DIR, exist_ok=True)

# --- Call Utility Functions for Drive Access and File Checks ---
# Ensure these functions (mount_google_drive_conditionally, check_file_exists_and_exit_if_not)
# are defined in a cell executed BEFORE this one.
mount_google_drive_conditionally()
check_file_exists_and_exit_if_not(FINE_TUNING_DATASET_PATH, "fine-tuning dataset")

# --- Call Hugging Face Login Utility Function ---
# Ensure this function (login_to_huggingface_hub) is defined in a cell executed BEFORE this one.
login_to_huggingface_hub()

# --- Load Model with QLoRA (Standard Hugging Face way) ---
print(f"\nLoading model: {MODEL_NAME} with standard Hugging Face QLoRA...")

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4", # NormalFloat 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bfloat16
    bnb_4bit_use_double_quant=True, # Double quantization for extra memory saving
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto", # Automatically maps model to available GPU
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
# Llama 3 tokenizer doesn't have a default pad_token, set it to eos_token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # For training, right padding is generally better

print("Base model loaded successfully with 4-bit quantization.")

# --- Prepare Model for PEFT and Configure LoRA Adapters ---
# Prepare the model for k-bit training (handles some initializations for quantized models)
model = prepare_model_for_kbit_training(model)

# LoRA Configuration
lora_config = LoraConfig(
    r=16, # Rank of the update matrices
    lora_alpha=32, # Scaling factor
    lora_dropout=0.05, # Dropout for regularization (can be 0 if desired)
    bias="none",
    task_type="CAUSAL_LM",
    # Target modules for Llama 3 (these are the linear layers in attention/FFN blocks)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)

# Get the PEFT model
model = get_peft_model(model, lora_config)
print("LoRA adapters configured and applied to the model.")
model.print_trainable_parameters() # Shows how many parameters are actually being trained

# --- Load Prepared Dataset ---
print(f"\nLoading fine-tuning dataset from: {FINE_TUNING_DATASET_PATH}")
train_dataset = load_dataset("json", data_files=FINE_TUNING_DATASET_PATH, split="train")
print(f"Dataset loaded with {len(train_dataset)} examples.")

# --- Apply Formatting Function for SFTTrainer ---
# Ensure 'formatting_prompts_func' is defined in a cell executed BEFORE this one.
train_dataset = train_dataset.map(
    lambda examples: formatting_prompts_func(examples, tokenizer), # Pass tokenizer to the function
    batched=True,
    remove_columns=["passage_id", "passage_text", "question", "answer"],
)
print("Dataset formatted for SFTTrainer.")
print(train_dataset)

# --- Configure Training Arguments ---
training_args = TrainingArguments(
    output_dir="./training_logs_standard_hf", # Different log dir for standard HF
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8, # Keep aggressive accumulation for memory
    warmup_steps=10,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=False, # Set to False if bfloat16 is true, or if GPU doesn't support fp16
    bf16=torch.cuda.is_bf16_supported(), # Use BF16 if supported
    logging_steps=10,
    optim="paged_adamw_8bit", # Use paged optimizer for memory efficiency
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=42,
    save_total_limit=1,
    push_to_hub=False,
    report_to="none",
    remove_unused_columns=False,
    dataloader_num_workers=0,
)

# --- Initialize SFTTrainer ---
print("\nInitializing SFTTrainer...")
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    packing=True,
    args=training_args,
)
print("SFTTrainer initialized. Starting training...")

# --- Start Training ---
trainer.train()
print("\nFine-tuning complete!")

# --- Save Fine-tuned Adapters ---
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Fine-tuned LoRA adapters saved to: {OUTPUT_DIR}")
print("You can load these adapters with the base model for inference later.")


Google Drive already mounted. Skipping mounting step.

Checking for existence of fine-tuning dataset: /content/drive/MyDrive/266_fp_ph_z/phase_z3_generated_qa.jsonl
Fine-tuning dataset found. Proceeding.

Logging into Hugging Face Hub...


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Hugging Face login successful!

Loading model: meta-llama/Meta-Llama-3-8B-Instruct with standard Hugging Face QLoRA...


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Base model loaded successfully with 4-bit quantization.
LoRA adapters configured and applied to the model.
trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196

Loading fine-tuning dataset from: /content/drive/MyDrive/266_fp_ph_z/phase_z3_generated_qa.jsonl


Generating train split: 0 examples [00:00, ? examples/s]

Dataset loaded with 4733 examples.


Map:   0%|          | 0/4733 [00:00<?, ? examples/s]

Dataset formatted for SFTTrainer.
Dataset({
    features: ['text'],
    num_rows: 4733
})

Initializing SFTTrainer...


Generating train split: 0 examples [00:00, ? examples/s]

  super().__init__(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


SFTTrainer initialized. Starting training...


  return fn(*args, **kwargs)


Step,Training Loss
10,1.2442
20,0.8763
30,0.8054
40,0.7695
50,0.7359
60,0.7349
70,0.6916
80,0.6345
90,0.6423
100,0.6186



Fine-tuning complete!
Fine-tuned LoRA adapters saved to: /content/drive/MyDrive/266_fp_ph_z/llama3_8b_qa_finetuned_adapters_standard_hf
You can load these adapters with the base model for inference later.


# Part 3 - Inference, load saved LoRA adapters from persistent storage

In [11]:
DATA_PATH = '/content/drive/MyDrive/266_fp_ph_z'
# Path to the saved fine-tuned LoRA adapters from Step 2
FINE_TUNED_ADAPTERS_PATH = os.path.join(DATA_PATH, "llama3_8b_qa_finetuned_adapters_standard_hf")

In [12]:
FINE_TUNED_ADAPTERS_PATH

'/content/drive/MyDrive/266_fp_ph_z/llama3_8b_qa_finetuned_adapters_standard_hf'

In [13]:
# --- Imports for Inference ---
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel # For loading LoRA adapters
# The google.colab.drive and huggingface_hub imports are now handled by utility functions.

# --- Configuration ---
#DATA_PATH = "/content/drive/MyDrive/266_fp_ph_z"
DATA_PATH = '/content/drive/MyDrive/266_fp_ph_z'
# Path to the saved fine-tuned LoRA adapters from Step 2
FINE_TUNED_ADAPTERS_PATH = os.path.join(DATA_PATH, "llama3_8b_qa_finetuned_adapters_standard_hf")
MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
MAX_SEQ_LENGTH = 2048 # Max sequence length used during training

# --- Call Utility Functions for Drive Access and Hugging Face Login ---
# Ensure these functions are defined in cells executed BEFORE this one.
mount_google_drive_conditionally()
login_to_huggingface_hub()

# --- Check for existence of fine-tuned adapters ---
print(f"\nChecking for existence of fine-tuned adapters: {FINE_TUNED_ADAPTERS_PATH}")
if not os.path.exists(FINE_TUNED_ADAPTERS_PATH):
    print(f"Error: Fine-tuned adapters directory not found at {FINE_TUNED_ADAPTERS_PATH}")
    print("Please ensure Step 2 (LLM Fine-Tuning) was completed successfully and adapters were saved.")
    exit()
else:
    print("Fine-tuned adapters directory found. Proceeding.")

# --- Load Base Model with 4-bit Quantization ---
print(f"\nLoading base model: {MODEL_NAME} with 4-bit quantization...")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16, # Use float16 for compute during inference
)

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
# Llama 3 tokenizer doesn't have a default pad_token, set it to eos_token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left" # For inference, left padding is generally preferred

print("Base model and tokenizer loaded.")

# --- Load Fine-tuned LoRA Adapters ---
print(f"Loading LoRA adapters from: {FINE_TUNED_ADAPTERS_PATH}...")
# Attach the LoRA adapters to the base model
model = PeftModel.from_pretrained(base_model, FINE_TUNED_ADAPTERS_PATH)
print("LoRA adapters loaded.")

# Optional: Merge adapters into the base model for faster inference (requires more VRAM)
# This creates a single, merged model. For A100 40GB, this might be feasible and faster.
print("Merging LoRA adapters into base model (optional, for faster inference)...")
model = model.merge_and_unload()
print("Adapters merged.")

# Set model to evaluation mode
model.eval()

# --- Inference Examples ---
print("\n--- Running Inference ---")

# Example 1: A question that should be answerable from a passage
example_passage_1 = """
In the ancient city of Athens, democracy flourished, allowing citizens to participate directly in governance.
However, only free-born men were considered citizens, excluding women, slaves, and foreign residents.
The Assembly, where laws were debated and passed, met regularly on the Pnyx hill.
Socrates, a prominent philosopher, was known for his method of questioning, which often challenged conventional wisdom.
"""
example_question_1 = "Where did the Assembly meet in ancient Athens?"

# Example 2: Another question
example_passage_2 = """
The process of photosynthesis in plants converts light energy into chemical energy, primarily in the form of glucose.
This complex process occurs mainly in the chloroplasts, which contain chlorophyll, the green pigment that absorbs sunlight.
Water and carbon dioxide are absorbed, and oxygen is released as a byproduct.
"""
example_question_2 = "What is released as a byproduct during photosynthesis?"


# Function to generate response
def generate_answer(passage: str, question: str, model, tokenizer, max_new_tokens=100):
    """
    Generates an answer to a question based on a provided passage using the fine-tuned model.
    """
    messages = [
        {"role": "system", "content": "You are a helpful assistant that answers questions based on the provided passage. Be concise and directly answer the question."},
        {"role": "user", "content": f"Passage: {passage}\nQuestion: {question}"},
    ]
    # Apply chat template and tokenize
    input_ids = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True, # Important: tells the model to generate the assistant's turn
        return_tensors="pt"
    ).to(model.device)

    # Generate response
    with torch.no_grad(): # No need to calculate gradients during inference
        outputs = model.generate(
            input_ids=input_ids,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            top_p=0.9,
            temperature=0.7,
            pad_token_id=tokenizer.pad_token_id,
            # Adjust stopping criteria if needed (e.g., stop at <|eot_id|>)
        )

    # Decode the generated text
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=False)

    # Find the start of the assistant's response
    assistant_start_tag = "<|start_header_id|>assistant<|end_header_id|>\n"
    start_index = decoded_output.find(assistant_start_tag)

    if start_index != -1:
        generated_answer = decoded_output[start_index + len(assistant_start_tag):].strip()
        # Remove any trailing <|eot_id|> or other special tokens
        generated_answer = generated_answer.replace("<|eot_id|>", "").strip()
    else:
        generated_answer = "Could not parse assistant's response." # Fallback

    return generated_answer

# Run Example 1
print("Example 1:")
print(f"Passage: {example_passage_1.strip()}")
print(f"Question: {example_question_1}")
answer_1 = generate_answer(example_passage_1, example_question_1, model, tokenizer)
print(f"Generated Answer: {answer_1}\n")

# Run Example 2
print("Example 2:")
print(f"Passage: {example_passage_2.strip()}")
print(f"Question: {example_question_2}")
answer_2 = generate_answer(example_passage_2, example_question_2, model, tokenizer)
print(f"Generated Answer: {answer_2}\n")

print("Inference complete.")


Google Drive already mounted. Skipping mounting step.

Logging into Hugging Face Hub...


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Hugging Face login successful!

Checking for existence of fine-tuned adapters: /content/drive/MyDrive/266_fp_ph_z/llama3_8b_qa_finetuned_adapters_standard_hf
Fine-tuned adapters directory found. Proceeding.

Loading base model: meta-llama/Meta-Llama-3-8B-Instruct with 4-bit quantization...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Base model and tokenizer loaded.
Loading LoRA adapters from: /content/drive/MyDrive/266_fp_ph_z/llama3_8b_qa_finetuned_adapters_standard_hf...
LoRA adapters loaded.
Merging LoRA adapters into base model (optional, for faster inference)...
Adapters merged.

--- Running Inference ---
Example 1:
Passage: In the ancient city of Athens, democracy flourished, allowing citizens to participate directly in governance.
However, only free-born men were considered citizens, excluding women, slaves, and foreign residents.
The Assembly, where laws were debated and passed, met regularly on the Pnyx hill.
Socrates, a prominent philosopher, was known for his method of questioning, which often challenged conventional wisdom.
Question: Where did the Assembly meet in ancient Athens?
Generated Answer: The Assembly met regularly on the Pnyx hill in ancient Athens.

Example 2:
Passage: The process of photosynthesis in plants converts light energy into chemical energy, primarily in the form of glucose.
This c