**Run 25 July - INFERENCE | UNIT TEST | ON FINE TUNED LLAMA 8B MODEL QUANTIZED AND WITH QLORA**

Run Unit Test on 1000+ Question + Passage pairs, to generate answers from Llama 8b previously fine tuned and saved for domain geetha vahini.pdf.

***
Important notes from last run

1. Hugging Face login successful!

 - Loading model: meta-llama/Meta-Llama-3-8B-Instruct with standard Hugging Face QLoRA...

  - Loading checkpoint shards: 100%

  - Base model loaded successfully with 4-bit quantization.
  - LoRA adapters configured and applied to the model.
trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196

  - Loading fine-tuning dataset from: /content/drive/MyDrive/fpdata/geetha_vahini/phase_4_question_passage_ans_triplet.jsonl
  - Dataset loaded with 2515 examples.






# Part 1: (Aggressive) Installation.

IMPORTANT: follow these steps precisely:
1.	Run this code block.
2.	After it completes, go to Runtime -> Disconnect and delete runtime.
3.	Once the runtime restarts, RUN THIS CODE BLOCK AGAIN.
4.	After this block finishes its second run, you can safely proceed to the next sections.

In [1]:
# --- Installation Block ---
print("Starting library installations and upgrades...")

# Aggressively uninstall to ensure a clean slate, especially for torch and torchvision
!pip uninstall -y torch torchvision torchaudio transformers accelerate bitsandbytes trl peft datasets xformers

# Clear relevant caches
print("Clearing bitsandbytes cache...")
!rm -rf ~/.cache/bitsandbytes
print("Clearing Hugging Face cache...")
!rm -rf ~/.cache/huggingface/hub/*

# Install PyTorch and Torchvision specifically for CUDA 12.1 (common in Colab)
print("Installing PyTorch and Torchvision for CUDA 12.1...")
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install other core Hugging Face libraries
print("Installing transformers, accelerate, bitsandbytes, trl, peft, datasets...")
# Pin trl to a known compatible version to avoid import errors
!pip install transformers accelerate bitsandbytes "trl==0.8.6" peft datasets
# xformers is optional, uncomment if you want to try it, but it's not strictly necessary
# !pip install xformers

print("\nLibrary installation complete.")
print("IMPORTANT: Please follow the instructions above about restarting the runtime.")




Starting library installations and upgrades...
Found existing installation: torch 2.5.1+cu121
Uninstalling torch-2.5.1+cu121:
  Successfully uninstalled torch-2.5.1+cu121
Found existing installation: torchvision 0.20.1+cu121
Uninstalling torchvision-0.20.1+cu121:
  Successfully uninstalled torchvision-0.20.1+cu121
Found existing installation: torchaudio 2.5.1+cu121
Uninstalling torchaudio-2.5.1+cu121:
  Successfully uninstalled torchaudio-2.5.1+cu121
Found existing installation: transformers 4.54.0
Uninstalling transformers-4.54.0:
  Successfully uninstalled transformers-4.54.0
Found existing installation: accelerate 1.9.0
Uninstalling accelerate-1.9.0:
  Successfully uninstalled accelerate-1.9.0
Found existing installation: bitsandbytes 0.46.1
Uninstalling bitsandbytes-0.46.1:
  Successfully uninstalled bitsandbytes-0.46.1
Found existing installation: trl 0.8.6
Uninstalling trl-0.8.6:
  Successfully uninstalled trl-0.8.6
Found existing installation: peft 0.16.0
Uninstalling peft-0.16.

# Part 2: Google drive mount, and unit test passage-question jsonl file upload

In [3]:
#######
## ACCESSIBILITY CHECKS
# Access to unit test file: unit_test_passage_questions_clean.jsonl. It is in jsonl format
# Successful login to Hugging Face
# Access to fine tuned adapters
########################

import os
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
from huggingface_hub import login
from google.colab import drive

# --- Configuration ---
ROOT_DIR_ADAPTERS = "/content/drive/MyDrive/266_fp_ph_z"
FINE_TUNED_ADAPTERS_FOLDER_NAME = "llama3_8b_qa_finetuned_adapters_standard_hf"
FINE_TUNED_ADAPTERS_PATH = os.path.join(ROOT_DIR_ADAPTERS, FINE_TUNED_ADAPTERS_FOLDER_NAME) #CORRECT. ADAPTERS ARE IN THE ORIGINAL ROOT DIR FOR PHASE Z
MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"

CHROMA_DB_DIR = "/content/drive/MyDrive/266_fp_ph_z/ph_z_rag" # NEW RAG DIR FOR RAG SPECIFIC STORES AND OUTPUT

UNIT_TEST_FILE_NAME = "unit_test_question_passage_for_fine_tuned_llama8B.jsonl"
UNIT_TEST_FILE_PATH = os.path.join(CHROMA_DB_DIR, UNIT_TEST_FILE_NAME)

# New output file for generated answers
GENERATED_ANSWERS_FILE = os.path.join(CHROMA_DB_DIR, "ph_z_UT_generated_answers_fine_tuned_llama8bQLora.jsonl")

# --- Mount Google Drive ---
print("Mounting Google Drive...")
if not os.path.exists('/content/drive/MyDrive'):
    try:
        drive.mount('/content/drive')
        print("Google Drive mounted successfully!")
    except Exception as e:
        print(f"Error mounting Google Drive: {e}")
        print("Please ensure you are running this in a Google Colab environment and authorize Drive access.")
        exit()
else:
    print("Google Drive already mounted.")

# --- Check for Unit Test File Existence ---
print(f"\nChecking for existence of Unit Test File: {UNIT_TEST_FILE_PATH}")
if os.path.exists(UNIT_TEST_FILE_PATH):
    print(f"SUCCESS: Unit test file '{UNIT_TEST_FILE_NAME}' found at {UNIT_TEST_FILE_PATH}.")
    try:
        with open(UNIT_TEST_FILE_PATH, 'r', encoding='utf-8') as f:
            first_line = f.readline()
            json.loads(first_line) # Try to parse the first line to check if it's valid JSON
        print("SUCCESS: Unit test file appears to be valid JSONL format.")
    except Exception as e:
        print(f"WARNING: Unit test file found, but could not parse as JSONL. Error: {e}")
else:
    print(f"ERROR: Unit test file '{UNIT_TEST_FILE_NAME}' NOT found at {UNIT_TEST_FILE_PATH}.")
    print("Please ensure it is uploaded to your RAG data directory.")


# --- Hugging Face Login ---
print("\nAttempting Hugging Face Hub login...")
try:
    login()
    print("SUCCESS: Hugging Face login successful!")
except Exception as e:
    print(f"ERROR: Hugging Face login failed: {e}")
    print("Please ensure you have accepted the Llama 3 license and pasted a valid token.")


# --- Check for Fine-tuned Adapters Existence ---
print(f"\nChecking for existence of Fine-tuned Adapters: {FINE_TUNED_ADAPTERS_PATH}")
if os.path.exists(os.path.join(FINE_TUNED_ADAPTERS_PATH, "adapter_config.json")) and \
   os.path.exists(os.path.join(FINE_TUNED_ADAPTERS_PATH, "adapter_model.safetensors")):
    print("SUCCESS: Fine-tuned adapter files found.")
else:
    print(f"ERROR: Fine-tuned adapter files NOT found in {FINE_TUNED_ADAPTERS_PATH}.")
    print("Please ensure they were copied/extracted correctly.")


print("\n--- All accessibility checks complete. ---")


Mounting Google Drive...
Mounted at /content/drive
Google Drive mounted successfully!

Checking for existence of Unit Test File: /content/drive/MyDrive/266_fp_ph_z/ph_z_rag/unit_test_question_passage_for_fine_tuned_llama8B.jsonl
SUCCESS: Unit test file 'unit_test_question_passage_for_fine_tuned_llama8B.jsonl' found at /content/drive/MyDrive/266_fp_ph_z/ph_z_rag/unit_test_question_passage_for_fine_tuned_llama8B.jsonl.
SUCCESS: Unit test file appears to be valid JSONL format.

Attempting Hugging Face Hub login...


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

SUCCESS: Hugging Face login successful!

Checking for existence of Fine-tuned Adapters: /content/drive/MyDrive/266_fp_ph_z/llama3_8b_qa_finetuned_adapters_standard_hf
SUCCESS: Fine-tuned adapter files found.

--- All accessibility checks complete. ---


# Part 4 - Inference on loaded passage, question pairs ("generate answer for the question given the context that is the passage"

 - Generated answer files saved to persistent storage.  unit_test_generated_answers_fine_tuned_llama8bQLora.jsonl

In [4]:
##INFERENCE RUN BATCH . unit_test_passage_questions_clean.jsonl
# unit_test_generated_answers_fine_tuned_llama8bQLora.jsonl


import os
import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel # For loading LoRA adapters
from huggingface_hub import login # For Hugging Face authentication
from google.colab import drive # Keep drive import for mounting

MAX_SEQ_LENGTH = 2048 # Max sequence length used during training

# --- Hugging Face Login (REQUIRED for Llama 3) ---
print("\nLogging into Hugging Face Hub...")
try:
    login()
    print("Hugging Face login successful!")
except Exception as e:
    print(f"Hugging Face login failed: {e}")
    print("Please ensure you have accepted the Llama 3 license and pasted a valid token.")
    exit() # Exit if login fails

# --- Check for existence of fine-tuned adapters directory and unit test file ---
print(f"\nChecking for existence of fine-tuned adapters directory: {FINE_TUNED_ADAPTERS_PATH}")
if not os.path.exists(FINE_TUNED_ADAPTERS_PATH):
    print(f"Error: Fine-tuned adapters directory not found at {FINE_TUNUNED_ADAPTERS_PATH}")
    print("Please ensure Step 2 (LLM Fine-Tuning) was completed successfully and adapters were saved to this path,")
    print("or that you have manually uploaded the entire folder to your Google Drive.")
    exit()
else:
    print("Fine-tuned adapters directory found. Proceeding.")

print(f"\nChecking for existence of unit test file: {UNIT_TEST_FILE_PATH}")
if not os.path.exists(UNIT_TEST_FILE_PATH):
    print(f"Error: Unit test file not found at {UNIT_TEST_FILE_PATH}")
    print("Please ensure you ran Section 2 and successfully uploaded the file.")
    exit()
else:
    print("Unit test file found. Proceeding.")


# --- Load Base Model with 4-bit Quantization ---
print(f"\nLoading base model: {MODEL_NAME} with 4-bit quantization...")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16, # Use float16 for compute during inference
)

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
# Llama 3 tokenizer doesn't have a default pad_token, set it to eos_token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left" # For inference, left padding is generally preferred

print("Base model and tokenizer loaded.")

# --- Load Fine-tuned LoRA Adapters ---
print(f"Loading LoRA adapters from: {FINE_TUNED_ADAPTERS_PATH}...")
# Attach the LoRA adapters to the base model
model = PeftModel.from_pretrained(base_model, FINE_TUNED_ADAPTERS_PATH)
print("LoRA adapters loaded.")

# Optional: Merge adapters into the base model for faster inference (requires more VRAM)
print("Merging LoRA adapters into base model (optional, for faster inference)...")
model = model.merge_and_unload()
print("Adapters merged.")

# Set model to evaluation mode
model.eval()

# --- Function to generate response ---
def generate_answer(passage: str, question: str, model, tokenizer, max_new_tokens=100):
    """
    Generates an answer to a question based on a provided passage using the fine-tuned model.
    """
    messages = [
        {"role": "system", "content": "You are a helpful assistant that answers questions based on the provided passage. Be concise and directly answer the question."},
        {"role": "user", "content": f"Passage: {passage}\nQuestion: {question}"},
    ]
    # Apply chat template and tokenize
    input_ids = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True, # Important: tells the model to generate the assistant's turn
        return_tensors="pt"
    ).to(model.device)

    # Generate response
    with torch.no_grad(): # No need to calculate gradients during inference
        outputs = model.generate(
            input_ids=input_ids,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            top_p=0.9,
            temperature=0.7,
            pad_token_id=tokenizer.pad_token_id,
        )

    # Decode the generated text
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=False)

    # Find the start of the assistant's response
    assistant_start_tag = "<|start_header_id|>assistant<|end_header_id|>\n"
    start_index = decoded_output.find(assistant_start_tag)

    if start_index != -1:
        generated_answer = decoded_output[start_index + len(assistant_start_tag):].strip()
        # Remove any trailing <|eot_id|> or other special tokens
        generated_answer = generated_answer.replace("<|eot_id|>", "").strip()
    else:
        generated_answer = "Could not parse assistant's response." # Fallback

    return generated_answer

# --- Load Unit Test Data and Run Inference ---
print(f"\n--- Running Inference on '{UNIT_TEST_FILE_NAME}' and saving results ---")

unit_test_data = []
try:
    with open(UNIT_TEST_FILE_PATH, 'r', encoding='utf-8') as f:
        for line in f:
            unit_test_data.append(json.loads(line))
    print(f"Loaded {len(unit_test_data)} examples from '{UNIT_TEST_FILE_NAME}'.")
except Exception as e:
    print(f"Error loading unit test data: {e}")
    exit()

# Open the output file for writing generated answers
with open(GENERATED_ANSWERS_FILE, 'w', encoding='utf-8') as f_out:
    for i, example in enumerate(unit_test_data):
        doc_id = example.get('id', f"unknown_id_{i+1}")
        passage = example.get('passage', 'No passage provided.')
        question = example.get('question', 'No question provided.')

        print(f"\n--- Unit Test Example {i+1} (Doc ID: {doc_id}) ---")
        print(f"Passage: {passage.strip()}")
        print(f"Question: {question}")

        generated_answer = generate_answer(passage, question, model, tokenizer)
        print(f"Generated Answer: {generated_answer}")

        # Save the generated answer to the new JSONL file
        output_entry = {
            "id": doc_id,
            "question": question,
            "passage": passage,
            "gen_answer": generated_answer
        }
        json.dump(output_entry, f_out)
        f_out.write('\n') # Add newline for JSONL format

print(f"\nAll unit test inference examples completed. Generated answers saved to: {GENERATED_ANSWERS_FILE}")



Logging into Hugging Face Hub...


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Hugging Face login successful!

Checking for existence of fine-tuned adapters directory: /content/drive/MyDrive/266_fp_ph_z/llama3_8b_qa_finetuned_adapters_standard_hf
Fine-tuned adapters directory found. Proceeding.

Checking for existence of unit test file: /content/drive/MyDrive/266_fp_ph_z/ph_z_rag/unit_test_question_passage_for_fine_tuned_llama8B.jsonl
Unit test file found. Proceeding.

Loading base model: meta-llama/Meta-Llama-3-8B-Instruct with 4-bit quantization...


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Base model and tokenizer loaded.
Loading LoRA adapters from: /content/drive/MyDrive/266_fp_ph_z/llama3_8b_qa_finetuned_adapters_standard_hf...
LoRA adapters loaded.
Merging LoRA adapters into base model (optional, for faster inference)...




Adapters merged.

--- Running Inference on 'unit_test_question_passage_for_fine_tuned_llama8B.jsonl' and saving results ---


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Passage: Through attachments and affection, and even envy andhatred, one plunges into activity and gets immersed in the world. This leads to embodiment in the physical frame and further egoism. In order to become free from thetwin pulls of pleasure and pain, one must rid oneself of the body-consciousness, and keep clear of self-centred actions. This again involves the absence of attachmentand hatred. Desire is the number one enemy of Libera- tion, or Moksha. Desire binds one to the wheel of birth and death. It brings about endless worry and tribulations.
Question: What does the passage imply about the relationship between Guna qualities and human experience?
Generated Answer: The passage implies that the Guna qualities of attachment, affection, envy, and hatred lead to embodiment in the physical frame and further egoism, ultimately binding one to the cycle of birth and death.

--- Unit Test Example 101 (Doc ID: 100) ---
P