Run 24 July


*   Given an input file of passages = phase_2_passages.jsonl
*   Generate questions per passage. Generate answers to those questions based on the passage


*   Model used: Mistral

Targeted output: question, passage, answer triplets





# File Mounting and Setup

In [1]:
# --- Installation Block ---
# Install/upgrade necessary libraries for the entire pipeline.
# It's crucial to run these installations first.

# Core libraries for Large Language Models (LLMs) from Hugging Face
# `transformers`: The main library for model loading, tokenization, and generation.
# `accelerate`: Helps with efficient model loading and inference, especially on GPUs.
# `bitsandbytes`: Essential for 4-bit quantization.
!pip install --upgrade transformers accelerate bitsandbytes

# For displaying progress bars during long loops (e.g., passage processing)
!pip install tqdm

# Optional: If you want to keep these for future phases but not strictly needed for Q&A generation now
# !pip install pdfplumber
# !pip install sentence-transformers

print("Installation block executed. Proceeding to imports.")

# --- Import Block ---
# Import all necessary Python modules and components from installed libraries.

import os        # For interacting with the operating system (e.g., creating directories, joining paths)
import json      # For working with JSON data (reading .jsonl files)
import torch     # PyTorch library, fundamental for deep learning models and GPU operations
from tqdm.auto import tqdm # For displaying smart progress bars in notebooks
import re        # For regular expressions, used to parse and clean generated questions

# Specific imports from the transformers library for LLM operations
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

print("Import block executed.")
print("IMPORTANT: If this is the first run of these installations, you may need to")
print("restart the Colab runtime now (Runtime > Disconnect and delete runtime).")
print("After restarting, run this *entire block* again, and then proceed to the next blocks.")

Installation block executed. Proceeding to imports.
Import block executed.
IMPORTANT: If this is the first run of these installations, you may need to
restart the Colab runtime now (Runtime > Disconnect and delete runtime).
After restarting, run this *entire block* again, and then proceed to the next blocks.


In [2]:
def free_memory():
    import gc, torch
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    print(" Memory cleaned (RAM + GPU cache).")

# Phase Z3 - Generate Questions - Load files

*   List item
*   List item


3 Questions per passage, using A MODEL OTHER THAN llama 8B

In [3]:
DATA_PATH = "/content/drive/MyDrive/266_fp_ph_z/"
FILE_NAME_ORIGINAL_PDF = "geetha_vahini.pdf"
FILE_NAME_CLEAN_TEXT = "phase_1_clean.txt"
FILE_NAME_PASSAGES = "phase_2_passages.jsonl"
FILE_NAME_GENERATED_QA = "phase_z3_generated_qa.jsonl"

full_path_original_pdf = os.path.join(DATA_PATH, FILE_NAME_ORIGINAL_PDF)
full_path_clean_text = os.path.join(DATA_PATH, FILE_NAME_CLEAN_TEXT)
full_path_passages = os.path.join(DATA_PATH, FILE_NAME_PASSAGES)
full_path_generated_qa = os.path.join(DATA_PATH, FILE_NAME_GENERATED_QA) # <--- AND THIS!



In [4]:
from google.colab import drive

def verify_and_load_passages() -> list:
    drive.mount('/content/drive', force_remount=True)

    if not os.path.exists(full_path_passages):
        print(f"Error: Passages file not found at '{full_path_passages}'.")
        print("Please ensure the path and file name are correct.")
        return []

    passages = []
    try:
        with open(full_path_passages, 'r', encoding='utf-8') as f:
            for line_num, line in enumerate(f):
                try:
                    passages.append(json.loads(line.strip()))
                except json.JSONDecodeError as e:
                    print(f"Error decoding JSON on line {line_num + 1} in '{full_path_passages}': {e}")
                    continue

        print(f"Successfully loaded {len(passages)} passages from '{full_path_passages}'.")
        if passages:
            print(f"Sample of the first passage: {passages[0]}")
        return passages

    except Exception as e:
        print(f"An unexpected error occurred while reading '{full_path_passages}': {e}")
        return passages


loaded_passages = verify_and_load_passages()

if loaded_passages:
      print("\nReady to proceed with LLM processing of passages.")
else:
      print("\nFailed to load passages. Please check the errors above.")

Mounted at /content/drive
Successfully loaded 4682 passages from '/content/drive/MyDrive/266_fp_ph_z/phase_2_passages.jsonl'.
Sample of the first passage: {'doc_id': 'geetha_vahini_0030', 'text': 'We need not learn any new language or read any old text to imbibe the lesson that the Lord is eager to teach us now, for B hagawan Sri Sathya Sai Baba is the Sanathana victory in the battle we are now waging. This Geetha Vahini Sarathi, the Timeless Charioteer, who is the same stream, refreshing and revitalising, brought by communicated the Geetha Sastra to Adithya the same Divine Restorer to revivify man caught in the mesh and helped Manu and King Ikshvaku to know it. He was the of modern dialectics, in the pride of modern science, in the charioteer of Arjuna during the great battle between Good cynical scorn of modern superficiality. The teaching here and Evil fought out at Kurukshetra. When the rider, Arjuna, set forth will comfort, console, and confer strength and was overcome with grief 

# Phase Z3b - generate questions from passage, generate answers for question - passage pairs

In [8]:
# --- LLM Model Loading (Mistral-7B-Instruct-v0.2 with 4-bit Quantization) ---

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4", # Use NF4 quantization type
    bnb_4bit_use_double_quant=True, # Use double quantization for better precision
    bnb_4bit_compute_dtype=torch.bfloat16 # Recommended for T4 GPUs
)

model_id = "mistralai/Mistral-7B-Instruct-v0.2"

print(f"Loading tokenizer for {model_id}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token # Set pad token

print(f"Loading model {model_id} with 4-bit quantization. This may take a few minutes...")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto" # Automatically put model layers on GPU
)
print(f"Model {model_id} loaded successfully!")
print(f"Model on device: {model.device}")
print(f"GPU VRAM Usage after loading: {torch.cuda.memory_allocated() / (1024**3):.2f} GB")


# --- Q&A Generation Function for Batches (Optimized for 1 question per passage) ---
def generate_qa_from_batch(batch_passages: list, model, tokenizer) -> list:
    batch_q_prompts = []
    original_passage_info = []

    for passage_data in batch_passages:
        passage_id = passage_data.get("doc_id", "unknown_id")
        passage_text = passage_data.get("text", "")

        original_passage_info.append({"passage_id": passage_id, "passage_text": passage_text})

        if not passage_text:
            batch_q_prompts.append("")
            continue

        question_prompt = (
            f"[INST] Given the following text passage, generate 1 factual question " # <--- CHANGED FROM 3 TO 1
            f"that can be answered directly and explicitly from the passage. "
            f"Ensure the question is clear and concise." # Adjusted for singular
            f"\n\nPassage:\n{passage_text}\n[/INST]\n"
        )
        batch_q_prompts.append(question_prompt)

    valid_q_prompts = [p for p in batch_q_prompts if p]
    if not valid_q_prompts:
        return []

    # LLM Call 1 (Batched): Question Generation
    q_inputs = tokenizer(valid_q_prompts, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_position_embeddings).to(model.device)

    q_output_sequences = model.generate(
        **q_inputs,
        max_new_tokens=128, # Can reduce max_new_tokens if questions are consistently short
        num_return_sequences=1,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id
    )

    all_questions_for_batch_flat = []
    for i, q_output in enumerate(q_output_sequences):
        original_prompt = valid_q_prompts[i]
        questions_raw = tokenizer.decode(q_output[q_inputs.input_ids[i].shape[0]:], skip_special_tokens=True)
        questions_text = questions_raw.strip()

        # We're now only expecting and taking the first coherent question
        questions = re.findall(r'^\d*\.?\s*(.+)', questions_text, re.MULTILINE) # More flexible regex
        questions = [q.strip() for q in questions][:1] # <--- ENSURE ONLY 1 QUESTION IS PICKED

        for question in questions: # Will only loop once if questions has 1 item
            all_questions_for_batch_flat.append({"original_passage_idx": batch_q_prompts.index(original_prompt), "question": question})


    # LLM Call 2 (Batched): Answer Extraction
    batch_a_prompts = []
    answer_prompt_map_idx = []

    for item in all_questions_for_batch_flat:
        passage_idx = item["original_passage_idx"]
        question = item["question"]

        passage_info = original_passage_info[passage_idx]
        passage_text = passage_info["passage_text"]

        answer_prompt = (
            f"[INST] Based *only* on the following passage, answer the question below. "
            f"If the answer is not directly and explicitly available in the passage, "
            f"respond with 'N/A'. Be concise and direct."
            f"\n\nPassage:\n{passage_text}\n\nQuestion: {question}\n[/INST]\n"
        )
        batch_a_prompts.append(answer_prompt)
        answer_prompt_map_idx.append({"passage_id": passage_info['passage_id'],
                                       "passage_text": passage_text,
                                       "question": question})

    if not batch_a_prompts:
        return []

    a_inputs = tokenizer(batch_a_prompts, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_position_embeddings).to(model.device)

    a_output_sequences = model.generate(
        **a_inputs,
        max_new_tokens=128,
        num_return_sequences=1,
        do_sample=False,
        temperature=0.0,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id
    )

    final_qa_pairs = []
    for i, a_output in enumerate(a_output_sequences):
        answer_raw = tokenizer.decode(a_output[a_inputs.input_ids[i].shape[0]:], skip_special_tokens=True)
        answer_text = answer_raw.strip()

        qa_info = answer_prompt_map_idx[i]
        qa_info["answer"] = answer_text
        final_qa_pairs.append(qa_info)

    return final_qa_pairs


# --- Main Execution Block with TQDM Progress and Persistence ---

# Ensure passages are loaded and model is available before starting
if not loaded_passages:
    print("No passages loaded. Exiting Q&A generation.")
elif model is None or tokenizer is None:
    print("LLM model or tokenizer failed to load. Exiting Q&A generation.")
else:
    print(f"\nStarting Q&A generation for {len(loaded_passages)} passages...")

    # --- Crucial for speed: Define your batch size ---
    # Start with a conservative batch size (e.g., 4 or 8) and increase it
    # if you have VRAM headroom (check GPU VRAM usage in Colab).
    # Too large a batch size will lead to CUDA out of memory.
    batch_size = 8

    # Open the output file in append mode ('a') to write results incrementally
    # This ensures persistence in case of Colab disconnections
    with open(full_path_generated_qa, 'a', encoding='utf-8') as outfile:
        # Create batches
        for i in tqdm(range(0, len(loaded_passages), batch_size),
                      desc="Processing Batches",
                      unit="batch"):

            current_batch = loaded_passages[i:i + batch_size]

            # Generate Q&A for the current batch
            qa_pairs = generate_qa_from_batch(current_batch, model, tokenizer)

            # Write generated Q&A pairs to the file immediately
            if qa_pairs:
                for qa_pair in qa_pairs:
                    outfile.write(json.dumps(qa_pair) + '\n')

    print("\nQ&A generation complete for all passages!")
    print(f"All generated Q&A pairs saved to: {full_path_generated_qa}")

    # Optional: Verify by loading and printing a few lines from the generated file
    print("\n--- Verifying a few lines from the output file ---")
    try:
        with open(full_path_generated_qa, 'r', encoding='utf-8') as f_verify:
            for j, line in enumerate(f_verify):
                print(json.loads(line.strip()))
                if j >= 4: # Print first 5 entries
                    break
    except Exception as e:
        print(f"Error reading generated Q&A file for verification: {e}")

Loading tokenizer for mistralai/Mistral-7B-Instruct-v0.2...
Loading model mistralai/Mistral-7B-Instruct-v0.2 with 4-bit quantization. This may take a few minutes...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Model mistralai/Mistral-7B-Instruct-v0.2 loaded successfully!
Model on device: cuda:0
GPU VRAM Usage after loading: 7.78 GB

Starting Q&A generation for 4682 passages...


Processing Batches:   0%|          | 0/586 [00:00<?, ?batch/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore


Q&A generation complete for all passages!
All generated Q&A pairs saved to: /content/drive/MyDrive/266_fp_ph_z/phase_z3_generated_qa.jsonl

--- Verifying a few lines from the output file ---
{'passage_id': 'geetha_vahini_0030', 'passage_text': 'We need not learn any new language or read any old text to imbibe the lesson that the Lord is eager to teach us now, for B hagawan Sri Sathya Sai Baba is the Sanathana victory in the battle we are now waging. This Geetha Vahini Sarathi, the Timeless Charioteer, who is the same stream, refreshing and revitalising, brought by communicated the Geetha Sastra to Adithya the same Divine Restorer to revivify man caught in the mesh and helped Manu and King Ikshvaku to know it. He was the of modern dialectics, in the pride of modern science, in the charioteer of Arjuna during the great battle between Good cynical scorn of modern superficiality. The teaching here and Evil fought out at Kurukshetra. When the rider, Arjuna, set forth will comfort, console,