# Task 3: Model Inference & Test Generation

Generate predictions from both fine-tuned models on the test set.

**Models:**
- Model A (Context-Only): `fdangolo/flan-t5-context-only`
- Model B (Explain-and-Answer): `fdangolo/flan-t5-exp-ans`

**Test Data:** ConflictQA test set (896 examples)

**Expected Runtime:** 15-30 minutes on Colab GPU

---

## Setup Instructions

1. Enable GPU runtime (Runtime > Change runtime type > T4 GPU)
2. Upload `test_context_only.jsonl` to Google Drive
3. Have your Hugging Face token ready for authentication
4. Execute cells in order from top to bottom

---

## Step 1: Mount Google Drive & Setup Paths

Mount Drive to access the test data and save predictions.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

BASE_PATH = "/content/drive/MyDrive/reproducing_project"
DATA_PATH = f"{BASE_PATH}/data/splits"

print(f"Drive mounted")
print(f"Base path: {BASE_PATH}")
print(f"Data path: {DATA_PATH}")

## Step 2: Install Dependencies

Install required libraries for model inference.

In [None]:
!pip install -q transformers datasets peft accelerate bitsandbytes torch
!pip install -q huggingface_hub tqdm

print("All dependencies installed")

## Step 3: Authenticate with Hugging Face

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Step 4: Load Test Dataset

Load the test set from `test_context_only.jsonl`. Both models use the same input format.

In [None]:
from datasets import load_dataset
import json
import os

test_file_path = f"{DATA_PATH}/test_context_only.jsonl"

print(f"Loading test data from: {test_file_path}")

if not os.path.exists(test_file_path):
    print(f"Error: File not found at {test_file_path}")
    print(f"Please ensure test_context_only.jsonl is uploaded to: {DATA_PATH}/")
else:
    test_dataset = load_dataset('json', data_files={'test': test_file_path})['test']
    
    print("\nExample Test Input:")
    print(test_dataset[0]['input'][:300] + "...")
    print(f"\nLoaded {len(test_dataset)} test examples")

## Step 5: Generate Predictions - Model A (Context-Only)

This model answers questions directly without providing explanations.

In [None]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel
from tqdm import tqdm

base_model_id = "google/flan-t5-base"
adapter_A_id = "fdangolo/flan-t5-context-only"
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Using device: {device}")

print(f"\nLoading base model: {base_model_id}")
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(base_model_id).to(device)

print(f"Loading LoRA adapter: {adapter_A_id}")
model = PeftModel.from_pretrained(model, adapter_A_id).to(device)
model.eval()

print("Model A loaded successfully\n")

predictions_A = []
print(f"Generating predictions for Model A...")
print(f"Processing {len(test_dataset)} examples...")

with torch.no_grad():
    for example in tqdm(test_dataset, desc="Model A"):
        inputs = tokenizer(example['input'], return_tensors="pt", max_length=1024, truncation=True).to(device)
        
        outputs = model.generate(
            **inputs, 
            max_new_tokens=50,
            early_stopping=True
        )
        
        prediction_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        predictions_A.append(prediction_text)

print(f"\nGenerated {len(predictions_A)} predictions")

output_file_A = f"{BASE_PATH}/predictions_A.json"
with open(output_file_A, 'w') as f:
    json.dump(predictions_A, f, indent=2)

print(f"Saved predictions to: {output_file_A}")

print("\n" + "="*70)
print("Sample Predictions (Model A)")
print("="*70)
for i in range(min(3, len(test_dataset))):
    print(f"\nExample {i+1}:")
    print(f"Input: {test_dataset[i]['input'][:200]}...")
    print(f"Prediction: {predictions_A[i]}")
print("="*70)

## Step 6: Generate Predictions - Model B (Explain-and-Answer)

This model explains the conflict between sources before providing an answer.

The output format is: Explanation + "\n\n" + Answer

We parse the output to extract only the final answer for evaluation.

In [None]:
import torch
import gc

print("Cleaning up GPU memory...")
del model
gc.collect()
torch.cuda.empty_cache()
print("Memory cleared\n")

adapter_B_id = "fdangolo/flan-t5-exp-ans"

print(f"Loading base model: {base_model_id}")
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(base_model_id).to(device)

print(f"Loading LoRA adapter: {adapter_B_id}")
model = PeftModel.from_pretrained(model, adapter_B_id).to(device)
model.eval()

print("Model B loaded successfully\n")

predictions_B_raw = []
predictions_B_parsed = []
print(f"Generating predictions for Model B...")
print(f"Processing {len(test_dataset)} examples...")

with torch.no_grad():
    for example in tqdm(test_dataset, desc="Model B"):
        inputs = tokenizer(example['input'], return_tensors="pt", max_length=1024, truncation=True).to(device)
        
        outputs = model.generate(
            **inputs, 
            max_new_tokens=300,
            early_stopping=True
        )
        
        raw_prediction_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        predictions_B_raw.append(raw_prediction_text)
        
        # Parse output: split by "\n\n" and take the last part (the answer)
        parsed_answer = raw_prediction_text.split("\n\n")[-1].strip()
        predictions_B_parsed.append(parsed_answer)

print(f"\nGenerated {len(predictions_B_parsed)} predictions")

output_file_B_raw = f"{BASE_PATH}/predictions_B_raw.json"
with open(output_file_B_raw, 'w') as f:
    json.dump(predictions_B_raw, f, indent=2)
print(f"Saved raw predictions to: {output_file_B_raw}")

output_file_B = f"{BASE_PATH}/predictions_B.json"
with open(output_file_B, 'w') as f:
    json.dump(predictions_B_parsed, f, indent=2)
print(f"Saved parsed predictions to: {output_file_B}")

print("\n" + "="*70)
print("Sample Predictions (Model B)")
print("="*70)
for i in range(min(3, len(test_dataset))):
    print(f"\nExample {i+1}:")
    print(f"Input: {test_dataset[i]['input'][:200]}...")
    print(f"\nRaw Output (Explanation + Answer):")
    print(f"  {predictions_B_raw[i]}")
    print(f"\nParsed Answer Only:")
    print(f"  {predictions_B_parsed[i]}")
    print("-" * 70)
print("="*70)

---

## Step 7: Summary

Generation complete for both models.

In [None]:
print("\n" + "="*80)
print("PREDICTION GENERATION COMPLETE")
print("="*80)

print(f"\nSummary:")
print(f"  Test examples processed: {len(test_dataset)}")
print(f"  Model A predictions: {len(predictions_A)}")
print(f"  Model B predictions: {len(predictions_B_parsed)}")

print(f"\nOutput files (saved to Google Drive):")
print(f"  {BASE_PATH}/predictions_A.json")
print(f"  {BASE_PATH}/predictions_B.json")
print(f"  {BASE_PATH}/predictions_B_raw.json (with explanations)")

print(f"\nFile sizes:")
import os
if os.path.exists(f"{BASE_PATH}/predictions_A.json"):
    size_A = os.path.getsize(f"{BASE_PATH}/predictions_A.json") / 1024
    print(f"  predictions_A.json: {size_A:.1f} KB")
if os.path.exists(f"{BASE_PATH}/predictions_B.json"):
    size_B = os.path.getsize(f"{BASE_PATH}/predictions_B.json") / 1024
    print(f"  predictions_B.json: {size_B:.1f} KB")
if os.path.exists(f"{BASE_PATH}/predictions_B_raw.json"):
    size_B_raw = os.path.getsize(f"{BASE_PATH}/predictions_B_raw.json") / 1024
    print(f"  predictions_B_raw.json: {size_B_raw:.1f} KB")

print("\n" + "="*80)
print("\nNext steps:")
print("  1. Download prediction files from Google Drive")
print("  2. Run evaluation script: evaluate_conflictqa.py")
print("  3. Compare results with original paper metrics")
print("  4. Document findings in REPRODUCIBILITY_LOG.md")
print("\n" + "="*80)

print("\nQuick statistics:")
print(f"  Model A - Average prediction length: {sum(len(p.split()) for p in predictions_A) / len(predictions_A):.1f} words")
print(f"  Model B - Average prediction length: {sum(len(p.split()) for p in predictions_B_parsed) / len(predictions_B_parsed):.1f} words")
print(f"  Model B - Average raw output length: {sum(len(p.split()) for p in predictions_B_raw) / len(predictions_B_raw):.1f} words")

print("\nAll predictions generated successfully")
print("="*80 + "\n")

---

## Notes

### About the Predictions

**Model A (Context-Only)**
- Generates short, direct answers
- No explanation provided
- Optimized for concise responses

**Model B (Explain-and-Answer)**
- Generates explanation first, then answer
- Output format: Explanation\n\nAnswer
- Both raw (with explanation) and parsed (answer only) versions are saved
- The parsed version is used for evaluation

### File Locations

Prediction files saved to Google Drive:
```
{BASE_PATH}/
├── predictions_A.json          # Model A: Direct answers
├── predictions_B.json          # Model B: Parsed answers (for evaluation)
└── predictions_B_raw.json      # Model B: Full output (with explanations)
```

### Evaluation

To evaluate these predictions:
1. Download the prediction files from Google Drive
2. Place them in the src/evaluation/ directory
3. Run the evaluation script

### GPU Usage

- Inference time: 15-30 minutes on Colab T4 GPU
- Memory: Both models fit comfortably in T4's 15GB memory
- Cost: Free tier is sufficient

---

For reproducibility research:
- Document inference time in REPRODUCIBILITY_LOG.md
- Compare generated samples with original paper examples
- Note any differences in model behavior