# üîß Environment Setup

This notebook can run in **both Google Colab and VS Code locally**.

**Choose your environment:**
- **Google Colab:** Skip to the next cell
- **VS Code Local:** Run the cell below first

In [1]:
# Environment Detection and Setup
import os
import sys

# Detect if running in Colab
try:
    import google.colab
    IN_COLAB = True
    print("üåê Running in Google Colab")
except:
    IN_COLAB = False
    print("üíª Running locally (VS Code)")

# Set base path based on environment
if IN_COLAB:
    # Colab: Mount Google Drive
    from google.colab import drive
    drive.mount('/content/drive')
    BASE_PATH = "/content/drive/MyDrive/reproducing_project"
    print(f"‚úì Drive mounted. Base path: {BASE_PATH}")
else:
    # Local: Use project directory
    BASE_PATH = "/Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-conflicting-context"
    print(f"‚úì Using local path: {BASE_PATH}")
    
# Verify path exists
if os.path.exists(BASE_PATH):
    print(f"‚úì Path verified")
else:
    print(f"‚ö†Ô∏è WARNING: Path does not exist: {BASE_PATH}")
    print(f"   Please update BASE_PATH in this cell")

üíª Running locally (VS Code)
‚úì Using local path: /Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-conflicting-context
‚úì Path verified


# üöÄ Quick Start Guide

**Before running:**
1. ‚úÖ Upload `data/splits/*.jsonl` files to your Google Drive
2. ‚úÖ Set `BASE_PATH` in cell 2 to your Google Drive folder
3. ‚úÖ Set `YOUR_HF_USERNAME` in cell 8 to your Hugging Face username
4. ‚úÖ Make sure you have a GPU runtime: Runtime > Change runtime type > T4 or A100

**Then:** Run all cells (Runtime > Run all)

**This notebook trains TWO models:**
- Model A: Context-Only approach
- Model B: Explain-and-Answer approach

**Expected time:** 2-6 hours total (depends on GPU)

# Task 2: Finetuning with Google Colab

**Goal:** Reproduce the training from `train.sh` using modern transformers and PEFT libraries.

**Two Experiments:**
- **Experiment A (Context-Only):** Fine-tune flan-t5-base on `train_context_only.jsonl` / `dev_context_only.jsonl`
- **Experiment B (Explain-and-Answer):** Fine-tune flan-t5-base on `train_exp_ans.jsonl` / `dev_exp_ans.jsonl`

**Note:** This notebook replaces the outdated `autotrain-advanced` approach with direct use of `transformers` and `peft` libraries, which work reliably in Google Colab.

---

## Time Tracking (for Reproducibility Log)
**Remember to log:**
- Start/End time for each experiment
- GPU type (Runtime > Change runtime type)
- GPU hours used
- Any errors or issues encountered

In [2]:
# Install required libraries
!pip install -q transformers datasets peft accelerate bitsandbytes huggingface_hub

print("‚úì All dependencies installed")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
‚úì All dependencies installed
‚úì All dependencies installed


In [3]:
# Prepare data directories for both experiments
import os

if IN_COLAB:
    # Colab: Create directories and copy from Drive
    data_context_dir = f"{BASE_PATH}/data_context_only"
    data_exp_ans_dir = f"{BASE_PATH}/data_exp_ans"
    
    os.makedirs(data_context_dir, exist_ok=True)
    os.makedirs(data_exp_ans_dir, exist_ok=True)
    
    print("--- Preparing Experiment A (Context-Only) ---")
    !cp "{BASE_PATH}/data/splits/train_context_only.jsonl" "{data_context_dir}/train.jsonl"
    !cp "{BASE_PATH}/data/splits/dev_context_only.jsonl" "{data_context_dir}/valid.jsonl"
    !ls -lh "{data_context_dir}/"
    
    print("\n--- Preparing Experiment B (Explain-and-Answer) ---")
    !cp "{BASE_PATH}/data/splits/train_exp_ans.jsonl" "{data_exp_ans_dir}/train.jsonl"
    !cp "{BASE_PATH}/data/splits/dev_exp_ans.jsonl" "{data_exp_ans_dir}/valid.jsonl"
    !ls -lh "{data_exp_ans_dir}/"
else:
    # Local: Use existing data/splits directory directly
    data_context_dir = f"{BASE_PATH}/data/splits"
    data_exp_ans_dir = f"{BASE_PATH}/data/splits"
    
    print("--- Using Local Data ---")
    print(f"‚úì Context-Only data: {data_context_dir}")
    print(f"‚úì Explain-and-Answer data: {data_exp_ans_dir}")
    
    # Verify files exist
    required_files = [
        'train_context_only.jsonl', 'dev_context_only.jsonl',
        'train_exp_ans.jsonl', 'dev_exp_ans.jsonl'
    ]
    
    for file in required_files:
        path = os.path.join(data_context_dir, file)
        if os.path.exists(path):
            size = os.path.getsize(path) / 1024  # KB
            print(f"  ‚úì {file} ({size:.1f} KB)")
        else:
            print(f"  ‚úó {file} NOT FOUND")

print("\n‚úì Data preparation complete")

--- Using Local Data ---
‚úì Context-Only data: /Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-conflicting-context/data/splits
‚úì Explain-and-Answer data: /Users/francescodangolo/Desktop/CS 421 - Natural Language Processing/Research Project/qa-with-conflicting-context/data/splits
  ‚úì train_context_only.jsonl (591.8 KB)
  ‚úì dev_context_only.jsonl (624.4 KB)
  ‚úì train_exp_ans.jsonl (642.6 KB)
  ‚úì dev_exp_ans.jsonl (699.8 KB)

‚úì Data preparation complete


In [6]:
# Hugging Face Authentication (environment-aware)
import os

# Set your Hugging Face username here
HF_USERNAME = "dangolofrancesco"  # ‚ö†Ô∏è UPDATE THIS to your HF username

if IN_COLAB:
    # Colab: Use notebook_login with widget
    from huggingface_hub import notebook_login
    print("üîê Please authenticate with Hugging Face (Colab):")
    notebook_login()
else:
    # Local: Use CLI-based authentication
    print("üîê Authenticating with Hugging Face (Local)...")
    print("\n‚ö†Ô∏è  IMPORTANT: You need to authenticate first!")
    print("\nOption 1 (Recommended): Run this in terminal:")
    print("   huggingface-cli login")
    print("\nOption 2: Set environment variable:")
    print("   export HF_TOKEN='your_token_here'")
    print("\nAfter authentication, this cell will verify the login.")
    print("\n" + "="*60)
    
    # Check if authenticated by trying to import and verify
    from huggingface_hub import HfFolder
    token = HfFolder.get_token()
    
    if token:
        print("‚úÖ Already authenticated! Token found.")
        from huggingface_hub import login
        login(token=token)
    else:
        print("\n‚ùå Not authenticated yet!")
        print("\nPlease run in terminal: huggingface-cli login")
        print("Then re-run this cell.")
        raise ValueError("Hugging Face authentication required. Run 'huggingface-cli login' in terminal.")

print(f"\n‚úì Authenticated as: {HF_USERNAME}")
print("‚úì Ready to train and push models!")

üîê Authenticating with Hugging Face (Local)...

‚ö†Ô∏è  IMPORTANT: You need to authenticate first!

Option 1 (Recommended): Run this in terminal:
   huggingface-cli login

Option 2: Set environment variable:
   export HF_TOKEN='your_token_here'

After authentication, this cell will verify the login.

‚úÖ Already authenticated! Token found.

‚úì Authenticated as: dangolofrancesco
‚úì Ready to train and push models!

‚úì Authenticated as: dangolofrancesco
‚úì Ready to train and push models!


In [None]:
# Training function that matches train.sh configuration
import os
import time
from datetime import datetime
from datasets import load_dataset
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForSeq2Seq
)
from peft import LoraConfig, get_peft_model, TaskType

def train_model(experiment_name, data_path, model_name_suffix, your_hf_username, train_file, valid_file):
    """
    Train a model matching the train.sh configuration.

    Args:
        experiment_name: "Context-Only" or "Explain-and-Answer"
        data_path: Path to training data directory
        model_name_suffix: Suffix for model name (e.g., "context-only")
        your_hf_username: Your Hugging Face username
        train_file: Name of training file (e.g., "train_context_only.jsonl")
        valid_file: Name of validation file (e.g., "dev_context_only.jsonl")
    """
    print(f"\n{'='*60}")
    print(f"  EXPERIMENT: {experiment_name}")
    print(f"{'='*60}\n")

    start_time = time.time()
    start_datetime = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(f"‚è±Ô∏è  Start Time: {start_datetime}")

    # Configuration (matching train.sh)
    base_model_id = "google/flan-t5-base"
    new_model_repo = f"{your_hf_username}/flan-t5-{model_name_suffix}"

    # LoRA Config (from train.sh: use-peft with default LoRA settings)
    peft_config = LoraConfig(
        task_type=TaskType.SEQ_2_SEQ_LM,
        r=16,              # LoRA rank
        lora_alpha=32,     # LoRA alpha
        lora_dropout=0.05, # LoRA dropout
        target_modules=["q", "v"],  # Target attention layers explicitly
        inference_mode=False
    )

    # Training Args (matching train.sh parameters)
    training_args = TrainingArguments(
        output_dir=f"{BASE_PATH}/models/{model_name_suffix}",
        learning_rate=2e-4,                    # from train.sh
        per_device_train_batch_size=4,         # from train.sh
        per_device_eval_batch_size=4,
        num_train_epochs=3,                    # from train.sh
        logging_steps=10,
        save_strategy="epoch",
        eval_strategy="epoch",                 # Changed from evaluation_strategy (deprecated)
        load_best_model_at_end=True,
        push_to_hub=True,
        hub_model_id=new_model_repo,
        report_to="none",  # Disable wandb/tensorboard
        warmup_steps=50,  # Add warmup
        weight_decay=0.01,  # Add weight decay
        logging_first_step=True,  # Log first step to see if training starts
        save_total_limit=2,  # Only keep 2 checkpoints
    )

    print(f"üì¶ Loading dataset from: {data_path}")
    
    # Determine file paths based on environment
    if IN_COLAB:
        train_path = os.path.join(data_path, 'train.jsonl')
        valid_path = os.path.join(data_path, 'valid.jsonl')
    else:
        # Local: use original filenames
        train_path = os.path.join(data_path, train_file)
        valid_path = os.path.join(data_path, valid_file)
    
    # Load dataset
    raw_datasets = load_dataset('json', data_files={
        'train': train_path,
        'validation': valid_path
    })

    print(f"   - Training examples: {len(raw_datasets['train'])}")
    print(f"   - Validation examples: {len(raw_datasets['validation'])}")

    print(f"\nü§ñ Loading model and tokenizer: {base_model_id}")
    # Load model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(base_model_id)
    model = AutoModelForSeq2SeqLM.from_pretrained(base_model_id)

    # Preprocessing function (matching train.sh: model_max_length=1024)
    def preprocess_function(examples):
        # Ensure inputs are strings (handle potential None or non-string values)
        inputs = [str(text) if text is not None else "" for text in examples['input']]
        outputs = [str(text) if text is not None else "" for text in examples['output']]
        
        model_inputs = tokenizer(
            inputs, 
            max_length=1024,  # from train.sh
            truncation=True,
            padding=False  # Will pad dynamically
        )
        
        # Tokenize targets - T5 requires this format
        labels = tokenizer(
            text_target=outputs,  # Use text_target parameter for T5
            max_length=256,
            truncation=True,
            padding=False
        )
        
        model_inputs["labels"] = labels["input_ids"]
        return model_inputs

    print("üîÑ Tokenizing dataset...")
    tokenized_datasets = raw_datasets.map(
        preprocess_function,
        batched=True,
        remove_columns=raw_datasets["train"].column_names
    )
    
    # Debug: Check tokenized data
    print(f"   Sample input IDs length: {len(tokenized_datasets['train'][0]['input_ids'])}")
    print(f"   Sample labels length: {len(tokenized_datasets['train'][0]['labels'])}")
    print(f"   First few input tokens: {tokenized_datasets['train'][0]['input_ids'][:10]}")
    print(f"   First few label tokens: {tokenized_datasets['train'][0]['labels'][:10]}")

    print("üîß Applying LoRA (PEFT)...")
    # Apply LoRA
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()

    # Ensure model is in training mode
    model.train()
    
    # Enable gradient checkpointing for memory efficiency
    model.gradient_checkpointing_enable()

    # Data collator for dynamic padding (with label padding)
    data_collator = DataCollatorForSeq2Seq(
        tokenizer=tokenizer,
        model=model,
        padding=True,
        label_pad_token_id=-100  # Ignore padding tokens in loss
    )

    # Initialize trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        tokenizer=tokenizer,
        data_collator=data_collator,
    )

    print(f"\nüöÄ Starting training for: {experiment_name}")
    print(f"   Model will be saved to: {new_model_repo}")

    # Train
    trainer.train()

    print(f"\nüíæ Pushing model to Hugging Face Hub: {new_model_repo}")
    # Push to hub
    trainer.push_to_hub()

    # Calculate time
    end_time = time.time()
    end_datetime = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    duration_seconds = end_time - start_time
    duration_hours = duration_seconds / 3600

    print(f"\n{'='*60}")
    print(f"  ‚úÖ EXPERIMENT COMPLETE: {experiment_name}")
    print(f"{'='*60}")
    print(f"‚è±Ô∏è  End Time: {end_datetime}")
    print(f"‚è±Ô∏è  Duration: {duration_hours:.2f} hours ({duration_seconds/60:.1f} minutes)")
    print(f"üîó Model Hub: https://huggingface.co/{new_model_repo}")
    print(f"\nüìù LOG THIS IN REPRODUCIBILITY_LOG.md:")
    print(f"   - Experiment: {experiment_name}")
    print(f"   - Start: {start_datetime}")
    print(f"   - End: {end_datetime}")
    print(f"   - GPU Hours: {duration_hours:.2f}")
    print(f"   - Model: {new_model_repo}")
    print(f"{'='*60}\n")

    return trainer, duration_hours

print("‚úì Training function defined")

In [None]:
# Set up data paths for experiments (using variables from data preparation cell)
context_only_path = data_context_dir
exp_ans_path = data_exp_ans_dir

print(f"‚úì Experiment A data path: {context_only_path}")
print(f"‚úì Experiment B data path: {exp_ans_path}")
print(f"‚úì Hugging Face username: {HF_USERNAME}")

---
## üî¨ Experiment A: Context-Only

Fine-tune on prompts that include only the conflicting contexts (no explanation requested).

In [None]:
# ============================================================================
# EXPERIMENT A: Context-Only
# ============================================================================
# This experiment trains on questions with only the context (no explanations)

trainer_a, duration_a = train_model(
    experiment_name="Context-Only (Experiment A)",
    data_path=context_only_path,  # Uses environment-aware path from earlier cell
    model_name_suffix="context-only",
    your_hf_username=HF_USERNAME,
    train_file="train_context_only.jsonl",
    valid_file="dev_context_only.jsonl"
)

---
## üî¨ Experiment B: Explain-and-Answer

Fine-tune on prompts that ask the model to explain the conflict AND provide an answer.

In [None]:
# ============================================================================
# EXPERIMENT B: Explain-and-Answer
# ============================================================================
# This experiment trains on questions with explanations before the answer

trainer_b, duration_b = train_model(
    experiment_name="Explain-and-Answer (Experiment B)",
    data_path=exp_ans_path,  # Uses environment-aware path from earlier cell
    model_name_suffix="exp-ans",
    your_hf_username=HF_USERNAME,
    train_file="train_exp_ans.jsonl",
    valid_file="dev_exp_ans.jsonl"
)

---
## üìä Summary

Total GPU hours and summary of both experiments.

In [None]:
# ============================================================================
# SUMMARY
# ============================================================================

print("\n" + "="*80)
print(" üéâ ALL EXPERIMENTS COMPLETED!")
print("="*80)
print(f"\nüìä TOTAL GPU HOURS: {duration_a + duration_b:.2f} hours")
print(f"\n   Experiment A (Context-Only): {duration_a:.2f} hours")
print(f"   Experiment B (Explain-and-Answer): {duration_b:.2f} hours")
print(f"\nüîó MODELS:")
print(f"   - https://huggingface.co/{HF_USERNAME}/flan-t5-context-only")
print(f"   - https://huggingface.co/{HF_USERNAME}/flan-t5-exp-ans")
print("\n" + "="*80)
print("\nüìù NEXT STEPS:")
print("   1. Update REPRODUCIBILITY_LOG.md with the GPU hours above")
print("   2. Run evaluation on test sets")
print("   3. Document any issues encountered")
print("="*80 + "\n")