# LLM-PEFT-PPM Replication Study - Setup and Environment

## Setup Instructions for TU/e HPC Jupyter Environment

This notebook handles the initial setup and environment configuration for replicating the LLM-PEFT-PPM experiments.

### Prerequisites
- Jupyter launched with GPU access on TU/e HPC
- HuggingFace account and token
- Access to the original repository: https://github.com/raseidi/llm-peft-ppm

---

## 1. Environment Check and System Information

In [None]:
import os
import sys
import platform
import subprocess
import torch
import numpy as np
import pandas as pd
from pathlib import Path

print("=== System Information ===")
print(f"Python version: {sys.version}")
print(f"Platform: {platform.platform()}")
print(f"Current working directory: {os.getcwd()}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU device: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Check if running on SLURM
if 'SLURM_JOB_ID' in os.environ:
    print(f"\n=== SLURM Information ===")
    print(f"Job ID: {os.environ['SLURM_JOB_ID']}")
    print(f"Node: {os.environ.get('SLURMD_NODENAME', 'Unknown')}")
    print(f"Partition: {os.environ.get('SLURM_JOB_PARTITION', 'Unknown')}")

## 2. Repository Setup

In [None]:
# Define project structure
PROJECT_ROOT = Path.cwd() / "llm-peft-ppm-replication"
ORIGINAL_REPO = "https://github.com/raseidi/llm-peft-ppm.git"

print(f"Project root: {PROJECT_ROOT}")

# Clone repository if it doesn't exist
if not PROJECT_ROOT.exists():
    print("Cloning original repository...")
    subprocess.run(["git", "clone", ORIGINAL_REPO, str(PROJECT_ROOT)], check=True)
    print("Repository cloned successfully!")
else:
    print("Repository already exists")

# Change to project directory
os.chdir(PROJECT_ROOT)
print(f"Changed to directory: {os.getcwd()}")

# List repository contents
print("\nRepository structure:")
for item in sorted(PROJECT_ROOT.glob("*")):
    print(f"  {'📁' if item.is_dir() else '📄'} {item.name}")

## 3. Dependencies Installation

In [None]:
# Install uv package manager if not available
try:
    subprocess.run(["uv", "--version"], check=True, capture_output=True)
    print("uv is already installed")
except (subprocess.CalledProcessError, FileNotFoundError):
    print("Installing uv...")
    subprocess.run([
        "curl", "-LsSf", "https://astral.sh/uv/install.sh"
    ], shell=True, check=True)
    
    # Add uv to PATH for current session
    uv_path = Path.home() / ".local" / "bin"
    if str(uv_path) not in os.environ["PATH"]:
        os.environ["PATH"] = f"{uv_path}:{os.environ['PATH']}"

In [None]:
# Create virtual environment and install dependencies
venv_path = PROJECT_ROOT / ".venv"

if not venv_path.exists():
    print("Creating virtual environment...")
    subprocess.run(["uv", "venv", ".venv", "--python", "3.12"], check=True)
    print("Virtual environment created!")

# Install requirements
print("Installing dependencies...")
subprocess.run(["uv", "pip", "install", "-r", "requirements.txt"], check=True)

# Install additional packages mentioned in methodology
additional_packages = [
    "pyyaml",
    "jupyter",
    "matplotlib",
    "seaborn",
    "plotly",
    "tqdm",
    "scikit-learn"
]

for package in additional_packages:
    try:
        subprocess.run(["uv", "pip", "install", package], check=True)
        print(f"✅ Installed {package}")
    except subprocess.CalledProcessError:
        print(f"❌ Failed to install {package}")

print("\nDependency installation completed!")

## 4. HuggingFace Configuration

In [None]:
# HuggingFace token setup
import getpass
from pathlib import Path

# Check if HF token is already configured
env_file = PROJECT_ROOT / ".env"

if not env_file.exists() and "HF_TOKEN" not in os.environ:
    print("HuggingFace token not found. Please provide your token:")
    print("You can get a token from: https://huggingface.co/docs/hub/en/security-tokens")
    
    hf_token = getpass.getpass("Enter your HuggingFace token: ")
    
    # Save to .env file
    with open(env_file, "w") as f:
        f.write(f"HF_TOKEN={hf_token}\n")
    
    print("✅ HuggingFace token saved to .env file")
    
    # Set environment variable for current session
    os.environ["HF_TOKEN"] = hf_token
else:
    print("✅ HuggingFace token already configured")
    
    # Load from .env if exists
    if env_file.exists() and "HF_TOKEN" not in os.environ:
        with open(env_file, "r") as f:
            for line in f:
                if line.startswith("HF_TOKEN="):
                    os.environ["HF_TOKEN"] = line.split("=", 1)[1].strip()
                    break

In [None]:
# Test HuggingFace connection
try:
    from transformers import AutoTokenizer
    
    # Test with a small public model
    print("Testing HuggingFace connection...")
    tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
    print("✅ HuggingFace connection successful!")
    
except Exception as e:
    print(f"❌ HuggingFace connection failed: {e}")
    print("Please check your token and internet connection")

## 5. Create Project Structure

In [None]:
# Create additional directories for replication study
directories = [
    "replication_results",
    "replication_results/experiments",
    "replication_results/logs",
    "replication_results/models",
    "replication_results/plots",
    "replication_notebooks",
    "data_analysis",
    "traffic_fines_data"  # For the additional dataset
]

for directory in directories:
    dir_path = PROJECT_ROOT / directory
    dir_path.mkdir(exist_ok=True)
    print(f"📁 Created/verified: {directory}")

print("\n✅ Project structure setup completed!")

## 6. Quick Test of Main Components

In [None]:
# Test import of main script
sys.path.append(str(PROJECT_ROOT))

try:
    # Test if we can import the main modules
    print("Testing imports...")
    
    # Import main training script components
    exec(open("next_event_prediction.py").read().split("if __name__")[0])
    print("✅ Main script imports successful")
    
except Exception as e:
    print(f"⚠️ Import test failed: {e}")
    print("This is normal if dependencies aren't fully compatible yet")

## 7. Dataset Verification

In [None]:
# Check if datasets exist or need to be downloaded
data_dir = PROJECT_ROOT / "data"

expected_datasets = [
    "BPI12",
    "BPI17", 
    "BPI20RfP",
    "BPI20PTC",
    "BPI20PD", 
    "BPI_Traffic_Fines",
]

print("Checking dataset availability...")
for dataset in expected_datasets:
    dataset_path = data_dir / dataset
    if dataset_path.exists():
        print(f"✅ {dataset}: Found")
    else:
        print(f"📥 {dataset}: Will be downloaded automatically on first run")

print("\nNote: Datasets will be automatically downloaded via SkPM when experiments run")

## 8. Configuration Summary

In [None]:
# Create configuration summary
config_summary = {
    "environment": {
        "python_version": sys.version,
        "pytorch_version": torch.__version__,
        "cuda_available": torch.cuda.is_available(),
        "gpu_name": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None",
        "project_root": str(PROJECT_ROOT)
    },
    "datasets": expected_datasets,
    "experiment_parameters": {
        "models": ["qwen25-05b", "llama32-1b", "pm-gpt2"],
        "peft_techniques": ["lora", "layer_freezing"],
        "baseline": "LSTM",
        "epochs_llm": 10,
        "epochs_rnn": 25,
        "lora_r": 256,
        "lora_alpha": 512
    }
}

# Save configuration
import json
config_file = PROJECT_ROOT / "replication_results" / "experiment_config.json"
with open(config_file, "w") as f:
    json.dump(config_summary, f, indent=2, default=str)

print("=== Experiment Configuration ===")
print(json.dumps(config_summary, indent=2, default=str))
print(f"\n Configuration saved to: {config_file}")

## 9. Next Steps

**Setup Complete!** 
<!-- 
### Ready to proceed with:

1. **Data Exploration** (`02_data_exploration.ipynb`)
   - Analyze the 5 original datasets
   - Prepare Traffic Fines dataset
   - Preprocessing and feature engineering

2. **Baseline Experiments** (`03_baseline_experiments.ipynb`)
   - RNN baseline with hyperparameter search
   - Traditional approaches comparison

3. **LLM-PEFT Experiments** (`04_llm_peft_experiments.ipynb`)
   - LoRA fine-tuning experiments
   - Layer freezing strategies
   - Multi-task vs single-task learning

4. **Results Analysis** (`05_results_analysis.ipynb`)
   - Performance comparison
   - Statistical significance testing
   - Visualization of results

5. **Competitor Baselines** (`06_competitor_baselines.ipynb`)
   - S-NAP narrative approach
   - Transfer learning baseline

 -->
