# HRM Training on Rfam Dataset - Kaggle GPU

This notebook trains the Hierarchical Recurrent Model (HRM) on RNA structure prediction.

**Before running:**
1. Click **Settings** (right sidebar) â†’ **Accelerator** â†’ Select **GPU T4 x2** (free)
2. Upload your `Rfam.csv` to **Input** (or use Kaggle Datasets)
3. Run all cells!

## 1. Setup - Clone Repo and Install Dependencies

In [None]:
# Clone your repository
!git clone https://github.com/alvin-banh/psifold.git
%cd psifold
!git checkout claude/rfam-dataset-01A21JDYSTy1d19n2U5WSUsw

# Check we're in the right place
!ls -la

In [None]:
# Install dependencies
!pip install -q pandas numpy

# Verify PyTorch and GPU
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU name: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")
print(f"GPU count: {torch.cuda.device_count()}")

## 2. Upload or Link Data

**Option A: Upload Rfam.csv directly**
- Click **Add Data** (right sidebar)
- Upload your `Rfam.csv`
- It will appear in `/kaggle/input/`

**Option B: Use from your local folder**
- Run the cell below to upload

In [None]:
# Check what data is available
import os
print("Available input data:")
!ls -lh /kaggle/input/

# Set the data path (modify if needed)
DATA_PATH = "/kaggle/input/rfam/Rfam.csv"  # Adjust this path based on where your data is

# If data not found, try to locate it
if not os.path.exists(DATA_PATH):
    print(f"\nData not found at {DATA_PATH}")
    print("Looking for Rfam.csv...")
    for root, dirs, files in os.walk("/kaggle/input/"):
        for file in files:
            if "rfam" in file.lower() or file.endswith(".csv"):
                found_path = os.path.join(root, file)
                print(f"Found: {found_path}")
                DATA_PATH = found_path
                break

## 3. Explore the Data (Optional)

In [None]:
# Quick exploration
!python examples/explore_rfam.py --data_path {DATA_PATH}

## 4. Train the Model

This will train for 10 epochs (~30-40 minutes on T4 GPU)

In [None]:
# Quick training run (10 epochs)
!python examples/train_rfam.py \
  --data_path {DATA_PATH} \
  --dim 128 \
  --n_epochs 10 \
  --batch_size 32 \
  --max_length 256 \
  --device cuda \
  --output_dir ./outputs/kaggle_run

## 5. Check Results

In [None]:
# View training results
import json

with open('./outputs/kaggle_run/results.json', 'r') as f:
    results = json.load(f)

print("=" * 60)
print("Final Test Results")
print("=" * 60)
print(f"Test F1:        {results['test_f1']:.4f}")
print(f"Test Precision: {results['test_precision']:.4f}")
print(f"Test Recall:    {results['test_recall']:.4f}")
print(f"Test MCC:       {results['test_mcc']:.4f}")
print(f"\nBest Epoch:     {results['best_epoch']}")
print(f"Best Val F1:    {results['best_val_f1']:.4f}")

## 6. Visualize Training Progress

In [None]:
# Plot training progress (if you want to add visualization)
import matplotlib.pyplot as plt

# You can add plotting code here to visualize loss curves, F1 over epochs, etc.
print("Model saved at: ./outputs/kaggle_run/best_model.pt")

## 7. Download Trained Model (Optional)

In [None]:
# Create a zip file of outputs
!zip -r outputs.zip ./outputs/kaggle_run/

# Download via Kaggle Output
# The outputs.zip will be available in the Output tab after the notebook finishes
print("Output saved! Check the Output tab on the right to download outputs.zip")

## 8. Longer Training (Optional - Run in New Session)

For better results, train longer:

In [None]:
# Uncomment to run a longer training session (50 epochs, ~3 hours)
# !python examples/train_rfam.py \
#   --data_path {DATA_PATH} \
#   --dim 256 \
#   --n_heads 8 \
#   --n_cycles 3 \
#   --cycle_steps 3 \
#   --n_epochs 50 \
#   --batch_size 64 \
#   --max_length 256 \
#   --device cuda \
#   --output_dir ./outputs/kaggle_full

---

## Next Steps

1. **Download the trained model** from the Output tab
2. **Try different hyperparameters** (dim, n_cycles, batch_size)
3. **Test on RNAsolo** dataset for validation
4. **Compare with baselines** (RNAfold, UFold)

**Expected Performance:**
- Quick run (10 epochs): F1 ~ 0.70-0.75
- Full run (50 epochs): F1 ~ 0.80-0.85
- Target: F1 > 0.85 (competitive)

ðŸ§¬ Happy training!