# Protein Task Vectors — Phase 1 Training (Colab)

Run on a free T4 GPU. Train one property at a time.

**Before starting:** Runtime → Change runtime type → **T4 GPU**

## Step 1: Mount Google Drive (for persistent storage)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Create persistent directory on Google Drive
!mkdir -p /content/drive/MyDrive/protein-task-vectors/checkpoints
!mkdir -p /content/drive/MyDrive/protein-task-vectors/zero_shot
!mkdir -p /content/drive/MyDrive/protein-task-vectors/phase1_metrics
!mkdir -p /content/drive/MyDrive/protein-task-vectors/task_vectors
print('Google Drive mounted. Checkpoints will persist between sessions.')

## Step 2: Clone repo and install dependencies

In [None]:
# CHANGE THIS to your GitHub repo URL
REPO_URL = "https://github.com/YOUR_USERNAME/task-arithmetic.git"

import os
if os.path.exists('/content/task-arithmetic'):
    %cd /content/task-arithmetic
    !git pull
else:
    !git clone {REPO_URL} /content/task-arithmetic
    %cd /content/task-arithmetic

!pip install -e . -q
print('\nDependencies installed.')

In [None]:
# Install MMseqs2
!cd /tmp && wget -q https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz && tar xzf mmseqs-linux-avx2.tar.gz && cp mmseqs/bin/mmseqs /usr/local/bin/
!mmseqs version

In [None]:
# Verify GPU
import torch
print(f'GPU: {torch.cuda.get_device_name(0)}')
print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')
print(f'bfloat16: {torch.cuda.is_bf16_supported()}')

In [None]:
# Create merged config with T4-safe settings
# (base config assumes A100 80GB; T4 has only 16GB)
import yaml

with open('configs/train_config.yaml') as f:
    config = yaml.safe_load(f)

with open('configs/colab_overrides.yaml') as f:
    overrides = yaml.safe_load(f)

# Deep merge overrides into config
for section, values in overrides.items():
    if section in config and isinstance(config[section], dict):
        config[section].update(values)
    else:
        config[section] = values

# Write merged config
with open('configs/train_config_colab.yaml', 'w') as f:
    yaml.dump(config, f, default_flow_style=False, sort_keys=False)

print('Created configs/train_config_colab.yaml with T4-safe settings:')
print(f'  mixed_precision: {config["training"]["mixed_precision"]}')
print(f'  batch_size: {config["training"]["batch_size"]}')
print(f'  list_size: {config["training"]["list_size"]}')
print(f'  grad_accum: {config["training"]["gradient_accumulation_steps"]}')
print(f'  eval_batch_size: {config["evaluation"]["eval_batch_size"]}')

## Step 3: Symlink results to Google Drive

This way checkpoints survive Colab disconnects.

In [None]:
import os
import shutil

DRIVE_DIR = '/content/drive/MyDrive/protein-task-vectors'
REPO_DIR = '/content/task-arithmetic'

# Symlink results subdirs to Google Drive
for subdir in ['checkpoints', 'zero_shot', 'phase1_metrics', 'task_vectors']:
    local = os.path.join(REPO_DIR, 'results', subdir)
    remote = os.path.join(DRIVE_DIR, subdir)
    if os.path.islink(local):
        print(f'  {subdir}: already symlinked')
    else:
        if os.path.isdir(local):
            # Copy any existing files first
            for f in os.listdir(local):
                src = os.path.join(local, f)
                dst = os.path.join(remote, f)
                if not os.path.exists(dst):
                    shutil.copy2(src, dst) if os.path.isfile(src) else shutil.copytree(src, dst)
            shutil.rmtree(local)
        os.symlink(remote, local)
        print(f'  {subdir}: symlinked to Drive')

print('\nResults will be saved to Google Drive automatically.')

## Step 4: Download data

Downloads ProteinGym (~500MB). Only runs once — skips if already downloaded.

In [None]:
!python -m src.data.download --config configs/train_config.yaml

## Step 5: Categorize and split (if not already done)

In [None]:
import os

if not os.path.exists('data/processed/category_assignments.json'):
    !python -m src.data.categorize --config configs/train_config.yaml
else:
    print('Already categorized.')

if not os.path.exists('data/splits/train_assays.json'):
    !python -m src.data.splits --config configs/train_config.yaml
else:
    print('Splits already created.')

## Step 6: Zero-shot baseline

Scores all assays with ESM-2 masked marginal likelihood.
This takes ~2-4 hours for all 217 assays on T4. Skips already-scored assays.

In [None]:
!python scripts/04_zero_shot.py --config configs/train_config_colab.yaml

## Step 7: Train property models

Train ONE property per Colab session.
Change `PROPERTY` below and run a new session for each.

Order: stability → binding → expression → activity

Each takes ~2-4 hours on T4.

In [None]:
#########################################
# CHANGE THIS for each training session #
#########################################
PROPERTY = "stability"  # stability | binding | expression | activity

In [None]:
!python scripts/05_train_property_models.py \
    --config configs/train_config_colab.yaml \
    --property {PROPERTY} \
    --resume

## Step 8: Evaluate (after all 4 properties are trained)

In [None]:
# Per-property evaluation
for prop in ['stability', 'binding', 'expression', 'activity']:
    print(f'\n=== Evaluating {prop} ===')
    !python scripts/06_evaluate.py --config configs/train_config_colab.yaml --property {prop}

In [None]:
# Cross-property matrix (THE key result)
!python scripts/06_evaluate.py --config configs/train_config_colab.yaml --cross-property

In [None]:
# View the result
import pandas as pd
matrix = pd.read_csv('results/phase1_metrics/cross_property_matrix.csv', index_col=0)
print('Cross-Property Evaluation Matrix (Spearman correlation)')
print(matrix.to_string())

## Step 9: Extract task vectors

In [None]:
!python scripts/07_extract_vectors.py --config configs/train_config_colab.yaml

In [None]:
# View cosine similarity between task vectors
import pandas as pd
sim = pd.read_csv('results/task_vectors/cosine_similarity_matrix.csv', index_col=0)
print('Task Vector Cosine Similarity')
print(sim.to_string())

## Done!

All results are saved to your Google Drive at:
- `My Drive/protein-task-vectors/checkpoints/` — trained models
- `My Drive/protein-task-vectors/phase1_metrics/` — evaluation results
- `My Drive/protein-task-vectors/task_vectors/` — extracted vectors

You can close this notebook. Everything persists on Drive.