# ProteinMCP ‚Äî Fitness Modeling Workflow

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/charlesxu90/ProteinMCP/blob/main/notebooks/ProteinMCP_Fitness_Modeling.ipynb)

Build and compare protein fitness prediction models using multiple backbone architectures:

| Model | Description |
|-------|-------------|
| **EV+OneHot** | Evolutionary couplings + one-hot encoding (PLMC) |
| **ESM2-650M / ESM2-3B** | Meta's protein language models |
| **ProtT5-XL / ProtAlbert** | ProtTrans transformer embeddings |

Each model is trained with SVR, XGBoost, and KNN heads, then compared via 5-fold cross-validated Spearman correlation.

**Links:** [GitHub](https://github.com/charlesxu90/ProteinMCP) ¬∑ [ESM](https://github.com/facebookresearch/esm) ¬∑ [ProtTrans](https://github.com/agemagician/ProtTrans) ¬∑ [PLMC](https://github.com/debbiemarkslab/plmc)

---

In [None]:
#@title üìã User Configuration
#@markdown ### Protein Settings
PROTEIN_NAME = "TEVp_S219V" #@param {type:"string"}
use_example_data = True #@param {type:"boolean"}

#@markdown ### API Key (required)
ANTHROPIC_API_KEY = "" #@param {type:"string"}

#@markdown ---
#@markdown If `use_example_data` is **False**, upload your own `wt.fasta` and `data.csv` below.

import os

# ---------- Validate API key ----------
if not ANTHROPIC_API_KEY:
    raise ValueError("ANTHROPIC_API_KEY is required. Get one at https://console.anthropic.com/")
os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_API_KEY

# ---------- Paths ----------
REPO_DIR     = "/content/ProteinMCP"
DATA_DIR     = f"/content/data/{PROTEIN_NAME}"
RESULTS_DIR  = f"/content/results/{PROTEIN_NAME}"
WT_FASTA     = f"{DATA_DIR}/wt.fasta"
DATA_CSV     = f"{DATA_DIR}/data.csv"

os.makedirs(DATA_DIR, exist_ok=True)

if use_example_data:
    print(f"Will use bundled example data for {PROTEIN_NAME}")
else:
    from google.colab import files
    print("Upload wt.fasta and data.csv (must contain 'seq' and 'log_fitness' columns):")
    uploaded = files.upload()
    for fname, content in uploaded.items():
        with open(os.path.join(DATA_DIR, fname), "wb") as f:
            f.write(content)
    assert os.path.exists(WT_FASTA), f"Missing {WT_FASTA} ‚Äî please upload wt.fasta"
    assert os.path.exists(DATA_CSV),  f"Missing {DATA_CSV} ‚Äî please upload data.csv"

print(f"\nPROTEIN_NAME : {PROTEIN_NAME}")
print(f"DATA_DIR     : {DATA_DIR}")
print(f"RESULTS_DIR  : {RESULTS_DIR}")
print(f"WT_FASTA     : {WT_FASTA}")
print(f"DATA_CSV     : {DATA_CSV}")

In [None]:
#@title üêç Install Conda / Mamba
%%time
import os

CONDA_READY = "/content/.conda_ready"

if os.path.exists(CONDA_READY):
    print("Conda already installed ‚Äî skipping.")
else:
    # Install Miniforge (same pattern as ColabFold)
    os.system("wget -qnc https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh")
    os.system("bash Miniforge3-Linux-x86_64.sh -bfp /usr/local")
    os.system("rm -f Miniforge3-Linux-x86_64.sh")

    # Make conda available in this session
    os.environ["PATH"] = "/usr/local/bin:" + os.environ["PATH"]
    os.environ["CONDA_PREFIX"] = "/usr/local"

    # Create protein-mcp conda env (in-place, reuse base)
    os.system("mamba install -y -n base python=3.12 pip nodejs=20")

    open(CONDA_READY, "w").close()
    print("Conda/Mamba installed successfully.")

# Ensure PATH is set for subsequent cells
os.environ["PATH"] = "/usr/local/bin:" + os.environ["PATH"]

In [None]:
#@title üì¶ Install ProteinMCP & Claude Code
%%time
import os

PROTEINMCP_READY = "/content/.proteinmcp_ready"
REPO_DIR = "/content/ProteinMCP"

if os.path.exists(PROTEINMCP_READY):
    print("ProteinMCP & Claude Code already installed ‚Äî skipping.")
else:
    # Clone repo
    if not os.path.isdir(REPO_DIR):
        os.system("git clone https://github.com/charlesxu90/ProteinMCP.git /content/ProteinMCP")

    # Install ProteinMCP as editable package
    os.system(f"pip install -e {REPO_DIR}")
    os.system(f"pip install -r {REPO_DIR}/requirements.txt")

    # Install Claude Code
    os.system("npm install -g @anthropic-ai/claude-code")

    open(PROTEINMCP_READY, "w").close()
    print("ProteinMCP & Claude Code installed successfully.")

# Set API key for Claude Code
os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_API_KEY
print(f"ANTHROPIC_API_KEY set (ends with ...{ANTHROPIC_API_KEY[-4:]})")

In [None]:
#@title üîß Install Fitness Modeling Skill & MCPs
%%time
import os

SKILL_READY = "/content/.skill_ready"
REPO_DIR = "/content/ProteinMCP"

if os.path.exists(SKILL_READY):
    print("Fitness modeling skill already installed ‚Äî skipping.")
else:
    # Install the fitness_modeling skill (installs msa_mcp, plmc_mcp, ev_onehot_mcp, esm_mcp, prottrans_mcp)
    os.system(f"cd {REPO_DIR} && pskill install fitness_modeling")

    open(SKILL_READY, "w").close()
    print("Fitness modeling skill installed.")

# Verify MCP status
os.system(f"cd {REPO_DIR} && pmcp status")

In [None]:
#@title Step 0 ‚Äî Setup Results Directory
import os, shutil

REPO_DIR    = "/content/ProteinMCP"
EXAMPLE_DIR = f"{REPO_DIR}/examples/case1_fitness_modeling"

os.makedirs(RESULTS_DIR, exist_ok=True)

if use_example_data:
    # Copy bundled example data into DATA_DIR
    for fname in ["wt.fasta", "data.csv"]:
        src = os.path.join(EXAMPLE_DIR, fname)
        dst = os.path.join(DATA_DIR, fname)
        if not os.path.exists(dst):
            shutil.copy2(src, dst)
            print(f"Copied {src} ‚Üí {dst}")

# Copy input files to RESULTS_DIR (needed by training tools)
for fname in ["wt.fasta", "data.csv"]:
    src = os.path.join(DATA_DIR, fname)
    dst = os.path.join(RESULTS_DIR, fname)
    if not os.path.exists(dst):
        shutil.copy2(src, dst)
        print(f"Copied {src} ‚Üí {dst}")

# Verify
assert os.path.exists(f"{RESULTS_DIR}/wt.fasta"), "wt.fasta missing in RESULTS_DIR"
assert os.path.exists(f"{RESULTS_DIR}/data.csv"),  "data.csv missing in RESULTS_DIR"
print(f"\nResults directory ready: {RESULTS_DIR}")
print(f"Files: {os.listdir(RESULTS_DIR)}")

In [None]:
#@title Step 1 ‚Äî Generate MSA
%%time
import os

REPO_DIR = "/content/ProteinMCP"

prompt = f"""\
Can you obtain the MSA for {PROTEIN_NAME} from {WT_FASTA} using msa mcp \
and save it to {RESULTS_DIR}/{PROTEIN_NAME}.a3m.
Please convert the relative path to absolute path before calling the MCP servers.
"""

cmd = (
    f'cd {REPO_DIR} && claude -p "{prompt}" '
    f'--allowedTools "mcp__msa_mcp__generate_msa,Bash,Read,Write"'
)
os.system(cmd)

# Verify output
msa_file = f"{RESULTS_DIR}/{PROTEIN_NAME}.a3m"
assert os.path.exists(msa_file), f"MSA file not found: {msa_file}"
print(f"\nMSA generated: {msa_file}")

In [None]:
#@title Step 2 ‚Äî Build PLMC Model
%%time
import os

REPO_DIR = "/content/ProteinMCP"

prompt = f"""\
I have created an a3m file in {RESULTS_DIR}/{PROTEIN_NAME}.a3m. \
Can you help build an EV model using plmc mcp and save it to {RESULTS_DIR}/plmc directory. \
The wild-type sequence is {WT_FASTA}.
Please convert the relative path to absolute path before calling the MCP servers.
After building the model, create symlinks in {RESULTS_DIR}/plmc/:
  ln -sf {PROTEIN_NAME}.model_params uniref100.model_params
  ln -sf {PROTEIN_NAME}.EC uniref100.EC
"""

cmd = (
    f'cd {REPO_DIR} && claude -p "{prompt}" '
    f'--allowedTools "mcp__plmc_mcp__plmc_convert_a3m_to_a2m,mcp__plmc_mcp__plmc_generate_model,Bash,Read,Write"'
)
os.system(cmd)

# Verify outputs
plmc_dir = f"{RESULTS_DIR}/plmc"
assert os.path.exists(f"{plmc_dir}/uniref100.model_params"), "PLMC model_params symlink missing"
assert os.path.exists(f"{plmc_dir}/uniref100.EC"), "PLMC EC symlink missing"
print(f"\nPLMC model built: {os.listdir(plmc_dir)}")

In [None]:
#@title Step 3 ‚Äî Build EV+OneHot Model
%%time
import os

REPO_DIR = "/content/ProteinMCP"

prompt = f"""\
I have created a plmc model in directory {RESULTS_DIR}/plmc. \
Can you help build an EV+OneHot model using ev_onehot_mcp and save it to {RESULTS_DIR}/ directory. \
The wild-type sequence is {RESULTS_DIR}/wt.fasta, and the dataset is {RESULTS_DIR}/data.csv.
Please convert the relative path to absolute path before calling the MCP servers.
"""

cmd = (
    f'cd {REPO_DIR} && claude -p "{prompt}" '
    f'--allowedTools "mcp__ev_onehot_mcp__ev_onehot_train_fitness_predictor,Bash,Read,Write"'
)
os.system(cmd)

# Verify output
assert os.path.exists(f"{RESULTS_DIR}/metrics_summary.csv"), "EV+OneHot metrics not found"
print(f"\nEV+OneHot model trained. Metrics:")
import pandas as pd
print(pd.read_csv(f"{RESULTS_DIR}/metrics_summary.csv").to_string(index=False))

In [None]:
#@title Step 4 ‚Äî Build ESM Models
%%time
import os

REPO_DIR = "/content/ProteinMCP"

# --- 4.1: ESM2-650M (extract embeddings + train svr/xgboost/knn) ---
prompt_650m = f"""\
Can you help train ESM models for data in {RESULTS_DIR}/ and save them to \
{RESULTS_DIR}/esm2_650M_{{head_model}} using the esm mcp server with svr, xgboost, \
and knn as the head models.
Please convert the relative path to absolute path before calling the MCP servers.
Obtain the embeddings if they are not created.
"""

cmd_650m = (
    f'cd {REPO_DIR} && claude -p "{prompt_650m}" '
    f'--allowedTools "mcp__esm_mcp__extract_protein_embeddings,mcp__esm_mcp__esm_train_fitness_model,Bash,Read,Write"'
)
os.system(cmd_650m)

# --- 4.2: ESM2-3B (extract embeddings + train svr/xgboost/knn) ---
prompt_3b = f"""\
Can you help train ESM models for data in {RESULTS_DIR}/ and save them to \
{RESULTS_DIR}/esm2_3B_{{head_model}} using the esm mcp server with svr, xgboost, \
and knn as the head models and esm2_t36_3B_UR50D as the backbone.
Please convert the relative path to absolute path before calling the MCP servers.
Obtain the embeddings if they are not created.
"""

cmd_3b = (
    f'cd {REPO_DIR} && claude -p "{prompt_3b}" '
    f'--allowedTools "mcp__esm_mcp__extract_protein_embeddings,mcp__esm_mcp__esm_train_fitness_model,Bash,Read,Write"'
)
os.system(cmd_3b)

# Verify
for backbone in ["esm2_650M", "esm2_3B"]:
    for head in ["svr", "xgboost", "knn"]:
        d = f"{RESULTS_DIR}/{backbone}_{head}"
        if os.path.isdir(d):
            print(f"  {backbone}_{head}: {os.listdir(d)}")
        else:
            print(f"  {backbone}_{head}: NOT FOUND")

In [None]:
#@title Step 5 ‚Äî Build ProtTrans Models
%%time
import os

REPO_DIR = "/content/ProteinMCP"

prompt = f"""\
Can you help train ProtTrans models for data in {RESULTS_DIR}/ and save them to \
{RESULTS_DIR}/{{backbone_model}}_{{head_model}} using the prottrans mcp server with \
ProtT5-XL and ProtAlbert as backbone_models and knn, xgboost, and svr as the head models.
Please convert the relative path to absolute path before calling the MCP servers.
Create the embeddings if they are not created.
"""

cmd = (
    f'cd {REPO_DIR} && claude -p "{prompt}" '
    f'--allowedTools "mcp__prottrans_mcp__prottrans_extract_embeddings,mcp__prottrans_mcp__prottrans_train_fitness_model,Bash,Read,Write"'
)
os.system(cmd)

# Verify
for backbone in ["ProtT5-XL", "ProtAlbert"]:
    for head in ["svr", "xgboost", "knn"]:
        d = f"{RESULTS_DIR}/{backbone}_{head}"
        if os.path.isdir(d):
            print(f"  {backbone}_{head}: {os.listdir(d)}")
        else:
            print(f"  {backbone}_{head}: NOT FOUND")

In [None]:
#@title Step 6 ‚Äî Aggregate Results & Visualize
%%time
import os
import pandas as pd

REPO_DIR = "/content/ProteinMCP"

# ---- 6.1 Collect and aggregate all model results ----
results = []

# EV+OneHot ‚Äî metrics_summary.csv (stage/fold format)
ev_path = os.path.join(RESULTS_DIR, "metrics_summary.csv")
if os.path.exists(ev_path):
    ev = pd.read_csv(ev_path)
    cv_mean = ev[ev["fold"] == "mean"]["spearman_correlation"].values[0]
    cv_std  = ev[ev["fold"] == "std"]["spearman_correlation"].values[0]
    results.append({"backbone": "EV+OneHot", "head": "ridge",
                    "mean_cv_spearman": cv_mean, "std_cv_spearman": cv_std})

# ESM & ProtTrans ‚Äî training_summary.csv in subdirectories
for dir_name in sorted(os.listdir(RESULTS_DIR)):
    summary = os.path.join(RESULTS_DIR, dir_name, "training_summary.csv")
    if not os.path.exists(summary):
        continue
    df = pd.read_csv(summary)
    if "mean_cv_spearman" in df.columns:
        mean_sp = df["mean_cv_spearman"].values[0]
        std_sp  = df["std_cv_spearman"].values[0]
    elif "cv_mean" in df.columns:
        mean_sp = df["cv_mean"].values[0]
        std_sp  = df["cv_std"].values[0]
    else:
        continue
    parts = dir_name.rsplit("_", 1)
    if len(parts) == 2:
        results.append({"backbone": parts[0], "head": parts[1],
                        "mean_cv_spearman": mean_sp, "std_cv_spearman": std_sp})

all_models = pd.DataFrame(results)
all_models.to_csv(os.path.join(RESULTS_DIR, "all_models_comparison.csv"), index=False)
print(f"Saved {len(results)} model results to all_models_comparison.csv\n")
print(all_models.sort_values("mean_cv_spearman", ascending=False).to_string(index=False))

# ---- 6.2 Generate four-panel visualization ----
VIZ_SCRIPT = f"{REPO_DIR}/workflow-skills/scripts/fitness_modeling_viz.py"
VIZ_PYTHON = f"{REPO_DIR}/tool-mcps/ev_onehot_mcp/env/bin/python"

# Use ev_onehot_mcp env (has matplotlib, seaborn, scipy, Pillow)
if os.path.exists(VIZ_PYTHON):
    os.system(f"{VIZ_PYTHON} {VIZ_SCRIPT} {RESULTS_DIR}")
else:
    # Fallback: install deps in base env
    os.system(f"pip install -q matplotlib seaborn scipy Pillow")
    os.system(f"python {VIZ_SCRIPT} {RESULTS_DIR}")

# ---- 6.3 Display figure inline ----
from IPython.display import display, Image

summary_png = os.path.join(RESULTS_DIR, "figures", "fitness_modeling_summary.png")
if os.path.exists(summary_png):
    print("\nFour-panel summary:")
    display(Image(filename=summary_png, width=800))
else:
    # Try individual figures
    figs_dir = os.path.join(RESULTS_DIR, "figures")
    if os.path.isdir(figs_dir):
        for f in sorted(os.listdir(figs_dir)):
            if f.endswith(".png"):
                print(f"\n{f}:")
                display(Image(filename=os.path.join(figs_dir, f), width=500))
    else:
        print("No figures generated ‚Äî check logs above.")

In [None]:
#@title Step 7 ‚Äî Download Results
import os
import pandas as pd

# ---- Summary table ----
csv_path = os.path.join(RESULTS_DIR, "all_models_comparison.csv")
if os.path.exists(csv_path):
    df = pd.read_csv(csv_path).sort_values("mean_cv_spearman", ascending=False)
    print("=" * 60)
    print("MODEL PERFORMANCE SUMMARY (5-fold CV Spearman œÅ)")
    print("=" * 60)
    for i, row in df.iterrows():
        print(f"  {row['backbone']:15} + {row['head']:8}: "
              f"{row['mean_cv_spearman']:.3f} ¬± {row['std_cv_spearman']:.3f}")
    print("=" * 60)
    best = df.iloc[0]
    print(f"  Best: {best['backbone']} ({best['head']}) ‚Äî œÅ = {best['mean_cv_spearman']:.3f}")
    print("=" * 60)

# ---- Zip and download ----
zip_path = f"/content/{PROTEIN_NAME}_results.zip"
os.system(f'cd /content && zip -r "{zip_path}" "results/{PROTEIN_NAME}"')
print(f"\nResults zipped to {zip_path}")

from google.colab import files
files.download(zip_path)

In [None]:
#@title üßπ Cleanup (optional)
#@markdown Run this cell to uninstall the fitness modeling skill and remove temp files.

import os

REPO_DIR = "/content/ProteinMCP"

# Uninstall skill and MCPs
os.system(f"cd {REPO_DIR} && pskill uninstall fitness_modeling")

# Remove flag files
for flag in ["/content/.conda_ready", "/content/.proteinmcp_ready", "/content/.skill_ready"]:
    if os.path.exists(flag):
        os.remove(flag)

print("Cleanup complete.")

---
## Instructions & Troubleshooting

### Data Format
Your `data.csv` must contain at minimum:
- **`seq`** ‚Äî Full protein sequence
- **`log_fitness`** ‚Äî Log-transformed fitness value (target)

Your `wt.fasta` should contain the wild-type reference sequence in standard FASTA format.

### Common Issues

| Problem | Solution |
|---------|----------|
| `uniref100.model_params not found` | Re-run Step 2 ‚Äî symlinks may not have been created |
| `wt.fasta not found` in EV+OneHot | Ensure wt.fasta is in RESULTS_DIR (Step 0) |
| ESM embeddings extraction fails | The `claude -p` call will fall back to `esm-extract` CLI |
| GPU Out of Memory | Use **Runtime ‚Üí Change runtime type ‚Üí T4**; or skip ESM2-3B |
| Low Spearman correlation | Check data quality; ensure proper log-transformation |
| MCP not found | Re-run the skill install cell |

### Model Performance Reference

| Model | Typical CV Spearman | Best Use |
|-------|-------------------|----------|
| EV+OneHot | 0.20‚Äì0.35 | Baseline, interpretable |
| ESM2-650M | 0.15‚Äì0.25 | Fast, good balance |
| ESM2-3B | 0.18‚Äì0.28 | Higher accuracy |
| ProtT5-XL | 0.15‚Äì0.25 | Alternative to ESM |
| ProtAlbert | 0.08‚Äì0.15 | Lightweight option |

**Recommended head models:** SVR (most stable), XGBoost (higher potential), KNN (simple baseline)

### References
- [ESM](https://github.com/facebookresearch/esm) ‚Äî Meta's protein language models
- [ProtTrans](https://github.com/agemagician/ProtTrans) ‚Äî Protein transformer embeddings
- [PLMC](https://github.com/debbiemarkslab/plmc) ‚Äî Evolutionary coupling analysis
- [ProteinMCP](https://github.com/charlesxu90/ProteinMCP) ‚Äî This project