Skip to content

XAheli/AiXBio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

FuncScreen: Contrastive PLM Embeddings for Evasion-Resistant Biosecurity Screening

License: MIT Python 3.9+ PyTorch ESM-2 ProteinMPNN Hackathon

Important

We came in top 25% πŸ‘€

Project Submission: View the official project page and hackathon submission on Apart Research

TL;DR: Current DNA screening checks if a sequence looks like a known threat. AI tools like ProteinMPNN design sequences that function identically but look nothing alike β€” evading all homology-based screening. FuncScreen detects threats by function, not sequence, closing this gap.


The Problem

Detection rate vs sequence identity

At <20% sequence identity, homology screening achieves 0% detection. FuncScreen maintains detection signal in this critical regime.

AI protein design tools (ProteinMPNN) generate functional threat variants with as low as 7% sequence identity to known threats. Current screening infrastructure β€” SecureDNA, IBBIS Common Mechanism β€” relies on sequence homology and completely misses these variants.

Key Results

All metrics reported with 95% bootstrap confidence intervals (1,000 iterations).

Evaluation Split K-mer (Homology) ESM-2 Cosine NN ESM-2 Linear KNN FuncScreen
Standard .990 [.973, 1.0] .993 [.977, 1.0] .995 [.988, .999] 1.00 [.999, 1.0] 1.00 [1.0, 1.0]
Hard Negative .989 [.969, 1.0] .994 [.980, 1.0] .995 [.990, .999] .995 [.988, 1.0] 1.00 [1.0, 1.0]
Seq. Divergent .821 [.744, .897] .886 [.826, .940] .943 [.905, .973] .966 [.934, .992] .974 [.947, .994]
MPNN Adversarial .952 [.944, .959] .965 [.958, .972] .994 [.991, .996] .997 [.996, .998] .991 [.988, .993]
AUROC Heatmap (all methods x all splits)

AUROC Heatmap

ROC Curves β€” ProteinMPNN Adversarial Split

ROC MPNN

Ablation Study

Ablation Best Config MPNN AUROC Seq. Div. AUROC
Projection dim 512 .997 .983
Temperature 0.2 .997 .987
Hard neg. ratio k=3 (default) .991 .974
Multi-scale No improvement .989 .974
Mixup No improvement .986 .976

Temperature 0.2 and projection dim 512 close the gap with KNN on MPNN adversarial while maintaining strong sequence-divergent performance.

Generalization

Leave-One-Subcategory-Out Cross-Validation
Held-out Subcategory n AUROC 95% CI
Hemolysin 169 0.963 [0.935, 0.986]
Cytolysin family 15 1.000 [1.000, 1.000]
Pore-forming toxin 17 0.615 [0.452, 0.775]
Aerolysin family 4 0.998 [0.985, 1.000]

Model generalizes to cytolysins and aerolysins without seeing them in training. Pore-forming toxin subcategory (0.615) is an honest limitation.

Second Threat Family: Ribosome-Inactivating Proteins
Method AUROC 95% CI
K-mer 0.992 [0.973, 1.000]
FuncScreen 0.962 [0.879, 1.000]

FuncScreen generalizes to a completely different threat family (ricin, abrin) with >0.96 AUROC.

Out-of-Distribution False Positive Rates
Method Kinases GPCRs Transcription Factors
K-mer 0.000 0.000 0.000
Cosine NN 0.990 0.969 1.000
KNN 0.000 0.000 0.011
FuncScreen 0.041 0.031 0.242

Cosine NN is catastrophically unreliable on OOD proteins. FuncScreen has low FPR on kinases/GPCRs but elevated FPR on transcription factors β€” a calibration issue for future work.

Method

Raw ESM-2 Space Contrastive Projection

Left: Raw ESM-2 embeddings (overlapping clusters). Right: After contrastive learning (clean separation).

  1. Data curation: 985 proteins from UniProt Swiss-Prot (335 pore-forming toxins + 650 benign homologs)
  2. Embedding extraction: Frozen ESM-2 (650M) mean-pooled representations (1280-dim)
  3. Contrastive learning: Supervised contrastive loss + BCE with hard-negative mining
  4. Mixup augmentation: Embedding-space interpolation for small-dataset regularization
  5. Adversarial training: High-temperature ProteinMPNN variants (T=0.8, 1.0) added to training
  6. Adversarial stress testing: 4,100 ProteinMPNN-designed variants at 5 sampling temperatures
  7. Certified robustness: Randomized smoothing with 1,000 MC samples, empirical-certified gap ≀1%
ProteinMPNN Variant Diversity Distribution

MPNN Identity Distribution

Certified Robustness

Certified Robustness

Empirical vs certified accuracy under biological mutations. Gap ≀1%.

Training Curves

Training Curves

Repository Structure

src/
  config.py                     Central configuration
  data/
    curate.py                   UniProt data fetching and quality filtering
    splits.py                   Train/val/test splitting with 4 evaluation slices
    adversarial.py              BLOSUM62-conservative adversarial variant generation
    structures.py               AlphaFold DB structure download and PDB parsing
    proteinmpnn.py              ProteinMPNN inverse folding wrapper
  models/
    embeddings.py               ESM-2 embedding and hidden state extraction
    contrastive.py              AttentionPooling, MixupAugmenter, contrastive learning
  screening/
    baselines.py                K-mer, Cosine NN, Linear, KNN baseline screeners
    certified.py                Randomized smoothing for biological mutation spaces
  evaluation/
    metrics.py                  Bootstrap CIs, paired significance tests, detection-vs-divergence
    experiments.py              Full experiment runner
    visualize.py                Publication-quality figures with CI error bars

scripts/
  01_curate_data.py             Fetch and split data from UniProt
  02_extract_embeddings.py      Extract ESM-2 embeddings (GPU)
  03_train_contrastive.py       Train screener (--multi-scale, --mixup, --adversarial-augment)
  04_evaluate.py                Evaluate with bootstrap CIs and paired tests
  05_certify.py                 Certified robustness (1,000 MC samples)
  06_generate_mpnn_variants.py  ProteinMPNN adversarial generation (GPU)
  07_ablation.py                Projection dim, temperature, hard neg ratio ablations
  08_loso_cv.py                 Leave-one-subcategory-out cross-validation
  09_ood_evaluation.py          Out-of-distribution false positive rates
  10_second_family.py           Ribosome-inactivating protein generalization

data/                           Curated datasets, embeddings, splits
results/                        Checkpoints, figures, tables, ablation results

Reproducing Results

git clone https://github.com/XAheli/AiXBio.git aixbio
cd aixbio
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Core pipeline
python scripts/01_curate_data.py
python scripts/02_extract_embeddings.py --model esm2 --device cuda --batch-size 32
python scripts/03_train_contrastive.py --embedding esm2 --device cuda --epochs 50
python scripts/04_evaluate.py --embedding esm2

# ProteinMPNN adversarial variants
git clone https://github.com/dauparas/ProteinMPNN.git
export PYTHONPATH=$PYTHONPATH:$(pwd)/ProteinMPNN
python scripts/06_generate_mpnn_variants.py --device cuda --structure-source alphafold
python scripts/04_evaluate.py --embedding esm2

# Additional experiments
python scripts/05_certify.py --embedding esm2 --device cuda
python scripts/07_ablation.py --device cuda --embedding esm2
python scripts/08_loso_cv.py --embedding esm2 --device cuda
python scripts/09_ood_evaluation.py --device cuda --embedding esm2
python scripts/10_second_family.py --device cuda

Citation

@misc{poddar2026funcscreen,
title={(HckPrj) FuncScreen: Contrastive PLM Embeddings for Evasion-Resistant Biosecurity Screening},
author={Aheli Poddar},
date={2026-04-26},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={\url{https://apartresearch.com}}
}

License

MIT License. See LICENSE.

Limitations and Dual-Use Considerations

  • KNN baseline outperforms FuncScreen on MPNN adversarial split in aggregate AUROC (0.997 vs 0.991); ablation shows this gap closes with dim=512 or Ο„=0.2
  • Elevated OOD false positive rate on transcription factors (24%) β€” calibration needed
  • LOSO CV: pore-forming toxin subcategory poorly detected when held out (0.615 AUROC)
  • Primarily evaluated on one threat family; RIP mini-experiment provides preliminary generalization evidence
  • Certified robustness uses 1,000 MC samples (production requires 100,000+)
  • ProteinMPNN variant generation demonstrates a known attack vector β€” disclosed to motivate defensive improvements
  • This work is intended to strengthen biosecurity screening infrastructure, not to enable evasion

Releases

No releases published

Packages

 
 
 

Contributors