FuncScreen: Contrastive PLM Embeddings for Evasion-Resistant Biosecurity Screening

Important

We came in top 25% 👀

Project Submission: View the official project page and hackathon submission on Apart Research

TL;DR: Current DNA screening checks if a sequence looks like a known threat. AI tools like ProteinMPNN design sequences that function identically but look nothing alike — evading all homology-based screening. FuncScreen detects threats by function, not sequence, closing this gap.

The Problem

At <20% sequence identity, homology screening achieves 0% detection. FuncScreen maintains detection signal in this critical regime.

AI protein design tools (ProteinMPNN) generate functional threat variants with as low as 7% sequence identity to known threats. Current screening infrastructure — SecureDNA, IBBIS Common Mechanism — relies on sequence homology and completely misses these variants.

Key Results

All metrics reported with 95% bootstrap confidence intervals (1,000 iterations).

Evaluation Split	K-mer (Homology)	ESM-2 Cosine NN	ESM-2 Linear	KNN	FuncScreen
Standard	.990 [.973, 1.0]	.993 [.977, 1.0]	.995 [.988, .999]	1.00 [.999, 1.0]	1.00 [1.0, 1.0]
Hard Negative	.989 [.969, 1.0]	.994 [.980, 1.0]	.995 [.990, .999]	.995 [.988, 1.0]	1.00 [1.0, 1.0]
Seq. Divergent	.821 [.744, .897]	.886 [.826, .940]	.943 [.905, .973]	.966 [.934, .992]	.974 [.947, .994]
MPNN Adversarial	.952 [.944, .959]	.965 [.958, .972]	.994 [.991, .996]	.997 [.996, .998]	.991 [.988, .993]

AUROC Heatmap (all methods x all splits)

ROC Curves — ProteinMPNN Adversarial Split

Ablation Study

Ablation	Best Config	MPNN AUROC	Seq. Div. AUROC
Projection dim	512	.997	.983
Temperature	0.2	.997	.987
Hard neg. ratio	k=3 (default)	.991	.974
Multi-scale	No improvement	.989	.974
Mixup	No improvement	.986	.976

Temperature 0.2 and projection dim 512 close the gap with KNN on MPNN adversarial while maintaining strong sequence-divergent performance.

Generalization

Leave-One-Subcategory-Out Cross-Validation

Held-out Subcategory	n	AUROC	95% CI
Hemolysin	169	0.963	[0.935, 0.986]
Cytolysin family	15	1.000	[1.000, 1.000]
Pore-forming toxin	17	0.615	[0.452, 0.775]
Aerolysin family	4	0.998	[0.985, 1.000]

Model generalizes to cytolysins and aerolysins without seeing them in training. Pore-forming toxin subcategory (0.615) is an honest limitation.

Second Threat Family: Ribosome-Inactivating Proteins

Method	AUROC	95% CI
K-mer	0.992	[0.973, 1.000]
FuncScreen	0.962	[0.879, 1.000]

FuncScreen generalizes to a completely different threat family (ricin, abrin) with >0.96 AUROC.

Out-of-Distribution False Positive Rates

Method	Kinases	GPCRs	Transcription Factors
K-mer	0.000	0.000	0.000
Cosine NN	0.990	0.969	1.000
KNN	0.000	0.000	0.011
FuncScreen	0.041	0.031	0.242

Cosine NN is catastrophically unreliable on OOD proteins. FuncScreen has low FPR on kinases/GPCRs but elevated FPR on transcription factors — a calibration issue for future work.

Method

Left: Raw ESM-2 embeddings (overlapping clusters). Right: After contrastive learning (clean separation).

Data curation: 985 proteins from UniProt Swiss-Prot (335 pore-forming toxins + 650 benign homologs)
Embedding extraction: Frozen ESM-2 (650M) mean-pooled representations (1280-dim)
Contrastive learning: Supervised contrastive loss + BCE with hard-negative mining
Mixup augmentation: Embedding-space interpolation for small-dataset regularization
Adversarial training: High-temperature ProteinMPNN variants (T=0.8, 1.0) added to training
Adversarial stress testing: 4,100 ProteinMPNN-designed variants at 5 sampling temperatures
Certified robustness: Randomized smoothing with 1,000 MC samples, empirical-certified gap ≤1%

ProteinMPNN Variant Diversity Distribution

Certified Robustness

Empirical vs certified accuracy under biological mutations. Gap ≤1%.

Training Curves

Repository Structure

src/
  config.py                     Central configuration
  data/
    curate.py                   UniProt data fetching and quality filtering
    splits.py                   Train/val/test splitting with 4 evaluation slices
    adversarial.py              BLOSUM62-conservative adversarial variant generation
    structures.py               AlphaFold DB structure download and PDB parsing
    proteinmpnn.py              ProteinMPNN inverse folding wrapper
  models/
    embeddings.py               ESM-2 embedding and hidden state extraction
    contrastive.py              AttentionPooling, MixupAugmenter, contrastive learning
  screening/
    baselines.py                K-mer, Cosine NN, Linear, KNN baseline screeners
    certified.py                Randomized smoothing for biological mutation spaces
  evaluation/
    metrics.py                  Bootstrap CIs, paired significance tests, detection-vs-divergence
    experiments.py              Full experiment runner
    visualize.py                Publication-quality figures with CI error bars

scripts/
  01_curate_data.py             Fetch and split data from UniProt
  02_extract_embeddings.py      Extract ESM-2 embeddings (GPU)
  03_train_contrastive.py       Train screener (--multi-scale, --mixup, --adversarial-augment)
  04_evaluate.py                Evaluate with bootstrap CIs and paired tests
  05_certify.py                 Certified robustness (1,000 MC samples)
  06_generate_mpnn_variants.py  ProteinMPNN adversarial generation (GPU)
  07_ablation.py                Projection dim, temperature, hard neg ratio ablations
  08_loso_cv.py                 Leave-one-subcategory-out cross-validation
  09_ood_evaluation.py          Out-of-distribution false positive rates
  10_second_family.py           Ribosome-inactivating protein generalization

data/                           Curated datasets, embeddings, splits
results/                        Checkpoints, figures, tables, ablation results

Reproducing Results

git clone https://github.com/XAheli/AiXBio.git aixbio
cd aixbio
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Core pipeline
python scripts/01_curate_data.py
python scripts/02_extract_embeddings.py --model esm2 --device cuda --batch-size 32
python scripts/03_train_contrastive.py --embedding esm2 --device cuda --epochs 50
python scripts/04_evaluate.py --embedding esm2

# ProteinMPNN adversarial variants
git clone https://github.com/dauparas/ProteinMPNN.git
export PYTHONPATH=$PYTHONPATH:$(pwd)/ProteinMPNN
python scripts/06_generate_mpnn_variants.py --device cuda --structure-source alphafold
python scripts/04_evaluate.py --embedding esm2

# Additional experiments
python scripts/05_certify.py --embedding esm2 --device cuda
python scripts/07_ablation.py --device cuda --embedding esm2
python scripts/08_loso_cv.py --embedding esm2 --device cuda
python scripts/09_ood_evaluation.py --device cuda --embedding esm2
python scripts/10_second_family.py --device cuda

Citation

@misc{poddar2026funcscreen,
title={(HckPrj) FuncScreen: Contrastive PLM Embeddings for Evasion-Resistant Biosecurity Screening},
author={Aheli Poddar},
date={2026-04-26},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={\url{https://apartresearch.com}}
}

License

MIT License. See LICENSE.

Limitations and Dual-Use Considerations

KNN baseline outperforms FuncScreen on MPNN adversarial split in aggregate AUROC (0.997 vs 0.991); ablation shows this gap closes with dim=512 or τ=0.2
Elevated OOD false positive rate on transcription factors (24%) — calibration needed
LOSO CV: pore-forming toxin subcategory poorly detected when held out (0.615 AUROC)
Primarily evaluated on one threat family; RIP mini-experiment provides preliminary generalization evidence
Certified robustness uses 1,000 MC samples (production requires 100,000+)
ProteinMPNN variant generation demonstrates a known attack vector — disclosed to motivate defensive improvements
This work is intended to strengthen biosecurity screening infrastructure, not to enable evasion

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
paper		paper
results		results
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FuncScreen: Contrastive PLM Embeddings for Evasion-Resistant Biosecurity Screening

The Problem

Key Results

Ablation Study

Generalization

Method

Repository Structure

Reproducing Results

Citation

License

Limitations and Dual-Use Considerations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FuncScreen: Contrastive PLM Embeddings for Evasion-Resistant Biosecurity Screening

The Problem

Key Results

Ablation Study

Generalization

Method

Repository Structure

Reproducing Results

Citation

License

Limitations and Dual-Use Considerations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages