Web app for protein sequence design using ProteinMPNN. Upload a PDB structure, pick chains, and get back designed sequences with mutations highlighted.
Prerequisites: Python 3, Docker, and Docker Compose.
# 1. Clone ProteinMPNN into vendor/ (not tracked in git)
### (git submodule)
git clone https://github.com/dauparas/ProteinMPNN vendor/ProteinMPNN
# 2. Download model weights (~6.4 MB, one-time, no pip install needed)
python3 scripts/download_weights.py
# 3. Start the service
docker compose up --buildOpen http://localhost:8000 in a browser. The service refuses to start if weights are missing. If /design returns a 500 error, the vendor/ProteinMPNN clone is likely missing.
- Drop a
.pdbfile onto the upload zone (or click to browse) - Enter chain IDs (e.g.
AorA, B) and number of sequences (1-10) - Click Design and wait ~30-60s for CPU inference
- Results show the native sequence and each design with mutations in red
app/
main.py FastAPI app, routes, static file mount
schemas.py Pydantic request/response models
dependencies.py Upload handling, param parsing
config.py Paths and constants
proteinmpnn/
wrapper.py Runs ProteinMPNN as a subprocess
parser.py Parses FASTA output (native + designed sequences)
validation/
pdb.py PDB structure validation (BioPython)
static/
index.html Single-page UI (vanilla HTML/JS)
tests/ pytest suite
scripts/ Weight download, debug utilities
test_pdbs/ Sample PDB files (1UBQ, 1A3N, 2LZM, 6MRR)
vendor/ProteinMPNN/ Vendored ProteinMPNN repo
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Web UI |
/health |
GET | Liveness probe ({"status": "ok", ...}) |
/design |
POST | Run sequence design (multipart form data) |
/docs |
GET | Interactive Swagger docs |
Request (multipart/form-data):
pdb_file— PDB file uploadchains— JSON array of chain IDs, e.g.["A"]num_sequences— integer, 1-10 (default 5)
Response:
{
"status": "success",
"metadata": {
"num_residues": 76,
"chains": ["A"],
"num_sequences": 5
},
"native_sequence": "MQIFVKTL...",
"sequences": ["MQIFVKTL...", "..."]
}Errors: 400 (bad PDB or params), 504 (timeout), 500 (internal).
For reference
# Health check
curl localhost:8000/health
# Design 3 sequences for chain A of ubiquitin
curl -X POST localhost:8000/design \
-F "pdb_file=@test_pdbs/1UBQ.pdb" \
-F 'chains=["A"]' \
-F "num_sequences=3"
# Multi-chain design
curl -X POST localhost:8000/design \
-F "pdb_file=@test_pdbs/1A3N.pdb" \
-F 'chains=["A","B"]' \
-F "num_sequences=2"# Run tests
pytest tests/
# Lint
ruff check .- CPU-only inference (no GPU required is a plus!)
- Stateless — no database, no sessions
- Single-page vanilla HTML/JS frontend
If I circle back to the project, I'm excited about adding:
Long PDB job submissions with this only-CPU inference are going to really drag, and users of the app would have no visibility into what's going on. To tackle this we could have background workers (coordinated with a combo like Redis + RedisQueue or Celery) that are polled for job status.
Scientist users would appreciate feedback on the quality of their submission and the confidence the model has in its returned output. We could pretty easily surface warnings for clashes in input sequences of incomplete residues above a certain percentage, and then work in reporting the confidence ProteinMNPP has in the different returned regions of each sequence.
In a real world setting the value add of this app would be multified many times over with bulk batch jobs. Scientists would likely submit massive amounts of PDBs and expect to return in some time for batched progress on them. To enable this, we'd revisit the single subprocess decision and probably spread model invocations over many parallel processes in a batch.