Plasmid Host Range Predictor (PlasPredict)

A machine learning-powered web application for predicting the host range of plasmids using k-mer composition analysis and conjugation system detection.

Overview

PlasPredict analyzes plasmid DNA sequences to predict their potential bacterial host range based on genomic features:

K-mer composition analysis - Uses k-mer frequency distribution (k=1-6) extracted from plasmid sequences
Machine learning prediction - XGBoost-based model trained on known plasmid-host relationships
Conjugation system detection - Identifies conjugative elements using HMM models from ConjScan

Features

🧬 Sequence Upload - Support for FASTA/FASTA-like formats (.fna, .fasta, .fa, .txt)
🤖 ML Prediction - XGBoost model for host range prediction
🔍 Conjugation Detection - HMM-based identification of conjugative systems
📊 Result Visualization - Interactive display of predictions and detected systems
🐳 Docker Deployment - Container-ready with production-grade Gunicorn server
☁️ Cloud-Ready - GitHub Actions CI/CD for automated DockerHub publishing

Quick Start

Local Development

Prerequisites

Python 3.11+
System dependencies: prodigal, hmmer

Installation

# Clone the repository
git clone https://github.com/davidgllund/plaspredict.git
cd plaspredict

# Install system dependencies (Ubuntu/Debian)
sudo apt-get install -y prodigal hmmer

# Install Python dependencies
pip install -r app/requirements.txt

Running the Application

cd app
python app.py

The web application will be available at http://localhost:5000

Docker Deployment

Build Locally

docker build -f app/Dockerfile -t plaspredict:latest .
docker run -p 8000:8000 plaspredict:latest

Access at http://localhost:8000

Using DockerHub Image

docker pull davidgllund/plaspredict:latest
docker run -p 8000:8000 davidgllund/plaspredict:latest

Docker Compose

cd app
docker-compose up

Project Structure

plaspredict/
├── app/
│   ├── app.py                    # Main Flask application
│   ├── config.py                 # Configuration management
│   ├── requirements.txt           # Python dependencies
│   ├── Dockerfile                 # Multi-stage Docker build
│   ├── docker-compose.yml        # Docker Compose configuration
│   ├── start-script.sh           # Container entrypoint (Gunicorn)
│   ├── models/
│   │   ├── plaspredict_model.pkl # XGBoost trained model
│   │   └── conjscan_models/      # HMM models for conjugation detection
│   ├── static/
│   │   ├── style.css            # Application styling
│   │   ├── script.js            # Frontend JavaScript
│   │   └── plaspredict_logo.png # Logo
│   ├── templates/
│   │   └── index.html           # Web interface HTML
│   ├── uploads/                 # User-uploaded sequence files
│   ├── logs/                    # Application logs
│   └── tmp/                     # Temporary analysis files
├── .github/
│   └── workflows/
│       └── docker-build-publish.yml  # GitHub Actions CI/CD
├── DOCKER_PUBLISH_SETUP.md      # Docker/DockerHub setup guide
└── README.md                    # This file

Configuration

Environment Variables

# Flask environment (development/production)
export FLASK_ENV=production

# Secret key (change in production!)
export SECRET_KEY=your-secret-key-here

Application Settings

Edit app/config.py to modify:

Maximum file upload size (default: 50MB)
Maximum sequence length (default: 1MB)
Prediction timeout (default: 120 seconds)
Logging level and paths
Model and HMM paths

Usage

Web Interface

Upload a sequence file - Select a FASTA/nucleotide file from your computer
Submit for analysis - Click the "Predict" button
View results:
- Host range predictions with confidence scores
- Detected conjugative elements and systems
- Sequence statistics and analysis details

API Endpoints

POST `/predict`

Submit a sequence for analysis.

Request:

curl -X POST -F "file=@plasmid.fasta" http://localhost:8000/predict

Response:

{
  "success": true,
  "predictions": {
    "Escherichia": 0.92,
    "Bacillus": 0.45,
    "Pseudomonas": 0.78
  },
  "conjugation_systems": [
    {
      "system": "B_traE",
      "type": "Type B",
      "genes": ["traE", "traF"]
    }
  ],
  "sequence_length": 5280,
  "gc_content": 0.52
}

GET `/get-inc-types`

Retrieve all supported incompatibility (Inc) types.

Response:

{
  "inc_types": ["IncA", "IncB", "IncC", ...]
}

Model Details

XGBoost Predictor

Training data: Curated plasmid sequences with known host ranges
Features: K-mer frequency distributions (k=1 to k=6)
Output: Probability scores for bacterial host genera

Conjugation System Detection

Uses HMM models from ConjScan database:

Type B conjugation systems (tra genes)
Type C/G/F/FA/FATA/I systems
Gene prediction via Prodigal
HMM matching via HMMER

Docker & CI/CD

GitHub Actions Workflow

Automatically builds and publishes Docker images on:

Push to main - Tags: latest, commit SHA
Version tags (e.g., v1.0.0) - Semantic versioning tags
Pull requests - Build only (no push)

See DOCKER_PUBLISH_SETUP.md for detailed setup instructions.

Build Configuration

The Dockerfile uses a multi-stage build for optimal image size:

Builder stage - Compiles Python dependencies and system tools
Runtime stage - Minimal image with only necessary components

Key features:

Non-root user (UID 1000) for security
Gunicorn production server
System dependencies: prodigal, hmmer
Caches Python packages

Troubleshooting

502 Bad Gateway Error

Symptoms: Application shows 502 error when accessed Solution:

Ensure Gunicorn is running on port 8000
Check logs: docker logs <container-id>
Verify all dependencies are installed

Conjugation Systems Not Detected

Symptoms: Prediction results show empty conjugation systems Possible causes:

HMM models directory not mounted/copied correctly
HMMER or Prodigal not installed in container
Sequence too short or too low quality

Solutions:

Verify model files exist: /home/appuser/app/models/conjscan_models/
Check container logs for errors
Test with a longer plasmid sequence

Model Loading Fails

Error: FileNotFoundError: Model not found at... Solution:

Ensure plaspredict_model.pkl exists in app/models/
Verify file permissions are readable
Check paths in app/config.py

File Upload Fails

Error: Request entity too large or file not accepted Solutions:

Increase MAX_CONTENT_LENGTH in config.py (max 50MB by default)
Ensure file extension is in ALLOWED_EXTENSIONS (.fna, .fasta, .fa, .txt)
Verify file size is under limit: ls -lh your_file.fasta

Development

Setting Up Development Environment

# Create virtual environment
python3.11 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r app/requirements.txt

# Install development tools
pip install pytest pytest-cov flake8

# Run tests
pytest

# Run linting
flake8 app/

Building Docker Image for Testing

# Build with tag
docker build -f app/Dockerfile -t plaspredict:dev .

# Run with volume mount for development
docker run -v $(pwd)/app:/home/appuser/app -p 8000:8000 plaspredict:dev

Performance Considerations

Prediction timeout: 120 seconds (configurable)
Max sequence length: 1MB (1,000,000 bp)
Gunicorn workers: 2 (configurable for your hardware)
File upload size: 50MB maximum
Memory usage: ~500MB-1GB per worker

For high-throughput deployments, consider:

Increasing worker count based on CPU cores
Using Kubernetes for horizontal scaling
Adding Redis for result caching
Load balancing with Nginx

Dependencies

System Requirements

Python 3.11 or higher
Prodigal (gene prediction)
HMMER3 (HMM-based sequence matching)

Python Packages

Flask 2.3.3 - Web framework
Gunicorn 21.2.0 - WSGI HTTP server
XGBoost 2.0.0 - Machine learning model
BioPython 1.81 - Biological sequence handling
NumPy 1.24.3 - Numerical computing
Pandas 2.0.3 - Data analysis
joblib 1.3.2 - Model serialization
Werkzeug 2.3.7 - WSGI utilities

Security Notes

Production Deployment

Change SECRET_KEY - Update in environment variables
Enable HTTPS - Use reverse proxy (Nginx, Caddy)
File Uploads - Secure temporary directory handling
Model Protection - Keep trained models in private storage
Logging - Monitor for suspicious activities
Rate Limiting - Consider adding request throttling
CORS - Configure if serving API to other domains

Container Security

Non-root user (appuser, UID 1000)
Read-only filesystems where possible
No hardcoded credentials
Regular dependency updates via GitHub Actions

License

[Add your license here]

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Support

For issues, questions, or suggestions:

Open an GitHub issue
Check existing documentation in the repository
Review troubleshooting section above

Citation

If you use PlasPredict in your research, please cite:

[Add citation information here]

Authors

David G. Lund - Initial development and deployment

Acknowledgments

ConjScan database for HMM models
BioPython community
Flask and XGBoost teams

Changelog

Version 1.0.0 (2026-04-28)

Initial release with Docker support
GitHub Actions CI/CD pipeline
Multi-stage Dockerfile optimization
Gunicorn production server
Comprehensive documentation

Last Updated: April 28, 2026 Repository: https://github.com/davidgllund/plaspredict DockerHub: https://hub.docker.com/r/davidgllund/plaspredict

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
app		app
DOCKER_PUBLISH_SETUP.md		DOCKER_PUBLISH_SETUP.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Plasmid Host Range Predictor (PlasPredict)

Overview

Features

Quick Start

Local Development

Prerequisites

Installation

Running the Application

Docker Deployment

Build Locally

Using DockerHub Image

Docker Compose

Project Structure

Configuration

Environment Variables

Application Settings

Usage

Web Interface

API Endpoints

POST /predict

GET /get-inc-types

Model Details

XGBoost Predictor

Conjugation System Detection

Docker & CI/CD

GitHub Actions Workflow

Build Configuration

Troubleshooting

502 Bad Gateway Error

Conjugation Systems Not Detected

Model Loading Fails

File Upload Fails

Development

Setting Up Development Environment

Building Docker Image for Testing

Performance Considerations

Dependencies

System Requirements

Python Packages

Security Notes

Production Deployment

Container Security

License

Contributing

Support

Citation

Authors

Acknowledgments

Changelog

Version 1.0.0 (2026-04-28)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

POST `/predict`

GET `/get-inc-types`

Packages