A machine learning-powered web application for predicting the host range of plasmids using k-mer composition analysis and conjugation system detection.
PlasPredict analyzes plasmid DNA sequences to predict their potential bacterial host range based on genomic features:
- K-mer composition analysis - Uses k-mer frequency distribution (k=1-6) extracted from plasmid sequences
- Machine learning prediction - XGBoost-based model trained on known plasmid-host relationships
- Conjugation system detection - Identifies conjugative elements using HMM models from ConjScan
- 🧬 Sequence Upload - Support for FASTA/FASTA-like formats (.fna, .fasta, .fa, .txt)
- 🤖 ML Prediction - XGBoost model for host range prediction
- 🔍 Conjugation Detection - HMM-based identification of conjugative systems
- 📊 Result Visualization - Interactive display of predictions and detected systems
- 🐳 Docker Deployment - Container-ready with production-grade Gunicorn server
- ☁️ Cloud-Ready - GitHub Actions CI/CD for automated DockerHub publishing
- Python 3.11+
- System dependencies:
prodigal,hmmer
# Clone the repository
git clone https://github.com/davidgllund/plaspredict.git
cd plaspredict
# Install system dependencies (Ubuntu/Debian)
sudo apt-get install -y prodigal hmmer
# Install Python dependencies
pip install -r app/requirements.txtcd app
python app.pyThe web application will be available at http://localhost:5000
docker build -f app/Dockerfile -t plaspredict:latest .
docker run -p 8000:8000 plaspredict:latestAccess at http://localhost:8000
docker pull davidgllund/plaspredict:latest
docker run -p 8000:8000 davidgllund/plaspredict:latestcd app
docker-compose upplaspredict/
├── app/
│ ├── app.py # Main Flask application
│ ├── config.py # Configuration management
│ ├── requirements.txt # Python dependencies
│ ├── Dockerfile # Multi-stage Docker build
│ ├── docker-compose.yml # Docker Compose configuration
│ ├── start-script.sh # Container entrypoint (Gunicorn)
│ ├── models/
│ │ ├── plaspredict_model.pkl # XGBoost trained model
│ │ └── conjscan_models/ # HMM models for conjugation detection
│ ├── static/
│ │ ├── style.css # Application styling
│ │ ├── script.js # Frontend JavaScript
│ │ └── plaspredict_logo.png # Logo
│ ├── templates/
│ │ └── index.html # Web interface HTML
│ ├── uploads/ # User-uploaded sequence files
│ ├── logs/ # Application logs
│ └── tmp/ # Temporary analysis files
├── .github/
│ └── workflows/
│ └── docker-build-publish.yml # GitHub Actions CI/CD
├── DOCKER_PUBLISH_SETUP.md # Docker/DockerHub setup guide
└── README.md # This file
# Flask environment (development/production)
export FLASK_ENV=production
# Secret key (change in production!)
export SECRET_KEY=your-secret-key-hereEdit app/config.py to modify:
- Maximum file upload size (default: 50MB)
- Maximum sequence length (default: 1MB)
- Prediction timeout (default: 120 seconds)
- Logging level and paths
- Model and HMM paths
- Upload a sequence file - Select a FASTA/nucleotide file from your computer
- Submit for analysis - Click the "Predict" button
- View results:
- Host range predictions with confidence scores
- Detected conjugative elements and systems
- Sequence statistics and analysis details
Submit a sequence for analysis.
Request:
curl -X POST -F "file=@plasmid.fasta" http://localhost:8000/predictResponse:
{
"success": true,
"predictions": {
"Escherichia": 0.92,
"Bacillus": 0.45,
"Pseudomonas": 0.78
},
"conjugation_systems": [
{
"system": "B_traE",
"type": "Type B",
"genes": ["traE", "traF"]
}
],
"sequence_length": 5280,
"gc_content": 0.52
}Retrieve all supported incompatibility (Inc) types.
Response:
{
"inc_types": ["IncA", "IncB", "IncC", ...]
}- Training data: Curated plasmid sequences with known host ranges
- Features: K-mer frequency distributions (k=1 to k=6)
- Output: Probability scores for bacterial host genera
Uses HMM models from ConjScan database:
- Type B conjugation systems (tra genes)
- Type C/G/F/FA/FATA/I systems
- Gene prediction via Prodigal
- HMM matching via HMMER
Automatically builds and publishes Docker images on:
- Push to main - Tags:
latest, commit SHA - Version tags (e.g.,
v1.0.0) - Semantic versioning tags - Pull requests - Build only (no push)
See DOCKER_PUBLISH_SETUP.md for detailed setup instructions.
The Dockerfile uses a multi-stage build for optimal image size:
- Builder stage - Compiles Python dependencies and system tools
- Runtime stage - Minimal image with only necessary components
Key features:
- Non-root user (UID 1000) for security
- Gunicorn production server
- System dependencies: prodigal, hmmer
- Caches Python packages
Symptoms: Application shows 502 error when accessed Solution:
- Ensure Gunicorn is running on port 8000
- Check logs:
docker logs <container-id> - Verify all dependencies are installed
Symptoms: Prediction results show empty conjugation systems Possible causes:
- HMM models directory not mounted/copied correctly
- HMMER or Prodigal not installed in container
- Sequence too short or too low quality
Solutions:
- Verify model files exist:
/home/appuser/app/models/conjscan_models/ - Check container logs for errors
- Test with a longer plasmid sequence
Error: FileNotFoundError: Model not found at...
Solution:
- Ensure
plaspredict_model.pklexists inapp/models/ - Verify file permissions are readable
- Check paths in
app/config.py
Error: Request entity too large or file not accepted Solutions:
- Increase
MAX_CONTENT_LENGTHinconfig.py(max 50MB by default) - Ensure file extension is in
ALLOWED_EXTENSIONS(.fna, .fasta, .fa, .txt) - Verify file size is under limit:
ls -lh your_file.fasta
# Create virtual environment
python3.11 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r app/requirements.txt
# Install development tools
pip install pytest pytest-cov flake8
# Run tests
pytest
# Run linting
flake8 app/# Build with tag
docker build -f app/Dockerfile -t plaspredict:dev .
# Run with volume mount for development
docker run -v $(pwd)/app:/home/appuser/app -p 8000:8000 plaspredict:dev- Prediction timeout: 120 seconds (configurable)
- Max sequence length: 1MB (1,000,000 bp)
- Gunicorn workers: 2 (configurable for your hardware)
- File upload size: 50MB maximum
- Memory usage: ~500MB-1GB per worker
For high-throughput deployments, consider:
- Increasing worker count based on CPU cores
- Using Kubernetes for horizontal scaling
- Adding Redis for result caching
- Load balancing with Nginx
- Python 3.11 or higher
- Prodigal (gene prediction)
- HMMER3 (HMM-based sequence matching)
- Flask 2.3.3 - Web framework
- Gunicorn 21.2.0 - WSGI HTTP server
- XGBoost 2.0.0 - Machine learning model
- BioPython 1.81 - Biological sequence handling
- NumPy 1.24.3 - Numerical computing
- Pandas 2.0.3 - Data analysis
- joblib 1.3.2 - Model serialization
- Werkzeug 2.3.7 - WSGI utilities
- Change SECRET_KEY - Update in environment variables
- Enable HTTPS - Use reverse proxy (Nginx, Caddy)
- File Uploads - Secure temporary directory handling
- Model Protection - Keep trained models in private storage
- Logging - Monitor for suspicious activities
- Rate Limiting - Consider adding request throttling
- CORS - Configure if serving API to other domains
- Non-root user (appuser, UID 1000)
- Read-only filesystems where possible
- No hardcoded credentials
- Regular dependency updates via GitHub Actions
[Add your license here]
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
For issues, questions, or suggestions:
- Open an GitHub issue
- Check existing documentation in the repository
- Review troubleshooting section above
If you use PlasPredict in your research, please cite:
[Add citation information here]
- David G. Lund - Initial development and deployment
- ConjScan database for HMM models
- BioPython community
- Flask and XGBoost teams
- Initial release with Docker support
- GitHub Actions CI/CD pipeline
- Multi-stage Dockerfile optimization
- Gunicorn production server
- Comprehensive documentation
Last Updated: April 28, 2026 Repository: https://github.com/davidgllund/plaspredict DockerHub: https://hub.docker.com/r/davidgllund/plaspredict