MaCPepDB 2.0 - Mass Centric Peptide Database

Description

Creates a peptide databases by digesting proteins stored in FASTA-/Uniprot-Text-files.

Ambiguous amino acids

Some UniProt entries contain one letter codes which encode multiple amino acids. Usually the encoded amino acids have a similar or equal mass. Ambiguous one letter codes are:

B encodes D & N
J encodes I & L
Z encodes E & Q

Because the amino acids encoded by B & Z have a different mass and only a few hundreds entries contain these, MaCPepDB resolves the ambiguity by creating all possible combination of the peptide with the distinct amino acids, e.g.:

ambiguous peptide	distinct peptides
`PE_B_TIDE_Z_K`	`PE_D_TIDE_E_K`
	`PE_D_TIDE_Q_K`
	`PE_N_TIDE_E_K`
	`PE_N_TIDE_Q_K`

J encodes Leucine and Isoleucine, both have the same mass. Resolving those would not make the peptides better distinguishable by mass.

In theory X is also ambiguous encoding all amino acids. Practically a lot more entries containing X sometimes with a high abundance of X. Resolving this would increase the amount of peptides significantly and slow down MaCPepDB's search functionality. Because X has no mass peptides, containing it, will be discarded entirely.

Dependencies

Only necessary for development and non-Docker installation

GIT
Build tools (Ubuntu: build-essential, Arch Linux: base-devel)
C/C++-header for PostgreSQL (Ubuntu: libpq-dev, Arch Linux: postgresql-libs)
C/C++-header for libev (Ubuntu: libev-dev, Arch Linux: libev)
Rust Compiler
Docker & Docker Compose
Python 3.x
pyenv
pipenv

Development

Make sure pipenv finds pyenv

Prepare development environment

# Install correct python version and create environment
pipenv install -d

# Change to environment
pipenv shell

# Start the database
docker-compose up

# Run migrations
MACPEPDB_DB_URL=postgresql://postgres:developer@127.0.0.1:5433/macpepdb_dev pipenv run db-migrate

Use pipenv to install or uninstall Python modules

Running tests

TEST_MACPEPDB_URL=postgresql://postgres:developer@127.0.0.1:5433/macpepdb_dev pipenv run tests

Run the modules CLI

Run python -m macpepdb --help in the root-folder of the repository.

Usage

Native installation

Than update pip with pip install --upgrade pip and run pip install -e git+https://github.com/mpc-bioinformatics/macpepdb.git@<MACPEPDB_GIT_TAG>#egg=MaCPepDB to install MaCPepDB. Then you can use MacPepDB by running python -m macpepdb. Appending --help shows the available command line parameter.

Docker installation

To create a Docker image use: docker build --tag macpepdb-py . . You can use the image to start a container with docker run -it --rm macpepdb-py --help. To access your files in the container mount your files to /usr/src/macpepdb/data with -v YOUR_DATA_FOLDER:/usr/src/macpepdb/data (add it before the macpepdb-py). Keep in mind your working in a container, so all file paths are within the container.
If you intend to create a protein/peptide database and your Postgresql server is running in a Docker container too, make sure both, the Postgresql server and the MacPepDB container have access to the same Docker network by adding --network=YOUR_DOCKER_NETWORK (before the ´macpepdb-py´).

Building a database

Prepare the database

Follow the Citus documentation to setup a Citus cluster.
Run psql -h <CITUS_CONTROLLER> -U <DB_USER> -c "ALTER DATABASE <DB_NAME> SET citus.multi_shard_modify_mode = 'sequential';" and psql -h <CITUS_CONTROLLER> -U <DB_USER> -c "ALTER DATABASE <DB_NAME> SET citus.shard_count = 100;" to configure the database
Run MACPEPDB_DB_URL=postgresql://<USER>:<PASSWORD>@<HOST>:<PORT>/<DATABASE> alembic upgrade head, if you use the docker container, run the command in a temporary container: docker run --rm -it macpepdb sh

Fill the database

First create a work folder with the following structure:

|_ work_dir
   |__ protein_data
   |__ taxonomy_data
   |__ logs

Place your protein data files as .dat- or .txt-files, containing the proteins in UniProt-text-format, in the protein_data-folder. If you like to use the web interface as well, download the taxdump.zip from NCBI and put the contained .dmp-files in the taxonomy_data-folder.

Than start the database maintenance job with python -m macpepdb database .... Run python -m macpepdb database --help to see the required arguments. Remember to use the container internal paths when using a docker container.

WebAPI

Create a new config file with the default config

python -m macpepdb web write-config-file <PATH_TO_CONFIG_YAML>

Adjust the YAML file to your needs. Than start the WebAPI with

python -m macpepdb web serve -e production -c <PATH_TO_CONFIG_YAML>

For high availability in production use start multiple WebAPI and combine them with NginX (have a look in nginx.example.conf)

Upgrading

1.x to 2.x

Due to changes of the database schema and the database engine, version 2.x is not compatible with version 1.x. You have to recreate the database.

Citation and Publication

MaCPepDB: A Database to Quickly Access All Tryptic Peptides of the UniProtKB
Julian Uszkoreit, Dirk Winkelhardt, Katalin Barkovits, Maximilian Wulf, Sascha Roocke, Katrin Marcus, and Martin Eisenacher
Journal of Proteome Research 2021 20 (4), 2145-2150
DOI: 10.1021/acs.jproteome.0c00967

Name		Name	Last commit message	Last commit date
Latest commit History 212 Commits
migrations		migrations
src/macpepdb		src/macpepdb
test_files		test_files
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Readme.md		Readme.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
environment.yml		environment.yml
init_db.sh		init_db.sh
nginx.example.conf		nginx.example.conf
pyproject.toml		pyproject.toml
web.procfile		web.procfile

di-hardt/macpepdb

Folders and files

Latest commit

History

Repository files navigation