Deep feed-forward neural network for predicting amino acid sequences from protein conformations.
- Python 3.9
- Pytorch
- Numpy
- SciKit-Learn
- Matplotlib
- Scipy
- Biopython
We recommend using conda to install the required python packages in a contained environment:
-
Import the SeqPredNN environment using the SeqPredNN_environment.yml file
conda env create -n SeqPredNN -f SeqPRedNN_environment.yml
-
Activate the conda environment before using SeqPredNN
conda activate SeqPredNN
-
Prepare input files
-
Predicting an amino acid sequence for a set of protein structures requires:
-
a directory containing the .pdb format files of your protein structures
-
a comma-separated list of protein names, pdb filepaths in the abovementioned directory, and protein chain IDs for each protein chain e.g. the row for chain B of protein 1HST in the file /examples/example_pdb_directory/1hst.pdb.gz would read "1HST,1hst.pdb.gz,B"
-
The neural network parameters of the trained sequence prediction model
-
-
Examples of a chain list and PDB directory are given in /examples/
-
We vaildated SeqPredNN using the pretrained SeqPredNN model parameters and recommend you use these parameters to generate protein sequences.
-
-
Generate structural features for your protein structures using
featurise.py
python SeqPredNN/featurise.py -gm -o example_features example/example_chain_list.csv example/example_pdb_directory
-
The
-gm
argument indicates that the structure files are gzipped and should be uncompressed before they are parsed (-g
), and that modified amino acids should be converted to the appropriate unmodified standard amino acid (-m
) -
The
-o
argument indicates the directory where the structural features will be saved (in this case the features will be saved inexample_features/
) -
There are two positional arguments:
- the chain list
- the PDB directory
-
For additional command line arguments run
python SeqPredNN/featurise.py --help
-
-
Predict amino acid sequences using
predict.py
python SeqPredNN/predict.py -p example_features example_features/chain_list.txt pretrained_model/pretrained_parameters.pth
- prediction-only mode
-p
only predicts sequences and does not evaluate the model by comparing predicted sequences with the original sequence - There are three positional arguments:
- the directory where the features are saved (here example_features)
- a newline-seperated text file listing all the protein chains to be predicted (chain_list.txt lists all the featurised chains. It is automatically generated in the feature directory)
- the neural network parameters (here pretrained_model/pretrained_parameters.pth)
- prediction-only mode
-
Download the PDB files of the structures in your training dataset - https://www.wwpdb.org/ftp/pdb-ftp-sites
-
Generate structural features for the proteins using featurise.py e.g
python SeqPredNN/featurise.py -gm my_pdb_subset.csv my_pdb_directory
- see Predicting protein sequences for more details
-
Train the model using train_model.py
python SeqPredNN/train_model.py -r 0.8 -t my_test_set -e 200 my_feature_directory unbalanced
- The train ratio (
-r
) is the fraction of residues assigned to the training dataset. The remaining residues are assigned to a validation set used to evaluate the model during training - The test chain file (
-t
) is a newline-delimited text file listing chains that should be excluded from the training and validation datasets so that they can be used for independent evaluation of the model. - the number of epochs for training (
-e
) - The balanced/unbalanced keyword specifies the sampling mode. "unbalanced" sampling partitions all the residues in the features into the training and validation datasets. "balanced" sampling undersamples the residues so that each of the 20 amino acid classes occur the same number of times in the dataset.
- The train ratio (
-
Test your model using
predict.py
python SeqPredNN/predict.py my_feature_directory my_test_set pretrained_model/pretrained_parameters.pth
-
Predicts the sequences of all the protein chains in the test set, and compares the predictred sequences with the native sequences to evaluate the model performance
-
Evaluation output:
- A classification report with precision, recall and f1-score for each amino acid class
- The top K accuracy of the predictions for each amino acid class
- 3 confusion matrices (unnormalised, normalised by prediciton and normalised by true residue)
- For each chain in the test set:
- The predicted sequence
- The probabilities for each amino acid class produced by the model for each preducted residue
- A classification report
- Cross-entropy loss for each predicted residue
-
The pretrained model was trained using the chains in pretrained_model/SeqPredNN_pdb_subset.csv. The dataset consists of 38105 chains with less than 90% sequence identity, resolution < 2.5 angstrom, no chain breaks, length of 40-10000 residues, and only X-ray crystallography structures. It was generated by the pisces server. We excluded a random test set of 10% of the chains from training.
This software and code is distributed under a GNU General Public License V3
Lategan, F.A., Schreiber, C. & Patterton, H.G. SeqPredNN: a neural network that generates protein sequences that fold into specified tertiary structures. BMC Bioinformatics 24, 373 (2023). https://doi.org/10.1186/s12859-023-05498-4