A deep learning model designed to screen anticancer peptides (ACPs) using peptide sequences only. A contrastive learning technique was applied to enhance model performance, yielding better results than a model trained solely on binary classification loss. Furthermore, two independent encoders were employed as a replacement for data augmentation, a technique commonly used in contrastive learning.
- pytorch>=2.0.1
- numpy>=1.25.2
- biopython
Datasets for model training were obtained from ACPred-LAF. Six benchmark datasets were used for model training:
- ACP-Mixed-80
- ACP2.0 main
- ACP2.0 alternative
- ACP500+ACP164
- ACP500+ACP2710
- LEE+Independent
For more detailed information, refer to this research article.
To predict ACPs using only peptide sequences, prepare your peptide sequence list in the FASTA format. For more detailed information about the FASTA format, refer to this link.
Use the following command to run the inference:
python inf.py --input {input_file} --batch_size {batch_size} --model_type {model_type}
--device {device} --output {output_file}
- input_file: The input file contains peptide sequences in fasta format.
- batch_size: The batch size used during inference
- model_type: Specifies the type of optimized model. There are six optimized models available for predicting ACPs, each trained on one of six benchmark datasets. The default recommended option is ACP-Mixed-80.
- Options
ACP_Mixed_80
: The optimized model that was trained using the ACP-Mixed-80 benchmark dataset.ACP2_main
: The optimized model that was trained using the ACP2.0 main benchmark dataset.ACP2_alter
: The optimized model that was trained using the ACP2.0 alternative benchmark dataset.ACP500_ACP164
: The optimized model that was trained using the ACP500+ACP164 benchmark dataset.ACP500_ACP2710
: The optimized model that was trained using the ACP500+ACP2710 benchmark dataset.LEE_Indep
: The optimized model that was trained using the LEE+Independent benchmark dataset.
- Options
- device: The device used for predicting ACPs
- Options
cpu
gpu
- Options
- output_file: The file where prediction results will be saved.
Alternatively, you can utilize the acppred Python package for predictions
- Install acppred: First, install the acppred package using pip:
pip install acppred
- Predict Using acppred: Utilize the following Python script to perform ACP predictions:
import acppred as ap
ap.predict(fasta_file, model_type="ACP_Mixed_80", device="cpu", batch_size=64)
- fasta_file: The path to the FASTA file containing the peptide sequences you want to analyze.
- model_type: Specifies the machine learning model to use for the prediction.
- device: Indicates whether to use the CPU ("cpu") or GPU ("gpu") for computation.
- batch_size: Determines the number of sequences to process simultaneously. A larger batch size can expedite the prediction process but will require more memory. Adjust this parameter based on your system's capabilities and the size of your dataset.
This script facilitates ACP prediction by integrating the acppred package, allowing you to specify the model type, computing device, and batch size.
Note: Due to variability in the maximum peptide sequence length across each benchmark dataset, there are restrictions on the maximum input peptide sequence length for each model type.
Model Type | Maximum Number of Amino Acid Residues |
---|---|
ACP2_main | 50 |
ACP2_alter | 50 |
LEE_Indep | 95 |
ACP500_ACP164 | 206 |
ACP500_ACP2710 | 206 |
ACP_Mixed_80 | 207 |
Use the following command to start model training:
python train.py --model_info {model_info} --batch_size {batch_size} --dropout_rate {dropout_rate}
--lr {learning_rate} --epoch {maximum_training_epochs} --dataset {bechmark_dataset}
--alpha {alpha} --beta {beta} --temp {temperature} --gpu {gpu_number}
-
model_info: Choose an encoder architecture from the
./model/model_params
directory for model training. For example,--model_info ./model/model_params/cnn1.json
. -
batch_size: Batch size used during model training
-
dropout_rate: Dropout rate applied during model training
-
learning_rate: Learning rate set for model training.
-
maximum_training_epochs: Maximum number of training epochs.
-
benchmark_dataset: Select one dataset from the six available benchmark datasets for model training.
- Options
ACP_Mixed-80
: ACP-Mixed-80 datasetACP2_main
: ACP2.0 main datasetACP2_alter
: ACP2.0 alternative datasetACP500_ACP164
: ACP500+ACP164 datasetACP500_ACP2710
: ACP500+ACP2710 datasetLEE_Indep
: LEE+Independent dataset
- Options
-
alpha: Adjusts the balance between cross-entropy and contrastive loss components. Range: 0.0 to 1.0.
-
beta: Balances the two types of cross-entropy losses (cross-entropy loss 1 and 2).
- Options
0
: Only cross-entroly loss 1 is used for model training.0.5
: Both cross-entropy loss 1 and 2 are used for model training.1
: Only cross-entroly loss 2 is used for model training.
- Options
-
temperature: Temperature parameter in contrastive loss calculation.
-
gpu: GPU number to be used for model training, as identified by the
nvidia-smi`` command. Use
-1`` for CPU training.
Byungjo Lee, Dongkwan Shin, Contrastive learning for enhancing feature extraction in anticancer peptides, Briefings in Bioinformatics, Volume 25, Issue 3, May 2024, bbae220, https://doi.org/10.1093/bib/bbae220