Contrastive learning for enhancing feature extraction in anticancer peptides

A deep learning model designed to screen anticancer peptides (ACPs) using peptide sequences only. A contrastive learning technique was applied to enhance model performance, yielding better results than a model trained solely on binary classification loss. Furthermore, two independent encoders were employed as a replacement for data augmentation, a technique commonly used in contrastive learning.

Dependencies

pytorch>=2.0.1
numpy>=1.25.2
biopython

Datasets

Datasets for model training were obtained from ACPred-LAF. Six benchmark datasets were used for model training:

ACP-Mixed-80
ACP2.0 main
ACP2.0 alternative
ACP500+ACP164
ACP500+ACP2710
LEE+Independent

For more detailed information, refer to this research article.

Inference

To predict ACPs using only peptide sequences, prepare your peptide sequence list in the FASTA format. For more detailed information about the FASTA format, refer to this link.

Method 1: Command-Line Inference

Use the following command to run the inference:

python inf.py --input {input_file} --batch_size {batch_size} --model_type {model_type}
              --device {device} --output {output_file}

input_file: The input file contains peptide sequences in fasta format.
batch_size: The batch size used during inference
model_type: Specifies the type of optimized model. There are six optimized models available for predicting ACPs, each trained on one of six benchmark datasets. The default recommended option is ACP-Mixed-80.
- Options
  - ACP_Mixed_80: The optimized model that was trained using the ACP-Mixed-80 benchmark dataset.
  - ACP2_main: The optimized model that was trained using the ACP2.0 main benchmark dataset.
  - ACP2_alter: The optimized model that was trained using the ACP2.0 alternative benchmark dataset.
  - ACP500_ACP164: The optimized model that was trained using the ACP500+ACP164 benchmark dataset.
  - ACP500_ACP2710: The optimized model that was trained using the ACP500+ACP2710 benchmark dataset.
  - LEE_Indep: The optimized model that was trained using the LEE+Independent benchmark dataset.
device: The device used for predicting ACPs
- Options
  - cpu
  - gpu
output_file: The file where prediction results will be saved.

Method 2: Using the acppred Python Package

Alternatively, you can utilize the acppred Python package for predictions

Install acppred: First, install the acppred package using pip:

pip install acppred

Predict Using acppred: Utilize the following Python script to perform ACP predictions:

import acppred as ap

ap.predict(fasta_file, model_type="ACP_Mixed_80", device="cpu", batch_size=64)

fasta_file: The path to the FASTA file containing the peptide sequences you want to analyze.
model_type: Specifies the machine learning model to use for the prediction.
device: Indicates whether to use the CPU ("cpu") or GPU ("gpu") for computation.
batch_size: Determines the number of sequences to process simultaneously. A larger batch size can expedite the prediction process but will require more memory. Adjust this parameter based on your system's capabilities and the size of your dataset.

This script facilitates ACP prediction by integrating the acppred package, allowing you to specify the model type, computing device, and batch size.

Note: Due to variability in the maximum peptide sequence length across each benchmark dataset, there are restrictions on the maximum input peptide sequence length for each model type.

Model Type	Maximum Number of Amino Acid Residues
ACP2_main	50
ACP2_alter	50
LEE_Indep	95
ACP500_ACP164	206
ACP500_ACP2710	206
ACP_Mixed_80	207

Model training

Use the following command to start model training:

python train.py --model_info {model_info} --batch_size {batch_size} --dropout_rate {dropout_rate}
                --lr {learning_rate} --epoch {maximum_training_epochs} --dataset {bechmark_dataset}
                --alpha {alpha} --beta {beta} --temp {temperature} --gpu {gpu_number}

model_info: Choose an encoder architecture from the ./model/model_params directory for model training. For example, --model_info ./model/model_params/cnn1.json.
batch_size: Batch size used during model training
dropout_rate: Dropout rate applied during model training
learning_rate: Learning rate set for model training.
maximum_training_epochs: Maximum number of training epochs.
benchmark_dataset: Select one dataset from the six available benchmark datasets for model training.
- Options
  - ACP_Mixed-80: ACP-Mixed-80 dataset
  - ACP2_main: ACP2.0 main dataset
  - ACP2_alter: ACP2.0 alternative dataset
  - ACP500_ACP164: ACP500+ACP164 dataset
  - ACP500_ACP2710: ACP500+ACP2710 dataset
  - LEE_Indep: LEE+Independent dataset
alpha: Adjusts the balance between cross-entropy and contrastive loss components. Range: 0.0 to 1.0.
beta: Balances the two types of cross-entropy losses (cross-entropy loss 1 and 2).
- Options
  - 0: Only cross-entroly loss 1 is used for model training.
  - 0.5: Both cross-entropy loss 1 and 2 are used for model training.
  - 1: Only cross-entroly loss 2 is used for model training.
temperature: Temperature parameter in contrastive loss calculation.
gpu: GPU number to be used for model training, as identified by the nvidia-smi`` command. Use -1`` for CPU training.

Reference

Byungjo Lee, Dongkwan Shin, Contrastive learning for enhancing feature extraction in anticancer peptides, Briefings in Bioinformatics, Volume 25, Issue 3, May 2024, bbae220, https://doi.org/10.1093/bib/bbae220

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
ckpt		ckpt
dataset/ACPred-LAF		dataset/ACPred-LAF
model		model
model_old		model_old
util		util
README.md		README.md
inf.py		inf.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contrastive learning for enhancing feature extraction in anticancer peptides

Dependencies

Datasets

Inference

Method 1: Command-Line Inference

Method 2: Using the acppred Python Package

Model training

Reference

About

Releases

Packages

Languages

bzlee-bio/con_ACP

Folders and files

Latest commit

History

Repository files navigation

Contrastive learning for enhancing feature extraction in anticancer peptides

Dependencies

Datasets

Inference

Method 1: Command-Line Inference

Method 2: Using the acppred Python Package

Model training

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages