# BERT, ATTENTION and LSTM
* The original tutorial can be accessed at this [git repository](https://github.com/BioGavin/c_AMPs-prediction)
* [Reference](https://www.nature.com/articles/s41587-022-01226-0)

## Model preparation 

* Clone the git repository to the local directory 
```bash
git clone https://github.com/BioGavin/c_AMPs-prediction.git 
cd c_AMPs-prediction/
```

* Download BERT model and move to the "Models/" folder
Access "bert.bin" from the [link](https://www.dropbox.com/sh/o58xdznyi6ulyc6/AABLckEnxP54j2X7BrGybhyea?dl=0)
```bash
mv bert.bin Models/ 
```

* Validate the model
```bash
cd Models
md5sum -c md5.txt
```
The following output indicates that the verification is successful; otherwise the model file was not downloaded completely.
```
lstm.h5: OK
att.h5: OK
bert.bin: OK
```

* Return to the root directory
```bash
cd ..
```

* Create the environment
```bash
conda create -y -n amp_prediction python=3.7 certifi=2022.12.7
conda activate amp_prediction

pip install -r requirement.txt

cd bert_sklearn
pip install .

cd .. && cp -r bert_sklearn/ ~/miniconda3/envs/amp_prediction/lib/python3.7/site-packages/ 
# Please replace the destination with the directory of your conda environment

```


## Input preparation

* Format sequence
```bash
mv ripps_cystobacter.fasta Data/

# optional procedures if needed

# seqkit seq -M 300 -g input.faa -o output.faa # Delete sequence longer than 300 bp
# seqkit seq -w 0 input.faa -o output.faa # Convert the sequence into one line
# seqkit grep -s -p "X" -v input.faa -o output.faa # Remove unknown amino acid "X" in the sequence
```

* Formatting
```
cd Data
perl ../script/format.pl ripps_cystobacter.fasta none > ripps_cystobacter.txt
```

## Prediction

* Predict by three models
```
python3 ../script/prediction_attention.py ripps_cystobacter.txt ripps_cystobacter.att.txt # Attention
python3 ../script/prediction_lstm.py ripps_cystobacter.txt ripps_cystobacter.lstm.txt # LSTM
python3 ../script/prediction_bert.py ripps_cystobacter.txt ripps_cystobacter.bert.txt # BERT
```

* Summary
```
python3 ../script/result.py ripps_cystobacter.att.txt ripps_cystobacter.lstm.txt ripps_cystobacter.bert.txt ripps_cystobacter.fasta ripps_cystobacter.result.tsv
```

# Sequential Model Ensemble Pipeline (SMEP)
* [Reference](https://www.nature.com/articles/s41551-022-00991-2)