# Build your PLMSearch locally 🧪

**Notice: The experiment are implement on a server with an `56-core Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz and 252 GB RAM memory`. The GPU environment of the server is `1×GeForce GTX 1080 Ti 11GB`.**

## Quick links

* [SS-predictor pipeline](#1)
  * [Search against self](#1-1)
  * [Search against Swiss-Prot](#1-2)
* [PLMSearch pipeline](#2)
  * [Search against self](#2-1)
  * [Search against Swiss-Prot](#2-2)
* [TM-align compute with Spark](#3)
* [Start from Fasta (preprocessing)](#4)
* [Train your own SS-predictor](#5)

## SS-predictor pipeline
<span id="1"></span>
<div align=center><img src="scientist_figures/workflow_img/similarity.png" width="90%" height="90%" /></div>

### 1. Search against self
<span id="1-1"></span>

In [1]:
!CUDA_VISIBLE_DEVICES=0 python ./plmsearch/main_similarity.py \
-qer './example/query_mean_esm_result.pkl' \
-ter './example/query_mean_esm_result.pkl' \
-smp './plmsearch_data/esm_ss_predict/model_scop_tri.sav' \
-opr './example/ss_predictor_self'

We have 1 GPUs in total!, we will use as you selected
query protein list: 100%|██████████████████████| 5/5 [00:00<00:00, 46500.04it/s]
[32m[I 230403 20:11:30 main_similarity:141][39m Sort end.
Esm embedding generate time cost: 14.138102769851685 s


### 2. Search against Swiss-Prot
<span id="1-2"></span>

In [11]:
!CUDA_VISIBLE_DEVICES=0 python ./plmsearch/main_similarity.py \
-qer './example/query_mean_esm_result.pkl' \
-ter './plmsearch_data/swissprot_to_swissprot/target_mean_esm_result.pkl' \
-smp './plmsearch_data/esm_ss_predict/model_scop_tri.sav' \
-opr './example/ss_predictor_swissprot'

We have 1 GPUs in total!, we will use as you selected
query protein list: 100%|█████████████████████████| 5/5 [00:13<00:00,  2.74s/it]
[32m[I 230403 20:25:07 main_similarity:141][39m Sort end.
Esm embedding generate time cost: 86.00837111473083 s


## PLMSearch pipeline
<span id="2"></span>
<div align=center><img src="scientist_figures/workflow_img/main.png" width="90%" height="90%" /></div>

### 1. Search against self
<span id="2-1"></span>

In [3]:
#Step 1. generate pfamclan prefilter result
!python ./plmsearch/main_pfam.py \
-qpr './example/query_pfam_result.json' \
-tpr './example/query_pfam_result.json' \
-c \
-opr './example/pfamclan_self'

[32m[I 230403 20:12:57 main_pfam:13][39m query protein num = 5
[32m[I 230403 20:12:57 main_pfam:14][39m target protein num = 5
query protein list: 100%|█████████████████████| 5/5 [00:00<00:00, 111550.64it/s]


In [4]:
#Step 2. PLMSearch search
!CUDA_VISIBLE_DEVICES=0 python ./plmsearch/main_similarity.py \
-qer './example/query_mean_esm_result.pkl' \
-ter './example/query_mean_esm_result.pkl' \
-smp './plmsearch_data/esm_ss_predict/model_scop_tri.sav' \
-ipr './example/pfamclan_self' \
-opr './example/plmsearch_self'

We have 1 GPUs in total!, we will use as you selected
Get prefilter list: 5it [00:00, 27060.03it/s]
[32m[I 230403 20:13:00 main_similarity:104][39m prefilter num = 5
query protein list: 100%|█████████████████████| 5/5 [00:00<00:00, 111550.64it/s]
[32m[I 230403 20:13:01 main_similarity:141][39m Sort end.
Esm embedding generate time cost: 2.1746182441711426 s


### 2. Search against Swiss-Prot
<span id="2-2"></span>

In [5]:
#Step 1. generate pfamclan prefilter result
!python ./plmsearch/main_pfam.py \
-qpr './example/query_pfam_result.json' \
-tpr './plmsearch_data/swissprot_to_swissprot/target_pfam_result.json' \
-c \
-opr './example/pfamclan_swissprot'

[32m[I 230403 20:13:03 main_pfam:13][39m query protein num = 5
[32m[I 230403 20:13:03 main_pfam:14][39m target protein num = 498654
query protein list: 100%|█████████████████████████| 5/5 [00:00<00:00,  5.13it/s]


In [6]:
#Step 2. PLMSearch search
!CUDA_VISIBLE_DEVICES=0 python ./plmsearch/main_similarity.py \
-qer './example/query_mean_esm_result.pkl' \
-ter './plmsearch_data/swissprot_to_swissprot/target_mean_esm_result.pkl' \
-smp './plmsearch_data/esm_ss_predict/model_scop_tri.sav' \
-ipr './example/pfamclan_swissprot' \
-opr './example/plmsearch_swissprot'

We have 1 GPUs in total!, we will use as you selected
Get prefilter list: 19238it [00:00, 207177.44it/s]
[32m[I 230403 20:14:08 main_similarity:104][39m prefilter num = 19238
query protein list: 100%|██████████████████████| 5/5 [00:00<00:00, 98457.84it/s]
[32m[I 230403 20:14:08 main_similarity:141][39m Sort end.
Esm embedding generate time cost: 63.6106595993042 s


## TM-align compute with Spark
<span id="3"></span>
<div align=center><img src="scientist_figures/workflow_img/tmalign_compute.png" width="90%" height="90%" /></div>


In [12]:
#install
%cd ./plmsearch/pytmalign/
!python setup.py build_ext --inplace
%cd ../..
#tmalign compute with spark
%cd ./plmsearch/
!python tmalign_compute.py \
-qsd '../plmsearch_data/swissprot_to_swissprot/query_structure/' \
-tsd '../plmsearch_data/swissprot_to_swissprot/query_structure/' \
-ipr '../example/tmalign_compute/test' \
-s
%cd ..

/home/lw/plmsearch/plmsearch/pytmalign
/home/lw/plmsearch
/home/lw/plmsearch/plmsearch
Get prefilter list: 6it [00:00, 553.95it/s]
100%|█████████████████████████████████████████| 6/6 [00:00<00:00, 213269.69it/s]
23/04/03 20:25:34 WARN Utils: Your hostname, ZzStudio-7048-4x1080 resolves to a loopback address: 127.0.1.1; using 10.176.64.2 instead (on interface eth0)
23/04/03 20:25:34 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/03 20:25:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Compute total time cost 6.368920087814331 s                                     
/home/lw/plmsearch


## Start from Fasta (preprocessing)
<span id="4"></span>

### 1. Generate ESM-1b embedding

In [2]:
#esm generate
!CUDA_VISIBLE_DEVICES=0 python ./plmsearch/esm_generate.py \
-f './plmsearch_data/swissprot_to_swissprot/query_protein.fasta' \
-m './example/query_mean_esm_result.pkl'

Transferred model to GPU
Read ./plmsearch_data/swissprot_to_swissprot/query_protein.fasta with 5 sequences
Processing 1 of 1 batches (5 sequences)
Esm embedding generate time cost: 24.65529227256775 s


### 2. Generate Pfam result

In [9]:
#pfam generate
!python ./plmsearch/pfam_local_generate.py \
-f './plmsearch_data/swissprot_to_swissprot/query_protein.fasta' \
-o './example/query_pfam_result.json'

1680524087.9343655
perl ./plmsearch_data/PfamScan/pfam_scan.pl -fasta ./plmsearch_data/swissprot_to_swissprot/query_protein.fasta -dir ./plmsearch_data/Pfam_db -outfile ./tmp.txt
Pfam local generate time cost 3.77127742767334 s


## Train your own SS-predictor
<span id="5"></span>
<div align=center><img src="scientist_figures/workflow_img/ss-predictor.png" width="90%" height="90%" /></div>

In [10]:
#Train SS-predictor
!CUDA_VISIBLE_DEVICES=0 python ./plmsearch/esm_ss_predict_tri_train.py \
--save_model_path './example/ss_predictor/model_scop_tri.sav'

We have 1 GPUs in total! We will use as you selected
# training with esm_ss_predict_tri: ss_batch_size=100, epochs=20, lr=1e-05
# save model path: ./example/ss_predictor/model_scop_tri.sav
# loading esm result: ./plmsearch_data/esm_ss_predict/train/mean_esm_result.pkl
# loading protein list file: ./plmsearch_data/esm_ss_predict/train/protein_list.txt
# loading ss mat file: ./plmsearch_data/esm_ss_predict/train/ss_mat.npz
[32m[I 230403 20:14:57 esm_ss_predict:44][39m (8953, 8953) 40082581
PPI: 100%|██████████████████████| 40082581/40082581 [02:12<00:00, 302322.24it/s]
# loaded 40082581 sequence pairs
# training model
Epoch 1
-------------------------------
Train_mse_loss_avg: 0.047866  [    0/36074322]
Train_mse_loss_avg: 0.004335  [10000/36074322]
Train_mse_loss_avg: 0.003556  [20000/36074322]
Train_mse_loss_avg: 0.003652  [30000/36074322]
Train_mse_loss_avg: 0.002784  [40000/36074322]
Train_mse_loss_avg: 0.003162  [50000/36074322]
Train_mse_loss_avg: 0.002963  [60000/36074322]
Train