# Build your PLMSearch locally 🧪

**Notice: The experiment are implement on a server with an `56-core Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz and 252 GB RAM memory`.**

## Quick links

* [SS-predictor pipeline](#1)
  * [Search against self](#1-1)
  * [Search against Swiss-Prot](#1-2)
* [PLMSearch pipeline](#2)
  * [Search against self](#2-1)
  * [Search against Swiss-Prot](#2-2)
* [TM-align compute with Spark](#3)
* [Start from Fasta (preprocessing)](#4)
* [Train your own SS-predictor](#5)

## SS-predictor pipeline
<span id="1"></span>
<div align=center><img src="scientist_figures/workflow_img/similarity.png" width="90%" height="90%" /></div>

### 1. Search against self
<span id="1-1"></span>

In [1]:
!python ./plmsearch/main_similarity.py \
-qer './example/query_mean_esm_result_cpu.pkl' \
-ter './example/query_mean_esm_result_cpu.pkl' \
-smp './plmsearch_data/esm_ss_predict/model_scop_tri.sav' \
-d \
-opr './example/ss_predictor_self'

None of GPU is selected.
query protein list: 100%|██████████████████████| 5/5 [00:00<00:00, 90394.48it/s]
[32m[I 230403 23:30:28 main_similarity:141][39m Sort end.
Esm embedding generate time cost: 0.4538588523864746 s


### 2. Search against Swiss-Prot
<span id="1-2"></span>

In [2]:
!python ./plmsearch/main_similarity.py \
-qer './example/query_mean_esm_result_cpu.pkl' \
-ter './plmsearch_data/swissprot_to_swissprot/target_mean_esm_result_cpu.pkl' \
-smp './plmsearch_data/esm_ss_predict/model_scop_tri.sav' \
-d \
-opr './example/ss_predictor_swissprot'

None of GPU is selected.
query protein list: 100%|█████████████████████████| 5/5 [01:05<00:00, 13.13s/it]
[32m[I 230403 23:32:13 main_similarity:141][39m Sort end.
Esm embedding generate time cost: 108.95698642730713 s


## PLMSearch pipeline
<span id="2"></span>
<div align=center><img src="scientist_figures/workflow_img/main.png" width="90%" height="90%" /></div>

### 1. Search against self
<span id="2-1"></span>

In [3]:
#Step 1. generate pfamclan prefilter result
!python ./plmsearch/main_pfam.py \
-qpr './example/query_pfam_result.json' \
-tpr './example/query_pfam_result.json' \
-c \
-opr './example/pfamclan_self'

[32m[I 230403 23:32:19 main_pfam:13][39m query protein num = 5
[32m[I 230403 23:32:19 main_pfam:14][39m target protein num = 5
query protein list: 100%|█████████████████████| 5/5 [00:00<00:00, 109226.67it/s]


In [4]:
#Step 2. PLMSearch search
!python ./plmsearch/main_similarity.py \
-qer './example/query_mean_esm_result_cpu.pkl' \
-ter './example/query_mean_esm_result_cpu.pkl' \
-smp './plmsearch_data/esm_ss_predict/model_scop_tri.sav' \
-ipr './example/pfamclan_self' \
-d \
-opr './example/plmsearch_self'

None of GPU is selected.
Get prefilter list: 5it [00:00, 31254.13it/s]
[32m[I 230403 23:32:20 main_similarity:104][39m prefilter num = 5
query protein list: 100%|██████████████████████| 5/5 [00:00<00:00, 97541.95it/s]
[32m[I 230403 23:32:20 main_similarity:141][39m Sort end.
Esm embedding generate time cost: 0.08173346519470215 s


### 2. Search against Swiss-Prot
<span id="2-2"></span>

In [5]:
#Step 1. generate pfamclan prefilter result
!python ./plmsearch/main_pfam.py \
-qpr './example/query_pfam_result.json' \
-tpr './plmsearch_data/swissprot_to_swissprot/target_pfam_result.json' \
-c \
-opr './example/pfamclan_swissprot'

[32m[I 230403 23:32:23 main_pfam:13][39m query protein num = 5
[32m[I 230403 23:32:23 main_pfam:14][39m target protein num = 498654
query protein list: 100%|█████████████████████████| 5/5 [00:01<00:00,  4.94it/s]


In [6]:
#Step 2. PLMSearch search
!python ./plmsearch/main_similarity.py \
-qer './example/query_mean_esm_result_cpu.pkl' \
-ter './plmsearch_data/swissprot_to_swissprot/target_mean_esm_result_cpu.pkl' \
-smp './plmsearch_data/esm_ss_predict/model_scop_tri.sav' \
-ipr './example/pfamclan_swissprot' \
-d \
-opr './example/plmsearch_swissprot'

None of GPU is selected.
Get prefilter list: 19238it [00:00, 161942.30it/s]
[32m[I 230403 23:33:02 main_similarity:104][39m prefilter num = 19238
query protein list: 100%|██████████████████████| 5/5 [00:00<00:00, 71089.90it/s]
[32m[I 230403 23:33:03 main_similarity:141][39m Sort end.
Esm embedding generate time cost: 38.78334379196167 s


## TM-align compute with Spark
<span id="3"></span>
<div align=center><img src="scientist_figures/workflow_img/tmalign_compute.png" width="90%" height="90%" /></div>

In [7]:
#install
%cd ./plmsearch/pytmalign/
!python setup.py build_ext --inplace
%cd ../..
#tmalign compute with spark
%cd ./plmsearch/
!python tmalign_compute.py \
-qsd '../plmsearch_data/swissprot_to_swissprot/query_structure/' \
-tsd '../plmsearch_data/swissprot_to_swissprot/query_structure/' \
-ipr '../example/tmalign_compute/test' \
-s
%cd ..

/home/lw/plmsearch/plmsearch/pytmalign
/home/lw/plmsearch
/home/lw/plmsearch/plmsearch
Get prefilter list: 6it [00:00, 32725.39it/s]
100%|█████████████████████████████████████████| 6/6 [00:00<00:00, 206277.25it/s]
23/04/03 23:33:10 WARN Utils: Your hostname, ZzStudio-7048-4x1080 resolves to a loopback address: 127.0.1.1; using 10.176.64.2 instead (on interface eth0)
23/04/03 23:33:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/03 23:33:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Compute total time cost 11.454262018203735 s                                    
/home/lw/plmsearch


## Start from Fasta (preprocessing)
<span id="4"></span>

### 1. Generate ESM-1b embedding

In [8]:
#esm generate
!python ./plmsearch/esm_generate.py \
-f './plmsearch_data/swissprot_to_swissprot/query_protein.fasta' \
-m './example/query_mean_esm_result_cpu.pkl' \
--nogpu

Read ./plmsearch_data/swissprot_to_swissprot/query_protein.fasta with 5 sequences
Processing 1 of 1 batches (5 sequences)
Esm embedding generate time cost: 20.87261128425598 s


### 2. Generate Pfam result

In [9]:
#pfam generate
!python ./plmsearch/pfam_local_generate.py \
-f './plmsearch_data/swissprot_to_swissprot/query_protein.fasta' \
-o './example/query_pfam_result.json'

1680536020.3650196
perl ./plmsearch_data/PfamScan/pfam_scan.pl -fasta ./plmsearch_data/swissprot_to_swissprot/query_protein.fasta -dir ./plmsearch_data/Pfam_db -outfile ./tmp.txt
Pfam local generate time cost 12.641374349594116 s


## Train your own SS-predictor
<span id="5"></span>
<div align=center><img src="scientist_figures/workflow_img/ss-predictor.png" width="90%" height="90%" /></div>

In [10]:
#Train SS-predictor
!python ./plmsearch/esm_ss_predict_tri_train.py \
-d \
-mer './plmsearch_data/esm_ss_predict/train/mean_esm_result_cpu.pkl' \
--save_model_path './example/ss_predictor/model_scop_tri_cpu.sav'

None of GPU is selected.
# training with esm_ss_predict_tri: ss_batch_size=100, epochs=20, lr=1e-05
# save model path: ./example/ss_predictor/model_scop_tri_cpu.sav
# loading esm result: ./plmsearch_data/esm_ss_predict/train/mean_esm_result_cpu.pkl
# loading protein list file: ./plmsearch_data/esm_ss_predict/train/protein_list.txt
# loading ss mat file: ./plmsearch_data/esm_ss_predict/train/ss_mat.npz
[32m[I 230403 23:33:57 esm_ss_predict:44][39m (8953, 8953) 40082581
PPI: 100%|██████████████████████| 40082581/40082581 [02:05<00:00, 319947.98it/s]
# loaded 40082581 sequence pairs
# training model
Epoch 1
-------------------------------
Train_mse_loss_avg: 0.064362  [    0/36074322]
Train_mse_loss_avg: 0.005796  [10000/36074322]
Train_mse_loss_avg: 0.002918  [20000/36074322]
Train_mse_loss_avg: 0.004253  [30000/36074322]
Train_mse_loss_avg: 0.003699  [40000/36074322]
Train_mse_loss_avg: 0.002158  [50000/36074322]
Train_mse_loss_avg: 0.003505  [60000/36074322]
Train_mse_loss_avg: 0.003