# Search your own data 🧪

**Notice: The experiment are implement on a server with an `72-core Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz and 503 GB RAM memory`.**

## Quick links

* [SS-sort(COS) pipeline](#1)
  * [Search against self](#1-1)
  * [Search against Swiss-Prot](#1-2)
* [SS-filter pipeline](#2)
  * [Search against self](#2-1)
  * [Search against Swiss-Prot](#2-2)
* [TM-align compute with Spark](#3)
* [Start from Fasta (preprocessing)](#4)
* [Train your own SS-predictor](#5)

## SS-sort(COS) pipeline
<span id="1"></span>
<div align=center><img src="scientist_figures/workflow_img/ss-sort(cos).png" width="80%" height="80%" /></div>

### 1. Search against self
<span id="1-1"></span>

In [14]:
!python ./ss_filter/main_similarity.py \
-qer './example/query_mean_esm_result_cpu.pkl' \
-ter './example/query_mean_esm_result_cpu.pkl' \
-smp './ss_filter_data/esm_ss_predict/model_scop_tri.sav' \
-d \
-opr './example/ss_sort_cos_self'

None of GPU is selected.
query protein list: 100%|█████████████████████| 5/5 [00:00<00:00, 119837.26it/s]
[32m[I 221208 21:39:33 main_similarity:145][39m Sort end.
Esm embedding generate time cost: 0.07173991203308105 s


### 2. Search against Swiss-Prot
<span id="1-2"></span>

In [15]:
!python ./ss_filter/main_similarity.py \
-qer './example/query_mean_esm_result_cpu.pkl' \
-ter './ss_filter_data/swissprot_to_swissprot/target_mean_esm_result_cpu.pkl' \
-smp './ss_filter_data/esm_ss_predict/model_scop_tri.sav' \
-d \
-opr './example/ss_sort_cos_swissprot'

None of GPU is selected.
query protein list: 100%|█████████████████████████| 5/5 [00:48<00:00,  9.65s/it]
[32m[I 221208 21:40:48 main_similarity:145][39m Sort end.
Esm embedding generate time cost: 77.84056210517883 s


## SS-filter pipeline
<span id="2"></span>
<div align=center><img src="scientist_figures/workflow_img/main.png" width="80%" height="80%" /></div>

### 1. Search against self
<span id="2-1"></span>

In [16]:
#Step 1. generate pfamclan prefilter result
!python ./ss_filter/main_pfam.py \
-qpr './example/query_pfam_result.json' \
-tpr './example/query_pfam_result.json' \
-c \
-opr './example/pfamclan_self'

[32m[I 221208 21:40:53 main_pfam:13][39m query protein num = 5
[32m[I 221208 21:40:53 main_pfam:14][39m target protein num = 5
query protein list: 100%|█████████████████████| 5/5 [00:00<00:00, 134432.82it/s]


In [17]:
#Step 2. ss-filter search
!python ./ss_filter/main_similarity.py \
-qer './example/query_mean_esm_result_cpu.pkl' \
-ter './example/query_mean_esm_result_cpu.pkl' \
-smp './ss_filter_data/esm_ss_predict/model_scop_tri.sav' \
-ipr './example/pfamclan_self' \
-d \
-opr './example/ss_filter_self'

None of GPU is selected.
Get prefilter list: 5it [00:00, 32615.12it/s]
[32m[I 221208 21:40:54 main_similarity:104][39m prefilter num = 5
query protein list: 100%|█████████████████████| 5/5 [00:00<00:00, 104335.92it/s]
[32m[I 221208 21:40:54 main_similarity:145][39m Sort end.
Esm embedding generate time cost: 0.07499408721923828 s


### 2. Search against Swiss-Prot
<span id="2-2"></span>

In [18]:
#Step 1. generate pfamclan prefilter result
!python ./ss_filter/main_pfam.py \
-qpr './example/query_pfam_result.json' \
-tpr './ss_filter_data/swissprot_to_swissprot/target_pfam_result.json' \
-c \
-opr './example/pfamclan_swissprot'

[32m[I 221208 21:40:56 main_pfam:13][39m query protein num = 5
[32m[I 221208 21:40:56 main_pfam:14][39m target protein num = 498654
query protein list: 100%|█████████████████████████| 5/5 [00:00<00:00,  5.81it/s]


In [19]:
#Step 2. ss-filter search
!python ./ss_filter/main_similarity.py \
-qer './example/query_mean_esm_result_cpu.pkl' \
-ter './ss_filter_data/swissprot_to_swissprot/target_mean_esm_result_cpu.pkl' \
-smp './ss_filter_data/esm_ss_predict/model_scop_tri.sav' \
-ipr './example/pfamclan_swissprot' \
-d \
-opr './example/ss_filter_swissprot'

None of GPU is selected.
Get prefilter list: 19238it [00:00, 193275.07it/s]
[32m[I 221208 21:41:23 main_similarity:104][39m prefilter num = 19238
query protein list: 100%|█████████████████████| 5/5 [00:00<00:00, 173318.35it/s]
[32m[I 221208 21:41:23 main_similarity:145][39m Sort end.
Esm embedding generate time cost: 25.38034415245056 s


## TM-align compute with Spark
<span id="3"></span>
<div align=center><img src="scientist_figures/workflow_img/tmalign_compute.png" width="80%" height="80%" /></div>

In [20]:
#install
%cd ./ss_filter/pytmalign/
!python setup.py build_ext --inplace
%cd ../..
#tmalign compute with spark
%cd ./ss_filter/
!python tmalign_compute.py \
-qsd '../ss_filter_data/swissprot_to_swissprot/query_structure/' \
-tsd '../ss_filter_data/swissprot_to_swissprot/query_structure/' \
-ipr '../example/tmalign_compute/test' \
-s
%cd ..

/data1/lw/git_ss_filter/git_ss_filter/ss_filter/pytmalign
running build_ext
/data1/lw/git_ss_filter/git_ss_filter
/data1/lw/git_ss_filter/git_ss_filter/ss_filter
Get prefilter list: 6it [00:00, 62757.67it/s]
100%|█████████████████████████████████████████| 6/6 [00:00<00:00, 285975.27it/s]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/08 21:41:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/12/08 21:41:28 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Compute total time cost 5.4679858684539795 s                                    
/data1/lw/git_ss_filter/git_ss_filter


## Start from Fasta (preprocessing)
<span id="4"></span>

### 1. Generate ESM-1b embedding

In [21]:
#esm generate
!python ./ss_filter/esm_generate.py \
-f './ss_filter_data/swissprot_to_swissprot/query_protein.fasta' \
-m './example/query_mean_esm_result_cpu.pkl' \
--nogpu

Read ./ss_filter_data/swissprot_to_swissprot/query_protein.fasta with 5 sequences
Processing 1 of 1 batches (5 sequences)
Esm embedding generate time cost: 14.700751781463623 s


### 2. Generate Pfam result

In [22]:
#pfam generate
!python ./ss_filter/pfam_local_generate.py \
-f './ss_filter_data/swissprot_to_swissprot/query_protein.fasta' \
-o './example/query_pfam_result.json'

1670506908.0153909
perl ./ss_filter_data/PfamScan/pfam_scan.pl -fasta ./ss_filter_data/swissprot_to_swissprot/query_protein.fasta -dir ./ss_filter_data/Pfam_db -outfile ./tmp.txt
Pfam local generate time cost 1.7778055667877197 s


## Train your own SS-predictor
<span id="5"></span>
<div align=center><img src="scientist_figures/workflow_img/ss-predictor.png" width="80%" height="80%" /></div>

In [24]:
#Train SS-predictor
!python ./ss_filter/esm_ss_predict_tri_train.py \
-d \
-mer './ss_filter_data/esm_ss_predict/train/mean_esm_result_cpu.pkl' \
--save_model_path './example/ss_predictor/model_scop_tri_cpu.sav'

None of GPU is selected.
# training with esm_ss_predict_tri: ss_batch_size=100, epochs=20, lr=1e-05
# save model path: ./example/ss_predictor/model_scop_tri_cpu.sav
# loading esm result: ./ss_filter_data/esm_ss_predict/train/mean_esm_result_cpu.pkl
# loading protein list file: ./ss_filter_data/esm_ss_predict/train/protein_list.txt
# loading ss mat file: ./ss_filter_data/esm_ss_predict/train/ss_mat.npz
[32m[I 221208 21:43:12 esm_ss_predict:44][39m (8953, 8953) 40082581
PPI: 100%|██████████████████████| 40082581/40082581 [01:52<00:00, 357002.92it/s]
# loaded 40082581 sequence pairs
# training model
Epoch 1
-------------------------------
Train_mse_loss_avg: 0.049949  [    0/36074322]
Train_mse_loss_avg: 0.006371  [10000/36074322]
Train_mse_loss_avg: 0.005829  [20000/36074322]
Train_mse_loss_avg: 0.006528  [30000/36074322]
Train_mse_loss_avg: 0.005094  [40000/36074322]
Train_mse_loss_avg: 0.002771  [50000/36074322]
Train_mse_loss_avg: 0.001889  [60000/36074322]
Train_mse_loss_avg: 0.003