# Search your own data 🧪

**Notice: The experiment are implement on a server with an `72-core Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz and 503 GB RAM memory`. The GPU environment of the server is `1×Tesla V100 PCIe 32GB`.**

## Quick links

* [SS-sort(COS) pipeline](#1)
  * [Search against self](#1-1)
  * [Search against Swiss-Prot](#1-2)
* [SS-filter pipeline](#2)
  * [Search against self](#2-1)
  * [Search against Swiss-Prot](#2-2)
* [TM-align compute with Spark](#3)
* [Start from Fasta (preprocessing)](#4)
* [Train your own SS-predictor](#5)

## SS-sort(COS) pipeline
<span id="1"></span>
<div align=center><img src="scientist_figures/workflow_img/ss-sort(cos).png" width="80%" height="80%" /></div>

### 1. Search against self
<span id="1-1"></span>

In [12]:
!CUDA_VISIBLE_DEVICES=0 python ./ss_filter/main_similarity.py \
-qer './example/query_mean_esm_result.pkl' \
-ter './example/query_mean_esm_result.pkl' \
-smp './ss_filter_data/esm_ss_predict/model_scop_tri.sav' \
-opr './example/ss_sort_cos_self'

We have 1 GPUs in total!, we will use as you selected
query protein list: 100%|█████████████████████| 5/5 [00:00<00:00, 117159.33it/s]
[32m[I 221208 21:11:06 main_similarity:148][39m Sort end.
Esm embedding generate time cost: 2.7625627517700195 s


### 2. Search against Swiss-Prot
<span id="1-2"></span>

In [13]:
!CUDA_VISIBLE_DEVICES=0 python ./ss_filter/main_similarity.py \
-qer './example/query_mean_esm_result.pkl' \
-ter './ss_filter_data/swissprot_to_swissprot/target_mean_esm_result.pkl' \
-smp './ss_filter_data/esm_ss_predict/model_scop_tri.sav' \
-opr './example/ss_sort_cos_swissprot'

We have 1 GPUs in total!, we will use as you selected
query protein list: 100%|█████████████████████████| 5/5 [00:12<00:00,  2.45s/it]
[32m[I 221208 21:12:07 main_similarity:148][39m Sort end.
Esm embedding generate time cost: 64.81993365287781 s


## SS-filter pipeline
<span id="2"></span>
<div align=center><img src="scientist_figures/workflow_img/main.png" width="80%" height="80%" /></div>

### 1. Search against self
<span id="2-1"></span>

In [14]:
#Step 1. generate pfamclan prefilter result
!python ./ss_filter/main_pfam.py \
-qpr './example/query_pfam_result.json' \
-tpr './example/query_pfam_result.json' \
-c \
-opr './example/pfamclan_self'

[32m[I 221208 21:12:14 main_pfam:13][39m query protein num = 5
[32m[I 221208 21:12:14 main_pfam:14][39m target protein num = 5
query protein list: 100%|█████████████████████| 5/5 [00:00<00:00, 138884.24it/s]


In [15]:
#Step 2. ss-filter search
!CUDA_VISIBLE_DEVICES=0 python ./ss_filter/main_similarity.py \
-qer './example/query_mean_esm_result.pkl' \
-ter './example/query_mean_esm_result.pkl' \
-smp './ss_filter_data/esm_ss_predict/model_scop_tri.sav' \
-ipr './example/pfamclan_self' \
-opr './example/ss_filter_self'

We have 1 GPUs in total!, we will use as you selected
Get prefilter list: 5it [00:00, 41040.16it/s]
[32m[I 221208 21:12:17 main_similarity:107][39m prefilter num = 5
query protein list: 100%|█████████████████████| 5/5 [00:00<00:00, 134432.82it/s]
[32m[I 221208 21:12:18 main_similarity:148][39m Sort end.
Esm embedding generate time cost: 2.783625602722168 s


### 2. Search against Swiss-Prot
<span id="2-2"></span>

In [16]:
#Step 1. generate pfamclan prefilter result
!python ./ss_filter/main_pfam.py \
-qpr './example/query_pfam_result.json' \
-tpr './ss_filter_data/swissprot_to_swissprot/target_pfam_result.json' \
-c \
-opr './example/pfamclan_swissprot'

[32m[I 221208 21:12:20 main_pfam:13][39m query protein num = 5
[32m[I 221208 21:12:20 main_pfam:14][39m target protein num = 498654
query protein list: 100%|█████████████████████████| 5/5 [00:00<00:00,  5.23it/s]


In [17]:
#Step 2. ss-filter search
!CUDA_VISIBLE_DEVICES=0 python ./ss_filter/main_similarity.py \
-qer './example/query_mean_esm_result.pkl' \
-ter './ss_filter_data/swissprot_to_swissprot/target_mean_esm_result.pkl' \
-smp './ss_filter_data/esm_ss_predict/model_scop_tri.sav' \
-ipr './example/pfamclan_swissprot' \
-opr './example/ss_filter_swissprot'

We have 1 GPUs in total!, we will use as you selected
Get prefilter list: 19238it [00:00, 215552.18it/s]
[32m[I 221208 21:13:11 main_similarity:107][39m prefilter num = 19238
query protein list: 100%|█████████████████████| 5/5 [00:00<00:00, 171897.70it/s]
[32m[I 221208 21:13:12 main_similarity:148][39m Sort end.
Esm embedding generate time cost: 50.15082263946533 s


## TM-align compute with Spark
<span id="3"></span>
<div align=center><img src="scientist_figures/workflow_img/tmalign_compute.png" width="80%" height="80%" /></div>


In [18]:
#install
%cd ./ss_filter/pytmalign/
!python setup.py build_ext --inplace
%cd ../..
#tmalign compute with spark
%cd ./ss_filter/
!python tmalign_compute.py \
-qsd '../ss_filter_data/swissprot_to_swissprot/query_structure/' \
-tsd '../ss_filter_data/swissprot_to_swissprot/query_structure/' \
-ipr '../example/tmalign_compute/test' \
-s
%cd ..

/data1/lw/git_ss_filter/git_ss_filter/ss_filter/pytmalign
running build_ext
/data1/lw/git_ss_filter/git_ss_filter
/data1/lw/git_ss_filter/git_ss_filter/ss_filter
Get prefilter list: 6it [00:00, 64527.75it/s]
100%|█████████████████████████████████████████| 6/6 [00:00<00:00, 282762.07it/s]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/08 21:13:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/12/08 21:13:17 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Compute total time cost 5.838896989822388 s                                     
/data1/lw/git_ss_filter/git_ss_filter


## Start from Fasta (preprocessing)
<span id="4"></span>

### 1. Generate ESM-1b embedding

In [19]:
#esm generate
!CUDA_VISIBLE_DEVICES=0 python ./ss_filter/esm_generate.py \
-f './ss_filter_data/swissprot_to_swissprot/query_protein.fasta' \
-m './example/query_mean_esm_result.pkl'

Transferred model to GPU
Read ./ss_filter_data/swissprot_to_swissprot/query_protein.fasta with 5 sequences
Processing 1 of 1 batches (5 sequences)
Esm embedding generate time cost: 15.655525207519531 s


### 2. Generate Pfam result

In [20]:
#pfam generate
!python ./ss_filter/pfam_local_generate.py \
-f './ss_filter_data/swissprot_to_swissprot/query_protein.fasta' \
-o './example/query_pfam_result.json'

1670505219.2397802
perl ./ss_filter_data/PfamScan/pfam_scan.pl -fasta ./ss_filter_data/swissprot_to_swissprot/query_protein.fasta -dir ./ss_filter_data/Pfam_db -outfile ./tmp.txt
Pfam local generate time cost 2.3816070556640625 s


## Train your own SS-predictor
<span id="5"></span>
<div align=center><img src="scientist_figures/workflow_img/ss-predictor.png" width="80%" height="80%" /></div>

In [21]:
#Train SS-predictor
!CUDA_VISIBLE_DEVICES=0 python ./ss_filter/esm_ss_predict_tri_train.py \
--save_model_path './example/ss_predictor/model_scop_tri.sav'

We have 1 GPUs in total! We will use as you selected
# training with esm_ss_predict_tri: ss_batch_size=100, epochs=20, lr=1e-05
# save model path: ./example/ss_predictor/model_scop_tri.sav
# loading esm result: ./ss_filter_data/esm_ss_predict/train/mean_esm_result.pkl
# loading protein list file: ./ss_filter_data/esm_ss_predict/train/protein_list.txt
# loading ss mat file: ./ss_filter_data/esm_ss_predict/train/ss_mat.npz
[32m[I 221208 21:13:48 esm_ss_predict:58][39m (8953, 8953) 40082581
PPI: 100%|██████████████████████| 40082581/40082581 [02:04<00:00, 322177.73it/s]
# loaded 40082581 sequence pairs
# training model
Epoch 1
-------------------------------
Train_mse_loss_avg: 0.273594  [    0/36074322]
Train_mse_loss_avg: 0.005572  [10000/36074322]
Train_mse_loss_avg: 0.004189  [20000/36074322]
Train_mse_loss_avg: 0.003107  [30000/36074322]
Train_mse_loss_avg: 0.006650  [40000/36074322]
Train_mse_loss_avg: 0.004287  [50000/36074322]
Train_mse_loss_avg: 0.004805  [60000/36074322]
Train