# Build your PLMSearch locally 🧪

**Notice:**

**The experiment are implement on a server with a `56-core Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40 GHz and 252 GB RAM`.**

**The GPU environment of the server is `1 × GeForce GTX 1080 Ti and 11 GB GPU Memory`.**

## Quick links
* [Start from Fasta (preprocessing)](#1)
  * [Generate ESM-1b embedding](#1-1)
  * [Generate Pfam result](#1-2)
* [SS-predictor pipeline](#2)
* [PLMSearch pipeline](#3)
* [TM-align compute with Spark](#4)

## Start from Fasta (preprocessing)
<span id="1"></span>

### 1. Generate ESM-1b embedding
<span id="1-1"></span>

In [9]:
#esm generate
!python ./plmsearch/embedding_generate.py \
-f './example/protein.fasta' \
-e './example/embedding.pkl' #--nogpu #for CPU-ONLY

Transferred model to GPU
Read ./example/protein.fasta with 5 sequences
Processing 1 of 1 batches (5 sequences)
Embedding generation time cost: 15.276249408721924 s


### 2. Generate Pfam result
PLMSearch requires this input, while SS-predictor `does not`.
<span id="1-2"></span>

In [3]:
#pfam generate
!python ./plmsearch/pfam_local_generate.py \
-f './example/protein.fasta' \
-o './example/pfam_result.json'

1693054451.8638368
sudo perl ./plmsearch_data/PfamScan/pfam_scan.pl -fasta ./example/protein.fasta -dir ./plmsearch_data/Pfam_db -outfile ./tmp.txt

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

[sudo] password for lw: 


## SS-predictor pipeline
<span id="2"></span>
<div align=center><img src="scientist_figures/workflow_img/ss_predictor3.png" width="90%" height="90%" /></div>

Set Swiss-Prot as target dataset

In [10]:
!python ./plmsearch/main_similarity.py \
-iqe './example/embedding.pkl' \
-ite './plmsearch_data/swissprot/embedding.pkl' \
-smp './plmsearch_data/model/plmsearch.sav' #-d #for CPU-ONLY

Embedding load time cost: 31.034204721450806 s
We have 4 GPUs in total!, we will use as you selected
Search query proteins batch by batch: 100%|███████| 1/1 [00:08<00:00,  8.97s/it]
Search time cost: 10.833179712295532 s


## PLMSearch pipeline
<span id="3"></span>
<div align=center><img src="scientist_figures/workflow_img/framework.png" width="90%" height="90%" /></div>

Set Swiss-Prot as target dataset

In [11]:
#Step 1. generate pfamclan prefilter result
!python ./plmsearch/main_pfam.py \
-qpr './example/pfam_result.json' \
-tpr './plmsearch_data/swissprot/pfam_result.json' \
-c

[32m[I 230826 21:00:01 main_pfam:13][39m query protein num = 5
[32m[I 230826 21:00:01 main_pfam:14][39m target protein num = 430140
query protein list: 100%|█████████████████████████| 5/5 [00:00<00:00,  5.65it/s]


In [12]:
#Step 2. PLMSearch search
!python ./plmsearch/main_similarity.py \
-iqe './example/embedding.pkl' \
-ite './plmsearch_data/swissprot/embedding.pkl' \
-smp './plmsearch_data/model/plmsearch.sav' \
-isr './example/search_result/pfamclan' #-d #for CPU-ONLY

Embedding load time cost: 30.055421352386475 s
We have 4 GPUs in total!, we will use as you selected
Get search list: 17355it [00:00, 194079.78it/s]
[32m[I 230826 21:00:35 main_similarity:172][39m presearch num = 17355
Search query proteins batch by batch: 100%|███████| 1/1 [00:04<00:00,  4.51s/it]
Search time cost: 6.460866689682007 s


## Alignment (Sequence align & TM-align)
<span id="4"></span>

In [13]:
!python ./plmsearch/sequence_align.py \
-qf './example/protein.fasta' \
-tf './example/protein.fasta' \
-ipr './example/alignment/test'

pairwise sequence align: 100%|█████████████████| 6/6 [00:00<00:00, 27962.03it/s]
sequence align output:   0%|                              | 0/6 [00:00<?, ?it/s]
P0AD96	P0AD96	1.0
>P0AD96	P0AD96
MNIKGKALLAGCIALAFSNMALAEDIKVAVVGAMSGPVAQYGDQEFTGAEQAVADINAKGGIKGNKLQIVKYDDACDPKQAVAVANKVVNDGIKYVIGHLCSSSTQPASDIYEDEGILMITPAATAPELTARGYQLILRTTGLDSDQGPTAAKYILEKVKPQRIAIVHDKQQYGEGLARAVQDGLKKGNANVVFFDGITAGEKDFSTLVARLKKENIDFVYYGGYHPEMGQILRQARAAGLKTQFMGPEGVANVSLSNIAGESAEGLLVTKPKNYDQVPANKPIVDAIKAKKQDPSGAFVWTTYAALQSLQAGLNQSDDPAEIAKYLKANSVDTVMGPLTWDEKGDLKGFEFGVFDWHANGTATDAK
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MNIKGKALLAGCIALAFSNMALAEDIKVAVVGAMSGPVAQYGDQEFTGAEQAVADINAKGGIKGNKLQI

In [14]:
!python ./plmsearch/tmalign.py \
-qsd './example/structure/' \
-tsd './example/structure/' \
-ipr './example/alignment/test'

pairwise tmalign: 100%|████████████████████████| 6/6 [00:00<00:00, 24648.21it/s]
tmalign output:   0%|                                     | 0/6 [00:00<?, ?it/s]
P0AD96	P0AD96	1.0
>P0AD96	P0AD96
MNIKGKALLAGCIALAFSNMALAEDIKVAVVGAMSGPVAQYGDQEFTGAEQAVADINAKGGIKGNKLQIVKYDDACDPKQAVAVANKVVNDGIKYVIGHLCSSSTQPASDIYEDEGILMITPAATAPELTARGYQLILRTTGLDSDQGPTAAKYILEKVKPQRIAIVHDKQQYGEGLARAVQDGLKKGNANVVFFDGITAGEKDFSTLVARLKKENIDFVYYGGYHPEMGQILRQARAAGLKTQFMGPEGVANVSLSNIAGESAEGLLVTKPKNYDQVPANKPIVDAIKAKKQDPSGAFVWTTYAALQSLQAGLNQSDDPAEIAKYLKANSVDTVMGPLTWDEKGDLKGFEFGVFDWHANGTATDAK
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
MNIKGKALLAGCIALAFSNMALAEDIKVAVVGAMSGPVAQYGDQEFTGAEQAVADINAKGGIKGNKLQI