This part of the tutorial will guide you how to obtain the ESM-2 embedding of the gene.

#### Query and Download

We need to download the protein expression of the gene under [UniProt](https://www.uniprot.org/) (here we use `P61922`, `Q3UJF9` and `Q91ZH7` as an example). We can do this efficiently with the following script:

In [1]:
from urllib.request import urlopen

In [2]:
gene_id_list = ['Q3UJF9', 'P61922', 'Q91ZH7']

In [3]:
f = open('gene_protein.txt', 'w')

for gene_id in gene_id_list:
    URL = urlopen('https://rest.uniprot.org/uniprotkb/' + gene_id + '.fasta')
    albumen = str(URL.read()).split('\\n')[:-1]
    f.write('>' + gene_id + '\n')
    f.write(''.join(albumen[1:]) + '\n')
f.close()

#### ESM-2 Embedding

Installed the ESM-2 model (you can refer to [ESM-2](https://github.com/facebookresearch/esm)), then **enter the following command:**

CUDA_VISIBLE_DEVICES=0 python scripts/extract.py esm2_t36_3B_UR50D gene_protein.txt examples/data/some_proteins_emb_esm2 --repr_layers 36 --include mean per_tok

After running successfully, the gene embedding file (in this case, `P61922.pt`, `Q3UJF9.pt` and `Q91ZH7.pt`) is generated in the `esm-main/examples/data/some_proteins_emb_esm2/` directory. We put them together:

In [4]:
import os
import torch
import pickle
import pandas as pd

df = pd.DataFrame()

for path, dir_lst, file_lst in os.walk(r'examples/data/some_proteins_emb_esm2'):
    for file_name in file_lst:
        data = torch.load(open(os.path.join(path, file_name), 'rb'))
        df.insert(df.shape[1], data['label'], data['representations'][36][-1].numpy())

df

Unnamed: 0,Q3UJF9,P61922,Q91ZH7
0,0.134797,0.025256,-0.149445
1,-0.053382,-0.031979,-0.187018
2,-0.012659,0.004654,-0.018185
3,0.043372,-0.081000,-0.261306
4,-0.072901,0.091502,-0.136699
...,...,...,...
2555,-0.071132,-0.050865,0.162222
2556,-0.076116,-0.019745,-0.090902
2557,0.065875,0.103102,0.006431
2558,-0.208240,-0.061796,-0.100855


In [5]:
pickle.dump(df, open('emb.pkl', 'wb'))