# Generate Protein Embeddings

As an example, here's how to proccess the frog reference proteome with ESM1b.

You can find proteome fastas on ENSEMBL: https://uswest.ensembl.org/info/data/ftp/index.html

We use a pretrained Transformer model from https://github.com/facebookresearch/esm. These models were trained on hundreds of millions of protein sequences from across the tree of life.

**NOTE:** These protein embedding scripts require an older version of the ESM Repo: you should checkout commit:
[`839c5b82c6cd9e18baa7a88dcbed3bd4b6d48e47`](https://github.com/facebookresearch/esm/commit/839c5b82c6cd9e18baa7a88dcbed3bd4b6d48e47)

**Clone the ESM repo.**

## Step 1: Download reference proteome

In [1]:
!mkdir data

In [2]:
import os
NAME_HUMAN = "Homo_sapiens.GRCh38.pep.all" # CHANGE THIS TO THE NAME OF THE REFERENCE PROTEOME YOU WANT
NAME_MOUSE = "Mus_musculus.GRCm39.pep.all"
DATA_PATH = os.path.abspath(os.getcwd()) + "/data" # PATH TO DATA DIRECTORY (YOU CAN USE THE ONE IN THIS DIRECTORY)
ESM_PATH = "/nfs/research/irene/anaelle/CrossSpeciesIC/esm/" # MAKE SURE TO CHANGE THIS TO THE PATH YOU CLONED THE ESM REPO TO ESM PATH
TORCH_HOME = "/nfs/research/irene/anaelle/CrossSpeciesIC/SATURN/torch_home" # MAKE SURE TO CHANGE THIS TO YOUR DESIRED DIRECTORY
DEVICE=2 # GPU NUMBER, CHANGE THIS

# PATH TO ENSMBL PROTEOME FASTA, CHANGE THIS
FASTA_URL_H = "http://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz" 
FASTA_URL_M = "http://ftp.ensembl.org/pub/release-105/fasta/mus_musculus/pep/Mus_musculus.GRCm39.pep.all.fa.gz"

In [10]:
DATA_PATH

'/nfs/research/irene/anaelle/Scripts/SATURN/protein_embeddings/data'

In [3]:
!wget -r {FASTA_URL_H} \
        -O data/{NAME_HUMAN}.fa.gz

will be placed in the single file you specified.

--2023-07-20 11:32:41--  http://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.139
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.139|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14376754 (14M) [application/x-gzip]
Saving to: ‘data/Homo_sapiens.GRCh38.pep.all.fa.gz’


2023-07-20 11:32:42 (31.6 MB/s) - ‘data/Homo_sapiens.GRCh38.pep.all.fa.gz’ saved [14376754/14376754]

FINISHED --2023-07-20 11:32:42--
Total wall clock time: 0.8s
Downloaded: 1 files, 14M in 0.4s (31.6 MB/s)


In [4]:
!wget -r {FASTA_URL_M} \
        -O data/{NAME_MOUSE}.fa.gz

will be placed in the single file you specified.

--2023-07-20 11:33:01--  http://ftp.ensembl.org/pub/release-105/fasta/mus_musculus/pep/Mus_musculus.GRCm39.pep.all.fa.gz
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.139
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.139|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11327006 (11M) [application/x-gzip]
Saving to: ‘data/Mus_musculus.GRCm39.pep.all.fa.gz’


2023-07-20 11:33:01 (32.9 MB/s) - ‘data/Mus_musculus.GRCm39.pep.all.fa.gz’ saved [11327006/11327006]

FINISHED --2023-07-20 11:33:02--
Total wall clock time: 0.5s
Downloaded: 1 files, 11M in 0.3s (32.9 MB/s)


In [5]:
!gunzip data/{NAME_HUMAN}.fa.gz

In [6]:
!gunzip data/{NAME_MOUSE}.fa.gz

## Step 2: Clean Fasta

In [7]:
!python clean_fasta.py \
--data_path=./data/{NAME_HUMAN}.fa \
--save_path=./data/{NAME_HUMAN}.clean.fa

Number of original sequences = 117,909
100%|████████████████████████████████| 117909/117909 [00:01<00:00, 99681.95it/s]
Number of cleaned sequences = 117,779


In [8]:
!python clean_fasta.py \
--data_path=./data/{NAME_MOUSE}.fa \
--save_path=./data/{NAME_MOUSE}.clean.fa

Number of original sequences = 67,165
100%|█████████████████████████████████| 67165/67165 [00:00<00:00, 145578.88it/s]
Number of cleaned sequences = 67,071


## Step 3: Run ESM

### Runned with sh files

## Step 4: Convert to Embeddings File

In [11]:
!python map_gene_symbol_to_protein_ids.py \
    --fasta_path ./data/{NAME_HUMAN}.fa \
    --save_path ./data/{NAME_HUMAN}.gene_symbol_to_protein_ID.json


!python convert_protein_embeddings_to_gene_embeddings.py \
    --embedding_dir ./data/{NAME_HUMAN}.clean.fa_esm1b \
    --gene_symbol_to_protein_ids_path ./data/{NAME_HUMAN}.gene_symbol_to_protein_ID.json \
    --embedding_model ESM1b \
    --save_path ./data/{NAME_HUMAN}.gene_symbol_to_embedding_ESM1b.pt


100%|███████████████████████████████| 117909/117909 [00:00<00:00, 270859.54it/s]
117909
Number of gene symbols = 19,844
Number of protein IDs = 116,411
100%|██████████████████████████████████| 116411/116411 [05:50<00:00, 332.57it/s]
data/Homo_sapiens.GRCh38.pep.all.clean.fa_esm1b/ENSP00000489813.1.pt
100%|██████████████████████████████████| 19790/19790 [00:01<00:00, 15790.11it/s]


In [12]:
!python map_gene_symbol_to_protein_ids.py \
    --fasta_path ./data/{NAME_MOUSE}.fa \
    --save_path ./data/{NAME_MOUSE}.gene_symbol_to_protein_ID.json


!python convert_protein_embeddings_to_gene_embeddings.py \
    --embedding_dir ./data/{NAME_MOUSE}.clean.fa_esm1b \
    --gene_symbol_to_protein_ids_path ./data/{NAME_MOUSE}.gene_symbol_to_protein_ID.json \
    --embedding_model ESM1b \
    --save_path ./data/{NAME_MOUSE}.gene_symbol_to_embedding_ESM1b.pt


100%|█████████████████████████████████| 67165/67165 [00:00<00:00, 294723.35it/s]
67165
Number of gene symbols = 22,380
Number of protein IDs = 67,092
100%|████████████████████████████████████| 67092/67092 [03:17<00:00, 340.11it/s]
data/Mus_musculus.GRCm39.pep.all.clean.fa_esm1b/ENSMUSP00000152932.2.pt
100%|██████████████████████████████████| 22324/22324 [00:00<00:00, 27503.76it/s]


## STEP 5: Running SPEAR

In [13]:
# Your final embeddings will be located at: 
os.path.abspath(f"./data/{NAME_HUMAN}.gene_symbol_to_embedding_ESM1b.pt")

'/nfs/research/irene/anaelle/Scripts/SATURN/protein_embeddings/data/Homo_sapiens.GRCh38.pep.all.gene_symbol_to_embedding_ESM1b.pt'

In [14]:
# Your final embeddings will be located at: 
os.path.abspath(f"./data/{NAME_MOUSE}.gene_symbol_to_embedding_ESM1b.pt")

'/nfs/research/irene/anaelle/Scripts/SATURN/protein_embeddings/data/Mus_musculus.GRCm39.pep.all.gene_symbol_to_embedding_ESM1b.pt'

In [15]:
import torch

In [None]:
torch.load(os.path.abspath(f"./data/{NAME_MOUSE}.gene_symbol_to_embedding_ESM1b.pt"))