# **Feature Generation**

Protein Sequence to 1280 long features using ESM 1b



In [None]:
'''pytorch 1.7.1 along with cuda 10.1 is required for the package to run!
so install it using the cell below'''

In [None]:
!pip install torch==1.7.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

In [None]:
import torch 
print(torch.__version__)

Install Bio_enbeddings
[More Info can be found on their GitHub page](https://github.com/sacdallago/bio_embeddings.git)

The cell below will take approximately 5-10 mins

In [None]:
!pip3 install -U pip > /dev/null
!pip3 install -U "bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git" > /dev/null

Mounting drive to read in the training data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np

*the cell below will give an error when you run it for the first time but ignore the error and re run the same cell again and that error will go away!*

In [None]:
from bio_embeddings.embed import ESM1bEmbedder

Reading the training data

In [None]:
d = pd.read_csv('/content/drive/MyDrive/Train.csv')
d.head()

*apparently the esm1b embedder cannot process sequeces of length greater than 1024 hen we need to truncate it to a value less than 1022(reserving 2 for seperator and unk tokens). I chose 1000*

In [None]:
def truncator(s):
  if len(str(s)) < 1000:
    ans = s
  else:
    ans = str(s)[:1000]
  return ans


d['SEQUENCE'] = d['SEQUENCE'].apply(truncator)


In [None]:
sequencez = d['SEQUENCE'].values

In [None]:
embedder = ESM1bEmbedder(device = 'cuda')

*storing the 1280 lomg embedding for each protein sequence in a list called "embeddings" and then dumping it into a pickel file for avoiding to generate it again and again*

*The cell below will take around 24-30 hours to process all of the 858777 training sequences. (I will provide a link for you to download it using wget to avoid this step)*

In [None]:
embeddings = []
for sequence in sequencez:
  per_amino = embedder.embed(sequence)
  embeddings.append(embedder.reduce_per_protein(per_amino))

In [None]:
import pickle
with open('ESM1b_train_embeddings.pkl', 'wb') as f:
  pickle.dump(embeddings, f)

To skip running above cells you can download the pickeled list(divided into 8 parts from ESM1BT1-ESM1BT8)

In [None]:
!wget https://storage.googleapis.com/enzymeclassification/ESM1b/ESM1BT1.pkl

In [None]:
!wget https://storage.googleapis.com/enzymeclassification/ESM1b/ESM1BT2.pkl

In [None]:
!wget https://storage.googleapis.com/enzymeclassification/ESM1b/ESM1BT3.pkl

In [None]:
!wget https://storage.googleapis.com/enzymeclassification/ESM1b/ESM1BT4.pkl

In [None]:
!wget https://storage.googleapis.com/enzymeclassification/ESM1b/ESM1BT5.pkl

In [None]:
!wget https://storage.googleapis.com/enzymeclassification/ESM1b/ESM1BT6.pkl

In [None]:
!wget https://storage.googleapis.com/enzymeclassification/ESM1b/ESM1BT7.pkl

In [None]:
!wget https://storage.googleapis.com/enzymeclassification/ESM1b/ESM1BT8.pkl

# The same code and technique can be used to extract the embeddings of the sequences from Test.csv

I'm providing the download links for those as well to save your time

In [None]:
!wget https://storage.googleapis.com/enzymeclassification/ESM1b/ESMtest1a.pkl

In [None]:
!wget https://storage.googleapis.com/enzymeclassification/ESM1b/ESMtest1b.pkl

In [None]:
!wget https://storage.googleapis.com/enzymeclassification/ESM1b/ESMtest2.pkl

In [None]:
!wget https://storage.googleapis.com/enzymeclassification/ESM1b/ESMtest3a.pkl

In [None]:
!wget https://storage.googleapis.com/enzymeclassification/ESM1b/ESMtest3b.pkl

Now that we have our embeddings for both train.csv as well as Test.csv, we can proceed towards training them using a Machine learning model.
Please see the Training notebook for that!



---

