<a href="https://colab.research.google.com/github/agemagician/ProtTrans/blob/master/Embedding/PyTorch/Advanced/ProtT5-XL-BFD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Important Notes:
1. ProtT5-XL-BFD has both encoder and decoder, for feature extraction we only load and use the encoder part.
2. Loading only the encoder part, reduces the inference speed and the GPU memory requirements by half.
2. In order to use ProtT5-XL-BFD encoder, you must install the latest huggingface transformers version from their GitHub repo, or at least the bellow mentioned commit.
3. If you are intersted in both the encoder and decoder, you should use T5Model rather than T5EncoderModel.

<h3>Extracting protein sequences' features using ProtT5-XL-BFD pretrained-model</h3>

**1. Load necessry libraries including huggingface transformers**

In [1]:
!pip install -q SentencePiece git+https://github.com/huggingface/transformers.git@40ecaf0c2b1c0b3894e9abf619f32472c5a3b3ca

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 1.1MB 13.6MB/s 
[K     |████████████████████████████████| 890kB 46.0MB/s 
[K     |████████████████████████████████| 2.9MB 25.0MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [2]:
import torch
from transformers import T5EncoderModel, T5Tokenizer
import re
import numpy as np
import gc

<b>2. Load the vocabulary and ProtT5-XL-BFD Model<b>

In [3]:
tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xl_bfd", do_lower_case=False )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=237990.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1786.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=24.0, style=ProgressStyle(description_w…




In [4]:
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_bfd")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=457.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=11275562268.0, style=ProgressStyle(desc…




Some weights of the model checkpoint at Rostlab/prot_t5_xl_bfd were not used when initializing T5EncoderModel: ['decoder.embed_tokens.weight', 'decoder.block.0.layer.0.SelfAttention.q.weight', 'decoder.block.0.layer.0.SelfAttention.k.weight', 'decoder.block.0.layer.0.SelfAttention.v.weight', 'decoder.block.0.layer.0.SelfAttention.o.weight', 'decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight', 'decoder.block.0.layer.0.layer_norm.weight', 'decoder.block.0.layer.1.EncDecAttention.q.weight', 'decoder.block.0.layer.1.EncDecAttention.k.weight', 'decoder.block.0.layer.1.EncDecAttention.v.weight', 'decoder.block.0.layer.1.EncDecAttention.o.weight', 'decoder.block.0.layer.1.layer_norm.weight', 'decoder.block.0.layer.2.DenseReluDense.wi.weight', 'decoder.block.0.layer.2.DenseReluDense.wo.weight', 'decoder.block.0.layer.2.layer_norm.weight', 'decoder.block.1.layer.0.SelfAttention.q.weight', 'decoder.block.1.layer.0.SelfAttention.k.weight', 'decoder.block.1.layer.0.SelfAttention

In [5]:
gc.collect()

1037

<b>3. Load the model into the GPU if avilabile and switch to inference mode<b>

In [6]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

In [7]:
model = model.to(device)
model = model.eval()

<b>4. Create or load sequences and map rarely occured amino acids (U,Z,O,B) to (X)<b>

In [8]:
sequences_Example = ["A E T C Z A O","S K T Z P"]

In [9]:
sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]

<b>5. Tokenize, encode sequences and load it into the GPU if possibile<b>

In [10]:
ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True)

In [11]:
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

<b>6. Extracting sequences' features and load it into the CPU if needed<b>

In [12]:
with torch.no_grad():
    embedding = model(input_ids=input_ids,attention_mask=attention_mask)

In [13]:
embedding = embedding.last_hidden_state.cpu().numpy()

<b>7. Remove padding (\<pad\>) and special tokens (\</s\>) that is added by ProtT5-XL-BFD model<b>

In [14]:
features = [] 
for seq_num in range(len(embedding)):
    seq_len = (attention_mask[seq_num] == 1).sum()
    seq_emd = embedding[seq_num][:seq_len-1]
    features.append(seq_emd)

In [15]:
print(features)

[array([[ 0.2906924 , -0.20804922, -0.21934754, ...,  0.33846605,
         0.46031427, -0.16942444],
       [ 0.20362833, -0.12995909, -0.17139852, ...,  0.1064226 ,
        -0.37738243,  0.05401208],
       [ 0.03736684, -0.085066  , -0.23616493, ...,  0.25152582,
         0.13338618,  0.03063769],
       ...,
       [ 0.4778798 , -0.2302942 , -0.10682338, ...,  0.4057171 ,
         0.52513736,  0.1917527 ],
       [ 0.0799041 ,  0.07094701, -0.08554685, ...,  0.2494752 ,
         0.3653832 , -0.45506272],
       [ 0.09110155,  0.17046824,  0.42079163, ...,  0.25626677,
         0.02010923, -0.11016797]], dtype=float32), array([[ 0.28857493, -0.11107822, -0.13360327, ..., -0.06594554,
         0.00528727, -0.21770114],
       [ 0.13953964, -0.12703417,  0.06635736, ..., -0.02377463,
        -0.28750962,  0.09930848],
       [ 0.22406411,  0.02809858,  0.04245748, ...,  0.14432026,
         0.20209198, -0.22417574],
       [ 0.5806172 , -0.16730031, -0.1463695 , ...,  0.36016148,
     