<a href="https://colab.research.google.com/github/agemagician/ProtTrans/blob/master/Embedding/TensorFlow/Advanced/ProtT5-XL-UniRef50.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Demo to use the ProtTrans-Glutar model for glutarylation prediction

In [1]:
#Your input sequence file here in fasta format. The model was trained using peptides of length 23.
filename="sample.fasta"

### Extracting EAAC dan CTDD protein sequences' features using iFeature toolkit

In [2]:
!python iFeature.py --file sample.fasta --type EAAC --out EAAC.tsv
!python iFeature.py --file sample.fasta --type CTDD --out CTDD.tsv

Descriptor type: EAAC
Descriptor type: CTDD


<h3>Extracting protein sequences' features using ProtT5-XL-UniRef50 pretrained-model</h3>

**1. Load necessry libraries including huggingface transformers**

In [3]:
!pip install -q SentencePiece transformers

In [4]:
from transformers import TFT5EncoderModel, T5Tokenizer
import numpy as np
import re
import gc

2022-04-25 09:59:47.557553: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-25 09:59:47.557600: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


<b>2. Load the vocabulary and ProtT5-XL-UniRef50 Model<b>

In [5]:
tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xl_uniref50", do_lower_case=False )
#tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xl_bfd", do_lower_case=False )


In [6]:
model = TFT5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_uniref50", from_pt=True)
#model = TFT5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_bfd", from_pt=True)

2022-04-25 10:00:08.112838: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-04-25 10:00:08.112895: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-04-25 10:00:08.112932: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (biows1): /proc/driver/nvidia/version does not exist
2022-04-25 10:00:08.118531: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-04-25 10:00:08.296264: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but t

All the weights of TFT5EncoderModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5EncoderModel for predictions without further training.


In [7]:
gc.collect()

9

<b>3. Create or load sequences and map rarely occured amino acids (U,Z,O,B) to (X)<b>

In [8]:
protlist = []
with open(filename) as fin:
    for line in fin:
        if line.startswith('>'):
            protlist.append(next(fin).strip())

In [9]:
protlist

['SLVNELTFSARKMMADEALDSGL',
 'PSVSLQTVKLMKEGLEAARLKAY',
 'LMDCMNKLKNNKEYLEFRKERSK',
 'AATILTSPDLRKQWLQEVKGMAD',
 'GLKPEQVERLTKEFSVYMTKDGR',
 'GGVEFNIDLPNKKVCIDSEHSSD',
 'EQNRTHSASFFKFLTEELSLDQD',
 'AFWRELVECFQKISKDSDCRAVV',
 'KIFRQQLEVFMKKNVDFLIAEYF',
 'GSWGSGLDMHTKPWIRARARKEY']

In [10]:
protlist = [" ".join(sequence) for sequence in protlist]

In [11]:
protlist

['S L V N E L T F S A R K M M A D E A L D S G L',
 'P S V S L Q T V K L M K E G L E A A R L K A Y',
 'L M D C M N K L K N N K E Y L E F R K E R S K',
 'A A T I L T S P D L R K Q W L Q E V K G M A D',
 'G L K P E Q V E R L T K E F S V Y M T K D G R',
 'G G V E F N I D L P N K K V C I D S E H S S D',
 'E Q N R T H S A S F F K F L T E E L S L D Q D',
 'A F W R E L V E C F Q K I S K D S D C R A V V',
 'K I F R Q Q L E V F M K K N V D F L I A E Y F',
 'G S W G S G L D M H T K P W I R A R A R K E Y']

In [12]:
protlist = [re.sub(r"[UZOB]", "X", sequence) for sequence in protlist]

<b>4. Tokenize, encode sequences and load it into the GPU if possibile<b>

In [13]:
ids = tokenizer.batch_encode_plus(protlist, add_special_tokens=True, padding=True, return_tensors="tf")

In [14]:
input_ids = ids['input_ids']
attention_mask = ids['attention_mask']

<b>5. Extracting sequences' features and load it into the CPU if needed<b>

In [15]:
embedding = model(input_ids)

In [16]:
embedding = np.asarray(embedding.last_hidden_state)

In [17]:
attention_mask = np.asarray(attention_mask)

<b>7. Remove padding (\<pad\>) and special tokens (\</s\>) that is added by ProtT5-XL-UniRef50 model<b>

In [18]:
features = [] 
for seq_num in range(len(embedding)):
    seq_len = (attention_mask[seq_num] == 1).sum()
    seq_emd = embedding[seq_num][:seq_len-1]
    features.append(seq_emd)

In [19]:
print(features)

[array([[ 0.03629582, -0.25272217, -0.07906041, ...,  0.12213635,
         0.09413542,  0.08972404],
       [ 0.22051564, -0.262192  , -0.04013605, ...,  0.18325903,
         0.23735234,  0.22343889],
       [ 0.00964273, -0.04136721, -0.12887119, ..., -0.35683772,
        -0.03652622,  0.26135442],
       ...,
       [ 0.1128105 , -0.01389266, -0.19745976, ...,  0.22488484,
        -0.07750582, -0.17140193],
       [-0.05688144, -0.09415594, -0.66101795, ..., -0.24926527,
        -0.16212186,  0.09775349],
       [-0.03187493, -0.0810577 ,  0.08225247, ..., -0.06895956,
         0.2325522 , -0.00681481]], dtype=float32), array([[-0.15055022,  0.05364951, -0.01746642, ...,  0.16237238,
        -0.03544049,  0.0491399 ],
       [ 0.10659074, -0.17598663, -0.31884226, ...,  0.25361112,
         0.04247674, -0.01424738],
       [ 0.00238571,  0.0408157 ,  0.06145804, ...,  0.25682843,
         0.17341541,  0.13685896],
       ...,
       [ 0.199428  , -0.02964877,  0.15113924, ...,  0.108

In [20]:
features[0].shape

(23, 1024)

**8. Sum features across 23 rows to get the vector**

In [21]:
import pandas as pd
import numpy as np

sum_features = []
for element in features:
    sum_features.append(element.sum(axis=0))
prott5_xl_uniref50 = pd.DataFrame(sum_features)



### Combine all features: CTDD, EAAC, ProtT5-XL-Uniref50

In [23]:
def strip_first_col(fname, delimiter=None):
    with open(fname, 'r') as fin:
        for line in fin:
            try:
               yield line.split(delimiter, 1)[1]
            except IndexError:
               continue

ctdd = np.loadtxt(strip_first_col('CTDD.tsv'), delimiter="\t",skiprows=1)   

eaac = np.loadtxt(strip_first_col('EAAC.tsv'), delimiter="\t",skiprows=1)   

X=np.concatenate((ctdd,prott5_xl_uniref50),axis=1)
X=np.concatenate((X,eaac),axis=1)


### Load and apply ProtTrans-Glutar model for classification

In [24]:
import pickle

# load model from file, choose either model trained using training data only, or model trained using whole data

loaded_model = pickle.load(open("ProtTrans-Glutar.train.model", "rb"))
#loaded_model = pickle.load(open("ProtTrans-Glutar.full.model", "rb"))


# make predictions for supplied data
y_pred = loaded_model.predict(X)
predictions = [round(value) for value in y_pred]

print (predictions)

[0, 1, 1, 1, 1, 0, 0, 0, 0, 0]
