<a href="https://colab.research.google.com/github/agemagician/ProtTrans/blob/master/Embedding/TensorFlow/Advanced/ProtXLNet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h3> Extracting protein sequences' features using XLNet pretrained-model <h3>

<b>1. Load necessry libraries including huggingface transformers<b>

In [1]:
!pip install -q transformers sentencepiece

In [2]:
import tensorflow as tf
from transformers import TFXLNetModel, XLNetTokenizer,XLNetConfig
import re
import numpy as np

<b>2. Load the vocabulary and XLNet Model</b>

In [3]:
tokenizer = XLNetTokenizer.from_pretrained("Rostlab/prot_xlnet", do_lower_case=False )

In [4]:
xlnet_men_len = 512

In [5]:
model = TFXLNetModel.from_pretrained("Rostlab/prot_xlnet", mem_len=xlnet_men_len, from_pt=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFXLNetModel: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing TFXLNetModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFXLNetModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFXLNetModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLNetModel for predictions without further training.


<b>3. Create or load sequences and map rarely occured amino acids (U,Z,O,B) to (X)<b>

In [6]:
sequences_Example = ["A E T C Z A O","S K T Z P"]

In [7]:
sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]

<b>4. Tokenize, encode sequences and load it into the GPU if possibile<b>

In [8]:
ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True, return_tensors="tf")

In [9]:
input_ids = ids['input_ids']
attention_mask = ids['attention_mask']

<b>5. Extracting sequences' features and load it into the CPU if needed<b>

In [10]:
output = model(input_ids,attention_mask=attention_mask,mems=None)
embedding = output.last_hidden_state
memory = output.mems

In [11]:
embedding = np.asarray(embedding)

In [12]:
attention_mask = np.asarray(attention_mask)

<b>6. Remove padding ([PAD]) and special tokens that is added by XLNet model<b>

In [13]:
features = [] 
for seq_num in range(len(embedding)):
    seq_len = (attention_mask[seq_num] == 1).sum()
    padded_seq_len = len(attention_mask[seq_num])
    seq_emd = embedding[seq_num][padded_seq_len-seq_len:padded_seq_len-2]
    features.append(seq_emd)

In [14]:
print(features)

[array([[ 0.48745072, -0.77087957,  0.99001706, ..., -0.37356192,
        -1.0589752 ,  0.95599777],
       [ 0.2132588 , -0.5478905 ,  0.6115487 , ..., -0.05471386,
        -0.87878954,  0.2464511 ],
       [ 0.37891257, -0.6396583 ,  0.67224425, ..., -0.1489127 ,
        -0.67695737,  0.34598166],
       ...,
       [ 0.09265009, -0.6810126 ,  0.51818657, ..., -0.32756305,
        -0.56731194, -0.16441321],
       [-0.08533202, -0.74382424,  0.29890722, ..., -0.24376419,
        -0.11153169, -0.72609997],
       [-0.4822583 , -0.83816975,  0.08214375, ..., -0.251965  ,
        -0.03577391, -0.5348946 ]], dtype=float32), array([[ 1.0403051 , -0.95049536,  0.3353436 , ..., -0.23747307,
        -0.275508  ,  0.47948724],
       [ 0.5085352 , -0.9508469 ,  1.023512  , ..., -0.05893063,
        -0.9752811 ,  0.08713093],
       [ 0.65626574, -0.75406295,  0.43234748, ...,  0.36707488,
        -0.70632553, -0.5553421 ],
       [ 0.5903387 ,  0.1415192 ,  0.29578805, ...,  0.20601025,
     