In [1]:
from transformers import T5Tokenizer, T5EncoderModel
import torch
import re
import csv

## embedding

#### 01. Environment Settings
Set the device to GPU (if available) or CPU so that the code knows where to perform the model's calculations.

In [2]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

print(torch.__version__)

2.2.1+cu121


In [3]:
print(device)
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))

cuda:0
True
1
NVIDIA GeForce RTX 4060 Laptop GPU


#### 02. Load Tokenizer and model:
The pre-trained Tokenizer and T5 encoder models were loaded using the from_pretrained method and the models were moved to the previously specified device (GPU or CPU).

In [4]:
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


#### 03. Preparing and Processing Protein Sequences
Protein sequences were first defined, and then rare or indeterminate amino acids were replaced with X's by regular expression substitutions and the addition of spaces, and each amino acid in the sequence was correctly separated for the model input format.

In [6]:
with open('epitope_cdr3_pair.csv', 'r') as file:
    csv_reader = csv.reader(file)
    
    sequence_examples = []
    next(csv_reader)
    for row in csv_reader:
        
        sequence_examples.append(row[0])


In [7]:
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

In [8]:
ids = tokenizer(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

In [9]:
with torch.no_grad():
    embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)   

In [10]:
emb_0 = embedding_repr.last_hidden_state[0,:13]
emb_1 = embedding_repr.last_hidden_state[1,:14]
emb_0_per_protein = emb_0.mean(dim=0)

In [11]:
print(emb_0_per_protein)

tensor([ 0.1421,  0.0178, -0.1057,  ...,  0.1298,  0.0049,  0.0661],
       device='cuda:0')


In [13]:
emb_0_per_protein.shape

torch.Size([1024])

In [15]:
emb_1_per_protein = emb_1.mean(dim=0)

In [16]:
print(emb_1_per_protein)

tensor([ 0.1518,  0.1074, -0.0823,  ...,  0.1613, -0.0450, -0.0311],
       device='cuda:0')
