<a href="https://colab.research.google.com/github/francescopatane96/ESM2_experiments/blob/main/ESM2_classification_FINETUNED_tf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install transformers datasets

# Load finetuned model from huggingface

In [2]:
from transformers import TFEsmForSequenceClassification

model = TFEsmForSequenceClassification.from_pretrained("francescopatane/esm2m35finetunedcytmem", output_attentions=True)

All model checkpoint layers were used when initializing TFEsmForSequenceClassification.

All the layers of TFEsmForSequenceClassification were initialized from the model checkpoint at francescopatane/esm2m35finetunedcytmem.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFEsmForSequenceClassification for predictions without further training.


## Tokenizing the data

All inputs to neural nets must be numerical. The process of converting strings into numerical indices suitable for a neural net is called **tokenization**. For natural language this can be quite complex, as usually the network's vocabulary will not contain every possible word, which means the tokenizer must handle splitting rarer words into pieces, as well as all the complexities of capitalization and unicode characters and so on.

With proteins, however, things are very easy. In protein language models, each amino acid is converted to a single token. Every model on `transformers` comes with an associated `tokenizer` that handles tokenization for it, and protein language models are no different. Let's get our tokenizer!

In [3]:
from transformers import AutoTokenizer

model_for_tokenizer = "francescopatane/esm2m35finetunedcytmem"
tokenizer = AutoTokenizer.from_pretrained(model_for_tokenizer)



In [4]:
sequence_to_predict = ["MKGKNRSLFVLLVLLLLHKVNNVLLERTIETLLECKNEYVKGENGYKLAKGHHCVEEDNL\
ERWLQGTNERRSEENIKYKYGVTELKIKYAQMNGKRSSRILKESIYGAHNFGGNSYMEGK\
DGGDKTGEEKDGEHKTDSKTDNGKGANNLVMLDYETSSNGQPAGTLDNVLEFVTGHEGNS\
RKNSSNGGNPYDIDHKKTISSAIINHAFLQNTVMKNCNYKRKRRERDWDCNTKKDVCIPD\
RRYQLCMKELTNLVNNTDTNFHRDITFRKLYLKRKLIYDAAVEGDLLLKLNNYRYNKDFC\
KDIRWSLGDFGDIIMGTDMEGIGYSKVVENNLRSIFGTDEKAQQRRKQWWNESKAQIWTA\
MMYSVKKRLKGNFIWICKLNVAVNIEPQIYRWIREWGRDYVSELPTEVQKLKEKCDGKIN\
YTDKKVCKVPPCQNACKSYDQWITRKKNQWDVLSNKFISVKNAEKVQTAGIVTPYDILKQ\
ELDEFNEVAFENEINKRDGAYIELCVCSVEEAKKNTQEVVTNVDNAAKSQATNSNPISQP\
VDSSKAEKVPGDSTHGNVNSGQDSSTTGKAVTGDGQNGNQTPAESDVQRSDIAESVSAKN\
VDPQKSVSKRSDDTASVTGIAEAGKENLGASNSRPSESTVEANSPGDDTVNSASIPVVSG\
ENPLVTPYNGLRHSKDNSDSDGPAESMANPDSNSKGETGKGQDNDMAKATKDSSNSSDGT\
SSATGDTTDAVDREINKGVPEDRDKTVGSKDGGGEDNSANKDAATVVGEDRIRENSAGGS\
TNDRSKNDTEKNGASTPDSKQSEDATALSKTESLESTESGDRTTNDTTNSLENKNGGKEK\
DLQKHDFKSNDTPNEEPNSDQTTDAEGHDRDSIKNDKAERRKHMNKDTFTKNTNSHHLNS\
NNNLSNGKLDIKEYKYRDVKATREDIILMSSVRKCNNNISLEYCNSVEDKISSNTCSREK\
SKNLCCSISDFCLNYFDVYSYEYLSCMKKEFEDPSYKCFTKGGFKDKTYFAAAGALLILL\
LLIA"]

In [5]:
tokenized_sequence = tokenizer(sequence_to_predict)

Token indices sequence length is longer than the specified maximum sequence length for this model (1026 > 1024). Running this sequence through the model will result in indexing errors


# Create the dataset

In [6]:
from datasets import Dataset

dataset = Dataset.from_dict(tokenized_sequence)

# Create tensors

In [7]:
tf_set = model.prepare_tf_dataset(
    dataset,
    batch_size=1,
    shuffle=False,
    tokenizer=tokenizer
)

# Predict

In [8]:
pred = model.predict(tf_set)



let's inspect the output of the transformer:


In [None]:
pred

- loss: None
- logits: output values (not normalized) associated with the two possible labels (0 or 1). the higher logit value corresponds to relate label choosen by the model.
- hidden states: None
- attentions: a list of tensors that contains attention weights for every input token respect to every other input token. they can be used to find the relevance of input regions to determine the prediction.

# Save attentions from output

In [10]:
attentions = pred.attentions

# Transformers interpret