<a href="https://colab.research.google.com/github/francescopatane96/ESM2_experiments/blob/main/ESM2_classification_FINETUNED_tf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install transformers evaluate datasets requests pandas sklearn

# Load finetuned model from huggingface

In [50]:
from transformers import TFEsmForSequenceClassification

model = TFEsmForSequenceClassification.from_pretrained("francescopatane/esm2m35finetunedcytmem", output_attentions=True)

All model checkpoint layers were used when initializing TFEsmForSequenceClassification.

All the layers of TFEsmForSequenceClassification were initialized from the model checkpoint at francescopatane/esm2m35finetunedcytmem.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFEsmForSequenceClassification for predictions without further training.


## Tokenizing the data

All inputs to neural nets must be numerical. The process of converting strings into numerical indices suitable for a neural net is called **tokenization**. For natural language this can be quite complex, as usually the network's vocabulary will not contain every possible word, which means the tokenizer must handle splitting rarer words into pieces, as well as all the complexities of capitalization and unicode characters and so on.

With proteins, however, things are very easy. In protein language models, each amino acid is converted to a single token. Every model on `transformers` comes with an associated `tokenizer` that handles tokenization for it, and protein language models are no different. Let's get our tokenizer!

In [51]:
from transformers import AutoTokenizer

model_for_tokenizer = "francescopatane/esm2m35finetunedcytmem"
tokenizer = AutoTokenizer.from_pretrained(model_for_tokenizer)



In [52]:
sequence_to_predict = ["MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIA"]

In [53]:
tokenized_sequence = tokenizer(sequence_to_predict)

In [54]:
from datasets import Dataset

dataset = Dataset.from_dict(tokenized_sequence)

In [55]:
tf_set = model.prepare_tf_dataset(
    dataset,
    batch_size=1,
    shuffle=False,
    tokenizer=tokenizer
)

In [58]:
pred = model.predict(tf_set)



In [64]:
pred

TFSequenceClassifierOutput(loss=None, logits=array([[ 0.33996692, -0.3757285 ]], dtype=float32), hidden_states=None, attentions=(array([[[[4.33955379e-02, 7.35736713e-02, 4.22732197e-02, ...,
          2.97519309e-03, 1.69191435e-02, 1.95562895e-02],
         [3.96362305e-01, 8.41892064e-02, 1.34481303e-02, ...,
          3.73062701e-03, 9.57507361e-03, 1.08326869e-02],
         [1.08874537e-01, 5.46037555e-02, 2.47110669e-02, ...,
          5.30858757e-03, 1.66340340e-02, 1.73259638e-02],
         ...,
         [4.57591284e-03, 1.61716864e-02, 5.42082824e-03, ...,
          1.17623098e-02, 2.18040869e-02, 7.66564980e-02],
         [2.87695578e-03, 1.06405765e-02, 3.25764553e-03, ...,
          2.76324507e-02, 1.69166792e-02, 7.93861896e-02],
         [2.00962722e-02, 1.63483713e-02, 3.02141020e-03, ...,
          4.32492569e-02, 1.99406575e-02, 5.56676127e-02]],

        [[1.16935924e-01, 5.12527302e-02, 4.44426388e-02, ...,
          5.49522461e-03, 5.48475748e-03, 7.38418754e-03],
 

In [None]:
pred.attentions

In [60]:
model.label2id = {"cytosol": 0, "membrane": 1}
model.id2label = {val: key for key, val in model.label2id.items()}