<a href="https://colab.research.google.com/github/brunojaime/hugging_face_projects/blob/master/stages_of_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Stages of a pipeline
1) Tokenizer -> Tokenize from raw text to tokenx
2) Model -> The generated tokens go into the model and generate the logits
3) Post processing -> Transforms the logits in labels and punctuations

## 1) Tokenizer
- Text gets split into tokens
- The tokenizer will add some special tokens, if the model expects them. Example: CLS and SEP to indicate the begining and the end of a sentence
- The tokenizer matches each token to the unique related id of the previous trained model

In [None]:
from transformers import AutoTokenizer


checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint) #from_pretrained downloads the model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [None]:
raw_inputs=[
    "The service was incredible and the food delicious",
    "I won´t buy in this place, they are so rude!"
]
inputs = tokenizer(raw_inputs,padding=True,truncation=True,return_tensors="tf")
# padding = True. Since we are passing two different sentences we will need to pad the shortest one in order to build an array
# truncation = True -> We ensure that any sentence longer than the maximum the model can handle is truncated
print(inputs)

{'input_ids': <tf.Tensor: shape=(2, 16), dtype=int32, numpy=
array([[  101,  1996,  2326,  2001,  9788,  1998,  1996,  2833, 12090,
          102,     0,     0,     0,     0,     0,     0],
       [  101,  1045,  2180, 29658,  2102,  4965,  1999,  2023,  2173,
         1010,  2027,  2024,  2061, 12726,   999,   102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 16), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}


## 2) Model

In [None]:
from transformers import TFAutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
# downloads and stores the configuration of the model as well as the pretrained weights
# The AutoModel class loads a model without its pretrained head
# It will generate a high dimensional tensor which it's a representation of the sentences
model = TFAutoModel.from_pretrained(checkpoint)

outputs = model(inputs)

print(outputs.last_hidden_state.shape)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.bias']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


(2, 16, 768)


shapes are : (2,16,768)

shape[0] -> batch size

shape[1] -> sequence length

shape[2] -> hidden size

To get a result on our classification problem we need to use TFAutoModelForSequenceClassification.
This works like AutoModel execpt that it will build a model with a classification header.
There is one AutoClass for each common NLP task in the Transformers library.

In [None]:
from transformers import TFAutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(inputs)
outputs.logits

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.373304 ,  4.7180037],
       [ 3.6227982, -2.9700565]], dtype=float32)>

This outputs are not probabilities yet. This is because each model of the transformers library returns logits. To make sense into this logits, we need to pass to the following step

## 3) PostProcessing
To convert logits into probabilities we need to apply a softmax layer to them

In [None]:
import tensorflow as tf
import numpy as np
predictions = tf.math.softmax(outputs.logits,axis=-1)
for index, prediction in enumerate(predictions):
  input= raw_inputs[index]
  classification = model.config.id2label[np.argmax(prediction)]
  perc_classification = np.max(prediction) * 100
  print(f"{input} -> {classification} ({round(perc_classification,2)})%")
 #The result of each sentence will sum up to 1


The service was incredible and the food delicious -> POSITIVE (99.99)%
I won´t buy in this place, they are so rude! -> NEGATIVE (99.86)%


In [None]:
model.config.id2label #This will tell us which number correspon to whathttps://www.youtube.com/watch?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&v=wVN12smEvqg

We see that:

0 -> NEGATIVE

1 -> POSITIVE