## Higging Pipeline blocks Understanding

Models in Hugging (pipeline) are compose of 3 main blocks:
- **tokenizer**: Map text to tokens ID
- **Model**: Get token ID, map to token vector and transform it to retrieve a high-dimensional vector representing  contextual understanding
- **Head**: The final classifier for our task (classification, answer question, masked word, etc)

![title](./HuggingBlockStructure.png)

## Each step expanded

### 1- Tokenizer

The first step is use a tokenizer model to transform our plain text into token IDs based in its dictionary

Some important details:
- **Padding**: add a special character to make all sentence with the same number of words
- **Attention Mask**: Tensor with 0's and 1's, where 1 indicate tokens should be attended to, and 0 ignored

In [2]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

print('special token padding ID:', tokenizer.pad_token_id)

raw_inputs = ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, max_length=512, return_tensors="tf")

print(inputs)

special token padding ID: 0
{'input_ids': <tf.Tensor: shape=(2, 16), dtype=int32, numpy=
array([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662,
        12172,  2607,  2026,  2878,  2166,  1012,   102],
       [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,
            0,     0,     0,     0,     0,     0,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 16), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}


#### 1.1 Tokenization step by step
We can do the same step by step if we want get more details about what is happening behind scence.

**NOTE:** Be careful because using this manual procedure we don't include special tokens like 101 -> [CLS] and 102 -> [SEP]

In [16]:
# split sentence by model definition (distibert)
tokens = tokenizer.tokenize(raw_inputs[0])
print(tokens)

# convert tokens to ID based in dictionary.
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

# come back to string from token ID
decoded_string = tokenizer.decode(ids)
print(decoded_string)

model_inputs = tf.constant([ids])
print(model_inputs)

output = model(model_inputs)
print("Logits:", output.logits)

['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.']
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]
i've been waiting for a huggingface course my whole life.
tf.Tensor(
[[ 1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026  2878
   2166  1012]], shape=(1, 14), dtype=int32)
Logits: tf.Tensor([[-2.7276194  2.8789368]], shape=(1, 2), dtype=float32)


### 2- Transform network

This block of code, get the token ID, use the dictionary to embedding into representitive token vectors and the subsequence layers manipulate it to generete the final high dimensional vector that represent the sentence 

In [3]:
from transformers import TFAutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertModel: ['classifier', 'dropout_19', 'pre_classifier']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


(2, 16, 768)


Note that with the class AutoModel we get the hidden responde or the sentence's vector representation. From here, we should be able to use this "features" to create our model for the specific task

For example, hugging provide plenty of models for kwown task, like:

- Model (retrieve the hidden states)
- ForCausalLM
- ForMaskedLM
- ForMultipleChoice
- ForQuestionAnswering
- ForSequenceClassification
- ForTokenClassification

Hence, if we are interested in classify the sentence in pos/neg we can use a Sequence Classifier.

In [5]:
from transformers import TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

print(outputs)
print(outputs.logits)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_38']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-1.5606973,  1.612282 ],
       [ 4.1692305, -3.3464472]], dtype=float32)>, hidden_states=None, attentions=None)
tf.Tensor(
[[-1.5606973  1.612282 ]
 [ 4.1692305 -3.3464472]], shape=(2, 2), dtype=float32)


The AutoModelForSequenceClassification include the Model and Head for the specific task. As a result, it return the logic prediction, we just need to convert to probabilities through softmax function.

In [7]:
import tensorflow as tf

predictions = tf.math.softmax(outputs.logits, axis=-1)

print('model predictions:', predictions)
print('Model labels order:', model.config.id2label)

model predictions: tf.Tensor(
[[4.0195312e-02 9.5980465e-01]
 [9.9945587e-01 5.4418476e-04]], shape=(2, 2), dtype=float32)
Model labels order: {0: 'NEGATIVE', 1: 'POSITIVE'}
