## Pipelines: Underneath the Hood
A Hugging Face pipeline encapsulates several steps into a single callable object, including tokenization, running the model, and post-processing the model's outputs. Here's how you can perform each step individually:

Tokenization: This is responsible for converting the input text into a format that the model can understand.

Model: This is the Transformer model that performs the actual task. The model takes the tensors produced by the tokenizer and returns output tensors.

Post-processor: This takes the output tensors from the model and converts them into a more user-friendly format.

### Table of Contents <a name="top"></a>
1. [Pipeline Review](#review)
2. [Tokenizer](#tokenizer)
3. [Create a model and inference the tokens](#model)
4. [Model Output](#output)
5. [Your Assignment](#assignment)

In [1]:
# Import everything we need up front
from transformers import pipeline
import torch
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import AutoModel

2024-04-20 17:09:14.485929: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Review <a name="review"></a>
Here is a pipeline just like we did in the last notebook

In [2]:
from transformers import pipeline

# Recall
sa = pipeline("sentiment-analysis",model = "distilbert-base-uncased-finetuned-sst-2-english")

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [3]:
# Let's create a list of a couple sentences to be sentiment analyzed
raw_inputs = [
    "When my friends ask me about my experience in this class, my response is I really love it!",
    "When I am honest, the biggest thing I dislike about this class is the long, boring lectures.",
]

In [4]:
# Just like last notebook, let's analyze them with the pipeline
sa(raw_inputs)

[{'label': 'POSITIVE', 'score': 0.9998725652694702},
 {'label': 'NEGATIVE', 'score': 0.9992125034332275}]

In [5]:
# Look inside the Pipeline
print('Tokenizer:', sa.tokenizer)
print('\n\nModel Config:', sa.model.config)

Tokenizer: DistilBertTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)


Model Config: DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_

### What is going on inside?
![alt text](images/tokenizer.jpg)
[Top of Page](#top)

## Tokenizer <a name="tokenizer"></a>
A tokenizer in the context of AI is a crucial component that converts raw text data into a numeric format that can be processed by machine learning models. 

It achieves this by breaking down the input text into smaller, meaningful units called tokens, which can be individual words, sub-words, or characters. The tokenizer also maps each token to a unique identifier, typically an integer value.  Additionally, the tokens map to the vocab words of the trained model.

Hugging Face provides a wide range of pre-trained tokenizers tailored to specific language models, which ensures that your text data is properly formatted and aligned with the expectations of the corresponding model, enabling you to leverage the power of state-of-the-art language models for your NLP tasks.

In the context of Hugging Face transformer models, some of the common tokenizers used include:

1. **BertTokenizer**: This tokenizer was developed for the BERT (Bidirectional Encoder Representations from Transformers) model. It tokenizes the input text into wordpieces, which are sub-word units that help handle out-of-vocabulary words.

2. **GPT2Tokenizer**: This tokenizer was developed for the GPT-2 (Generative Pre-trained Transformer 2) model. It uses a byte-level Byte-Pair Encoding (BPE) algorithm to tokenize the input text.

3. **RobertaTokenizer**: This tokenizer was developed for the RoBERTa (Robustly Optimized BERT Pretraining Approach) model. It is based on the BPE algorithm and is similar to the GPT2Tokenizer.

4. **DistilBertTokenizer**: This tokenizer was developed for the DistilBERT model, which is a smaller and faster version of the BERT model. It uses the same tokenization approach as the BertTokenizer.

These are just a few examples of the common tokenizers used in Hugging Face transformer models. The choice of tokenizer often depends on the specific model and the task at hand, as different tokenizers may have different strengths and weaknesses.

In [6]:
# Let's create just a tokenizer
from transformers import AutoTokenizer

# Use the same model as we did in the pipeline above
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
# But now, create  just the tokenizer
tok = AutoTokenizer.from_pretrained(checkpoint)

In [7]:
# Here is our original raw_inputs from above, just a list of sentences
print('Original raw_input:')
for s in raw_inputs:
    print(s)
# And here is how the tokenizer breaks it into text tokens. These are still words.
print('\nHere are the tokens:')
for s in raw_inputs:
    print(tok.tokenize(s))

Original raw_input:
When my friends ask me about my experience in this class, my response is I really love it!
When I am honest, the biggest thing I dislike about this class is the long, boring lectures.

Here are the tokens:
['when', 'my', 'friends', 'ask', 'me', 'about', 'my', 'experience', 'in', 'this', 'class', ',', 'my', 'response', 'is', 'i', 'really', 'love', 'it', '!']
['when', 'i', 'am', 'honest', ',', 'the', 'biggest', 'thing', 'i', 'dislike', 'about', 'this', 'class', 'is', 'the', 'long', ',', 'boring', 'lectures', '.']


#### Convert words to numbers

In [8]:
# Convert the words to a numeric representation. This is the primary function of the tokenizer
tok_inputs = tok(raw_inputs, return_tensors="pt") # use pytorch tensors vs. tensor flow tensors
#
# Look at the sentences that are now tokenized as numbers
print("Tokenized sentences:")
for i,t in  enumerate(tok_inputs['input_ids']):
    print('Sentence',i+1, t, '\n size of this tensor:', t.shape)
#
# Look at the attention mask, this is part of the input to the model tells the network which tokens to pay more attention to
# Let's not talk about this here.
print('Attention tensor:', tok_inputs['attention_mask'])

Tokenized sentences:
Sentence 1 tensor([ 101, 2043, 2026, 2814, 3198, 2033, 2055, 2026, 3325, 1999, 2023, 2465,
        1010, 2026, 3433, 2003, 1045, 2428, 2293, 2009,  999,  102]) 
 size of this tensor: torch.Size([22])
Sentence 2 tensor([  101,  2043,  1045,  2572,  7481,  1010,  1996,  5221,  2518,  1045,
        18959,  2055,  2023,  2465,  2003,  1996,  2146,  1010, 11771,  8921,
         1012,   102]) 
 size of this tensor: torch.Size([22])
Attention tensor: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


[Top of Page](#top)
## Create a model and inference the tokens <a name="model"> </a>

In [9]:
from transformers import AutoModelForSequenceClassification

# Create the same model as used for both the pipeline and the tokenizer above
model = AutoModelForSequenceClassification.from_pretrained(checkpoint,output_hidden_states=True,output_attentions=True)

In [10]:
# Inference the model. In other words, send it the tokenized sentences
outputs = model(**tok_inputs) # Give the model all of our inputs (** just means unpack the dictionary)

# The output from the model is a little complex, let's look at a few details
print("What is the type of the outputs object?", type(outputs))
print("What is the size of the output object?", len(outputs))

What is the type of the outputs object? <class 'transformers.modeling_outputs.SequenceClassifierOutput'>
What is the size of the output object? 3


When you call this model, it returns a tuple with the following three elements:

**Logits** (outputs.logits or outputs[0]): This is a tensor of shape (batch_size, num_labels). Since this is a binary classification model, num_labels is 2. The logits are the raw, unnormalized scores that the model assigns to each class. You can convert these into probabilities by applying the softmax function.

**Hidden states** (outputs.hidden_states or outputs[1]): This is a tuple of tensors representing the hidden states from all layers of the model. This is only returned if you set output_hidden_states=True when creating the model. Each tensor in the tuple has shape (batch_size, sequence_length, hidden_size), and outputs.hidden_states[0] gives the initial embeddings.

**Attention weights** (outputs.attentions or outputs[2]): This is a tuple of tensors representing the attention weights from all layers of the model. This is only returned if you set output_attentions=True when calling the model. Each tensor in the tuple has shape (batch_size, num_heads, sequence_length, sequence_length).

In [11]:
# Let's look at the hidden_states, the 2nd item in the tuple
print('Size of the hidden_state object:', len(outputs.hidden_states))

Size of the hidden_state object: 7


For this model, there are 7 hidden states.

- outputs.hidden_states[0]: The initial embeddnig of the input tokens
- outputs.hidden_states[1-6]: The 6 hidden layers in the model's neural network


Each row in this tensor corresponds to a token in your input, and the values in the row are the elements of the vector that represents that token. The dimensionality of these vectors (the number of elements in each row) is the size of the hidden states in the model, which is typically 768 for base models like BERT-base or DistilBERT-base.

In [12]:
# Let's look at the first hidden state; the initial sentence embedding
outputs.hidden_states[0].shape

torch.Size([2, 22, 768])

Here's what each dimension represents:

**First dimension** (2): This is typically the batch size, i.e., the number of samples that are processed together. In our case, we sent 2 sentences.

**Second dimension** (22): This is usually the sequence length, i.e., the number of tokens in each sample. Here, each sentence contains 22 tokens.

**Third dimension** (768): This is the size of the hidden states, i.e., the dimensionality of the vectors that represent each sentence. In models like BERT-base and DistilBERT-base, each token is represented by a vector of size 768.

So, a tensor of shape [2, 22, 768] would represent a batch of 2 sentences, where each sample is a sequence of 22 tokens, and each token is represented by a vector of size 768. This is a common shape for the output of a transformer model like BERT or DistilBERT.

In [13]:
# So, just look at the embedding of the first sentence
print('Input sentence:', raw_inputs[0])
# This is the 'embedding' of the first sentence. It will flow to the next hidden layer in the trained model.
outputs.hidden_states[0]

Input sentence: When my friends ask me about my experience in this class, my response is I really love it!


tensor([[[ 0.3549, -0.1386, -0.2253,  ...,  0.1536,  0.0748,  0.1310],
         [-1.2662,  0.7665, -0.4574,  ..., -0.4349,  1.0169,  0.4746],
         [ 0.4449,  0.5868, -1.0603,  ..., -1.1192, -0.1726, -0.1478],
         ...,
         [-0.5035, -0.6060, -0.0490,  ...,  0.6768,  0.2647,  0.0453],
         [ 1.3130, -0.0888, -1.0304,  ...,  0.1210,  0.3533,  0.3376],
         [-0.2599,  0.0690, -0.3147,  ..., -0.3003,  0.2748, -0.1766]],

        [[ 0.3549, -0.1386, -0.2253,  ...,  0.1536,  0.0748,  0.1310],
         [-1.2662,  0.7665, -0.4574,  ..., -0.4349,  1.0169,  0.4746],
         [-0.2136,  0.3967, -0.3817,  ...,  0.4060,  0.8568,  0.3673],
         ...,
         [-0.1879,  0.2698, -0.6148,  ...,  0.2745, -0.0660, -0.0161],
         [ 0.0614,  0.1503, -0.3501,  ...,  0.4288,  0.3237,  0.1198],
         [-0.2599,  0.0690, -0.3147,  ..., -0.3003,  0.2748, -0.1766]]],
       grad_fn=<NativeLayerNormBackward0>)

![alt text](images/heads.jpg)
[Top of Page](#top)

## Model output <a name="output"> </a>

From above remember the logits are the most valuable outputs when it comes to predicting the class of the sentence (positive or negative)

**Logits** (outputs.logits or outputs[0]): This is a tensor of shape (batch_size, num_labels). Since this is a binary classification model, num_labels is 2. The logits are the raw, unnormalized scores that the model assigns to each class. You can convert these into probabilities by applying the softmax function.

The **softmax** function is commonly used in machine learning models to convert a vector of real numbers into a probability distribution. Given an input vector x, the softmax function computes the exponential (exp) of every element in the vector, and then normalizes the result by dividing each exp(x_i) by the sum of the exponentials of all elements.


In [14]:
# This is our output or prediction of the Positive/Negative, but it is still un-normalized
outputs.logits

tensor([[-4.3012,  4.6666],
        [ 3.8892, -3.2567]], grad_fn=<AddmmBackward0>)

In [15]:
# Let's postprocess the tokens (Convert logits to probabilities that we can understand)
# Use the softmax function to normalize out logits
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
# Print results
print('In this model:', model.config.id2label)
#
# Print raw data
print('Raw Probability Predictions in the tensors:\n', predictions, '\n')
print('\n', raw_inputs,'\n')
#
# Format the output so it is easy to read
print("Normalized Classification Result:")
for i,t in enumerate(predictions.tolist()):
    print(raw_inputs[i])
    print("\tNegative Probability:", t[0], "Positivie Probability:", t[1])
# Recall the result when we used the pipeline at the top
for p in sa(raw_inputs):
    print("\nPipeline result from above:\n",p)

In this model: {0: 'NEGATIVE', 1: 'POSITIVE'}
Raw Probability Predictions in the tensors:
 tensor([[1.2743e-04, 9.9987e-01],
        [9.9921e-01, 7.8742e-04]], grad_fn=<SoftmaxBackward0>) 


 ['When my friends ask me about my experience in this class, my response is I really love it!', 'When I am honest, the biggest thing I dislike about this class is the long, boring lectures.'] 

Normalized Classification Result:
When my friends ask me about my experience in this class, my response is I really love it!
	Negative Probability: 0.00012743103434331715 Positivie Probability: 0.9998725652694702
When I am honest, the biggest thing I dislike about this class is the long, boring lectures.
	Negative Probability: 0.9992125034332275 Positivie Probability: 0.0007874164148233831

Pipeline result from above:
 {'label': 'POSITIVE', 'score': 0.9998725652694702}

Pipeline result from above:
 {'label': 'NEGATIVE', 'score': 0.9992125034332275}


## So what did we do?
![alt text](images/tokenizer.jpg)
[Top of Page](#top)

## Your Assignment <a name="assignment"> </a>

This is pretty complex. Let this sink in a little bit....

Any questions? This is about the limit of my knowledge about the inside workings of these simple NLP models.

I think going deeper on this topic is not useful for us. Let's move on to more useful topics.
