<h2> Behind the pipeline </h2>

**Reference Material**


[HF Video](https://https://youtu.be/1pedAIvTWXk)

[HF Course](https://huggingface.co/course/chapter2/2?fw=pt)

[Embedding vs Token](https://vaclavkosar.com/ml/transformer-embeddings-and-tokenization)



In [19]:
# pip install transformers && pip install torch
# Source : https://vaclavkosar.com/ml/transformer-embeddings-and-tokenization

from transformers import DistilBertTokenizerFast, DistilBertModel

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
tokens = tokenizer.encode('This is a input.', return_tensors='pt')
print("These are tokens!", tokens)
for token in tokens[0]:
    print("This are decoded tokens!", tokenizer.decode([token]))

model = DistilBertModel.from_pretrained("distilbert-base-uncased")
print("#-=======================-#")
print(model.embeddings.word_embeddings(tokens))
# for e in model.embeddings.word_embeddings(tokens)[0]:
#     print("This is an embedding!", e)


These are tokens! tensor([[ 101, 2023, 2003, 1037, 7953, 1012,  102]])
This are decoded tokens! [CLS]
This are decoded tokens! this
This are decoded tokens! is
This are decoded tokens! a
This are decoded tokens! input
This are decoded tokens! .
This are decoded tokens! [SEP]


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tensor([[[ 0.0390, -0.0123, -0.0208,  ...,  0.0607,  0.0230,  0.0238],
         [-0.0558,  0.0151,  0.0031,  ..., -0.0140, -0.0277,  0.0139],
         [-0.0440, -0.0236, -0.0283,  ...,  0.0053, -0.0081,  0.0170],
         ...,
         [-0.0788,  0.0202, -0.0352,  ...,  0.0119, -0.0037, -0.0402],
         [-0.0244, -0.0138, -0.0078,  ...,  0.0069,  0.0057, -0.0016],
         [-0.0199, -0.0095, -0.0099,  ..., -0.0235,  0.0071, -0.0071]]],
       grad_fn=<EmbeddingBackward0>)


In [5]:
len(model.embeddings.word_embeddings(tokens)[0])

7

In [6]:
# !pip install transformers
# !pip install torch

In [10]:
# https://huggingface.co/course/chapter2/2?fw=pt

from transformers import AutoTokenizer
checkpoint  = "distilbert-base-uncased-finetuned-sst-2-english"
raw_inputs = ['This is a input.', "This is not"]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

In [11]:

inputs 

{'input_ids': tensor([[ 101, 2023, 2003, 1037, 7953, 1012,  102],
        [ 101, 2023, 2003, 2025,  102,    0,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0, 0]])}

In the above output of tokens from the Transformer AutoTokenizer, the 'attention_mask' signifies which tokens are actual words and which ones are padding tokens.

In the example you provided, the 'input_ids' tensor contains the tokenized input sequence, which consists of seven tokens. 

The first token [101] is the special token [CLS], which is added to the beginning of the sequence. The fourth token [1037] corresponds to the word 'a', and the fifth token [7953] corresponds to the word 'sentence'.

The second and third tokens [2023, 2003] correspond to the words 'is' and 'this', respectively, and the sixth and seventh tokens [1012, 102] correspond to the special tokens[SEP] and [PAD], respectively.

The 'attention_mask' tensor is a binary tensor that has the same shape as the 'input_ids' tensor. It has a value of 1 for every actual word token in the input sequence and a value of 0 for every padding token. 
In this example, since all the tokens are actual words, the 'attention_mask' tensor has a value of 1 for every element. This helps the Transformer model to distinguish between the actual words and the padding tokens during training and inference.

In [9]:
# Tokens from the previous output for comparison
tokens

tensor([[ 101, 2023, 2003, 1037, 7953, 1012,  102]])

The purpose of the below code is to encode a tokenized input sequence using a pre-trained Transformer model and obtain a contextualized representation of the sequence.

More specifically, the code uses the AutoModel class from the Hugging Face Transformers library to load a pre-trained Transformer model specified by the checkpoint string. The model(**inputs) line applies the pre-trained model to the input sequence, which is represented as a dictionary of tokenized tensors in the inputs variable.

The outputs variable is a dictionary that contains the encoded representation of the input sequence in various forms. One of these representations is the last_hidden_state tensor, which is a 3-dimensional tensor that contains the contextualized embeddings of the input tokens.

The size of the last_hidden_state tensor is (batch_size, sequence_length, hidden_size), where batch_size is the number of input sequences in the batch, sequence_length is the length of the longest input sequence in the batch, and hidden_size is the size of the hidden layer in the pre-trained Transformer model. The elements of the last_hidden_state tensor are the embeddings of the input tokens after passing through the Transformer layers.

So to summarize, the purpose of the code is to encode a tokenized input sequence using a pre-trained Transformer model and obtain a tensor of embeddings for the input tokens, which capture both their semantic and syntactic properties in the context of the sequence.

In [13]:
from transformers import AutoModel
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([2, 7, 768])


In this tensor : torch.Size([2, 7, 768])
The first element 2 represent the batch size , number of inputs/sentences
The second element 7 represent the sequence length i.e size of the token including the padding token since it was true (this size [ 101, 2023, 2003,2025,  102,    0,    0] 
or this one [ 101, 2023, 2003, 1037, 7953, 1012,  102])
The third element 768 is the hidden size of the model used distilbert-base-uncased-finetuned-sst-2-english"

In [15]:
outputs[0]

tensor([[[ 0.0911, -0.0235,  0.4332,  ...,  0.1353,  0.1840,  0.0500],
         [ 0.1566, -0.1957,  0.1535,  ...,  0.0468,  0.1863,  0.2464],
         [ 0.1405, -0.1565,  0.4472,  ..., -0.0150,  0.0844,  0.4713],
         ...,
         [-0.0295,  0.1492,  0.3712,  ...,  0.0202,  0.1438, -0.2039],
         [ 0.1072, -0.3295,  0.3198,  ...,  0.3632,  0.2085, -0.2717],
         [ 1.0473,  0.1609,  0.6181,  ...,  0.6987, -0.1660, -0.1045]],

        [[-0.3731,  0.4434, -0.3577,  ..., -0.1507, -0.0791,  0.5306],
         [-0.6799,  0.4216, -0.1543,  ..., -0.7253, -0.5644,  0.3843],
         [-0.5186,  0.4889, -0.1868,  ..., -0.4023, -0.4582,  0.3454],
         ...,
         [ 0.2535,  0.4698,  0.1167,  ..., -0.0409, -0.1703, -0.1317],
         [-0.0814,  0.3816, -0.3894,  ..., -0.4100, -0.1549,  0.4925],
         [-0.0900,  0.0998, -0.1883,  ..., -0.2751, -0.1801,  0.2704]]],
       grad_fn=<NativeLayerNormBackward0>)

<b>What’s the difference between two code chunks below : </b>


```
# 1. 
from transformers import DistilBertTokenizerFast, DistilBertModel

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
tokens = tokenizer.encode('This is a input.', return_tensors='pt')
print("These are tokens!", tokens)
for token in tokens[0]:
    print("This are decoded tokens!", tokenizer.decode([token]))
 
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
print(model.embeddings.word_embeddings(tokens))
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
print(model.embeddings.word_embeddings(tokens))
```




```
# 2
from transformers import AutoTokenizer
checkpoint  = "distilbert-base-uncased-finetuned-sst-2-english"
raw_inputs = ['This is a input.']
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
 
from transformers import AutoModel
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
```

Both code chunks are creating embeddings out of tokens.

In the first code chunk, the tokenizer.encode() method is used to encode the input text as a sequence of token IDs, which are then passed to the model.embeddings.word_embeddings() method to obtain the corresponding embedding vectors for each token in the sequence. The model in this code chunk is an instance of DistilBertModel, which is a pre-trained transformer model that can produce contextualized embeddings for input sequences.

In the second code chunk, the AutoTokenizer class is used to tokenize and encode the input text as a sequence of token IDs, which are then passed to the AutoModel class to obtain the corresponding embeddings. The AutoModel class automatically loads the appropriate pre-trained model specified by the checkpoint string, which in this case is a DistilBertModel fine-tuned for sentiment analysis on the SST-2 dataset.

Both code chunks essentially perform the same task of encoding a text sequence as a sequence of token IDs and then obtaining embeddings for each token using a pre-trained transformer model. The main difference between the two is the use of the AutoTokenizer and AutoModel classes in the second code chunk, which allow for more flexibility in selecting and loading pre-trained models.




Classification

In [17]:
# Classification
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs =  model(**inputs)
print(outputs.logits)

tensor([[-2.7134,  2.7267],
        [ 2.8436, -2.2796]], grad_fn=<AddmmBackward0>)


<b>Q. What are logits wrt NLP</b>

In natural language processing (NLP), the term "logits" refers to the raw, unnormalized outputs of a classification model before they are converted into probabilities using a softmax function.

Logits are real-valued numbers that represent the degree of confidence that the model has in each possible output class. For example, in a binary classification problem where the two classes are "positive" and "negative", the model might output a single logit value that represents the degree of confidence that the input belongs to the positive class.

Logits can be positive or negative, and their magnitude represents the strength of the evidence for or against a particular class. Larger logits indicate higher confidence in a particular class, while smaller logits indicate lower confidence. In multi-class classification problems, the model outputs a vector of logits, where each element of the vector corresponds to a possible output class.

After the logits are computed, they are typically passed through a softmax function, which normalizes them into a probability distribution over the output classes. This allows the model to make a probabilistic prediction about the most likely output class for a given input.

In [20]:
import torch 

predictions  = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[0.0043, 0.9957],
        [0.9941, 0.0059]], grad_fn=<SoftmaxBackward0>)


In [21]:
model.config.id2label

{0: 'LABEL_0', 1: 'LABEL_1'}

Based on above label:

 First input/sentence - 0.0043*100 = .43 % Label_0  and 99.57 % Label_1

 Second input/sentence - 99.41 % Label_0  and 0.59% % Label_1

In [22]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]