<a href="https://colab.research.google.com/github/amir-asari/Introduction_to_Huggingface/blob/main/2_UsingTransformers/2_Behind_the_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Outline

In [None]:
import torch
import transformers

classifier = transformers.pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

# Behind the Pipeline API

3 Stage Pipeline  

0. dataset
1. language tokenizer -> word to it's unique number mapping
2. pretrained_model   -> word id to it's meaning vector / embedding vector / hidden layer size
3. model output

## 1. initialize tokenizer & model

In [None]:
raw_inputs = [
    "I love you so much",    # 5 Words
    "I hate you",            # 2 Words
    "you",                   # 1 Word
]

In [None]:
tokenizer        = transformers. AutoTokenizer. from_pretrained("bert-base-cased") # GPT or BERT. 2 foundational approaches in Llm
tokenizer_output = tokenizer(raw_inputs , padding=True , return_tensors="pt")    # Numeric ids => as PYTORCH TENSORS

print(tokenizer_output['input_ids'])

tensor([[ 101,  146, 1567, 1128, 1177, 1277,  102],
        [ 101,  146, 4819, 1128,  102,    0,    0],
        [ 101, 1128,  102,    0,    0,    0,    0]])


In [None]:
print(f'tokenizer returns multiple things things => {tokenizer_output.keys()}')

index = 0
while index < len(tokenizer_output['input_ids']):
    print(f'Sentence Number => {index+1}, Sentence = {raw_inputs[index].split()}, ')
    print(f"\t input_ids \t=> { tokenizer_output['input_ids'][index]}, \n\t attention_mask => {tokenizer_output['attention_mask'][index]}")
    index = index + 1

tokenizer returns multiple things things => dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
Sentence Number => 1, Sentence = ['I', 'love', 'you', 'so', 'much'], 
	 input_ids 	=> tensor([ 101,  146, 1567, 1128, 1177, 1277,  102]), 
	 attention_mask => tensor([1, 1, 1, 1, 1, 1, 1])
Sentence Number => 2, Sentence = ['I', 'hate', 'you'], 
	 input_ids 	=> tensor([ 101,  146, 4819, 1128,  102,    0,    0]), 
	 attention_mask => tensor([1, 1, 1, 1, 1, 0, 0])
Sentence Number => 3, Sentence = ['you'], 
	 input_ids 	=> tensor([ 101, 1128,  102,    0,    0,    0,    0]), 
	 attention_mask => tensor([1, 1, 1, 0, 0, 0, 0])


```
Sentence Number => 1, Sentence =    ['I'    , 'love'    , 'you' , 'so'  , 'much'],
	 input_ids 	=>      tensor(     [ 101   ,  146      , 1567  , 1128  , 1177      , 1277,  102    ]),
	 attention_mask =>  tensor(     [1      , 1         , 1     , 1     , 1         , 1   , 1       ])

Sentence Number => 2, Sentence =    ['I'    , 'hate'    , 'you'                         ],
	 input_ids 	=>      tensor(     [ 101   ,  146      , 4819  , 1128  ,  102      ,    0,    0    ]),
	 attention_mask =>  tensor(     [1      , 1         , 1     , 1     , 1         , 0     , 0])

Sentence Number => 3, Sentence =    ['hi'                                                   ]  
	 input_ids 	=>      tensor(     [  101  , 20844     ,   102 ,     0 ,     0     ,     0,     0]),
	 attention_mask =>  tensor(     [1      , 1         , 1, 0  , 0     , 0         , 0])
```

## 2. model

In [None]:
model_general         = transformers. AutoModel.                         from_pretrained("bert-base-cased")
model_classification  = transformers. AutoModelForSequenceClassification.from_pretrained("bert-base-cased")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
classification_output = model_classification( **tokenizer_output )

print(f'Model OUTPUT: {classification_output} ' )

Model OUTPUT: SequenceClassifierOutput(loss=None, logits=tensor([[ 0.5555, -0.3611],
        [ 0.6966, -0.3005],
        [ 0.6272, -0.2788]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None) 


In [None]:
model_general( **tokenizer_output )

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 3.5627e-01,  2.1760e-01, -2.5326e-01,  ..., -1.2633e-01,
           7.3651e-01, -9.8767e-02],
         [ 2.1338e-01,  6.1095e-02, -1.4929e-01,  ...,  4.3693e-01,
           3.3071e-01,  3.0004e-01],
         [-8.9151e-03, -1.7283e-01, -6.5382e-01,  ...,  4.8534e-01,
           6.5441e-02,  3.9704e-01],
         ...,
         [ 6.3233e-01,  3.5286e-01, -4.8239e-01,  ...,  4.3351e-01,
           5.1942e-01,  5.0280e-01],
         [ 7.4279e-01, -3.6099e-02, -8.3125e-01,  ...,  1.7308e-01,
           2.2512e-01, -4.9078e-02],
         [ 3.3614e-01,  3.1221e-01, -1.1223e+00,  ...,  1.8383e-01,
           1.2339e+00,  4.4745e-03]],

        [[ 5.0739e-01,  2.7870e-01, -1.8380e-01,  ..., -2.7795e-01,
           4.1159e-01, -3.7609e-01],
         [ 4.3100e-01, -1.6551e-02,  1.1972e-01,  ...,  7.1019e-02,
           3.0337e-01,  4.7959e-02],
         [ 9.9891e-02,  1.4374e-01, -6.8824e-01,  ...,  5.1895e-02,
           1.

## 3. model output

In [None]:
# a , b -> a / (a + b) , b / (a + b)
predictions = torch.nn.functional.softmax(classification_output.logits, dim=-1)
print(predictions)

tensor([[0.5254, 0.4746],
        [0.5081, 0.4919],
        [0.5043, 0.4957]], grad_fn=<SoftmaxBackward0>)


# Complete Pipeline

In [None]:
import torch
import transformers

checkpoint = "bert-base-cased"
tokenizer  = transformers.  AutoTokenizer.                        from_pretrained(checkpoint)
model      = transformers.  AutoModelForSequenceClassification.   from_pretrained(checkpoint)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
raw_inputs = [
    "I love you so much",   # 5 Words
    "screw you",            # 2 Words
]
numeric_ids = tokenizer(raw_inputs , padding=True , return_tensors="pt")    # Numeric ids => as PYTORCH TENSORS

outputs = model(**numeric_ids)

print(outputs.logits)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

print(model.config.id2label)

tensor([[-0.2870, -0.0723],
        [-0.2221, -0.0587]], grad_fn=<AddmmBackward0>)
tensor([[0.4465, 0.5535],
        [0.4592, 0.5408]], grad_fn=<SoftmaxBackward0>)
{0: 'LABEL_0', 1: 'LABEL_1'}
