In [1]:
from transformers import pipeline

### Sentiment Analysis : High Level

In [2]:
classifier = pipeline("sentiment-analysis")
classifier([
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!",
])

[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558095932007}]

### Pipeline Overview

Pipeline groups together three steps: 
- Preprocessing, 
- Passing the inputs through the model
- Post processing

![image.png](attachment:image.png)

### Pre-Processing with the Tokenizer

Like other neural networks, Transformer models can’t process raw text directly, so the first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. To do this we use a tokenizer, which will be responsible for:

- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model

![image.png](attachment:image.png)

In [3]:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [5]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!"
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

- *padding=True*  adds padding to the shorter sentence to match length of the longer one
- *truncation=True* truncates the sentence to maximum acceptable length of sentence
- return_tensors="pt" returns a Pytorch Tensor

In [6]:
inputs

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

- Returns dictionary of input_ids-> Returned tokens
- Attention mask -> Shows padding is applied "0" indicates to the model to not pay attention

### Going through the model

In [8]:
from transformers import AutoModel

In [9]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

In [11]:
output = model(**inputs)
print(output.last_hidden_state.shape)

torch.Size([2, 16, 768])


In [19]:
## Other ways to get the same output tensors ~
output["last_hidden_state"].shape, output[0].shape

(torch.Size([2, 16, 768]), torch.Size([2, 16, 768]))

Model returns an output with high dimensional representation of the sentence of size [batchsize, sequence_length, hidden size]. 
- **Batch size:** The number of sequences processed at a time (2 in our example).
- **Sequence length:** The length of the numerical representation of the sequence (16 in our example).
- **Hidden size:** The vector dimension of each model input.


#### Automodel for Sequence Classification

In [20]:
from transformers import AutoModelForSequenceClassification

In [21]:
checkpoint= "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
output = model(**inputs)

In [23]:
output

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

In [24]:
output.logits.shape

torch.Size([2, 2])

Output of transformers library returns logits for each class. These are not probabilities yet

### Postprocessing the output

In [40]:
import torch
import numpy as np

In [47]:
## Converting logits to probabilities by applying softmax
prediction = torch.nn.functional.softmax(output.logits, dim=-1)
print(prediction)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)


In [38]:
## Labels can be extracted from model.config
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}