<a href="https://colab.research.google.com/github/xcodesgit/sentiment_analysis_nlp/blob/main/transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [35]:
!pip install datasets evaluate transformers[sentencepiece]



Using pipeline import

In [41]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(["I've been waiting for food my whole life", "Oh no! THe food is still not here", "Yay, food is here"])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.6407083868980408},
 {'label': 'NEGATIVE', 'score': 0.9995132684707642},
 {'label': 'POSITIVE', 'score': 0.9975746273994446}]

Now lets build this pipeline

This preprocessing needs to be done in exactly the same way as when the model was pretrained. So for that, we can use AutoTokenizer class and its from_pretrained method. Using checkpoint name of our model, we can automatically fetch data associated with model's tokenizer and cache it.

In [46]:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)



In [47]:
raw_inputs = ["I've been waiting for food my whole life", "Oh no! THe food is still not here","Yay, food is here"]
inputs = tokenizer(raw_inputs, padding = True, truncation = True , return_tensors = "pt")
print(inputs)

{'input_ids': tensor([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 2833, 2026, 2878, 2166,  102],
        [ 101, 2821, 2053,  999, 1996, 2833, 2003, 2145, 2025, 2182,  102,    0],
        [ 101, 8038, 2100, 1010, 2833, 2003, 2182,  102,    0,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}


In [48]:
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

In [50]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([3, 12, 768])


Now need a model with a sequence classification head (to classify sentences as positive or negative). So we will use AutoModelForSequenceClassification

In [51]:
from transformers import AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

Model head took in input the high dimensional vectors and outputs vectors containing two values

In [54]:
print(outputs.logits.shape)

torch.Size([3, 2])


In [56]:
print(outputs.logits)

tensor([[ 0.3427, -0.2357],
        [ 4.2175, -3.4097],
        [-2.9073,  3.1120]], grad_fn=<AddmmBackward0>)


In [58]:
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim = -1)
print(predictions)

tensor([[6.4071e-01, 3.5929e-01],
        [9.9951e-01, 4.8678e-04],
        [2.4253e-03, 9.9757e-01]], grad_fn=<SoftmaxBackward0>)


In [59]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

So we can see that for "I've been waiting for food my whole life", negative sentiment is 0.67 and positive sentiment is 0.35
For  "Oh no! THe food is still not here", negative sentiment is 0.9995 and positive sentiment is 0.0004
For "Yay, food is here", negative sentiment is 0.002 and positive sentiment is 0.997





---

