<a href="https://colab.research.google.com/github/Zaheer-10/HuggingFace-Transformers_and_pipeline/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis Using Hugging face Transformers Pipeline

In [None]:
!pip install transformers

In [None]:
!pip install xformers

In [None]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis')

res = classifier("I love to travel")

print(res)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9997954964637756}]


In [None]:
from transformers import pipeline

generator = pipeline("text-generation" , model = 'distilgpt2')

res = generator('Today I will be learning about Hugging face and ',
                max_length= 100,
                num_return_sequences = 2)

print(res)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Today I will be learning about Hugging face and vernacular speech, and how to teach children how to understand the word and move on. And I will be sharing this book with you in the years to come.\nToday, I hope, in this important chapter of this book, there are some other questions this students have answered before.\nOn your part, I hope to explain in passing how I learned how to be funny, how to be funny, and how I taught that way to'}, {'generated_text': "Today I will be learning about Hugging face and \xa0 other styles, and I’ll be learning about it with the next update. In \xa0 a few weeks, I will also be getting new tools for creating and maintaining \xa0 style. I will also be using “I’ll be doing so” on\xa0 the first \xa0 update of my \xa0 styles. \xa0\nSome of the new \xa0 styles, including \xa0 the\xa0 'Saw�"}]


In [None]:
from transformers import pipeline

classifier = pipeline('zero-shot-classification')

res = classifier("I love to travel but ",
                 candidate_labels = ['Tourism' , 'politics' , 'business'] ,)

print(res)


No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'I love to travel but ', 'labels': ['Tourism', 'business', 'politics'], 'scores': [0.8348780274391174, 0.0958399623632431, 0.06928205490112305]}


In [None]:
#This is how the pipeline works
from transformers import pipeline
from transformers import AutoTokenizer , AutoModelForSequenceClassification

model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline('sentiment-analysis' , model = model , tokenizer = tokenizer)

res = classifier('The animal could not cross the road because it was tired')
print(res)

[{'label': 'NEGATIVE', 'score': 0.9991163611412048}]


In [None]:
#Let's see more about tokenizer

text = 'I am  going to achieve my Dream career beautifully soon  ASAP'
output = tokenizer(text)
print(output)
print("\n")
tokens = tokenizer.tokenize(text)
print(tokens)
print("\n")
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
print("\n")
decode_strings =  tokenizer.decode(ids)
print(decode_strings)

{'input_ids': [101, 1045, 2572, 2183, 2000, 6162, 2026, 3959, 2476, 17950, 2574, 17306, 2361, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


['i', 'am', 'going', 'to', 'achieve', 'my', 'dream', 'career', 'beautifully', 'soon', 'asa', '##p']


[1045, 2572, 2183, 2000, 6162, 2026, 3959, 2476, 17950, 2574, 17306, 2361]


i am going to achieve my dream career beautifully soon asap


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

encode_input = tokenizer("The tokenizer returns a dictionary with three important items:")

print(encode_input)

tokenizer.decode(encode_input['input_ids'])

{'input_ids': [101, 1109, 22559, 17260, 5166, 170, 17085, 1114, 1210, 1696, 4454, 131, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


'[CLS] The tokenizer returns a dictionary with three important items : [SEP]'

In [None]:
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], [101, 1327, 1164, 5450, 23434, 136, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]}


**Set the padding parameter to True to pad the shorter sequences in the batch to match the longest sequence:**

In [None]:
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True)
print(encoded_input)

{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}


**Set the truncation parameter to True to truncate a sequence to the maximum length accepted by the model:**

In [None]:
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
print(encoded_input)

{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}


In [None]:
>>> from transformers import AutoTokenizer, TFAutoModel

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> model = TFAutoModel.from_pretrained("bert-base-uncased")

>>> inputs = tokenizer("Hello world!", return_tensors="tf")
>>> outputs = model(**inputs)
outputs

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

TFBaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=<tf.Tensor: shape=(1, 5, 768), dtype=float32, numpy=
array([[[-0.14241204,  0.13353652, -0.12907045, ..., -0.35967988,
         -0.05622251,  0.3605013 ],
        [-0.35064834,  0.10419673,  0.6244456 , ..., -0.17610395,
          0.48340222,  0.06443401],
        [-0.24513134, -0.15731728,  0.6945201 , ..., -0.5654466 ,
         -0.08940079, -0.18564393],
        [-0.8247857 , -0.91192245, -0.6560709 , ...,  0.5074244 ,
         -0.19388738, -0.1658766 ],
        [ 0.8766518 ,  0.03524846, -0.12331428, ...,  0.27201617,
         -0.6369    , -0.15850045]]], dtype=float32)>, pooler_output=<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
array([[-8.97564590e-01, -3.30401391e-01, -7.69419611e-01,
         7.57992506e-01,  4.66781229e-01, -1.20347537e-01,
         9.18352425e-01,  1.80867970e-01, -7.27160692e-01,
        -9.99908864e-01, -4.47232902e-01,  8.91044021e-01,
         9.66213226e-01,  5.49151361e-01,  9.43439186