# Sentiment Analysis Using the RoBERTa Model

[RoBERTa](https://ai.meta.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/) is an open-source project at Meta that leverages the self-supervised method BERT for NLP.
 
In this notebook, we perform sentiment analysis on 1 tweet (hardcoded) using the roBERTa model called [twitter-roberta-base-sentiment-latest](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) that is trained on ~124M tweets from Twitter.

## Labels

| Index | Label    |
|:-----:|----------|
|   0   | Negative |
|   1   | Neutral  |
|   2   | Positive |

## Future Work

Use the Amazon product review dataset in the data directory to feed to the model.


In [25]:
# Install the packages
# Note pytorch must be installed in order for transformers to work.

!pip install ipywidgets scipy torch transformers > /dev/null 2>&1
!export TOKENIZERS_PARALLELISM=false

In [26]:
# Import packages

import os
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    logging,
)
from scipy.special import softmax

# Suppress warnings (as it can get noisy sometimes).
logging.set_verbosity_error()

In [27]:
tweet = "@cybersam terrible day @ beach ☹️ https://latimes.com"

# A tweet can comprise a user mention (prefix with @) and link. We need to replace those
# tokens with special literals @user and http respectively for the model to recognize. 

def modify(token):
    if token.startswith("@") and len(token) > 1:
        return '@user'
    if token.startswith("http"):
        return 'http'
    if token.startswith("https"):
        return 'https'
    
    return token

clean_tweet = ' '.join([modify(token) for token in tweet.split()])
print(clean_tweet)
        

@user terrible day @ beach ☹️ http


> **Notes**
> 1. If you are use the model `twitter-roberta-base-sentiment-latest` for the first time, the operation will take 1-5 minutes for the model will be downloaded from hugging face.
> 1. If you see this error: `ImportError: AutoModelForSequenceClassification requires the PyTorch library but it was not found in your environment`, please restart your Notebook kernel and rerun the entire notebook so that the kernel and reload any packages that aren't explicitly imported by this notebook.


In [28]:
# Run the model and tokenizer

# See model home page on hugging face:
# https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest

# Install the model name from hugging face.
model_name = 'cardiffnlp/twitter-roberta-base-sentiment-latest'
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces=True)

# For labels, get the label definitions from the hugging face.
labels = ['Negative', 'Neutral', 'Positive']

In [29]:
# Tokenize/vectorize the text.

# Encode the tweet as pytorch (pt) tensors
tokens = tokenizer.encode(tweet, return_tensors='pt')
print(tokens)

tensor([[    0,  1039,  4469,  9569,   424,  6587,   183,   787,  4105, 42699,
          9253, 12605,  1205,   640, 12805,  9452,     4,   175,     2]])


In [30]:
# Feed the tokenized tweet to the model.

result = model(tokens)
print(result)

SequenceClassifierOutput(loss=None, logits=tensor([[ 2.4574, -0.2846, -2.4405]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [31]:
# Make the output more friendly and relevant.

raw_scores = result.logits[0].detach().numpy()
print(f'Raw scores: {raw_scores}')

# Let's convert the raw scores to probabilities mapping to the labels.
probabilities = softmax(raw_scores)
for k, v in zip(labels, probabilities):
    print(f'{k}: {v:.3f}')

Raw scores: [ 2.4574018  -0.28458977 -2.4404929 ]
Negative: 0.933
Neutral: 0.060
Positive: 0.007
