# Sentiment Analysis

In this exercise, we're going to perform sentiment analysis on tweets.

In [1]:
import torch

## Data

We're going to use a truncated version of the [Sentiment140 dataset](https://www.kaggle.com/datasets/kazanova/sentiment140). The full dataset contains 1.6M tweets with a sentiment (polarity) label. Each line of the CSV file contains the following 6 fields:
1. target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
2. ids: The id of the tweet (2087)
3. date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
4. flag: The query (lyx). If there is no query, then this value is NO_QUERY.
5. user: the user that tweeted (robotickilldozr)
6. text: the text of the tweet (Lyx is cool)

We load the truncated CSV file `tweets.2000.processed.noemoticon.csv` instead. It contains 2000 tweets (1 tweet per line), 1000 each for label 0 and 4.

In [2]:
import csv
fieldnames = ['label', 'id', 'date', 'query', 'user', 'text']
with open('tweets.2000.processed.noemoticon.csv', 'r', encoding='ISO-8859-1') as csvfile:
    reader = csv.DictReader(csvfile, fieldnames=fieldnames)
    for row in reader:
        print(f"Label: {row['label']}")
        print(f"User: {row['user']}")
        print(f"Text: {row['text']}")
        break
print(f"Label is of type {type(row['label'])}, convert to int like this: {int(row['label'])}")

Label: 0
User: _TheSpecialOne_
Text: @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D
Label is of type <class 'str'>, convert to int like this: 0


## Model

We're going to use a popular sentiment analysis model from the CardiffNLP group, which we download from HuggingFace: `cardiffnlp/twitter-roberta-base-sentiment`. As you can see from the name, it is based on a pretrained RoBERTa model and was finetuned on tweets for sentiment analysis. Check out the [model description](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment?text=I+like+turtles.) for examples of how to use it.

In [4]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.47.1-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.24.0 (from transformers)
  Downloading huggingface_hub-0.27.0-py3-none-any.whl.metadata (13 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2024.11.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Downloading transformers-4.47.1-py3-none-any.whl (10.1 MB)
[2

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment')
model.eval()

config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
         

From the model description, we see that we have to preprocess tweets first to bring them into the format expected by the model. Let's implement the `preprocess` method as it is described in their example.

In [6]:
# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

Let's classify the first tweet in our dataset. Give the probabilities of the sentiment classification model for each of the 3 classes it outputs: negative, neutral and positive sentiment.

In [12]:
text = preprocess(row["text"])

In [19]:
with torch.no_grad():
    out = model(**tokenizer(text, return_tensors="pt"))
probabilities = torch.softmax(out.logits[0], dim=-1)
probabilities

tensor([0.6341, 0.3154, 0.0505])

Now classify all the tweets and evaluate how well the model does in terms of classification accuracy.

In [25]:
tokenizer(text, return_tensors='pt')

{'input_ids': tensor([[    0,  1039, 12105,  2054,   111,    83,  1401,     6,    14,    18,
            10,   741, 22539,     4,  1437,   370,   197,   102,   300,   871,
          8902,     9,  7470,  1053,     7,   109,    24,     4, 25606,   495,
             2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1]])}

In [22]:
def classify(tweet):
    """ Returns probabilities for each of the 3 classes: negative, neutral, positive. """
    text = preprocess(tweet)
    encoded_input = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        output = model(**encoded_input)
    probabilities = torch.softmax(output.logits[0], dim=-1)
    return probabilities

In [23]:
def score(label, predictions):
    """ Returns 1 if the label has the highest probability. """
    # dataset's labels are 0 for negative and 4 for positive --> convert to 0 and 2
    true_label = label / 2
    predicted_label = predictions.argmax().item()
    return 1 if true_label == predicted_label else 0

In [24]:
import tqdm

correct = 0
total = 0
with open('tweets.2000.processed.noemoticon.csv', 'r', encoding='ISO-8859-1') as csvfile:
    reader = csv.DictReader(csvfile, fieldnames=fieldnames)
    for row in tqdm.tqdm(reader):
        label = int(row['label'])
        tweet = row['text']
        predictions = classify(tweet)
        correct += score(label, predictions)
        total += 1
print(f"Accuracy: {correct/total:.2%} ({correct}/{total})")

2000it [01:34, 21.09it/s]

Accuracy: 58.40% (1168/2000)



