## Task 1: Sentiment Analysis

In this task, I developed a sentiment analysis model to classify product reviews and tweets related to sustainability into positive, negative, or
neutral categories. The model was trained on a balanced dataset and tested on sustainability-specific data. The evaluation was based on precision, recall, and F1-score to measure the performance of the model. Insights were drawn about the model's ability to generalize from general-purpose data to the sustainability domain.

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install datasets transformers

# Approach Overview

This work aims to build a sentiment analysis model using a subset of the **Sentiment140 dataset** [[LINK](https://www.kaggle.com/datasets/kazanova/sentiment140)]. Here's the detailed breakdown of my approach:

1. **Sustainability-Related Data Extraction**:
   - To test the model's performance on sustainability-related tweets, we apply a **filtering technique** to extract **493 reviews** related to sustainability.
   - This is done by checking each tweet to see if it contains any words from a **predefined list of sustainability-related keywords** (Like "clean energy", "renewable energy", "climate change", etc.).
   - After filtering, we obtain **493 rows** of sustainability-related tweets.

2. **Dataset Selection**:
   - From the Sentiment140 dataset, we will randomly select **30,000 samples** for training purposes. These samples are **likely not related to sustainability**, as they are selected randomly from the general dataset after removing rows that we extracted earlier.

3. **Training and Testing Dataset**:
   - The extracted **493 sustainability-related tweets** will be used **exclusively for testing**.
   - The **3,000 rows** selected earlier, which are **probably unrelated to sustainability**, will be used for **training the sentiment analysis model**.

# Summary

- **Training Data**: 3,000 tweets from Sentiment140 (randomly selected, likely not related to sustainability).
- **Testing Data**: 493 tweets filtered from the Sentiment140 dataset that are **explicitly related to sustainability**.

This approach ensures that the model is trained on general tweets but evaluated on a more specific domain (sustainability-related tweets).

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv("/content/drive/My Drive/Advanced NLP/Dataset/twitter_dataset", encoding ="ISO-8859-1" , names=["target", "ids", "date", "flag", "user", "text"])

df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   ids     1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [None]:
df["target"].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
4,800000


### According to the dataset providers we have:
    0 -> NEGATIVE
    2 -> NEUTRAL
    4 -> POSITIVE

In [None]:
decode_map = {0: "NEGATIVE", 2: "NEUTRAL", 4: "POSITIVE"}

def decode_sentiment(label):
    return decode_map[int(label)]

In [None]:
df.target = df.target.apply(lambda x: decode_sentiment(x))

# Data Preparation

## Let's check for missing and duplicated values

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing Values:")
for column, count in missing_values.items():
    print(f"{column}: {count}")

print("*"*50)

# Check for duplicated values
duplicated_values = df.duplicated().sum()
print("Duplicated Values:")
print(f"Total duplicated rows: {duplicated_values}")

Missing Values:
target: 0
ids: 0
date: 0
flag: 0
user: 0
text: 0
**************************************************
Duplicated Values:
Total duplicated rows: 0


## Removing unecessary columns

In [None]:
df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,NEGATIVE,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,NEGATIVE,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,NEGATIVE,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,NEGATIVE,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,NEGATIVE,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


We are only interested in the target and text columns, the rest will be removed

In [None]:
df = df[["text", "target"]]

In [None]:
# We will plot the datafrma(non-truncated)
pd.set_option('display.max_colwidth', None)

df.head()

Unnamed: 0,text,target
0,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",NEGATIVE
1,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,NEGATIVE
2,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,NEGATIVE
3,my whole body feels itchy and like its on fire,NEGATIVE
4,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",NEGATIVE


## -> we can see the text column containing usrls and other special charachter. We will deal with that later before modeling.

## Let's plot the distribution our sentiments

In [None]:
from collections import Counter
import plotly.graph_objects as go

target_cnt = Counter(df.target)

fig = go.Figure(data=[
    go.Bar(x=list(target_cnt.keys()), y=list(target_cnt.values()))
])

fig.update_layout(
    title="Sentiments distribution",
    xaxis_title="Sentiments",
    yaxis_title="Count",
    width=800,
    height=600
)

# Show the figure
fig.show()

# Let's extract our testing dataset first

In [None]:
df.shape

(1600000, 2)

In [None]:
# Example sustainability-related keywords based
sustainability_keywords = [
    'climate change', 'renewable energy', 'clean energy', 'sustainable', 'green energy',
    'carbon emissions', 'environment', 'recycling', 'solar power', 'wind energy', 'sustainability',
    'biofuel', 'global warming', 'sustainable transport', 'fossil fuels', 'net zero', 'greenhouse gases',
    'carbon footprint', 'conservation', 'pollution'
]

In [None]:
# Now we will filter rows that contain any of the sustainability-related keywords in the 'text' column
def contains_sustainability_keywords(text):
    text = text.lower()  # Convert text to lowercase for case-insensitive matching
    return any(keyword in text for keyword in sustainability_keywords)

In [None]:
# Filter out rows containing sustainability-related keywords and remove them from df
testing_df = df[df['text'].apply(contains_sustainability_keywords)]  # Sustainability-related rows
df = df[~df['text'].apply(contains_sustainability_keywords)]   # Remaining rows (non-sustainability-related)

# Now 'testing_df' contains the sustainability-related tweets, and 'df' contains the rest

In [None]:
print(f"Test dataset contains {len(testing_df)}")

print(f"The rest is {len(df)}")

Test dataset contains 493
The rest is 1599507


### Let's chec the sentiments distribution again

In [None]:
target_cnt = Counter(testing_df.target)

fig = go.Figure(data=[
    go.Bar(x=list(target_cnt.keys()), y=list(target_cnt.values()))
])

fig.update_layout(
    title="Sentiments distribution",
    xaxis_title="Sentiments",
    yaxis_title="Count",
    width=800,
    height=600
)

fig.show()

In [None]:
from collections import Counter
import plotly.graph_objects as go


target_cnt = Counter(testing_df.target)

labels = list(target_cnt.keys())
values = list(target_cnt.values())

fig = go.Figure(
    data=[go.Pie(labels=labels, values=values, hole=0.0,
                 textinfo='percent+label', # Show both label and percentage
                 hoverinfo='label+percent+value' # Display extra info on hover
                 )]
)

fig.update_layout(
    title="Dataset labels distribution (Pie Chart)",
    width=800,
    height=600
)

# Show the figure
fig.show()

# We can say that our dataset is balanced for the testing set.

# Let's randomly select 30000 rows for our training
    - 15000 for Positive
    - 15000 for Negative

In [None]:
# Randomly select 1500 positive samples
positive_samples = df[df['target'] == 'POSITIVE'].sample(n=15000, random_state=42)

# Randomly select 1500 negative samples
negative_samples = df[df['target'] == 'NEGATIVE'].sample(n=15000, random_state=42)

# Concatenate the two
training_df = pd.concat([positive_samples, negative_samples]).reset_index(drop=True)

# Shuffle the rows in the new training dataset
training_df = training_df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"Training dataset contains {len(training_df)} samples")

Training dataset contains 30000 samples


### Let's chec the sentiment distribution

In [None]:
from collections import Counter
import plotly.graph_objects as go


target_cnt = Counter(training_df.target)

labels = list(target_cnt.keys())
values = list(target_cnt.values())

fig = go.Figure(
    data=[go.Pie(labels=labels, values=values, hole=0.0,
                 textinfo='percent+label',
                 hoverinfo='label+percent+value'
                 )]
)

fig.update_layout(
    title="Dataset labels distribution (Pie Chart)",
    width=800,
    height=600
)

# Show the figure
fig.show()

In [None]:
training_df.shape

(30000, 2)

# Let's continue cleaning the data

In [None]:
# We will plot the datafrma(non-truncated)
pd.set_option('display.max_colwidth', None)

training_df.head()

Unnamed: 0,text,target
0,@Smush21 that's funny... it would be awesome tho,POSITIVE
1,has an awful nosebleed...,NEGATIVE
2,I still haven't gotten my cupcake,NEGATIVE
3,Back to square one looking at houses,NEGATIVE
4,On my way to clarks villiage,POSITIVE


In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import re
from nltk.corpus import stopwords

TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
stop_words = stopwords.words("english")

def preprocess(text, stem=False):
    # Remove link,user and special characters
    text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
    tokens = []
    for token in text.split():
        if token not in stop_words:
            if stem:
                tokens.append(stemmer.stem(token))
            else:
                tokens.append(token)
    return " ".join(tokens)

training_df.text = training_df.text.apply(lambda x: preprocess(x))
testing_df.text = testing_df.text.apply(lambda x: preprocess(x))

In [None]:
training_df.head()

Unnamed: 0,text,target
0,funny would awesome tho,POSITIVE
1,awful nosebleed,NEGATIVE
2,still gotten cupcake,NEGATIVE
3,back square one looking houses,NEGATIVE
4,way clarks villiage,POSITIVE


# Modeling

For the modeling, we will fine-tune **RoBERTa** for sentiment classification.

The performance of the model will be evaluated on the testing dataset using the following metrics:
- **Accuracy**
- **Recall**
- **Precision**
- **F1-Score**

In [None]:
from sklearn.preprocessing import LabelBinarizer

label_binarizer = LabelBinarizer()

# Fit and transform the training labels
train_labels = label_binarizer.fit_transform(training_df['target']).astype('float32')
test_labels = label_binarizer.transform(testing_df['target']).astype('float32')

train_texts = training_df['text'].tolist()
test_texts = testing_df['text'].tolist()

In [None]:
len(test_labels[0])

1

In [None]:
test_labels[0]

array([0.], dtype=float32)

# Model Building

In [None]:
import torch
# The classification model contains an extra layer for classificaton and not for Fill-Mask like the bert-based-uncased model
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset

In [None]:
print(train_labels[0])
print(test_labels[0])

[1.]
[0.]


In [None]:
# train_texts

In [None]:
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')
model = RobertaForSequenceClassification.from_pretrained("roberta-base",
                                                           num_labels=len(train_labels[0]))



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]


`clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884



model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### After loading the model, now we need to build our custom dataset. Using PyTorch Dataset, we ca cutomize our training and testing data for modeling and make BERT able to understand and work with out dataset.

In [None]:
model

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

In [None]:
# Let's create our Dataset.
class CustomDataset(Dataset):
  # Initialize the Dataset variables
  def __init__(self, texts, labels, tokenizer, max_length=128):
    self.texts = texts
    self.labels = labels
    self.tokenizer = tokenizer
    self.max_length = max_length

  # Get the Dataset Length
  def __len__(self):
    return len(self.texts)

  def __getitem__(self, idx):
    # Get the text corresponding to the item in the idx index in our texts list!(This is why we have the texts as a list)
    text = str(self.texts[idx])
    # <REALY IMPORTANT> the label shoud be a torch tensor
    label = torch.tensor(self.labels[idx])

    # Truncation means that if the text is greater than 128 then it will get truncated.
    '''
    Tokenizer truncation=True is an option that can be used when calling a tokenizer to truncate each
    sentence to the maximum length the model can accept12.
    This will remove tokens from the longest sequence in the pair if a pair of sequences is provided3.
    Truncation can also be specified with the argument max_length or the model_max_length parameter32.
    Truncation is useful for avoiding errors or warnings when the input is too long for the model.
    '''
    encoding = self.tokenizer(text, truncation=True, padding="max_length", max_length=self.max_length, return_tensors='pt')
    # return_tensors='pt' make sure that a pytorch tensor is getting returned!

    # Now we will return a dictionary
    # input_ids is what the model is expecting and it contains the encodings of our input text.
    return {
        'input_ids': encoding['input_ids'].flatten(),
        'attention_mask': encoding['attention_mask'].flatten(),
        'labels': label
    }

In [None]:
train_dataset = CustomDataset(train_texts, train_labels, tokenizer)
val_dataset = CustomDataset(test_texts, test_labels, tokenizer)

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # Sigmoid to get probabilities, apply threshold of 0.5
    predictions = (logits > 0.5).astype(int)

    # Compute metrics for binary classification
    precision = precision_score(labels, predictions, average='binary', zero_division=0)
    recall = recall_score(labels, predictions, average='binary', zero_division=0)
    f1 = f1_score(labels, predictions, average='binary', zero_division=0)
    accuracy = accuracy_score(labels, predictions)

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }


In [None]:
# Define training arguments (remains unchanged)
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=15,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    learning_rate=1e-5,
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=1000
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)


`evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead



In [None]:
# Train the model
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
100,0.1663,0.170699,0.750507,0.743945,0.814394,0.777577
200,0.1687,0.16422,0.764706,0.789062,0.765152,0.776923
300,0.1504,0.164528,0.772819,0.799213,0.768939,0.783784
400,0.1361,0.164588,0.766734,0.776952,0.791667,0.78424
500,0.096,0.174188,0.758621,0.786561,0.753788,0.769826
600,0.0903,0.174723,0.750507,0.80786,0.700758,0.750507
700,0.089,0.199782,0.770791,0.761246,0.833333,0.79566
800,0.0756,0.192411,0.748479,0.748227,0.799242,0.772894
900,0.0828,0.216327,0.73428,0.696165,0.893939,0.782753
1000,0.0803,0.176142,0.776876,0.805556,0.768939,0.786822


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
100,0.1663,0.170699,0.750507,0.743945,0.814394,0.777577
200,0.1687,0.16422,0.764706,0.789062,0.765152,0.776923
300,0.1504,0.164528,0.772819,0.799213,0.768939,0.783784
400,0.1361,0.164588,0.766734,0.776952,0.791667,0.78424
500,0.096,0.174188,0.758621,0.786561,0.753788,0.769826
600,0.0903,0.174723,0.750507,0.80786,0.700758,0.750507
700,0.089,0.199782,0.770791,0.761246,0.833333,0.79566
800,0.0756,0.192411,0.748479,0.748227,0.799242,0.772894
900,0.0828,0.216327,0.73428,0.696165,0.893939,0.782753
1000,0.0803,0.176142,0.776876,0.805556,0.768939,0.786822


TrainOutput(global_step=28125, training_loss=0.07603290996975369, metrics={'train_runtime': 11333.955, 'train_samples_per_second': 39.704, 'train_steps_per_second': 2.481, 'total_flos': 2.95997279616e+16, 'train_loss': 0.07603290996975369, 'epoch': 15.0})

In [None]:
trainer.evaluate()

{'eval_loss': 0.2405790239572525,
 'eval_accuracy': 0.7525354969574036,
 'eval_precision': 0.762962962962963,
 'eval_recall': 0.7803030303030303,
 'eval_f1': 0.7715355805243446,
 'eval_runtime': 3.3255,
 'eval_samples_per_second': 148.247,
 'eval_steps_per_second': 4.811,
 'epoch': 15.0}

In [None]:
trainer.save_model('/content/drive/MyDrive/Advanced NLP/Models')

# Load and Use the Saved Model for Inference

In [None]:
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')
model2 = RobertaForSequenceClassification.from_pretrained("/content/drive/MyDrive/Advanced NLP/Models",
                                                           num_labels=1)

In [None]:
# Set model to evaluation mode
model2.eval()

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

In [None]:
text = "I love this product!"

inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)

with torch.no_grad():
    outputs = model2(**inputs)

In [None]:
logits = outputs.logits
sentiment_score = torch.sigmoid(logits).item()

In [None]:
print(f"Sentiment score: {sentiment_score:.4f}")

if sentiment_score > 0.65:
    print("Positive Sentiment")
else:
    print("Negative Sentiment")

Sentiment score: 0.7376
Positive Sentiment
