# Reading Library

In [None]:
from datasets import load_dataset
import pandas as pd
import numpy as np
import os

from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the dataset
dataset = load_dataset("climatebert/climate_sentiment")

# Convert the train split to a Pandas DataFrame
train_df = dataset['train'].to_pandas()

# Convert the test split to a Pandas DataFrame
test_df = dataset['test'].to_pandas()

# Concatenate train and test dataframes
dataframe = pd.concat([train_df, test_df], ignore_index=True)
dataframe.columns = ['message', 'sentiment']
print(dataframe.head())
print(dataframe.keys())

In [2]:
pd.set_option('display.max_colwidth', None)
dataframe

Unnamed: 0,message,sentiment
0,"− Scope 3: Optional scope that includes indirect emissions associated with the goods and services supply chain produced outside the organization. Included are emissions from the transport of products from our logistics centres to stores (downstream) performed by external logistics operators (air, land and sea transport) as well as the emissions associated with electricity consumption in franchise stores.",1
1,"The Group is not aware of any noise pollution that could negatively impact the environment, nor is it aware of any impact on biodiversity. With regards to land use, the Group is only a commercial user, and the Group is not aware of any local constraints with regards to water supply. The Group does not believe that it is at risk with regards to climate change in the near-or mid-term.",0
2,"Global climate change could exacerbate certain of the threats facing our business, including the frequency and severity of weather-related events referred to in Performance of critical infrastructure in this section 9. In addition, increases in energy prices are partly influenced by government policies to address climate change which, combined with a growing data demand that increases our energy requirements, could increase our energy costs beyond our current expectations.",0
3,"Setting an investment horizon is part and parcel of our policy of focusing on the long term and helping clients to build capital. Both financial and non-financial aspects play a role in measuring investment returns. Even if we make a successful investment in a mining company today, the same company may nonetheless cause damage to the environment tomorrow, and thus be compelled to make substantial provisions for improving its waste-processing activities and paying fines. As an asset manager that focuses on the long-term prospects, we can’t ignore the non-financial aspects.",0
4,"Climate change the physical impacts of climate change on our operations are uncertain and particular to geographic circumstances. in addition, a number of national governments have already introduced or are contemplating the introduction of regulatory responses to greenhouse gas emissions from the combustion of fossil fuels to address the impacts of climate change. these physical effects and regulatory responses may adversely impact the productivity and financial performance of our operations.",0
...,...,...
1315,"Indirect emissions result from operational activities we do not own or control. These include indirect energy emissions produced as a consequence of electricity we purchase to power our treatment plants and other indirect emissions as a consequence of our activities, e.g. from travel on company business and sludge and process waste disposal emissions.",1
1316,"All data in this TCFD report is as of, or for the year-ended December 31, 2020 unless otherwise noted. References to Daimler’s Sustainability Report 2020 will be available with its publication by March 29, 2021. References to the CDP Climate Change Questionnaire are related to the reporting year 2019.",1
1317,"Outcome: The bank explained that it would be winding down its fossil fuel-related merger and acquisition advice, investing substantially in clean tech and banking services, and that it was preparing its first TCFD report.",1
1318,"In 2020, Banco do Brasil Foundation celebrated its 35th anniversary. Along its journey, it has contributed to the societal transformation of Brazilians and the sustainable development of the country, focused on serving the society’s most vulnerable segments, from north to south, from east to west, in cities and the countryside.",2


# Data Preprocessing

## Steps

1. Cleaning the data by removing speical characters and numbers using regular expression
2. Removing stop words such as "a", "the", "and", etc.
3. Convert all words in text to lower cases

### Reference
https://www.baeldung.com/cs/naive-bayes-classification-performance

In [3]:
import re

def clean_message(text):
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    
    # Remove special characters and punctuation
    text = re.sub(r'[^A-Za-z\s]', '', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

# Apply the cleaning function to the "message" column
dataframe['message'] = dataframe['message'].apply(clean_message)

In [5]:
def tokenize_message(text):
    return word_tokenize(text)

dataframe['tokenized_message'] = dataframe['message'].apply(tokenize_message)

In [6]:
# Remove stop words
import nltk
from nltk.corpus import stopwords

# nltk.download('stopwords')
stop_words = set(stopwords.words('english'))  # Change 'english' to your language if needed

def remove_stopwords(tokenized_text):
    return [word for word in tokenized_text if word.lower() not in stop_words]

dataframe['tokenized_message'] = dataframe['tokenized_message'].apply(remove_stopwords)

In [7]:
# Convert all words to lowercase
def convert_to_lowercase(tokenized_text):
    return [word.lower() for word in tokenized_text]

dataframe['tokenized_message'] = dataframe['tokenized_message'].apply(convert_to_lowercase)

In [8]:
dataframe = dataframe[['sentiment', 'tokenized_message']]

# Data Splitting

In [10]:
X = dataframe['tokenized_message']
y = dataframe['sentiment']
X_train, X_test, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
# Convert lists of tokens back into strings
X_train = X_train.apply(lambda x: ' '.join(x))
X_test = X_test.apply(lambda x: ' '.join(x))

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_valid_tfidf = tfidf_vectorizer.transform(X_test)

# Support Vector Machine (SVM) Model

In [12]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Initialize the SVM model
svm = SVC()

# Train the model on the training data
svm.fit(X_train_tfidf, y_train)

# Make predictions on the training set
y_train_pred_svm = svm.predict(X_train_tfidf)

# Make predictions on the validation set
y_valid_pred_svm = svm.predict(X_valid_tfidf)

# Calculate accuracy on the training and validation sets
training_accuracy_svm = accuracy_score(y_train, y_train_pred_svm)
validation_accuracy_svm = accuracy_score(y_valid, y_valid_pred_svm)

print("SVM Training Accuracy:", training_accuracy_svm)
print("SVM Validation Accuracy:", validation_accuracy_svm)

SVM Training Accuracy: 0.9962121212121212
SVM Validation Accuracy: 0.7613636363636364


In [13]:
from sklearn.model_selection import GridSearchCV

# Define a range of hyperparameters to search over
param_grid = {
    'C': [10, 20 ,30],  # Regularization parameter
    'kernel': ['linear', 'rbf', 'poly'],  # Kernel function
    'gamma': ['scale', 'auto', 0.1, 1],  # Kernel coefficient (only for 'rbf' and 'poly' kernels)
}

# Initialize the SVM model
svm = SVC()

# Initialize the GridSearchCV object with cross-validation
grid_search = GridSearchCV(estimator=svm, param_grid=param_grid, cv=5, n_jobs=-1)

# Perform hyperparameter tuning on the training data
grid_search.fit(X_train_tfidf, y_train)

# Get the best hyperparameters from the grid search
best_params = grid_search.best_params_

# Use the best hyperparameters to create a new SVM model
best_svm = SVC(**best_params)

# Train the new SVM model on the training data
best_svm.fit(X_train_tfidf, y_train)

# Make predictions on the training set with the best model
y_train_pred_best_svm = best_svm.predict(X_train_tfidf)

# Make predictions on the validation set with the best model
y_valid_pred_best_svm = best_svm.predict(X_valid_tfidf)

# Calculate accuracy on the training and validation sets with the best model
training_accuracy_best_svm = accuracy_score(y_train, y_train_pred_best_svm)
validation_accuracy_best_svm = accuracy_score(y_valid, y_valid_pred_best_svm)

print("Best SVM Hyperparameters:", best_params)
print("Best SVM Training Accuracy:", training_accuracy_best_svm)
print("Best SVM Validation Accuracy:", validation_accuracy_best_svm)

Best SVM Hyperparameters: {'C': 20, 'gamma': 0.1, 'kernel': 'rbf'}
Best SVM Training Accuracy: 1.0
Best SVM Validation Accuracy: 0.7916666666666666


# Naïve Bayes

In [14]:
from sklearn.naive_bayes import MultinomialNB

# Initialize the Naive Bayes model
naive_bayes = MultinomialNB()

# Train the model on the training data
naive_bayes.fit(X_train_tfidf, y_train)

# Make predictions on the training set
y_train_pred_nb = naive_bayes.predict(X_train_tfidf)

# Make predictions on the validation set
y_valid_pred_nb = naive_bayes.predict(X_valid_tfidf)

# Calculate accuracy on the training and validation sets
training_accuracy_nb = accuracy_score(y_train, y_train_pred_nb)
validation_accuracy_nb = accuracy_score(y_valid, y_valid_pred_nb)

print("Naive Bayes Training Accuracy:", training_accuracy_nb)
print("Naive Bayes Validation Accuracy:", validation_accuracy_nb)

Naive Bayes Training Accuracy: 0.8787878787878788
Naive Bayes Validation Accuracy: 0.7083333333333334


In [15]:
from sklearn.model_selection import GridSearchCV

# Initialize the Naive Bayes model
naive_bayes = MultinomialNB()

# Define a range of alpha values to try
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
    'fit_prior': [True, False],
    'class_prior': [None, [0.3, 0.4, 0.3]]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(naive_bayes, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_tfidf, y_train)

# Get the best hyperparameters from grid search
best_alpha = grid_search.best_params_['alpha']
best_fit_prior = grid_search.best_params_['fit_prior']
best_class_prior = grid_search.best_params_['class_prior']

# Initialize a new Naive Bayes model with the best hyperparameters
best_naive_bayes = MultinomialNB(
    alpha=best_alpha,
    fit_prior=best_fit_prior,
    class_prior=best_class_prior
)

# Train the final model on the entire training dataset
best_naive_bayes.fit(X_train_tfidf, y_train)

# Train the model on the training data
naive_bayes.fit(X_train_tfidf, y_train)

# Make predictions on the training set
y_train_pred_nb = naive_bayes.predict(X_train_tfidf)

# Make predictions on the validation set
y_valid_pred_nb = naive_bayes.predict(X_valid_tfidf)

# Calculate accuracy on the training and validation sets
training_accuracy_nb = accuracy_score(y_train, y_train_pred_nb)
validation_accuracy_nb = accuracy_score(y_valid, y_valid_pred_nb)

print("Naive Bayes Training Accuracy:", training_accuracy_nb)
print("Naive Bayes Validation Accuracy:", validation_accuracy_nb)
print("Best Alpha:", best_alpha)


Naive Bayes Training Accuracy: 0.8787878787878788
Naive Bayes Validation Accuracy: 0.7083333333333334
Best Alpha: 0.5


# LogisticRegression

In [18]:
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression()
# Train the model on the training data
logistic_regression.fit(X_train_tfidf, y_train)

# Make predictions on the training set
y_train_pred_lr = logistic_regression.predict(X_train_tfidf)

# Make predictions on the validation set
y_valid_pred_lr = logistic_regression.predict(X_valid_tfidf)

# Calculate accuracy on the training and validation sets
training_accuracy_lr = accuracy_score(y_train, y_train_pred_lr)
validation_accuracy_lr = accuracy_score(y_valid, y_valid_pred_lr)

print("Logistic Regression Training Accuracy:", training_accuracy_lr)
print("Logistic Regression Validation Accuracy:", validation_accuracy_lr)

Logistic Regression Training Accuracy: 0.9545454545454546
Logistic Regression Validation Accuracy: 0.7727272727272727


In [16]:
# The l1 penalty encourages sparsity in the model by adding a penalty term that
# encourages many feature coefficients to be exactly zero. This can be useful when
# you suspect that many features are irrelevant or redundant.
# The l2 penalty adds a penalty term based on the square of the coefficients' magnitudes.
# It discourages coefficients from becoming too large, which helps prevent overfitting.

# A smaller C value, such as 0.1, increases the strength of regularization.
# In other words, it adds a stronger penalty for large coefficient values.
# This can help prevent overfitting by keeping the model's coefficients smaller.

logistic_regression = LogisticRegression(penalty='l2', C=0.1, max_iter=1000)

# Train the model on the training data
logistic_regression.fit(X_train_tfidf, y_train)

# Make predictions on the training set
y_train_pred_lr = logistic_regression.predict(X_train_tfidf)

# Make predictions on the validation set
y_valid_pred_lr = logistic_regression.predict(X_valid_tfidf)

# Calculate accuracy on the training and validation sets
training_accuracy_lr = accuracy_score(y_train, y_train_pred_lr)
validation_accuracy_lr = accuracy_score(y_valid, y_valid_pred_lr)

print("Logistic Regression Training Accuracy:", training_accuracy_lr)
print("Logistic Regression Validation Accuracy:", validation_accuracy_lr)

# We successfully reduced overfitting; however, since the dataset has 3 levels (negative, neutral, positive),
# and logistic regression is for binary decision, it may not perform optimally.

Logistic Regression Training Accuracy: 0.6732954545454546
Logistic Regression Validation Accuracy: 0.6136363636363636


# Limitations

Logistic Regression Model Limitations:

Initially, the Logistic Regression model exhibited strong performance with a training accuracy of 95.45% and a validation accuracy of 77.27%. However, when applied to the sentiment analysis task with three levels (negative, neutral, positive), Logistic Regression struggled to achieve optimal results. This is because Logistic Regression is inherently designed for binary classification, making it less suitable for multi-class problems like ours. The model's performance dropped significantly after tuning, with a training accuracy of 67.33% and a validation accuracy of 61.36%. This decrease in accuracy suggests that despite efforts to reduce overfitting, Logistic Regression may not capture the nuances of multi-class sentiment analysis effectively.

Naive Bayes Model Limitations:

Despite our best efforts, we encountered challenges with the Naive Bayes model. Even after extensive hyperparameter tuning, we faced overfitting issues, where the model performed exceptionally well on the training data but struggled to generalize to unseen validation data. Specifically, the Naive Bayes model achieved a high training accuracy of 87.88%, but its validation accuracy remained at a comparatively lower 70.83%. This suggests that the model may have become overly complex or that we need to explore further techniques to control overfitting.

SVM Model Limitations:

Similarly, with the SVM model, we encountered certain limitations. Although we performed hyperparameter tuning and obtained the best hyperparameters (C=20, gamma=0.1, kernel='rbf'), we still observed a slight performance gap between the training accuracy (100%) and the validation accuracy (79.17%). This discrepancy implies that, despite optimization efforts, there might be inherent limitations in the SVM's ability to generalize to unseen data. The SVM model excels at finding complex decision boundaries, but it may not fully capture the intricacies of sentiment analysis, especially in cases where the data exhibits high nonlinearity or complex relationships. Further strategies may be necessary to mitigate this performance gap and enhance model generalization.

Improvements:

To address these limitations, it may be beneficial to explore more advanced algorithms explicitly designed for multi-class sentiment analysis, such as gradient boosting methods (e.g., XGBoost) or deep learning approaches (e.g., neural networks with embeddings). 