# **PROBLEM STATEMENT**

To build a sentiment classification model over given HuggingFace dataset: **carblacac/twitter-sentiment-analysis**

# **DATASET DETAILS**

**Link:** https://huggingface.co/datasets/carblacac/twitter-sentiment-analysis

**Dataset Structure:** Dataset consists of 2 columns: text and feeling. 'text' is a sentence/tweet. 'feeling' is a binary classification (0/1) of the sentiment of the sentence. 0 indicates negative sentiment and 1 indicates a positive sentiment. <br/>

**Size:** Contains 1,578,627 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment.

**Language:** English (Monolingual) <br/>

**Sources:**
1.   University of Michigan Sentiment Analysis competition on Kaggle
2.   Twitter Sentiment Corpus by Niek Sanders

Training subset has been divided in two smallest datasets, train (80%) and validation (20%)

# Installing Libraries

In [49]:
!pip install -Uqq huggingface_hub
!pip install -Uqq datasets
!pip install -Uqq scikit-learn pandas

# Importing Libraries

In [50]:
from datasets import load_dataset
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix

# Loading Dataset

In [51]:
dataset = load_dataset("carblacac/twitter-sentiment-analysis")

In [52]:
train_data = dataset['train'].to_pandas()
test_data = dataset['test'].to_pandas()
validation_data = dataset['validation'].to_pandas()

In [53]:
train_data.head(10)
# 1 - POSITIVE, 0 - NEGATIVE

Unnamed: 0,text,feeling
0,@fa6ami86 so happy that salman won. btw the 1...,0
1,@phantompoptart .......oops.... I guess I'm ki...,0
2,@bradleyjp decidedly undecided. Depends on the...,1
3,@Mountgrace lol i know! its so frustrating isn...,1
4,@kathystover Didn't go much of any where - Lif...,1
5,@TashaWilson like questions she asks me the da...,1
6,@lisastarlynn I haven't heard anything. I'll t...,0
7,@SusanCosmos @speakgirl Thx 4 sharing!,1
8,"@lamere thank you so much, looking at these pi...",0
9,"not it teh best form today, dont no why, just ...",0


In [54]:
# Get training and testing splits

X_train = train_data['text']
y_train = train_data['feeling']
X_test = test_data['text']
y_test = test_data['feeling']

# **1. NAIVE BAYES CLASSIFIER**

**Metholody:**
1.   Build a scikit-learn pipeline for the sentiment analysis model
2.   Use CountVectorizer for feature extraction, converting text data into a numerical format
3. Choose the classification algorithm MultinomialNB (Naive Bayes), as the classifier


**Experimental Set-Up and Evaluation Metrics:**
1.   Confusion matrix and Classification Report
2.   Accuracy
3.   Precision
4.   Recall
5.   F1 Score





In [55]:
# Tokenization
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Training Naives Bayes Classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_vec, y_train)

In [56]:
# Building a pipeline

nb_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])

In [57]:
nb_clf.fit(X_train, y_train)

In [58]:
y_pred = nb_clf.predict(X_test)

In [59]:
print("Classification Report for NAIVE BAYES CLASSIFIER: ")
print(classification_report(y_test, y_pred))

Classification Report for NAIVE BAYES CLASSIFIER: 
              precision    recall  f1-score   support

           0       0.74      0.82      0.78     30969
           1       0.80      0.71      0.75     31029

    accuracy                           0.76     61998
   macro avg       0.77      0.76      0.76     61998
weighted avg       0.77      0.76      0.76     61998



In [60]:
print("Confusion Matrix for NAIVE BAYES CLASSIFIER:")
print(confusion_matrix(y_test, y_pred))

Confusion Matrix for NAIVE BAYES CLASSIFIER:
[[25287  5682]
 [ 8894 22135]]


In [61]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"EVALUATION METRICS FOR NAIVE BAYES CLASSIFIER: ")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")

EVALUATION METRICS FOR NAIVE BAYES CLASSIFIER: 
Accuracy: 0.76
Precision: 0.80
Recall: 0.71
F1-score: 0.75


# **2. SUPPORT VECTOR MACHINE**

**Metholody:**
1.   Build a scikit-learn pipeline for the sentiment analysis model
2.   Use CountVectorizer for feature extraction, converting text data into a numerical format
3.   Use SVM as the classifier


**Experimental Set-Up and Evaluation Metrics:**
1.   Confusion matrix and Classification Report
2.   Accuracy
3.   Precision
4.   Recall
5.   F1 Score

In [62]:
# Building a pipeline

svm_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None)),
])

In [63]:
svm_clf.fit(X_test, y_test)

In [64]:
predicted = svm_clf.predict(X_test)

In [65]:
print("Classification Report for SVM CLASSIFIER: ")
print(classification_report(y_test, predicted))

Classification Report for SVM CLASSIFIER: 
              precision    recall  f1-score   support

           0       0.81      0.74      0.77     30969
           1       0.76      0.83      0.79     31029

    accuracy                           0.78     61998
   macro avg       0.79      0.78      0.78     61998
weighted avg       0.79      0.78      0.78     61998



In [66]:
print("Confusion Matrix for SVM CLASSIFIER: ")
print(confusion_matrix(y_test, predicted))

Confusion Matrix for SVM CLASSIFIER: 
[[22789  8180]
 [ 5295 25734]]


In [67]:
accuracy = accuracy_score(y_test, predicted)
precision = precision_score(y_test, predicted)
recall = recall_score(y_test, predicted)
f1 = f1_score(y_test, predicted)

print(f"EVALUATION METRICS FOR SVM CLASSIFIER: ")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")

EVALUATION METRICS FOR SVM CLASSIFIER: 
Accuracy: 0.78
Precision: 0.76
Recall: 0.83
F1-score: 0.79


# **3. Bi-LSTM MODEL**

**Metholody:**
1.   Tokenize and pad the training and testing data
2.   Employ a Bidirectional LSTM layer to capture both forward and backward contextual information in the input sequences
3.   Follow it by a dense output layer with a single neuron and a sigmoid activation function
4.   Compile the model using the Adam optimizer, which adapts learning rates during training
5.   Employ binary cross-entropy as the loss function, suitable for binary classification problems
6.   Monitor the evaluation metrics for the Bi-LSTM model

**Experimental Set-Up and Evaluation Metrics:**
1.   Confusion matrix and Classification Report
2.   Accuracy
3.   Precision
4.   Recall
5.   F1 Score

In [80]:
# Import necessary libraries
from keras.models import Sequential
from keras.layers import Embedding, Bidirectional, LSTM, Dense
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

In [81]:
# Tokenize and pad the text data

max_words = 100000
max_len = 100

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)

In [70]:
# Build the BiLSTM model

embedding_dim = 100
lstm_units = 64

model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_len))
model.add(Bidirectional(LSTM(units=lstm_units)))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [78]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 100)          1000000   
                                                                 
 bidirectional (Bidirection  (None, 128)               84480     
 al)                                                             
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1084609 (4.14 MB)
Trainable params: 1084609 (4.14 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [71]:
# Train the model
model.fit(X_train_pad, y_train, epochs=5, batch_size=32, validation_split=0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x780520095fc0>

In [72]:
# Evaluate the model on the test set
y_pred_bilstm = model.predict(X_test_pad)
ypred_model = np.round(y_pred_bilstm)

accuracy = accuracy_score(y_test, ypred_model)
print(f'Test Accuracy: {accuracy}')

Test Accuracy: 0.7792832026839576


In [77]:
import pickle
pickle.dump(model , open('sentiment-analysis-bilstm.pk1' , 'wb'))

In [74]:
print("Classification Report for Bi-LSTM MODEL: ")
print(classification_report(y_test, ypred_model))

Classification Report for Bi-LSTM MODEL: 
              precision    recall  f1-score   support

           0       0.78      0.79      0.78     30969
           1       0.78      0.77      0.78     31029

    accuracy                           0.78     61998
   macro avg       0.78      0.78      0.78     61998
weighted avg       0.78      0.78      0.78     61998



In [79]:
print("Confusion Matrix for Bi-LSTM MODEL: ")
print(confusion_matrix(y_test, ypred_model))

Confusion Matrix for Bi-LSTM MODEL: 
[[24318  6651]
 [ 7033 23996]]


In [76]:
accuracy = accuracy_score(y_test, ypred_model)
precision = precision_score(y_test, ypred_model)
recall = recall_score(y_test, ypred_model)
f1 = f1_score(y_test, ypred_model)

print(f"EVALUATION METRICS FOR Bi-LSTM MODEL: ")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")

EVALUATION METRICS FOR Bi-LSTM MODEL: 
Accuracy: 0.78
Precision: 0.78
Recall: 0.77
F1-score: 0.78


# **RESULTS AND DISCUSSIONS**

Summary of evaluation metrics for above 3 methods:

**1. NAIVE BAYES** <br>
Accuracy: 0.76 <br>
Precision: 0.80 <br>
Recall: 0.71 <br>
F1-score: 0.75 <br>

**2. SVM** <br>
Accuracy: 0.78 <br>
Precision: 0.76 <br>
Recall: 0.83 <br>
F1-score: 0.79 <br>

**3. Bi-LSTM** <br>
Accuracy: 0.78 <br>
Precision: 0.78 <br>
Recall: 0.77 <br>
F1-score: 0.78 <br>

# **CONCLUSION**