## Sentiment Analysis with NLP Binary Classifier 

### Overview
This is a machine learning project on Natural Language Processing. The task given is to make binary predictions about the sentiment of the comments given in the stock_data.csv.

In [25]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch

### Data Loading

In [26]:
data = pd.read_csv('stock_data.csv')
text = data['Text'].astype("string")
sentiment = data['Sentiment']
# convert -1 in sentiment to 0
sentiment = pd.DataFrame([0 if int(label) == -1 else 1 for label in sentiment])

### Data Exploration

In [27]:
print(text.isnull().sum())
print(sentiment.isnull().sum())
print(sentiment.value_counts())

0
0    0
dtype: int64
1    3685
0    2106
dtype: int64


Data is generally cleaned, but there is some class imbalance. 

### Approach 1: Using CountVectorizer and Naive Bayes

### Data Preprocessing

In [28]:
# Split the data into training and validation and test sets
train_texts, tmp_texts, train_labels, tmp_labels = train_test_split(
    text, sentiment, test_size=0.3, random_state=42
)
val_texts, test_texts, val_labels, test_labels = train_test_split(
    tmp_texts, tmp_labels, test_size=0.6, random_state=42
)


In [29]:
import string
import nltk
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
import re

In [30]:
def preprocess_sentence(sentence):
    # remove urls
    pattern = re.compile(r'https?://\S+|www\.\S+')
    processed_text = pattern.sub(r'', sentence)
    processed_text = re.sub(r'RT[\s]+', '', processed_text)
    processed_text = re.sub(r'@\S+', '', processed_text)
    # remove non-ascii characters
    processed_text = processed_text.encode('ascii', 'ignore').decode()
    # convert to lowercase
    processed_text = nltk.word_tokenize(str(processed_text).lower())
    # remove numbers (to be assessed if this is a good idea)
    processed_text = [word for word in processed_text if not word.isdigit()]
    # remove stopwords and punctuations and the word 'user'
    stop_words = set.union(set(stopwords.words('english')), set(string.punctuation), set('user'))
    processed_text = [word for word in processed_text if word not in stop_words]
    # lemmatize
    lemmatizer = WordNetLemmatizer()
    def penn2morphy(penntag):
        """ Converts Penn Treebank tags to WordNet. """
        morphy_tag = {'NN':'n', 'JJ':'a',
                    'VB':'v', 'RB':'r'}
        try:
            return morphy_tag[penntag[:2]]
        except:
            return 'n' 
    def lemmatize_sent(text):
        # Text input is string, returns lowercased strings.
        return [lemmatizer.lemmatize(word, pos=penn2morphy(tag)) for word, tag in pos_tag(text)]
    processed_text = lemmatize_sent(processed_text) 
    return processed_text

preprocess_sentence(train_texts[0])

['kicker',
 'watchlist',
 'xide',
 'tit',
 'soq',
 'pnk',
 'cpw',
 'bpz',
 'aj',
 'trade',
 'method',
 'method',
 'see',
 'prev',
 'post']

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(analyzer=preprocess_sentence)
train_X = count_vect.fit_transform(train_texts)
val_X = count_vect.transform(val_texts)
test_X = count_vect.transform(test_texts)

In [32]:
train_X_df = pd.DataFrame(train_X.toarray(), columns=count_vect.get_feature_names_out())
features_count = train_X_df.sum(axis=0)
print("Number of words that appeared once: ", len(features_count[features_count == 1]))
print("Number of words that appeared twice: ", len(features_count[features_count == 2]))
print("Number of words that appeared 3 times: ", len(features_count[features_count == 3]))
print("Total number of words: ", len(features_count))

Number of words that appeared once:  4552
Number of words that appeared twice:  1063
Number of words that appeared 3 times:  471
Total number of words:  7668


Seems like there are a lot of low frequency words.

In [33]:
count_vect = CountVectorizer(analyzer=preprocess_sentence, min_df=3)
train_X = count_vect.fit_transform(train_texts)
val_X = count_vect.transform(val_texts)
test_X = count_vect.transform(test_texts)

In [34]:
from sklearn.naive_bayes import BernoulliNB

nb_clf = BernoulliNB()
nb_clf.fit(train_X, train_labels[0])

In [35]:
from sklearn.metrics import accuracy_score, f1_score, classification_report

# predictions
predictions_valid = nb_clf.predict(val_X)
print(classification_report(val_labels[0], predictions_valid))

print("Accuracy on validation set: {:.4f}".format(accuracy_score(val_labels[0], predictions_valid)))
print("f1 score on validation set: {:.4f}".format(f1_score(val_labels[0], predictions_valid)))

# test set:
predictions_test = nb_clf.predict(test_X)
print(classification_report(test_labels[0], predictions_test))

print("Accuracy on test set: {:.4f}".format(accuracy_score(test_labels[0], predictions_test)))
print("f1 score on test set: {:.4f}".format(f1_score(test_labels[0], predictions_test)))

              precision    recall  f1-score   support

           0       0.68      0.60      0.63       252
           1       0.78      0.84      0.81       443

    accuracy                           0.75       695
   macro avg       0.73      0.72      0.72       695
weighted avg       0.75      0.75      0.75       695

Accuracy on validation set: 0.7511
f1 score on validation set: 0.8113
              precision    recall  f1-score   support

           0       0.66      0.64      0.65       367
           1       0.81      0.82      0.82       676

    accuracy                           0.76      1043
   macro avg       0.74      0.73      0.73      1043
weighted avg       0.76      0.76      0.76      1043

Accuracy on test set: 0.7593
f1 score on test set: 0.8161


Some other variations were considered, for example, keeping the numbers during preprocessing and changing the minimum words frequency to be kept. However, the current model seems to be the best performing in terms of both f1 and accuracy.

### Approach 2: Using Fine-tuned DistilBert Model

In [11]:
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
import evaluate

PRE_TRAINED_MODEL_NAME = 'distilbert-base-uncased'

# Load the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Tokenize the training and validation sets
train_encodings = tokenizer(train_texts.tolist(), add_special_tokens=True, truncation=True, padding='max_length', return_tensors='pt')
val_encodings = tokenizer(val_texts.tolist(), add_special_tokens=True, truncation=True, padding='max_length', return_tensors='pt')
test_encodings = tokenizer(test_texts.tolist(), add_special_tokens=True, truncation=True, padding='max_length', return_tensors='pt')

class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


# Convert labels to tensor
train_labels_tensor = torch.tensor(train_labels.values.flatten(), dtype=torch.long)
val_labels_tensor = torch.tensor(val_labels.values.flatten(), dtype=torch.long)
test_labels_tensor = torch.tensor(test_labels.values.flatten(), dtype=torch.long)

# Create PyTorch datasets
train_dataset = SentimentDataset(train_encodings, train_labels_tensor)
val_dataset = SentimentDataset(val_encodings, val_labels_tensor)
test_dataset = SentimentDataset(test_encodings, test_labels_tensor)


f1 = evaluate.load('f1')
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1.compute(predictions=predictions, references=labels)

model = AutoModelForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL_NAME, num_labels=2)

# to handle class imbalance
weight_for_class_0 = (1 / 3685) 
weight_for_class_1 = (1 / 2106) 
class_weights = torch.tensor([weight_for_class_0, weight_for_class_1])

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=16,  # Adjust as needed
    per_device_eval_batch_size=16,   # Adjust as needed
    weight_decay=0.001,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()

results = trainer.evaluate()
print(results)

predictions = trainer.predict(test_dataset)
print(predictions.predictions)


{'input_ids': tensor([[  101,  1040,  2361,  ...,     0,     0,     0],
        [  101, 25430,  2100,  ...,     0,     0,     0],
        [  101,  3504,  2066,  ...,     0,     0,     0],
        ...,
        [  101, 19387,  1030,  ...,     0,     0,     0],
        [  101,  3795, 15768,  ...,     0,     0,     0],
        [  101,  6522,  4160,  ...,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/508 [00:00<?, ?it/s]

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 0.4186, 'learning_rate': 7.874015748031496e-07, 'epoch': 1.97}


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])


{'train_runtime': 14295.2869, 'train_samples_per_second': 0.567, 'train_steps_per_second': 0.036, 'train_loss': 0.4156319001528222, 'epoch': 2.0}


  0%|          | 0/44 [00:00<?, ?it/s]

{'eval_loss': 0.44967523217201233, 'eval_f1': 0.851063829787234, 'eval_runtime': 425.9066, 'eval_samples_per_second': 1.632, 'eval_steps_per_second': 0.103, 'epoch': 2.0}


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])


  0%|          | 0/66 [00:00<?, ?it/s]

[[ 1.1218294  -1.1930736 ]
 [ 0.49513736 -0.4262538 ]
 [-1.2042898   1.3345178 ]
 ...
 [-2.2954617   2.4387577 ]
 [-0.32987761  0.46210662]
 [-2.2696893   2.3852952 ]]


In [21]:
from sklearn.metrics import accuracy_score, f1_score, classification_report

predictions = np.argmax(predictions.predictions, axis=1)
print(classification_report(test_labels_tensor, predictions))
print("Accuracy on test set: {:.4f}".format(accuracy_score(test_labels_tensor, predictions)))
print("f1 score on test set: {:.4f}".format(f1_score(test_labels_tensor, predictions)))

              precision    recall  f1-score   support

           0       0.74      0.73      0.73       367
           1       0.86      0.86      0.86       676

    accuracy                           0.81      1043
   macro avg       0.80      0.80      0.80      1043
weighted avg       0.81      0.81      0.81      1043

Accuracy on test set: 0.8140
f1 score on test set: 0.8567


### Conclusion

The CountVectorizer and Naive Bayes models model has a f1 score of 0.816 while the DistilBERT model has a gf score of 0.857. DistilBERT seems to be performing slightly better than the Naive Bayes model by about 5%. However, the time taken to train the DistilBERT model was significantly larger.  

The trend is smiliar in terms of accuracy, DistilBERT still performed slightly better at 0.814

One thing to note is that the training of DistilBERT tries to maximise f1 score instead of accuracy. However, the requirement of the proect was to maximise the accuracy score, so this could possibly be higher for the DistilBERT model if it was done correctly.
