# Mini Project: Naive Bayes for NLP

Written by Adam Ten Hoeve  
COMP 4448 - Data Science Tools 2  
Summer 2021

In [16]:
# Load Necesary Libraries
import numpy as np
import pandas as pd
import re

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, f1_score

from nltk.corpus import stopwords

Find some text data of your own choice, it could be labelled tweets, etc. 
Your dataset should have at least 200 instances, and if there are several columns of text, you can choose to merge the text columns into a single text column.   

Clean the data, split the data, transform the data to a representation suitable for your algorithm, build your model and evaluate the model. Tune some parameters of interest and write a short report about what problem your mini project is trying to address, the description of your data, the choice of algorithm used, the performance of your algorithm, overfitting, the choice of hyperparameters tunned, then your recommendation or conclusion (imagine you were trying to recommend this algorithm to a stakeholder, and you need this report to include important and persuasive elements). Your report could be in one or two paragraphs and should include relevant code and output at the end. 

Dataset can be found here: https://www.kaggle.com/uciml/sms-spam-collection-dataset

In [3]:
# Load the Spam vs Ham dataset
df = pd.read_csv("spam.csv", encoding='ISO-8859-1')
# Remove unneeded columns
df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1, inplace=True)
# Rename columns
df.columns = ["label", "text"]
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
# Clean the data
# Remove trailing and leading whitespace, then convert everything to lowercase
df["text"] = df["text"].apply(lambda x: x.strip().lower())
# Remove punctuation and special characters
df["text"] = df["text"].apply(lambda x: re.sub(r"[^\w ]", "", x))
# Remove stopwords
cachedStopWords = stopwords.words("english")
df["text"] = df["text"].apply(lambda x: " ".join([word for word in x.split() if word not in cachedStopWords]))
df.head()

Unnamed: 0,label,text
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor u c already say
4,ham,nah dont think goes usf lives around though


In [9]:
# Convert the text data to tfidf data
vectorizer = TfidfVectorizer()
text_tfidf = vectorizer.fit_transform(df["text"])

In [10]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(text_tfidf, df["label"], test_size=0.2, random_state=42)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(4457, 9404)
(4457,)
(1115, 9404)
(1115,)


In [17]:
# Fit a Naive Bayes algorithm to the text tfidf data
nb_clf = MultinomialNB()
# Use a grid search to find the optimal parameters
param_grid = {"alpha": np.arange(0.05, 1.2, 0.05)}
grid = GridSearchCV(nb_clf, param_grid).fit(X_train, y_train)
# Find the best model from the gridsearch
best_nb = grid.best_estimator_

In [18]:
# Find the accuracy on the training and test sets
# Training set
train_preds = best_nb.predict(X_train)
print("Accuracy on the training set:", accuracy_score(y_train, train_preds))
# Test set
test_preds = best_nb.predict(X_test)
print("Accuracy on the test set:", accuracy_score(y_test, test_preds))

Accuracy on the training set: 0.9955126766883554
Accuracy on the test set: 0.9838565022421525


In [19]:
# Find the f1 score of the training and test se
print("F1 score on the training set:", f1_score(y_train, train_preds, pos_label="ham"))
print("F1 score on the test set:", f1_score(y_test, test_preds, pos_label="ham"))

F1 score on the training set: 0.9974160206718347
F1 score on the test set: 0.9907407407407408
