Hello! This is the notebook code for a group project which I created for a Data Science course I took. In this code, we used a publicly available dataset of 5,572 text messages taken from Kaggle, and the aim was to use Naive Bayes in order to model and predict spam messages in text.

The first step below is to download the dataset, which has been saved as a .csv for easier download from Google Drive

In [None]:
!gdown --id 1htegPIcvPH_maI2nP0bQjauzHLwYlJQO

Downloading...
From: https://drive.google.com/uc?id=1htegPIcvPH_maI2nP0bQjauzHLwYlJQO
To: /content/spam.csv
  0% 0.00/504k [00:00<?, ?B/s]100% 504k/504k [00:00<00:00, 56.0MB/s]


The first step is to take a look at the dataset, of course!

In [None]:
import pandas as pd
sms = pd.read_csv("spam.csv", encoding='latin-1')
sms

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


There are a lot of NaN values and extra columns, so those extra columns need to be dropped, as well as the existing columns renamed for easier referencing.

Below, we can see that there are 4825 legitimate messages, and 747 spam messages.

In [None]:
sms.dropna(inplace = True, axis = 1)
sms.columns = ["label", "msg"]
sms.groupby("label").describe()

Unnamed: 0_level_0,msg,msg,msg,msg
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


First, we create a nominal binary scale to code for ham or spam.

In [None]:
sms["label_sign"] = sms.label.map({"ham" : 0, "spam" : 1})
sms.head()

Unnamed: 0,label,msg,label_sign
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


Using the NLTK NLP library, the stopwords and punctuation are removed below. This is so there is no interference created by the inclusion of stopwords and punctuation.

In [None]:
import string
import nltk
from nltk.corpus import stopwords

mess = "Nah I don't think he goes to usf, he lives around here though";
nopunc = [char for char in mess if char not in string.punctuation]
nopunc = ''.join(nopunc)
nopunc

'Nah I dont think he goes to usf he lives around here though'

After punctuation is removed, the words in each sentence are tokenized using a simple text split method.

In [None]:
word_tokens = nopunc.split()
stopwords_ac =  stopwords.words('english')
filtered_sentence = []
for w in word_tokens: 
    if w not in stopwords_ac: 
        filtered_sentence.append(w) 

print(filtered_sentence)

def clean_word(mess):
    nopunc = [char for char in mess if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    word_tokens = nopunc.split()
    stopwords_ac =  stopwords.words('english')
    filtered_sentence = []
    for w in word_tokens: 
        if w not in stopwords_ac: 
            filtered_sentence.append(w) 
    return ' '.join(filtered_sentence)

['Nah', 'I', 'dont', 'think', 'goes', 'usf', 'lives', 'around', 'though']


Here, we apply the clean_word method in order to remove punctuation and stopwords for all the sentences in the dataframe, then explore the most common words in the legitimate texts.

In [None]:
from collections import Counter
sms["clean_msg"] = sms.msg.apply(clean_word)
words = sms[sms.label=='spam'].clean_msg.apply(lambda x: [word.lower() for word in x.split()])

ham_words = Counter()

for msg in words:
    ham_words.update(msg)

print(ham_words.most_common(5))

[('call', 347), ('free', 216), ('2', 173), ('txt', 150), ('u', 147)]


In [None]:
sms.head()

Unnamed: 0,label,msg,label_sign,clean_msg
0,ham,"Go until jurong point, crazy.. Available only ...",0,Go jurong point crazy Available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,0,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1,Free entry 2 wkly comp win FA Cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,0,U dun say early hor U c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",0,Nah I dont think goes usf lives around though


We chose to use the sklearn library due to its simplicity and effectiveness in creating a simple Naive Bayes model. First, we split the dataset into training/testing sets respectively.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(sms.clean_msg, sms.label_sign, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4179,)
(1393,)
(4179,)
(1393,)


We used a vectorizer to transform the training/testing sets into fitted vectors for model training.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# instantiate the vectorizer
vect = CountVectorizer()
vect.fit(X_train)
X_train_dim = vect.transform(X_train)
X_test_dim = vect.transform(X_test)

The result after using the multinomial Naive Bayes algorithm was a resounding success, with 98.6% accuracy! This is quite high however, and this may be the result of a fairly curated and commonly used dataset.

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
clf_mnb = MultinomialNB(alpha = 0.2)

clf_mnb.fit(X_train_dim, y_train)

y_test_pd = clf_mnb.predict(X_test_dim)
metrics.accuracy_score(y_test, y_test_pd)

0.9856424982053122

We dumped the model into a .joblib file for use in deploying in an API.

In [None]:
from joblib import dump, load
dump(clf_mnb, 'filename.joblib') 
clf_mnb = load('filename.joblib') 

['filename.joblib']