# Spam classification using Naive Bayes Algorithm
Raw dataset here: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection  
Adapted from: https://github.com/udacity/machine-learning/blob/master/projects/practice_projects/naive_bayes_tutorial/Naive_Bayes_tutorial.ipynb

Table of Content:
1. Prepare data
2. Train model
3. Predict
4. Evaluate model

# 1 Prepare data

## 1.1 Load data

In [23]:
import pandas as pd

df = pd.read_table('data/smsspamcollection/SMSSpamCollection', 
                   sep='\t', 
                   header=None, 
                   names=['label', 'sms_message'])

print('No.of rows:', df.shape[0])
print('No.of columns:', df.shape)
df.head()

('No.of rows:', 5572)
('No.of columns:', (5572, 2))


Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [27]:
# convert label string to numerical values
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


## 1.3 Create training and testing datasets

In [28]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Total no. of data points: {}'.format(df.shape[0]))
print('No. of data points in training dataset: {}'.format(X_train.shape[0]))
print('No. of data points in testing dataset: {}'.format(X_test.shape[0]))

Total no. of data points: 5572
No. of data points in training dataset: 4179
No. of data points in testing dataset: 1393


## 1.4 Extract feature by converting text data into BoW representation 

Represent text as matrix of word frequency

In [29]:
# use TfidfVectorizer which is the equivalent of using CountVectorizer, then applying tf-idf transformation
from sklearn.feature_extraction.text import TfidfVectorizer

text_vector = TfidfVectorizer(stop_words='english')

# Fit and tranform training data into a matrix of word token count.
training_data = text_vector.fit_transform(X_train)

# Transform testing data. Note that we only transform but not fit the testing data.
testing_data = text_vector.transform(X_test)

# 2 Train model

In [30]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

# 3 Predict

In [31]:
predictions = naive_bayes.predict(testing_data)

# 4 Evaluate model

In [41]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', accuracy_score(y_test, predictions))
print('Precision score: ', precision_score(y_test, predictions))
print('Recall score: ', recall_score(y_test, predictions))
print('F1 score: ', f1_score(y_test, predictions))

print('\nConclusion: Out of the messages which Naive Bayes algorithm identified to be spam, ' 
      'all of them are correctly identified to be spam. However, the algorithm identified only 80% '
      'of all spam messages.')

('Accuracy score: ', 0.97343862167982775)
('Precision score: ', 1.0)
('Recall score: ', 0.80000000000000004)
('F1 score: ', 0.88888888888888895)

Conclusion: Out of the messages which Naive Bayes algorithm identified to be spam, all of them are correctly identified to be spam. However, the algorithm identified only 80% of all spam messages.
