## NB Spam detection classifier

### Introduction
Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically classify such mail as 'Junk Mail'.

In this mission we will be using the Naive Bayes algorithm to create a model that can classify dataset SMS messages as spam or not spam, based on the training we give to the model. It is important to have some level of intuition as to what a spammy text message might look like.

What are spammy messages?
Usually they have words like 'free', 'win', 'winner', 'cash', 'prize' and the like in them as these texts are designed to catch your eye and in some sense tempt you to open them. Also, spam messages tend to have words written in all capitals and also tend to use a lot of exclamation marks. To the recipient, it is usually pretty straightforward to identify a spam text and our objective here is to train a model to do that for us!

Being able to identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. Also, this is a supervised learning problem, as we know what are trying to predict. We will be feeding a labelled dataset into the model, that it can learn from, to make future predictions.


In this notebook we will use Naive Bayes to classify spam. The dataset is publicly available at https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection For our convinience, it is also in the project folder. We will use sklearn. We will aslo calculate accuracy score, precision, recall and f1 

#### Use this cell for all your imports

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

#### Let us go through the documentation

In [2]:
pd?

In [3]:
train_test_split?

In [4]:
CountVectorizer?

In [5]:
MultinomialNB?

In [6]:
accuracy_score?

#### Import the dataframe

In [7]:
df = pd.read_table('SMSSpamCollection', sep='\t', header=None, names=['label', 'sms_message'])

####  Study the dataframe, use head, tail, sample, shape, groupby etc. 

In [8]:
# head
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [9]:
# tail
df.tail()

Unnamed: 0,label,sms_message
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [10]:
# sample
df.sample(5)

Unnamed: 0,label,sms_message
2831,ham,Howz that persons story
5450,ham,Sac needs to carry on:)
3700,ham,How i noe... Did ü specify da domain as nusstu...
2993,ham,K.i did't see you.:)k:)where are you now?
3275,ham,Thanx a lot...


In [11]:
# group by
df.groupby('label').size()

label
ham     4825
spam     747
dtype: int64

In [12]:
# shape
df.shape

(5572, 2)

#### Data Preprocessing
Scikit-learn only deals with numerical values and hence if we were to leave our label values as strings, scikit-learn would do the conversion internally. Therefore we change our ham label to 0 and spam label to 1.

In [13]:
# change ham to 0 and spam to 1. This is our response value
df['label'] = df.label.map({'ham':0, 'spam':1})

In [14]:
# Split our dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

In [15]:
print('Number of rows in the total set: ', df.shape[0])
print('Number of rows in the training set: ', X_train.shape[0])
print('Number of rows in the test set: ', X_test.shape[0])

Number of rows in the total set:  5572
Number of rows in the training set:  4179
Number of rows in the test set:  1393


### Bag of words

What we have here in our data set is a large collection of text data (5,572 rows of data). Most ML algorithms rely on numerical data to be fed into them as input, and email/sms messages are usually text heavy.

Here we'd like to introduce the Bag of Words(BoW) concept which is a term used to specify the problems that have a 'bag of words' or a collection of text data that needs to be worked with. The basic idea of BoW is to take a piece of text and count the frequency of the words in that text. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter.

Using a process which we will go through now, we can convert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrence of each word or token in that document.

For example:

Lets say we have 4 documents as follows:

['Hello, how are you!',
'Win money, win from home.',
'Call me now',
'Hello, Call you tomorrow?']

Our objective here is to convert this set of text to a frequency distribution matrix, as follows:

![title](countvectorizer.png)

> To handle this, we will be using sklearns count vectorizer method which does the following:

> - It tokenizes the string(separates the string into individual words) and gives an integer ID to each token.
> - It counts the occurrence of each of those tokens.
> - All tokens converted to lowercase, all special characters removed and all stop words (very frequently used english words like and ,the) 

In [16]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

In [17]:
# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

In [18]:
# Transform testing data and return the matrix. 
# Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

### Model training and predicting

In [19]:
# Instantiate our model
naive_bayes = MultinomialNB()

In [20]:
# Fit our model to the training data
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [21]:
# Predict on the test data
predictions = naive_bayes.predict(testing_data)

### Model Evaluation

In [22]:
# Score our model
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


### Conclusion
One of the major advantages that Naive Bayes has over other classification algorithms is its ability to handle an extremely large number of features. In our case, each word is treated as a feature and there are thousands of different words. Also, it performs well even with the presence of irrelevant features and is relatively unaffected by them. The other major advantage it has is its relative simplicity. Naive Bayes' works well right out of the box and tuning it's parameters is rarely ever necessary, except usually in cases where the distribution of the data is known. It rarely ever overfits the data. Another important advantage is that its model training and prediction times are very fast for the amount of data it can handle. All in all, Naive Bayes' really is a gem of an algorithm!