# SMS Spam filter

The next code loads the data into a data frame. Each data point is a message and a label.


In [11]:
import numpy as np
import pandas as pd
import csv
from sklearn.externals import joblib
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier  
from datetime import date
from sklearn.svm import LinearSVC
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score

version = date.today().strftime("%Y_%B_%d")

df = pd.read_csv('SMSSpamCollection', sep = '\t', names = ["labels", "message"], quoting=csv.QUOTE_NONE) 

'''df.drop_duplicates(subset='message', inplace=True) lets not drop duplicates as this is a spam filter it is better to train the linear model using duplicates'''



"df.drop_duplicates(subset='message', inplace=True) lets not drop duplicates as this is a spam filter it is better to train the linear model using duplicates"

In our data frame we have change label 'spam' to 1 and label 'ham' to 0.

This is done using the function change.

In [12]:
def change(x):
    if(x == 'ham'):
        return 0
    else:
        return 1

y = df['labels'].map(lambda x : change(x))

The next code helps us to print the cross validation scores, mean of the cross validation scores, standard deviation of the cross validation scores, accuracy scores, the number of data points our model correctly classifies and the confusion matrix where we are given the model pipeline, the training data point (X_train and y_train), the test data labels (y_test) and the test data label predictions (y_test_predicted). 

In [13]:
def print_accuracies(pipeline_dummy, X_train, y_train, y_test, y_test_predicted):

    scores = cross_val_score(pipeline_dummy, X_train, y_train, cv=10, scoring='accuracy')
    print("The corss validation scores are :", scores)
    print("The mean and standard deviation of the scores are:", scores.mean(), scores.std())
    print("The fraction of correctly classified samples is {}".format(accuracy_score(y_test, y_test_predicted)))
    print("The number of correctly classified samples is {} out of {}".format(accuracy_score(y_test, y_test_predicted, normalize=False), len(y_test.index)))
    print("The confusion matrix is ",pd.DataFrame(confusion_matrix(y_test, y_test_predicted), index =['ham', 'spam'], columns =['pred_ham', 'pred_spam']))

    print(classification_report(y_test, y_test_predicted))                                                            
 
    return 

# Naive Bayes Model
We use Naive Bayes to classify ham and spam.

The vectorizer which we use for Naive Bayes model is Count vectorizer.

Naive Bayes is a probabilistic classifier and it tries to figure out the probability of a word being in spam and ham message and assumes that the probability of a word being in spam or  ham message is independent of the other words present in the message. 

The models try to fit the probabilities so that we get the best fit for the data set given to us. 
By best fit, we mean Maximizing the log Likely-hood of the data set given the probabilities.
Now while we are trying to fit the probabilities for maximum log likely-hood.
We have to assume a distribution of these probabilities.
In my code, I have assumed that these probabilities have a multinomial distribution.(In this case it is just binomial though)

While using Multinomial Naive Bayes I have used alpha = 1. That is Laplace smoothing for the Naive Bayes algorithm as when it sees some new words the algorithm might not work when we don't use this.




In [14]:
#-------------------Naive_bayes---------------------------------------------------------
vectorizer_nb = CountVectorizer()
Xnb = vectorizer_nb.fit_transform(df['message'])
#i dont need to do anything with y as it has alreday been done in the last section

Xnb_train, Xnb_test, ynb_train, ynb_test = train_test_split(Xnb, y, random_state= 42, test_size = 0.2)

pipeline_nb = Pipeline([('classifier', MultinomialNB(alpha = 1))])
pipeline_nb.fit(Xnb_train, ynb_train)
model_path_nb = 'model.joblib_{}_nb'.format(version)
joblib.dump(pipeline_nb, model_path_nb)
reloaded_pipeline_nb = joblib.load(model_path_nb)
ynb_test_predicted = reloaded_pipeline_nb.predict(Xnb_test)

print_accuracies(pipeline_nb, Xnb_train, ynb_train, ynb_test, ynb_test_predicted)                                                          
                                                       

The corss validation scores are : [0.98434004 0.97986577 0.96644295 0.99327354 0.97757848 0.98430493
 0.97977528 0.9752809  0.98651685 0.97752809]
The mean and standard deviation of the scores are: 0.9804906843843095 0.006868210220589314
The fraction of correctly classified samples is 0.9847533632286996
The number of correctly classified samples is 1098 out of 1115
The confusion matrix is        pred_ham  pred_spam
ham        944         10
spam         7        154
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       954
           1       0.94      0.96      0.95       161

   micro avg       0.98      0.98      0.98      1115
   macro avg       0.97      0.97      0.97      1115
weighted avg       0.98      0.98      0.98      1115



# Random Forest Model
We use Random Forest to classify ham and spam.

The vectorizer which we use for Random Forest model is Count Vectorizer.

Random forest basically creates many decision trees and the tries to classify using all these trees. 
The use of many trees helps to get rid of the overfitting problem of decision trees due to their height. 
Random forest is an ensemble classifier that is it uses many models to classify unlike decision trees, this may lead to increase the bias of the model a bit but in the final model, this will reduce variance and thus will probably get rid of the overfitting issue.

Random forest in sklearn has many attributes as the number of nodes in a tree or the number of trees used by this ensemble classifier and also other attributes of the decision tree as such the impurity measure, these attributes can be tweaked to get better results.
I am just using random_state = 31 so that we get the same classifier everytime we run our code.

In [15]:
#-------------------random_forest----------------------------------------------------------
vectorizer_rf = CountVectorizer()
Xrf = vectorizer_rf.fit_transform(df['message'])
#i dont need to do anything with y as it has alreday been done in the last section

Xrf_train, Xrf_test, yrf_train, yrf_test = train_test_split(Xrf, y, random_state= 42, test_size = 0.2)

pipeline_rf = Pipeline([('classifier', RandomForestClassifier(random_state = 31))])
pipeline_rf.fit(Xrf_train, yrf_train)
model_path_rf = 'model.joblib_{}_rf'.format(version)
joblib.dump(pipeline_rf, model_path_rf)
reloaded_pipeline_rf = joblib.load(model_path_rf)
yrf_test_predicted = reloaded_pipeline_rf.predict(Xrf_test)


print_accuracies(pipeline_rf, Xrf_train, yrf_train, yrf_test, yrf_test_predicted)



The corss validation scores are : [0.96644295 0.98434004 0.96644295 0.97982063 0.96636771 0.96412556
 0.9505618  0.9752809  0.96853933 0.96179775]
The mean and standard deviation of the scores are: 0.968371962740919 0.009075936878679459
The fraction of correctly classified samples is 0.9730941704035875
The number of correctly classified samples is 1085 out of 1115
The confusion matrix is        pred_ham  pred_spam
ham        954          0
spam        30        131
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       954
           1       1.00      0.81      0.90       161

   micro avg       0.97      0.97      0.97      1115
   macro avg       0.98      0.91      0.94      1115
weighted avg       0.97      0.97      0.97      1115



# SVM Model
We use SVM to classify ham and spam.

The vectorizer which we use for SVM model is Tf-idf Vectorizer.

A support-vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks like outliers detection. And it tries to get the best separating hyaperplane. The best hyperplane is the hyperplane with largest distance to the nearest training-data point of any class(a.k.a margin), since in general the larger the margin, the lower the generalization error of the classifier.

And SMV model can we kernalized but here we just simply use the linear kernal.


In [16]:
#-----------------SVM------------------------------------------------------------
vectorizer_svm = TfidfVectorizer()
Xsvm = vectorizer_svm.fit_transform(df['message'])
#i dont need to do anything with y as it has alreday been done in the last section

Xsvm_train, Xsvm_test, ysvm_train, ysvm_test = train_test_split(Xsvm, y, random_state= 42, test_size = 0.2)

pipeline_svm = Pipeline([('classifier',LinearSVC())])

pipeline_svm.fit(Xsvm_train, ysvm_train)
model_path_svm = 'model.joblib_{}_svm'.format(version)
joblib.dump(pipeline_svm, model_path_svm)
reloaded_pipeline_svm = joblib.load(model_path_svm)
ysvm_test_predicted = reloaded_pipeline_svm.predict(Xsvm_test)


print_accuracies(pipeline_svm, Xsvm_train, ysvm_train, ysvm_test, ysvm_test_predicted)


The corss validation scores are : [0.98657718 0.98881432 0.98657718 0.99103139 0.98206278 0.98878924
 0.97303371 0.99325843 0.97752809 0.97752809]
The mean and standard deviation of the scores are: 0.9845200402767975 0.006321679371473787
The fraction of correctly classified samples is 0.9811659192825112
The number of correctly classified samples is 1094 out of 1115
The confusion matrix is        pred_ham  pred_spam
ham        951          3
spam        18        143
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       954
           1       0.98      0.89      0.93       161

   micro avg       0.98      0.98      0.98      1115
   macro avg       0.98      0.94      0.96      1115
weighted avg       0.98      0.98      0.98      1115

