# Constructing a Spam Filter

In this assignment we are given a collection of messages some of which are "spam"(unwanted messages) and the others are "ham".We will try to build a classifier which when trained with sufficient amount of data will be able to distinguish the "spam" messages from the "ham".For this task we will use three supervised learning procedures:

**1.Support Vector Machines**

**2.Naive Bayes Classifier**

**3.Multilayer Perceptron(Neural Networks)**



After fitting these supervised learning models we will evaluate the performance of each model by calculating the accuracy,precision,recall and f1-score on the test data.


In [16]:
import pandas as pd
import numpy as np

df = pd.read_csv('/home/biswadeep/smsspamcollection/SMSSpamCollection', sep='\t',names=["label", "sms"])
print(df.head())

  label                                                sms
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [17]:
df.drop_duplicates(subset='sms', inplace=True) #We drop all the messages that are there twice

In [18]:
df.describe()

Unnamed: 0,label,sms
count,5169,5169
unique,2,5169
top,ham,Eh sorry leh... I din c ur msg. Not sad alread...
freq,4516,1


In [19]:
df.groupby('label').count().reset_index()#Total Number of Spams and Hams

Unnamed: 0,label,sms
0,ham,4516
1,spam,653


In [20]:
from sklearn.model_selection import train_test_split

X = df['sms']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

print("Shape of X is {}".format(X.shape))
print("Shape of X_train is {} and shape of y_train is {}".format(X_train.shape, y_train.shape))
print("Shape of X_test is {} and shape of y_test is {}".format(X_test.shape, y_test.shape))

train_corpus = list(X_train)

Shape of X is (5169,)
Shape of X_train is (3876,) and shape of y_train is (3876,)
Shape of X_test is (1293,) and shape of y_test is (1293,)


In [21]:
#Here we construct the feature vectors for each message using Tf-Idf Vectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000)
vectorizer.fit(train_corpus)

print("Number of features = {}".format(len(vectorizer.vocabulary_)))
print("Number of omitted words = {}".format(len(vectorizer.stop_words_)))

X_train_text_features = vectorizer.transform(list(X_train))
print("Shape of X_train_text_features is {}".format(X_train_text_features.shape))

Number of features = 5000
Number of omitted words = 2277
Shape of X_train_text_features is (3876, 5000)


In [22]:
df['len'] = df['sms'].map(lambda x: len(x))

X = df[['sms', 'len']]
y = df['label']

X_train, X_test, y_train, y_test =  train_test_split(X, y, random_state=42)

train_corpus = list(X_train['sms'])

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000)
vectorizer.fit(train_corpus)

from scipy import sparse
def get_features(X):
    X_text_features = vectorizer.transform(list(X['sms']))
    X_len_features = sparse.csr_matrix(X['len']).T
    X_features = sparse.hstack([X_text_features, X_len_features])
    return X_features

X_train_features = get_features(X_train)

## 1. Support Vector Machine

The original SVM algorithm was invented by Vladimir N. Vapnik and Alexey Ya. Chervonenkis in 1963.When we are given data points each belong to one of two classes, and the goal is to decide which class a new data point belong to. In the case of support-vector machines, a data point is viewed as a p-dimensional vector. Now, we want to know whether we can separate the collection of such points with a (p-1)-dimensional hyperplane. This is called a linear classifier. There are many possible hyperplanes that might classify the data sucessfully. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on either side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum-margin classifier.





In [23]:
#Training a Support Vector Machine with the data
from sklearn import svm
model = svm.LinearSVC()

model.fit(X_train_features, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [24]:
from sklearn.metrics import accuracy_score, confusion_matrix,f1_score, precision_score, recall_score

y_train_predicted = model.predict(X_train_features)

print("The fraction of correctly classified samples is {}".format(accuracy_score(y_train, y_train_predicted)))
print("The number of correctly classified samples is {}".format(accuracy_score(y_train, 
                                                                               y_train_predicted, normalize=False)))
print("The precision score is {}".format(precision_score(y_train, y_train_predicted,average = "macro")))
print("The recall score is {}".format(recall_score(y_train, y_train_predicted,average = "macro")))
print("The f1 score is {}".format(f1_score(y_train, y_train_predicted,average = "macro")))

pd.DataFrame(confusion_matrix(y_train_predicted, y_train), 
             index={'true ham', 'true spam'}, 
             columns={'pred ham', 'pred spam'})

The fraction of correctly classified samples is 0.963622291022
The number of correctly classified samples is 3735
The precision score is 0.894976052299
The recall score is 0.957960149306
The f1 score is 0.923016710026


Unnamed: 0,pred spam,pred ham
true spam,3275,24
true ham,117,460


In [25]:
"""
    Evaluation within training data: k-fold cross validation
        - randomly partition the training data into k parts
        - train on k-1 parts and evaluate on the remaining part
"""

from sklearn.model_selection import cross_val_score
clf = svm.LinearSVC()
cv_scores = cross_val_score(model, X=X_train_features, y=y_train, cv=5, n_jobs=4)
print(cv_scores)

[0.9742268  0.93685567 0.92387097 0.94709677 0.94186047]


In [26]:
"""
    Evaluation on test data: This score is important
"""
X_test_features = get_features(X_test)
y_test_predicted = model.predict(X_test_features)

print("The fraction of correctly classified samples is {}".format(accuracy_score(y_test, y_test_predicted)))
print("The number of correctly classified samples is {}".format(accuracy_score(y_test, y_test_predicted, normalize=False)))
print("The precision score is {}".format(precision_score(y_test, y_test_predicted,average = "macro")))
print("The recall score is {}".format(recall_score(y_test, y_test_predicted,average = "macro")))
print("The f1 score is {}".format(f1_score(y_test, y_test_predicted,average = "macro")))

pd.DataFrame(confusion_matrix(y_test_predicted, y_test), 
             index={'true ham', 'true spam'}, 
             columns={'pred ham', 'pred spam'})

The fraction of correctly classified samples is 0.962877030162
The number of correctly classified samples is 1245
The precision score is 0.90764288879
The recall score is 0.935914106425
The f1 score is 0.921075120046


Unnamed: 0,pred spam,pred ham
true spam,1093,17
true ham,31,152


## 2. Naive Bayes Classifier

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.Naive Bayes is a technique for constructing models that assign class labels to problem instances which are represented as vectors of feature values where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. 

For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods. 

Despite the fact that the far-reaching independence assumptions are often inaccurate, the naive Bayes classifier has several properties that make it surprisingly useful in practice. In particular, the decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one-dimensional distribution. This helps alleviate problems stemming from the curse of dimensionality, such as the need for data sets that scale exponentially with the number of features. While naive Bayes often fails to produce a good estimate for the correct class probabilities,this may not be a requirement for many applications.

It is commonly used in problems like :Sex Classification(Male or Female),Document Classification,Spam Filtering etc.

In [27]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

X_train_features = X_train_features.toarray()

model.fit(X_train_features, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [28]:
from sklearn.metrics import accuracy_score, confusion_matrix,f1_score, precision_score, recall_score

y_train_predicted = model.predict(X_train_features)

print("The fraction of correctly classified samples is {}".format(accuracy_score(y_train, y_train_predicted)))
print("The number of correctly classified samples is {}".format(accuracy_score(y_train, 
                                                                               y_train_predicted, normalize=False)))
print("The precision score is {}".format(precision_score(y_train, y_train_predicted,average = "macro")))
print("The recall score is {}".format(recall_score(y_train, y_train_predicted,average = "macro")))
print("The f1 score is {}".format(f1_score(y_train, y_train_predicted,average = "macro")))

pd.DataFrame(confusion_matrix(y_train_predicted, y_train), 
             index={'true ham', 'true spam'}, 
             columns={'pred ham', 'pred spam'})

The fraction of correctly classified samples is 0.946852425181
The number of correctly classified samples is 3670
The precision score is 0.850724637681
The recall score is 0.969634433962
The f1 score is 0.896607503303


Unnamed: 0,pred spam,pred ham
true spam,3186,0
true ham,206,484


In [29]:
"""
    Evaluation within training data: k-fold cross validation
        - randomly partition the training data into k parts
        - train on k-1 parts and evaluate on the remaining part
"""

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
cv_scores = cross_val_score(clf, X=X_train_features, y=y_train, cv=5, n_jobs=4)
print(cv_scores)

[0.91623711 0.9007732  0.8916129  0.91225806 0.92894057]


In [30]:
"""
    Evaluation on test data: This score is important
"""
X_test_features = get_features(X_test)
y_test_predicted = model.predict(X_test_features.toarray())

print("The fraction of correctly classified samples is {}".format(accuracy_score(y_test, y_test_predicted)))
print("The number of correctly classified samples is {}".format(accuracy_score(y_test, y_test_predicted, normalize=False)))
print("The precision score is {}".format(precision_score(y_test, y_test_predicted,average = "macro")))
print("The recall score is {}".format(recall_score(y_test, y_test_predicted,average = "macro")))
print("The f1 score is {}".format(f1_score(y_test, y_test_predicted,average = "macro")))

pd.DataFrame(confusion_matrix(y_test_predicted, y_test), 
             index={'true ham', 'true spam'}, 
             columns={'pred ham', 'pred spam'})

The fraction of correctly classified samples is 0.907192575406
The number of correctly classified samples is 1173
The precision score is 0.789556900783
The recall score is 0.893830676578
The f1 score is 0.828119461184


Unnamed: 0,pred spam,pred ham
true spam,1025,21
true ham,99,148


## MultiLayer Perceptron(Neural Nets)

An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements (neurones) working in unison to solve specific problems. ANNs, like people, learn by example. An ANN is configured for a specific application, such as pattern recognition or data classification, through a learning process. Learning in biological systems involves adjustments to the synaptic connections that exist between the neurones. 

A multilayer perceptron (MLP) is a class of feedforward artificial neural network. An MLP consists of, at least, three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron.

Learning occurs in the perceptron by changing connection weights after each piece of data is processed, based on the amount of error in the output compared to the expected result. This is an example of supervised learning, and is carried out through backpropagation, a generalization of the least mean squares algorithm in the linear perceptron. 

MLPs are useful in research for their ability to solve problems stochastically, which often allows approximate solutions for extremely complex problems like fitness approximation.

MLPs are universal function approximators as showed by Cybenko's theorem so they can be used to create mathematical models by regression analysis. As classification is a particular case of regression when the response variable is categorical, MLPs make good classifier algorithms.

MLPs were a popular machine learning solution in the 1980s, finding applications in diverse fields such as speech recognition, image recognition, and machine translation software but thereafter faced strong competition from much simpler support vector machines. Interest in backpropagation networks returned due to the successes of deep learning. 


In [31]:
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(solver='lbfgs', alpha=1e-2,hidden_layer_sizes=(5, ), random_state=1)

X_train_features = X_train_features

model.fit(X_train_features, y_train)

MLPClassifier(activation='relu', alpha=0.01, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(5,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=1, shuffle=True, solver='lbfgs', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [32]:
from sklearn.metrics import accuracy_score, confusion_matrix

y_train_predicted = model.predict(X_train_features)

print("The fraction of correctly classified samples is {}".format(accuracy_score(y_train, y_train_predicted)))
print("The number of correctly classified samples is {}".format(accuracy_score(y_train, 
                                                                               y_train_predicted, normalize=False)))
print("The precision score is {}".format(precision_score(y_train, y_train_predicted,average = "macro")))
print("The recall score is {}".format(recall_score(y_train, y_train_predicted,average = "macro")))
print("The f1 score is {}".format(f1_score(y_train, y_train_predicted,average = "macro")))

pd.DataFrame(confusion_matrix(y_train_predicted, y_train), 
             index={'true ham', 'true spam'}, 
             columns={'pred ham', 'pred spam'})

The fraction of correctly classified samples is 0.99226006192
The number of correctly classified samples is 3846
The precision score is 0.987381495349
The recall score is 0.976979134181
The f1 score is 0.982102442736


Unnamed: 0,pred spam,pred ham
true spam,3383,21
true ham,9,463


In [35]:
"""
    Evaluation within training data: k-fold cross validation
        - randomly partition the training data into k parts
        - train on k-1 parts and evaluate on the remaining part
"""

from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='lbfgs', alpha=1e-2,hidden_layer_sizes=(5, ), random_state=1)
cv_scores = cross_val_score(clf, X=X_train_features, y=y_train, cv=5, n_jobs=4)
print(cv_scores)

[0.99613402 0.98969072 0.97419355 0.98064516 0.98062016]


In [34]:
"""
    Evaluation on test data: This score is important
"""

#from sklearn.neural_network import MLPClassifier
#model = MLPClassifier(solver='lbfgs', alpha=1e-2,hidden_layer_sizes=(5, ), random_state=1)


model.fit(X_train_features, y_train)
X_test_features = get_features(X_test)
y_test_predicted = model.predict(X_test_features)

print("The fraction of correctly classified samples is {}".format(accuracy_score(y_test, y_test_predicted)))
print("The number of correctly classified samples is {}".format(accuracy_score(y_test, y_test_predicted, normalize=False)))
print("The precision score is {}".format(precision_score(y_test, y_test_predicted,average = "macro")))
print("The recall score is {}".format(recall_score(y_test, y_test_predicted,average = "macro")))
print("The f1 score is {}".format(f1_score(y_test, y_test_predicted,average = "macro")))

pd.DataFrame(confusion_matrix(y_test_predicted, y_test), 
             index={'true ham', 'true spam'}, 
             columns={'pred ham', 'pred spam'})

The fraction of correctly classified samples is 0.981438515081
The number of correctly classified samples is 1269
The precision score is 0.961293221727
The recall score is 0.956645223104
The f1 score is 0.958952380952


Unnamed: 0,pred spam,pred ham
true spam,1113,13
true ham,11,156
