## Problem Statement

A SMS unsolicited mail (every now and then known as cell smartphone junk mail) is any junk message
brought to a cellular phone as textual content messaging via the Short Message Service (SMS). Use
probabilistic approach (Naive Bayes Classifier / Bayesian Network) to implement SMS Spam Filtering
system. SMS messages are categorized as SPAM or HAM using features like length of message, word
depend, unique keywords etc.
Download Data -Set from : http://archive.ics.uci.edu/ml/datasets/sms+spam+collection
This dataset is composed by just one text file, where each line has the correct class followed by
the raw message.

a. Apply Data pre-processing (Label Encoding, Data Transformation….) techniques if
necessary

b. Perform data-preparation (Train-Test Split)

c. Apply at least two Machine Learning Algorithms and Evaluate Models

d. Apply Cross-Validation and Evaluate Models and compare performance.

e. Apply Hyper parameter tuning and evaluate models and compare performance.

## Loading Dataset

In [1]:
import numpy as np
import pandas as pd


In [2]:
df  = pd.read_csv('spam.csv',encoding="ISO-8859-1")
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [3]:
df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis = 1,inplace = True)

In [4]:
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Data Pre-processing

In [5]:
df.rename(columns = {'v1':'target','v2':'sms'},inplace = True)

In [6]:
data = df.copy()
data.head()

Unnamed: 0,target,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
data['target'].value_counts()

ham     4825
spam     747
Name: target, dtype: int64

In [8]:
data['sms'].duplicated().sum()

403

There are 403 duplicate sms in our data 

In [9]:
# removing dplicate records
data.drop_duplicates(inplace = True)

In [10]:
# after removing duplicates resetting records
data.reset_index(inplace = True)

In [11]:
data.isnull().sum()

index     0
target    0
sms       0
dtype: int64

In [12]:
# converting label to readable format 
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data['target'] = le.fit_transform(data['target'])
data.head()


Unnamed: 0,index,target,sms
0,0,0,"Go until jurong point, crazy.. Available only ..."
1,1,0,Ok lar... Joking wif u oni...
2,2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,3,0,U dun say so early hor... U c already then say...
4,4,0,"Nah I don't think he goes to usf, he lives aro..."


## Extracting Features 

In [13]:
x = data['sms']
y = data['target']

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
cf = cv.fit_transform(x)


In [15]:
cf.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## Spliting Dataset into training and testing 

In [16]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(cf,y,test_size = 0.2,random_state=0)


## Training Model using SVM (Support Vector Machine)

In [32]:
from sklearn import svm


model = svm.SVC(kernel='rbf',C=30,gamma='auto')
model.fit(x_train,y_train)
y_predict = model.predict(x_test)

## Model Evaluation

In [30]:
model.score(x_test, y_test)

0.9671179883945842

In [28]:
from sklearn.metrics import classification_report
print("* Classification Report for SVM (support Vector Machine) :")
print("------------------------------------------------------")
print(classification_report(y_test,y_predict))

* Classification Report for SVM (support Vector Machine) :
------------------------------------------------------
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       885
           1       0.99      0.78      0.87       149

    accuracy                           0.97      1034
   macro avg       0.98      0.89      0.93      1034
weighted avg       0.97      0.97      0.97      1034



## Model Building using Naive Bayes classifier

In [35]:
from sklearn.naive_bayes import MultinomialNB

In [36]:
model1 = MultinomialNB()
model1.fit(x_train,y_train)

MultinomialNB()

In [37]:
ny_pred = model1.predict(x_test) 

In [38]:
ny_pred

array([0, 0, 0, ..., 0, 0, 0])

## Model Evaluation

In [39]:
from sklearn.metrics import accuracy_score,confusion_matrix,plot_confusion_matrix

accuracy = accuracy_score(y_test,ny_pred)
print(f"Accuracy of the model is {accuracy}")

Accuracy of the model is 0.9777562862669246


In [40]:
conf_matrix = confusion_matrix(y_test,ny_pred)
conf_matrix

array([[872,  13],
       [ 10, 139]], dtype=int64)

In [41]:
from sklearn.metrics import classification_report
print("Classification Report : \n")
print(classification_report(y_test,ny_pred))

Classification Report : 

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       885
           1       0.91      0.93      0.92       149

    accuracy                           0.98      1034
   macro avg       0.95      0.96      0.96      1034
weighted avg       0.98      0.98      0.98      1034



In [None]:
## Model Building using 

## Applying Cross-Validation

In [48]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model1,x_train,y_train,cv=5,scoring='f1_macro')
scores2 = cross_val_score(model,x_train,y_train,cv=5,scoring='f1_macro')
print(" Cross validation score for 'Naive Bayes Theorem':",scores.mean())
print("\n Cross validation score for 'SVM':",scores2.mean())

 Cross validation score for 'Naive Bayes Theorem': 0.9525404660395917

 Cross validation score for 'SVM': 0.8990167865274484


## Hypertuning using GridSearchCV

In [None]:
model_params = {
    'svm':{
        'model': svm.SVC(gamma='auto'),
        'params':{
            'c':[1,10,20],
            'kernel':['rbf','linear']
        }
    },
    'model1':{
        'model':MultinomialNB(),
        'params':{
            'n_estimators':[1,5,10]
        }
    },
    'logistic_regresson':{
        'model':Lo
    }
}

In [53]:
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(svm.SVC(gamma='auto'), {
    'C': [1,10,20],
    'kernel': ['rbf','linear']
}, cv=5, return_train_score=False)
clf.fit(x_train, y_train)

GridSearchCV(cv=5, estimator=SVC(gamma='auto'),
             param_grid={'C': [1, 10, 20], 'kernel': ['rbf', 'linear']})

In [54]:
df2 = pd.DataFrame(clf.cv_results_)[['param_C','param_kernel','mean_test_score']]
df2

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,rbf,0.878114
1,1,linear,0.981862
2,10,rbf,0.905925
3,10,linear,0.981378
4,20,rbf,0.955018
5,20,linear,0.981378
