## The aim is to create models to classify sms messages into spam or ham for the below mentioned dataset using CountVectorizer, Multinomial Naive Bayes and Bernoulli Naive Bayes.

## Dataset Description:

### url: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [1]:
## Importing required Libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [2]:
## reading the file
df = pd.read_table('SMSSpamCollection.txt',header=None,names=['Class','sms'])

In [3]:
## Checking shape of dataset
df.shape

(5572, 2)

In [4]:
## Checking Head of data
df.head()

Unnamed: 0,Class,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


- column 1 is info about type of message
- column 2 is the sms message

In [5]:
## Checking count of ham and spam:

ham_spam_count = df.Class.value_counts()
print(ham_spam_count)

ham     4825
spam     747
Name: Class, dtype: int64


In [6]:
## percentage of ham in the data:
print('Spam % is ',(ham_spam_count[1]/float(ham_spam_count[0]+ham_spam_count[1]))*100)

Spam % is  13.406317300789663


In [7]:
## Attaching label 0 for ham and label 1 for spam
df['label'] = df.Class.map({'ham':0,'spam':1})

In [8]:
df.head()

Unnamed: 0,Class,sms,label
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [9]:
## separating feature and target:
x = df.sms
y = df.label

In [10]:
x.shape,y.shape

((5572,), (5572,))

In [11]:
# splitting data into train and test in the ratio 70:30

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=2)

In [12]:
x_train.head()

1915    New TEXTBUDDY Chat 2 horny guys in ur area 4 j...
1056                             I'm at work. Please call
3717              Networking technical support associate.
5375    I cant pick the phone right now. Pls send a me...
945     I sent my scores to sophas and i had to do sec...
Name: sms, dtype: object

In [13]:
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((3900,), (1672,), (3900,), (1672,))

In [14]:
## Vectorizing the sentences and remove stop words

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english')

In [15]:
vect.fit(x_train,y_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [16]:
vect.vocabulary_

{'new': 4252,
 'textbuddy': 6111,
 'chat': 1562,
 'horny': 3115,
 'guys': 2927,
 'ur': 6468,
 'area': 936,
 'just': 3453,
 '25p': 334,
 'free': 2672,
 'receive': 5037,
 'search': 5354,
 'postcode': 4763,
 'gaytextbuddy': 2763,
 'com': 1700,
 'txt': 6374,
 '89693': 658,
 '08715500022': 117,
 'rpl': 5227,
 'stop': 5838,
 'cnl': 1669,
 'work': 6809,
 'networking': 4245,
 'technical': 6068,
 'support': 5953,
 'associate': 986,
 'pick': 4634,
 'phone': 4622,
 'right': 5180,
 'pls': 4689,
 'send': 5393,
 'message': 3979,
 'sent': 5402,
 'scores': 5335,
 'sophas': 5683,
 'secondary': 5360,
 'application': 912,
 'schools': 5329,
 'think': 6152,
 'thinking': 6155,
 'applying': 915,
 'research': 5130,
 'cost': 1812,
 'contact': 1776,
 'joke': 3422,
 'ogunrinde': 4380,
 'school': 5328,
 'expensive': 2431,
 'ones': 4408,
 'urgent': 6472,
 '09066612661': 210,
 'landline': 3567,
 'complimentary': 1735,
 'lux': 3810,
 'costa': 1813,
 'del': 2001,
 'sol': 5654,
 'holiday': 3088,
 '1000': 229,
 'cash':

In [17]:
## creating Document Term Matrix or Bag of words or also called sparse matrix

x_train_transformed = vect.transform(x_train)
x_test_transformed = vect.transform(x_test)
type(x_train_transformed)

scipy.sparse.csr.csr_matrix

In [18]:
print(x_train_transformed)

  (0, 117)	1
  (0, 334)	1
  (0, 658)	1
  (0, 936)	1
  (0, 1562)	1
  (0, 1669)	1
  (0, 1700)	1
  (0, 2672)	1
  (0, 2763)	1
  (0, 2927)	1
  (0, 3115)	1
  (0, 3453)	1
  (0, 4252)	1
  (0, 4763)	1
  (0, 5037)	1
  (0, 5227)	1
  (0, 5354)	1
  (0, 5838)	1
  (0, 6111)	1
  (0, 6374)	1
  (0, 6468)	1
  (1, 6809)	1
  (2, 986)	1
  (2, 4245)	1
  (2, 5953)	1
  :	:
  (3897, 1014)	1
  (3897, 1458)	1
  (3897, 2078)	1
  (3897, 2672)	2
  (3897, 4287)	2
  (3897, 4694)	1
  (3897, 4703)	1
  (3897, 5393)	1
  (3897, 6592)	1
  (3897, 6749)	1
  (3897, 6794)	1
  (3898, 2863)	1
  (3898, 3431)	1
  (3898, 4280)	1
  (3898, 6583)	1
  (3899, 347)	1
  (3899, 843)	1
  (3899, 2323)	1
  (3899, 3494)	1
  (3899, 4638)	2
  (3899, 5402)	1
  (3899, 6107)	1
  (3899, 6542)	1
  (3899, 6543)	2
  (3899, 6868)	1


In [19]:
## Converting x_train_transformed just to display and check DTM:
x_array = x_train_transformed.toarray()
pd.DataFrame(x_array,columns=vect.get_feature_names())

Unnamed: 0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,02073162414,...,zealand,zebra,zed,zeros,zoe,zogtorius,zoom,zouk,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Predicting with Multinomial Naive Bayes:

In [20]:
from sklearn.naive_bayes import MultinomialNB

In [21]:
mnb = MultinomialNB()

In [22]:
## Fitting/Training the model:

mnb.fit(x_train_transformed,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [23]:
## Predicting labels using DTM created for x_test:
y_pred_class = mnb.predict(x_test_transformed)

In [24]:
y_pred_class

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [25]:
## Predicting probability of label being 0(ham) or 1(spam) using DTM created for x_test:
y_pred_prob = mnb.predict_proba(x_test_transformed)

In [26]:
y_pred_prob

array([[9.99948979e-01, 5.10206162e-05],
       [9.98838354e-01, 1.16164642e-03],
       [9.56277513e-01, 4.37224874e-02],
       ...,
       [9.99633683e-01, 3.66316753e-04],
       [9.99981499e-01, 1.85014380e-05],
       [9.90126009e-01, 9.87399081e-03]])

In [27]:
## checking order of classes for predicted prob
mnb.classes_

array([0, 1], dtype=int64)

In [28]:
#Creating a dataframe of corresponding prob scores:
pd.DataFrame(y_pred_prob,columns=mnb.classes_)

Unnamed: 0,0,1
0,9.999490e-01,5.102062e-05
1,9.988384e-01,1.161646e-03
2,9.562775e-01,4.372249e-02
3,9.964031e-01,3.596910e-03
4,9.987206e-01,1.279414e-03
5,1.000000e+00,6.079780e-10
6,9.980785e-01,1.921524e-03
7,9.999925e-01,7.460160e-06
8,2.379224e-14,1.000000e+00
9,9.809091e-01,1.909093e-02


In [29]:
print("Probability of test document belonging to label 0(ham):",y_pred_prob[:,0])
print("Probability of test document belonging to label 1(spam):",y_pred_prob[:,1])

Probability of test document belonging to label 0(ham): [0.99994898 0.99883835 0.95627751 ... 0.99963368 0.9999815  0.99012601]
Probability of test document belonging to label 1(spam): [5.10206162e-05 1.16164642e-03 4.37224874e-02 ... 3.66316753e-04
 1.85014380e-05 9.87399081e-03]


### Checking model performance for Multinomial Naive Bayes:

In [30]:
from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred_class))

0.9796650717703349


In [31]:
cm = metrics.confusion_matrix(y_test,y_pred_class)
print(cm)

[[1434   11]
 [  23  204]]


In [32]:
TN = cm[0,0]
FP = cm[0,1]
FN = cm[1,0]
TP = cm[1,1]

In [33]:
sensitivity = TP/float(FN+TP)
print("Sensitivity",sensitivity)

Sensitivity 0.8986784140969163


In [34]:
specificity = TN/float(TN+FP)
print("specificity",specificity)

specificity 0.9923875432525952


In [35]:
precision = TP/float(TP+FP)
print("precision",precision)
print(metrics.precision_score(y_test,y_pred_class))

precision 0.9488372093023256
0.9488372093023256


In [36]:
from sklearn.metrics import classification_report
cr = classification_report(y_test,y_pred_class)

In [37]:
print(cr)

              precision    recall  f1-score   support

           0       0.98      0.99      0.99      1445
           1       0.95      0.90      0.92       227

    accuracy                           0.98      1672
   macro avg       0.97      0.95      0.96      1672
weighted avg       0.98      0.98      0.98      1672



## Predicting with Bernoulli Naive Bayes:

In [38]:
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()

In [39]:
bnb.fit(x_train_transformed,y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [40]:
y_pred_class_bnb = bnb.predict(x_test_transformed)
y_pred_class_bnb

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [41]:
## Predicting probability of label being 0(ham) or 1(spam) using DTM created for x_test:
y_pred_prob_bnb = bnb.predict_proba(x_test_transformed)

In [42]:
y_pred_prob_bnb

array([[1.00000000e+00, 3.16294038e-10],
       [1.00000000e+00, 6.06469473e-11],
       [9.99999998e-01, 1.98881156e-09],
       ...,
       [1.00000000e+00, 7.60406899e-11],
       [1.00000000e+00, 3.95187015e-10],
       [9.99999993e-01, 6.98826991e-09]])

In [43]:
#Creatind a datafram of corresponding proc scores:
pd.DataFrame(y_pred_prob_bnb,columns=bnb.classes_)

Unnamed: 0,0,1
0,1.000000e+00,3.162940e-10
1,1.000000e+00,6.064695e-11
2,1.000000e+00,1.988812e-09
3,1.000000e+00,2.352743e-10
4,1.000000e+00,9.461758e-11
5,1.000000e+00,1.804478e-13
6,1.000000e+00,3.408516e-11
7,1.000000e+00,4.684551e-12
8,7.779584e-14,1.000000e+00
9,1.000000e+00,3.971697e-10


In [44]:
print("Probability of test document belonging to label 0(ham):",y_pred_prob_bnb[:,0])
print("Probability of test document belonging to label 1(spam):",y_pred_prob_bnb[:,1])

Probability of test document belonging to label 0(ham): [1.         1.         1.         ... 1.         1.         0.99999999]
Probability of test document belonging to label 1(spam): [3.16294038e-10 6.06469473e-11 1.98881156e-09 ... 7.60406899e-11
 3.95187015e-10 6.98826991e-09]


### Checking model performance for Bernoulli Naive Bayes:

In [45]:
print(metrics.accuracy_score(y_test,y_pred_class_bnb))

0.9677033492822966


In [46]:
cm_bnb = metrics.confusion_matrix(y_test,y_pred_class_bnb)
print(cm_bnb)

[[1444    1]
 [  53  174]]


In [47]:
TN_bnb = cm_bnb[0,0]
FP_bnb = cm_bnb[0,1]
FN_bnb = cm_bnb[1,0]
TP_bnb = cm_bnb[1,1]

In [48]:
sensitivity_bnb = TP_bnb/float(FN_bnb+TP_bnb)
print("Sensitivity",sensitivity_bnb)

Sensitivity 0.7665198237885462


In [49]:
specificity_bnb = TN_bnb/float(TN_bnb+FP_bnb)
print("specificity",specificity_bnb)

specificity 0.9993079584775086


In [50]:
precision_bnb = TP_bnb/float(TP_bnb+FP_bnb)
print("precision",precision_bnb)
print(metrics.precision_score(y_test,y_pred_class_bnb))

precision 0.9942857142857143
0.9942857142857143


In [51]:
from sklearn.metrics import classification_report
cr_bnb = classification_report(y_test,y_pred_class_bnb)

In [52]:
print(cr_bnb)

              precision    recall  f1-score   support

           0       0.96      1.00      0.98      1445
           1       0.99      0.77      0.87       227

    accuracy                           0.97      1672
   macro avg       0.98      0.88      0.92      1672
weighted avg       0.97      0.97      0.97      1672



## Appending predicted values to the test data just to cross check predicted labels:

In [53]:
x_test

5086    Omg if its not one thing its another. My cat h...
2120                I hope you know I'm still mad at you.
2318    Waqt se pehle or naseeb se zyada kisi ko kuch ...
2917      What time should I tell my friend to be around?
1352                       Yo theres no class tmrw right?
1457    U sleeping now.. Or you going to take? Haha.. ...
1908                                      ELLO BABE U OK?
411     Come by our room at some point so we can iron ...
385     Double mins and txts 4 6months FREE Bluetooth ...
5463                                    U GOIN OUT 2NITE?
380     I taught that Ranjith sir called me. So only i...
3476    Night has ended for another day, morning has c...
3349                               Sorry, I'll call later
1190    In that case I guess I'll see you at campus lodge
961                    U sure u can't take any sick time?
4077    87077: Kick off a new season with 2wks FREE go...
1067         Once free call me sir. I am waiting for you.
4411          

In [54]:
type(x_test)

pandas.core.series.Series

In [55]:
type(y_test)

pandas.core.series.Series

In [56]:
type(y_pred_class)

numpy.ndarray

In [57]:
df_result = pd.DataFrame(columns = ['sms', 'Original Label'])

In [58]:
df_result['sms'] = x_test

In [59]:
df_result['Original Label'] = y_test

In [60]:
df_result['MNB Prediction'] = y_pred_class

In [61]:
df_result['BNB Prediction'] = y_pred_class_bnb

In [62]:
df_result

Unnamed: 0,sms,Original Label,MNB Prediction,BNB Prediction
5086,Omg if its not one thing its another. My cat h...,0,0,0
2120,I hope you know I'm still mad at you.,0,0,0
2318,Waqt se pehle or naseeb se zyada kisi ko kuch ...,0,0,0
2917,What time should I tell my friend to be around?,0,0,0
1352,Yo theres no class tmrw right?,0,0,0
1457,U sleeping now.. Or you going to take? Haha.. ...,0,0,0
1908,ELLO BABE U OK?,0,0,0
411,Come by our room at some point so we can iron ...,0,0,0
385,Double mins and txts 4 6months FREE Bluetooth ...,1,1,1
5463,U GOIN OUT 2NITE?,0,0,0
