<img height="60" width="120" src="https://shwetkm.github.io/upxlogo.png"></img>
# UpX Academy - Machine Learning Track
# Naive Bayes Classifier

## Goal:  Train a Naive Bayes model to classify future SMS messages as either spam or ham.

Steps:

1.  Convert the words ham and spam to a binary indicator variable(0/1)

2.  Convert the txt to a sparse matrix of TFIDF vectors

3.  Fit a Naive Bayes Classifier

4.  Measure your success using roc_auc_score



In [2]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score

In [5]:
df= pd.read_csv("D:/UpX/ML_with_Python_May17/Datasets/sms_spam.csv")

In [6]:
df.head()

Unnamed: 0,type,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### Train the classifier if it is spam or ham based on the text

In [7]:
#TFIDF Vectorizer
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)

#### Convert the spam and ham to 1 and 0 values respectively for probability testing

In [8]:
df.type.replace('spam', 1, inplace=True)

In [9]:
df.type.replace('ham', 0, inplace=True)

In [10]:
df.head()

Unnamed: 0,type,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [11]:
df.shape

(5574, 2)

In [12]:
##Our dependent variable will be 'spam' or 'ham' 
y = df.type

In [13]:
#Convert df.txt from text to features
X = vectorizer.fit_transform(df.text)

In [15]:
print (y.shape)
print (X.shape)

(5574,)
(5574, 8586)


In [16]:
##Split the test and train
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [17]:
##Train Naive Bayes Classifier
## Fast (One pass)
## Not affected by sparse data, so most of the 8605 words dont occur in a single observation
clf = naive_bayes.MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [18]:
y_test

3690    0
3527    0
724     0
3370    0
468     0
5412    0
4362    0
4241    0
5442    0
5309    0
2232    0
3573    0
4379    0
3316    1
4895    0
296     1
453     0
4880    0
2034    0
4287    0
605     0
1615    0
5169    0
4655    0
2754    0
2727    0
4295    1
3893    1
2559    0
730     0
       ..
3768    0
3809    0
3034    0
5082    0
257     0
507     0
1438    0
99      0
1957    0
5216    1
3412    0
4058    0
3650    0
2707    0
1954    0
4028    0
2164    0
4564    0
366     0
2561    0
3680    0
4320    0
3133    0
949     0
4842    0
19      1
4758    0
668     0
218     0
4660    0
Name: type, dtype: int64

#### Check for null values in spam

In [20]:
df[df.type.isnull()]

Unnamed: 0,type,text


#### There are no null values

In [21]:
clf.predict_proba(X_test)[:,1]

array([ 0.00270358,  0.01501181,  0.0666378 , ...,  0.00803285,
        0.0139652 ,  0.00349621])

In [22]:
##Check model's accuracy
roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])

0.98607103532616969

### With the model, the success rate is ~98.60%