## Naive Bayes ##

Before we get into the Naive Bayes algorithm,  we have to understand Bayes Theoram. The Bayes theoram sets out a probabilistic formula for predicting the likelihood(probability) of an event based on the probabilities of certain other events or features.

The Bayes theoram, mathematically is as follows:

<img src="files/bayes.png">

where:

P(A) and P(B) are the probabilities of observing A and B independantly.

P(A | B) is the probability of observing event A given that B is true.

P(B | A) is the probability of observing event B given that A is true.

For example: When we are searching for the term 'Sacramento Kings' on a search engine, in order for us to get the results pertaining to the Scramento Kings NBA basketball team, the search engine needs to be able to associate the two words together and not treat them individually, in which case we would get results of images tagged with 'Sacramento' like pictures of city landscapes and images of 'Kings' which could be pictures of crowns or kings from history. This is a classic case of the search engine treating the words as independant entities and hence being 'naive' in its approach. 


The 'Naive' bit gets added to the Bayes Theoram with the added assumption that each of the features that we are using to make our predictions are independant of each other. 


Our objective in this mission to train a classifier such that it will be able to identify if a message is spam or not using the Naive Bayes algorithm. 

We will break the Naive Bayes formula down using the mission objective.

** Prior: P(Spam) **

The prior is the probability of a message being spam without considering any other factors.

** Posterior: P(Spam/One), P(~Spam/One) **

The posterior is the probability of a message being spam, given that our classifier classified it with the value '1'(i.e. classified it as being a spam message), and the the probability of a message not being spam, given that our classifier classified it with the value '1'(i.e. classified it as being a spam message


** Sensitivity(True Positive Rate): **

For our mission, the sensitivity is * P(One/Spam) *, the probability of a message being spam, given that we got a '1' or spam classification assigned to it using our algorithm.


** Specificity(True Negative Rate): **

Similarly, the specificity is * P(Zero/~Spam) *, the probability of a message being not being spam, given that we got a '0' or not spam classification assigned to it using our algorithm.

Using the above values, we can calculate our posterior as follows:
    
`P(Spam/One) = (P(Spam) * P(One/Spam) + P(Spam) * P(One/~Spam)) / P(One)`

### Step 1: Understanding our dataset ### 


We will be using a dataset from the UCI Machine Learning repository which has a very good collection of datasets for experimental research purposes. 


 ** Here's a preview of the data: ** 

<img src="files/dqnb.png">

The columns in the data set are currently not named and as you can see, there are 2 columns. 

The first column takes two values, 'ham' which signifies that the message is not spam, and 'spam' which signifies that the message is spam. 



In [None]:
'''
Instructions:

Import the dataset into a pandas dataframe using the read_table method. Also, becasue this is a tab separated dataset
we will be using '\t' as the value for the 'sep' argument which specifies this format. Also, rename the column names
by specifying a list ['label, 'sms_message'] to the 'names' argument.

Print the first five values of the dataframe with the new column names.
'''

In [10]:
'''
Solution
'''
import pandas as pd
# Dataset - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection#
df = pd.read_table('/Users/adarshnair/Desktop/DQ_NB/smsspamcollection/SMSSpamCollection',
                   sep='\t', 
                   header=None, 
                   names=['label', 'sms_message'])

# Output
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Step 2: Cleaning our dataset ###

Now that we have a basic understanding of what our dataset looks like, lets convert our labels to binary variables, 0 to represent 'ham' and 1 to represent 'spam'.

In [None]:
'''
Instructions: Convert the values in the 'label' colum to numerical values using map method as follows:
{'ham':0, 'spam':1} This maps the 'ham' value to 0 and the 'spam' value to 1.
'''

In [13]:
'''
Solution
'''
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### Step 3: Training and testing sets ###

In [None]:
'''
Instructions:
Split the dataset into a training and testing set by using the train_test_split method in sklearn. 
'''

In [18]:
# split into training and testing sets
from sklearn.cross_validation import train_test_split
# Check total number of rows in data
print df.shape
X_train, X_test, y_train, y_test = train_test_split(df.sms_message, 
                                                    df.label, 
                                                    random_state=1)
# Training dataset size
print X_train.shape
# Testing dataset size
print X_test.shape

(5572, 2)
(4179,)
(1393,)


### Step 4: Bag of words ###

Spam detection is one of the major applications of machine learning and specifically email spam detection is something most of us take for granted these days. Most major email service providers have pretty efficient spam detection systems built in, such that we rely on their algorithms to take care of this for us.

There is one small issue though, most ML algorithms rely on numerical data to be fed into them, and email/sms messages are usually text heavy. To handle this, we will be using sklearns `sklearn.feature_extraction.text.CountVectorizer` method which does 3 things:

* It tokenizes the string and gives an integer ID to each token.
* It counts the occurrance of each of those tokens.
* It normalizes the values of the counts for each token so that extremely common words(words like 'the', 'a', 'an', 'is', 'from', pronouns, etc) don't skew the values. 

Using this method, we can covert a collection of documents to a matrix, with each document being a row and each token(word) being the column, and the corresponding values being the frequency of occurrance of each token in that document.

We have taken care of the part where the textual data is converted into this matrix, so you can focus on the actual prediction, but if you'd like to take a look at how we did it check the following code snippets, otherwise you can head straight to the problem set.

In [None]:
'''
This is an option segment showing you how to create a frequency matrix given a corpus of documents. The code is 
already complete and is meant for you to play aroudn with. Feel free to create your own training sample documents 
to see how the frequency matrix changes with changes in the documents. 
'''

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

# Let's create some data to play with.
train_simple = ['Hello how are you!',
                'Win money, from home.',
                'I am busy, call me later']

'''
Get the feature list, that is the word list from our corpus of 3 documents.
'''
vect = CountVectorizer()
vect.fit(train_simple)
vect.get_feature_names()

'''
Create a matrix with the rows being the 3 documents, and the columns being each word. 
The corresponding (row, column) value is the frequency of occurrance of that word(in the column) in a particular
document(in the row).
'''
train_simple_dtm = vect.transform(train_simple)
train_simple_dtm
train_simple_dtm.toarray()

# Here is the matrix in the form of a dataframe
pd.DataFrame(train_simple_dtm.toarray(), columns=vect.get_feature_names())

# # transform testing data into a document-term matrix (using existing vocabulary)
# test_simple = ["please don't call me"]
# test_simple_dtm = vect.transform(test_simple)
# test_simple_dtm.toarray()
# pd.DataFrame(test_simple_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,am,are,busy,call,from,hello,home,how,later,me,money,win,you
0,0,1,0,0,0,1,0,1,0,0,0,0,1
1,0,0,0,0,1,0,1,0,0,0,1,1,0
2,1,0,1,1,0,0,0,0,1,1,0,0,0


### (Hidden) Creating Frequency Matrix ###

In [38]:
# instantiate the vectorizer
vect = CountVectorizer()

# learn vocabulary and create document-term matrix in a single step
train_dtm = vect.fit_transform(X_train)
train_dtm

<4179x7456 sparse matrix of type '<type 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [39]:
# transform testing data into a document-term matrix
test_dtm = vect.transform(X_test)
test_dtm

<1393x7456 sparse matrix of type '<type 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

In [41]:
train_features = vect.get_feature_names()
print 'Number of training features: ', len(train_features)
test_features = vect.get_feature_names()
print 'Number of training features: ', len(test_features)

Number of training features:  7456
Number of training features:  7456


### Step 6: Naive Bayes Classifier ###

In [37]:

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(train_dtm, y_train)

# make predictions on test data using test_dtm
preds = nb.predict(test_dtm)
preds

# compare predictions to true labels
from sklearn import metrics
print metrics.accuracy_score(y_test, preds)
print metrics.confusion_matrix(y_test, preds)

0.988513998564
[[1203    5]
 [  11  174]]
