In [1]:
import pandas as pd
# Dataset available using filepath 'smsspamcollection/SMSSpamCollection'
df = pd.read_table('smsspamcollection/SMSSpamCollection', sep = '\t', names = ['label', 'sms_message'])

# Output printing out first 5 rows
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
'''
Solution
'''
df.label = df.label.map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
df.shape

(5572, 2)

### Step 3.1: Training and testing sets ###

Now that we understand how to use the Bag of Words approach, we can return to our original, larger UCI dataset and proceed with our analysis. Our first step is to split our dataset into a training set and a testing set so we can **first train, and then test** our model. 


>>**Instructions:**
Split the dataset into a training and testing set using the train_test_split method in sklearn, and print out the number of rows we have in each of our training and testing data. Split the data
using the following variables:
* `X_train` is our training data for the 'sms_message' column.
* `y_train` is our training data for the 'label' column
* `X_test` is our testing data for the 'sms_message' column.
* `y_test` is our testing data for the 'label' column. 


In [4]:
'''
Solution 
'''
# split into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


### Step 3.2: Applying Bag of Words processing to our dataset. ###

Now that we have split the data, our next objective is to follow the steps from "Step 2: Bag of Words," and convert our data into the desired matrix format. To do this we will be using CountVectorizer() as we did before. There are two  steps to consider here:

* First, we have to fit our training data (`X_train`) into `CountVectorizer()` and return the matrix.
* Secondly, we have to transform our testing data (`X_test`) to return the matrix. 

Note that `X_train` is our training data for the 'sms_message' column in our dataset and we will be using this to train our model. 

`X_test` is our testing data for the 'sms_message' column and this is the data we will be using (after transformation to a matrix) to make predictions on. We will then compare those predictions with `y_test` in a later step. 

For now, we have provided the code that does the matrix transformations for you!

In [5]:
'''
[Practice Node]

The code for this segment is in 2 parts. First, we are learning a vocabulary dictionary for the training data 
and then transforming the data into a document-term matrix; secondly, for the testing data we are only 
transforming the data into a document-term matrix.

This is similar to the process we followed in Step 2.3.

We will provide the transformed data to students in the variables 'training_data' and 'testing_data'.
'''

"\n[Practice Node]\n\nThe code for this segment is in 2 parts. First, we are learning a vocabulary dictionary for the training data \nand then transforming the data into a document-term matrix; secondly, for the testing data we are only \ntransforming the data into a document-term matrix.\n\nThis is similar to the process we followed in Step 2.3.\n\nWe will provide the transformed data to students in the variables 'training_data' and 'testing_data'.\n"

In [6]:
# add another sentence by Sean
type(X_test)

pandas.core.series.Series

In [7]:
X_test

1078                         Yep, by the pretty sculpture
4028        Yes, princess. Are you going to make me moan?
958                            Welp apparently he retired
4642                                              Havent.
4674    I forgot 2 ask ü all smth.. There's a card on ...
                              ...                        
3207                                        At home also.
4655                     Hope you are having a great day.
1140    Message:some text missing* Sender:Name Missing...
1793    WIN: We have a winner! Mr. T. Foley won an iPo...
1710    U meet other fren dun wan meet me ah... Muz b ...
Name: sms_message, Length: 1393, dtype: object

In [8]:
sean = pd.Series(['Sean is the greatest.'])
X_test = X_test.append(sean)
X_test

1078                         Yep, by the pretty sculpture
4028        Yes, princess. Are you going to make me moan?
958                            Welp apparently he retired
4642                                              Havent.
4674    I forgot 2 ask ü all smth.. There's a card on ...
                              ...                        
4655                     Hope you are having a great day.
1140    Message:some text missing* Sender:Name Missing...
1793    WIN: We have a winner! Mr. T. Foley won an iPo...
1710    U meet other fren dun wan meet me ah... Muz b ...
0                                   Sean is the greatest.
Length: 1394, dtype: object

In [9]:
'''
Solution
'''
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer(stop_words = None)

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

### Step 5: Naive Bayes implementation using scikit-learn ###

Now let's return to our spam classification context. Thankfully, sklearn has several Naive Bayes implementations that we can use, so we do not have to do the math from scratch. We will be using sklearn's `sklearn.naive_bayes` method to make predictions on our SMS messages dataset. 

Specifically, we will be using the **multinomial Naive Bayes algorithm**. This particular classifier is suitable for classification with discrete features (such as in our case, word counts for text classification). It takes in integer word counts as its input. On the other hand, **Gaussian Naive Bayes** is better suited for continuous data as it assumes that the input data has a Gaussian (normal) distribution.

In [10]:
'''
Instructions:

We have loaded the training data into the variable 'training_data' and the testing data into the 
variable 'testing_data'.

Import the MultinomialNB classifier and fit the training data into the classifier using fit(). Name your classifier
'naive_bayes'. You will be training the classifier using 'training_data' and 'y_train' from our split earlier. 
'''

"\nInstructions:\n\nWe have loaded the training data into the variable 'training_data' and the testing data into the \nvariable 'testing_data'.\n\nImport the MultinomialNB classifier and fit the training data into the classifier using fit(). Name your classifier\n'naive_bayes'. You will be training the classifier using 'training_data' and 'y_train' from our split earlier. \n"

In [11]:
'''
Solution
'''
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB()

In [12]:
'''
Instructions:
Now that our algorithm has been trained using the training data set we can now make some predictions on the test data
stored in 'testing_data' using predict(). Save your predictions into the 'predictions' variable.
'''

"\nInstructions:\nNow that our algorithm has been trained using the training data set we can now make some predictions on the test data\nstored in 'testing_data' using predict(). Save your predictions into the 'predictions' variable.\n"

In [13]:
'''
Solution
'''
predictions = naive_bayes.predict(testing_data)

Now that predictions have been made on our test set, we need to check the accuracy of our predictions.

In [14]:
pd.DataFrame(X_test).reset_index(drop=True)

Unnamed: 0,0
0,"Yep, by the pretty sculpture"
1,"Yes, princess. Are you going to make me moan?"
2,Welp apparently he retired
3,Havent.
4,I forgot 2 ask ü all smth.. There's a card on ...
...,...
1389,Hope you are having a great day.
1390,Message:some text missing* Sender:Name Missing...
1391,WIN: We have a winner! Mr. T. Foley won an iPo...
1392,U meet other fren dun wan meet me ah... Muz b ...


In [15]:
result = pd.DataFrame(X_test)
result['Spam_1_Prediction'] = predictions
result

Unnamed: 0,0,Spam_1_Prediction
1078,"Yep, by the pretty sculpture",0
4028,"Yes, princess. Are you going to make me moan?",0
958,Welp apparently he retired,0
4642,Havent.,0
4674,I forgot 2 ask ü all smth.. There's a card on ...,0
...,...,...
4655,Hope you are having a great day.,0
1140,Message:some text missing* Sender:Name Missing...,0
1793,WIN: We have a winner! Mr. T. Foley won an iPo...,1
1710,U meet other fren dun wan meet me ah... Muz b ...,0


# Sean is the greatest : HAM