# DOST AI Summer School 2017
# Multinomial Naive Bayes Spam Classifier

Prepared by Jerelyn Co (ADMU) and Hadrian Paulo Lim (ADMU) 

In [None]:
%pylab inline
import pandas as pd

# Practicals: Spam Filtering with Multinomial Naive Bayes Classifier

## Agenda

2. Representing text as numerical data
3. Reading a text-based dataset into pandas
4. Vectorizing our dataset
5. Building and evaluating a model
6. Comparing models
7. Examining a model for further insight
9. Tuning the vectorizer (challenge)

## Part 1: Representing text as numerical data

In [None]:
# example text for model training
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [None]:
# import and instantiate CountVectorizer (with the default parameters)
# using the variable name vect
from sklearn.feature_extraction.text import CountVectorizer


In [None]:
# learn the 'vocabulary' of the training data (occurs in-place)
# by calling vect.fit() on the simple_train array


In [None]:
# examine the fitted vocabulary
vect.get_feature_names()

In [None]:
# transform training data into a 'document-term matrix'
# using the transform() method of vect on the simple_train array
# Assign the result to a variable simple_train_dtm


In [None]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

In [None]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [None]:
# check the type of the document-term matrix
type(simple_train_dtm)

In [None]:
# examine the sparse matrix contents
print(simple_train_dtm)

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

In [None]:
# example text for model testing
simple_test = ["please don't call me"]

In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

In [None]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

In [None]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

**Summary:**

- `vect.fit(train)` **learns the vocabulary** of the training data
- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data
- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)

## Part 3: Reading a text-based dataset into pandas

In [None]:
# read file into pandas using a relative path
# assign the DataFrame object to a variable called spam_ham.
# Use the extra parameters: header=0, names=['label', 'location','message']
# for the read function
path = 'data/spam_ham.csv'
spam_ham = None
# Drop entries with null values using dropna(inplace=True) on your DataFrame


In [None]:
# Drop the irrelevant column.
# examine the shape
# You should get (some_number, 2)
spam_ham.shape

In [None]:
# examine the first 10 rows


In [None]:
# examine the class distribution
spam_ham.label.value_counts()

In [None]:
# convert the labels to a numerical variable
# where ham is reassigned to 0, and spam is 1.
# The converted labels should be under the label_num column.


In [None]:
# check that the conversion worked
spam_ham.head(10)

In [None]:
# how to define X and y (from the spam data) for use with COUNTVECTORIZER
X = spam_ham.message
y = spam_ham.label_num
print(X.shape)
print(y.shape)

In [None]:
# split X and y into training and testing sets
# Use the ff. variables: X_train, X_test, y_train, y_test
from sklearn.model_selection import train_test_split


## Part 4: Vectorizing our dataset

In [None]:
# instantiate the count vectorizer and assign it to vect again.


In [None]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)

In [None]:
# examine the fitted vocabulary
vect.get_feature_names()

In [None]:
# transform training data into a 'document-term matrix'
X_train_dtm = vect.transform(X_train)

In [None]:
# equivalently: combine fit and transform into a single step using the fit_transform method


In [None]:
# examine the document-term matrix
# This should be:
# a sparse matrix of type '<class 'numpy.int64'>'
# in Compressed Sparse Row format>
X_train_dtm

In [None]:
# transform testing data, too, into a document-term matrix
X_test_dtm = None


## Part 5: Building and evaluating a model

We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [None]:
# import and instantiate a Multinomial Naive Bayes model
# use the nb variable
from sklearn.naive_bayes import MultinomialNB
nb = None

In [None]:
# train the model using X_train_dtm and the fit() method


In [None]:
# make class predictions for X_test_dtm using the predict() function


In [None]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

In [None]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

In [None]:
# Print the classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_class), digits=4)

In [None]:
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

In [None]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

## Part 7: Examining a model for further insight

We will examine the our **trained Naive Bayes model** to calculate the approximate **"spamminess" of each token**.

In [None]:
# store the vocabulary of X_train using get_feature_names()
# its length should be 161925
X_train_tokens = None

In [None]:
# examine the first 50 tokens


In [None]:
# examine the last 50 tokens


In [None]:
# Naive Bayes counts the number of times each token appears in each class
nb.feature_count_

In [None]:
# rows represent classes, columns represent tokens
nb.feature_count_.shape

In [None]:
# number of times each token appears across all HAM messages
ham_token_count = nb.feature_count_[0, :]
ham_token_count

In [None]:
# number of times each token appears across all SPAM messages
spam_token_count = nb.feature_count_[1, :]
spam_token_count

In [None]:
# create a DataFrame of tokens with their separate ham and spam counts
tokens = None

In [None]:
# examine 5 random DataFrame rows using the sample() method


In [None]:
# Naive Bayes counts the number of observations in each class
nb.class_count_

Before we can calculate the "spamminess" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**.

In [None]:
# add 1 to ham and spam counts to avoid dividing by 0


tokens.sample(5, random_state=427)

In [None]:
# convert the ham and spam counts into frequencies
# by dividing them with nb.class_count_

tokens.sample(5, random_state=427)

In [None]:
# calculate the ratio of spam-to-ham for each token

tokens.sample(5, random_state=427)

In [None]:
# examine the DataFrame sorted by spam_ratio
tokens.sort_values('spam_ratio', ascending=False)

In [None]:
# look up the spam_ratio for a given token
# the token, adobe, can change depending on the train_test_split randomness
tokens.loc['adobe', 'spam_ratio']

## Part 9: Tuning the vectorizer (Challenge)

Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):

However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:

- **stop_words:** string {'english'}, list, or None (default)
    - If 'english', a built-in stop word list for English is used.
    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    - If None, no stop words will be used.

- **ngram_range:** tuple (min_n, max_n), default=(1, 1)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
    - All values of n such that min_n <= n <= max_n will be used.

- **max_df:** float in range [0.0, 1.0] or int, default=1.0
    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

- **min_df:** float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

**Guidelines for tuning CountVectorizer:**

- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.

Tasks:
1. **Experiment**, and let the data tell you the best approach!
2. Try to reduce or increase the features and get a better score on the previous model. 
   * Score above a 99.5%? Tell us! :)

## Part 10: Tuning the Laplacian Correction Factor (Challenge)

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)

> Parameters:	
alpha : float, optional (default=1.0)
Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).

One of the parameters that we can tune in training a Multinomial Naive Bayes Classifier is the Laplacian Correction Factor.

Tasks:
1. Tweak the correction factor from 0-3 in increments of 0.1, 5, and 10, thus training multiple classifiers.
2. Plot the precision-recall curves for these classifiers to compare and contrast.