# DOST AI Summer School 2017
# Multinomial Naive Bayes Spam Classifier

Prepared by Jerelyn Co (ADMU) and Hadrian Paulo Lim (ADMU)

# Text Classification

Task of classifying text / document under a predefined category using its content. May either be topic-based or genre-based classification.

### Data Attributes Description
* document: document ID
* content: document content
* category: document category i.e. auto, sports, computer

### Goal
Determine the category of the documents below using a Naive Bayes classifier:

| document | content           | category |
|----------|-------------------|----------|
| D6       | Home Runs Game    | ?        |
| D7       | Car Engine Noises | ?        |
| D8       | Computer GIFs     | ?        |

## Step 0: Loading the dataset to pandas dataframe

In [1]:
dat_path = "data/doc_classification.csv"

In [2]:
import pandas as pd

df = pd.read_csv(dat_path, index_col=False)
df

Unnamed: 0,document,content,category
0,D1,Saturn Dealer’s Car,Auto
1,D2,Toyota Car Tercel,Auto
2,D3,Baseball Game Play,Sports
3,D4,Pulled Muscle Game,Sports
4,D5,Colored GIFs Root,Computer


## Step 1: Integer-based labels/categories
> Machine learning algorithms generally handle numerical values better than categorical values

In [3]:
category_ids = {"Auto":0, "Sports":1, "Computer":2}
df["category_num"] = df.category.map(category_ids)

## Step 2: Splits
* Split features to X (predictor/s) and y (label)
* Split observations into train and test sets 

In [4]:
# Splitting features to X (predictor/s) and y(label)
X = df.content
y = df.category_num
print(X.shape)
print(y.shape)

(5,)
(5,)


In [5]:
# Splitting observations into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## Step 3: Building and evaluating the model

Use of [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
>Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.

Use of [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
> Convert a collection of raw documents to a matrix of TF-IDF features. Equivalent to CountVectorizer followed by TfidfTransformer.

Use of [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)
> The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [6]:
# Importing packages necessary for building the model
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline


text_clf = Pipeline([('vect_tfid', CountVectorizer()),
                     ('clf', MultinomialNB())])

Fitting the model to the train set

In [7]:
text_clf = text_clf.fit(X_train, y_train)

Get the predicted classes for X_test

In [8]:
y_predicted = text_clf.predict(X_test)

Calculate accuracy of class predictions

In [9]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_predicted)

1.0

Print the confusion matrix

In [10]:
metrics.confusion_matrix(y_test, y_predicted)

array([[1, 0],
       [0, 1]])

Print the classification report

In [11]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predicted, digits=4))

             precision    recall  f1-score   support

          0     1.0000    1.0000    1.0000         1
          1     1.0000    1.0000    1.0000         1

avg / total     1.0000    1.0000    1.0000         2



### !! Caution !! Our model showed perfect scores - but this doesn't necessarily mean that the model is 'perfect' already. Caveat: Small dataset is usually prone to bias and/or overfitting


## Going back to the goal:

Determine the category of the documents below using a Naive Bayes classifier:

| document | content           | category |
|----------|-------------------|----------|
| D6       | Home Runs Game    | ?        |
| D7       | Car Engine Noises | ?        |
| D8       | Computer GIFs     | ?        |

Document contents to list form

In [12]:
documents_to_predict = ["Home Runs Game", "Car Engine Noises", "Computer GIFs"]
predictions = text_clf.predict(documents_to_predict)

Mapping back category ids to the original category names and displaying resulting predictions

In [13]:
category_names = {0: "Auto", 1: "Sports", 2: "Computer"}

for pair in zip(documents_to_predict, predictions):
    print("{}: {}".format(pair[0], category_names[pair[1]]))

Home Runs Game: Sports
Car Engine Noises: Auto
Computer GIFs: Computer
