# Text classification with the Naive Bayes algorithm #

**Machine Learning** (ML) is a sub-field of Artificial Intelligence (AI) that automates analytical model building. As the amount of data grows, ML is becoming the defacto standard within all research fields. So what does it mean that a machine *learns*? Given enough data, we can offload programming tasks to the algorithm, that is, the algorithm learns structure from the data:

Tom Mitchell's **A well-posed learning problem**: "A computer program is said to learn from experience $E$ with respect to some task $T$ and some performance measure $P$, if its performance on $T$, as measured by $P$, improves with experience $E$"

**Classification** is one is a primary task of supervised learning. Given labeled data, a classification algorithm will output a solution that categorizes new examples (i.e., associate labels with subsets of the data). While unsupervised learning searches for groups within the data, classification learns to map a data set onto a categorical class values or labels (i.e., function approximation).

The **Naive Bayes** (NB) algorithm is a generative algorithm that is very popular for text classification. 
The probability of a document $d$ being in class $c$, $P(c \mid d)$ is computed as:


$$ P(c \mid d) \propto P(c) \prod_{i = 1}^{m}P(t_i \mid c) $$
and the class of a document $d$ is then computed as:

$$c_{MAP} = arg~max_{c \in \{c_1, c_2 \}} P(c \mid d)$$





## Research problem ##

The medieval writer Saxo Grammatricus (c. 1160 - post 1208) represents the beginning of the modern day historian in Scandinavia. Saxo's history of the Danes *Gesta Danorum* ("Deeds of the Danes") is the single most important written source to Danish history in the 12<sup>th</sup> century. *Gesta Danorum*  is tendentious, contains elements of fiction, and its compositions has been an academic subject of debate for more than a century. The more recent debate treat the bipartite composition \emph{Gesta Danorum} and centers on two related issues: 1) is the transition between the old mythical and new historical part located in book eight, nine, or ten; and 2) is this transition gradual (continuous) or sudden (point-like)? In this tutorial we will ask "is book nine more similar to the early books (1-8) or the late books (10-16)". This is an example uses simple vector space techniques from author chronometry to represent the most salient stylistic and semantic leaxical features of *Gesta Danorum*.

### Import data ###
We start by importing *Gesta Danorum* sampled in slices of 250 words (see `make_data.py` for more detail).

In [None]:
import os
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from collections import Counter


data_path = os.path.join("..","dat","saxo_books","saxo_class.csv")

data = pd.read_csv(data_path)

print(data.tail())

print(set(data["book_class"]))

### Split data set in training data and to-be determined data ###

In [None]:
idx = data["book_class"] == "uncertain"

data_uncertain = data[idx]
data = data[~idx]

print(set(data["book_class"]))
print(set(data_uncertain["book_class"]))

In [None]:
## CLASS DIST AND BIAS
def printdist(df):
    for label in set(df['book_class']):
        print("number of " + label + ": {}".format(sum(df['book_class'] == label)))

printdist(data)
printdist(data_uncertain)

print(707/(518+707))

### Split training data in a training set and a test set to avoid overfitting ###

In [None]:
# SPLIT DATA SET
ratio = .8
mask = np.random.rand(len(data)) <= ratio
train = data[mask]
test = data[~mask]

## training set
X_train = train["content"].values
y_train = train["book_class"].values
## test set
X_test = test["content"].values
y_test = test["book_class"].values

### Build document vector space ###

In [None]:
vectorizer = CountVectorizer(ngram_range = (1,2), stop_words = None,
    lowercase = True, max_df = .9, min_df = .01, max_features = 500)

feature_train = vectorizer.fit_transform(X_train)
feature_test =  vectorizer.transform(X_test)
feature_names = vectorizer.get_feature_names()

print(feature_names)

### Train Naive Bayes classifier ###

In [None]:
# TRAIN CLASSIFIER
nb_classifier = MultinomialNB()
nb_classifier.fit(feature_train, y_train)

# EVALUATION
pred = nb_classifier.predict(feature_test)
# horizontal: P(y|X); vertical: y
confmat = metrics.confusion_matrix(y_test, pred)

# obeserved accuracy
print("Accurracy: {}".format(round(metrics.accuracy_score(y_test, pred),2)))

### Classify the to-be-determined data using majority rule decision rule ###

In [None]:
feature_uncertain = vectorizer.transform(data_uncertain["content"])
uncertain_class = nb_classifier.predict(feature_uncertain)
decision = Counter(uncertain_class)

print("books 9 has {} votes for early style and {} for late style".format(decision["early"],decision["late"]))