# Naive Bayes

The Naive Bayes series of classifiers are not really Bayesian, but the process kind of looks that way if you squint hard enough.  I like this classifier a lot because it's a fast, simple classifier which does reasonably well on datasets.  You can usually do better than Naive Bayes with a tailor-made model, but for something quick and dirty to see if you meet a minimum baseline for a go/no-go decision, it does well.

In [None]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import CategoricalNB
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Now let's prep the data.  Because we'll do it the same way for each, we only need to do this once.  I'll also remove the bits where we analyze the data, as we've seen it enough times already.

In [None]:
campus_data = "../data/CampusRecruitment.csv"
df = pd.read_csv(campus_data, header=0)
y = df['status']
X = df.drop(['sl_no', 'status', 'salary'], axis=1)

## Pre-Processing

Something special about Naive Bayes is that we need to know the number of unique categories in each feature, and we have to do that before splitting into training and test data.  Otherwise, we can run into an error where a categorical value shows up for the first time in the test data.  We can use `nunique()` to get the number of unique values for each column, and then change it to an array for input.

In [None]:
X.nunique().array

But remember that this includes some continuous variables as well!

In [None]:
X.describe()

There are a couple of approaches we could take here.

1. Perform two Naive Bayes calculations, one with continuous variables and one with categorical variables. Multiply the results of the two together to get our final outcome.
2. Convert our continuous variables to discrete variables. We have a five-number summary and can bucketize results that way.

Either approach is reasonable, but I'll go with the second approach, using the `KBinsDiscretizer()` to bin the data. This is almost the opposite approach from kNN and logistic regression, where we needed to perform ordinal encoding on categorical variables.

In [None]:
est = preprocessing.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform', subsample=None)

# Get only continuous features
X_continuous = X.select_dtypes('float64')

# Fit the input features to our encoder
est.fit(X_continuous)

# Perform the transformation on our dataset
codes = est.transform(X_continuous)
feature_names = est.get_feature_names_out(['ssc_p', 'hsc_p', 'degree_p', 'etest_p', 'mba_p'])

# Combine the new discretized features with our existing categorical features and everything is now categorical.
X = pd.concat([X.select_dtypes(exclude='float64'), pd.DataFrame(codes, columns=feature_names).astype(int)], axis=1)
X.shape

Taking a look at one of the new columns, we can see it is no longer continuous.

In [None]:
X['ssc_p'].unique()

And this seriously cuts down on the number of unique values in our dataset.

In [None]:
X.nunique().array

Now we can create the Categorical Naive Bayes classifier based on these categories.

In [None]:
num_categories = X.nunique().array
clf = CategoricalNB(alpha=1, min_categories = num_categories)

Before we can feed this data into the categorical Naive Bayes algorithm (or pretty much any other classification algorithm), we need to convert any text data into numeric data.  There are a few common techniques for encoding.  The technique we will use for our dataset is called ordinal encoding.  What it does is convert strings to ordinal values for encoding.  For example, `gender` has two values, M and F.  Ordinal encoding will set one of these to 0 and the other to 1.  We need to do this for each of the non-numeric features. For the features we've already ordinally encoded, nothing will change.

In [None]:
enc = preprocessing.OrdinalEncoder()

X = enc.fit_transform(X)
X.shape

The `shape` here shows that we have the same number of rows as before (215) as well as the same number of columns (12).

In [None]:
X

By contrast, I'm going to perform a simple label encoding of the `status`.

In [None]:
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1740)

In [None]:
X_train

Now let's train the model.

In [None]:
clf = clf.fit(X_train, y_train)

## How'd We Do?

Let's first use the `accuracy_score` method in sklearn to see just how well we did.

In [None]:
predicted = clf.predict(X_test)
accuracy_score(y_test, predicted)

Now let's review the confusion matrix and classification report.

In [None]:
confusion_matrix(y_test, predicted)

In [None]:
print(classification_report(y_test, predicted))

Let's compare this to the rest of the group:

|Case|kNN|XGBoost|Logistic Regression|Naive Bayes|
|----|---|-------|-------------------|-----------|
|True Negative|13|18|17|15|
|False Positive|8|3|4|6|
|False Negative|4|5|6|5|
|True Positive|40|39|38|39|

Compared to kNN, Naive Bayes did a little better at indicating when a person would not be placed after graduation. It did slightly worse at picking out the students who would receive placements. Its accuracy is right in line with these other algorithms.

As a side note, if we did not discretize our dataset and simply ran `CategoricalNB()` on our continuous data, the accuracy would drop by about 8%, leading us to rather different conclusions.