# Participation: Naive Bayes

This week, we will explore the ins and outs of Naive Bayes Classification. We'll be looking at different ways to tease apart the data, different forms of Naive Bayes Classification, and how they impact our results.

In [1]:
#IMPORTS
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import CategoricalNB

For our work this week, we'll be exploring a classic ML dataset - the "Adult" data. The objective here is to predict a binary measure of income (above or below 50K) using features about the person.

In [2]:
dataset = fetch_openml(name="adult", version=2)

## Categorical NB

For Categorical Naive Bayes, we need to work with features that can be properly treated as categories. However, the implementation of Categorical NB in SKLearn requires features to be transformed into numbers, [0, num_categories-1]. We use an OrdinalEncoder to get the variables in this setting.

The classifier also cannot handle missing information / NaN values, so we need the OrdinalEncoder to include them as a new category. We use -1 as the placehoder for this new category, then we add 1 to every category to make the feature values start at 0 again.

**0) Change the random state below to your unique 5 digit identifier from your BuffOne Card.**

In [3]:
#Drop Numeric Features
catData = dataset.data.drop(["age", "fnlwgt", "education-num", "capital-gain", "capital-loss"], axis=1)

#Encode Features
feature_encoder = OrdinalEncoder(handle_unknown = "use_encoded_value", encoded_missing_value=-1, unknown_value=-1)
#feature_encoder = OrdinalEncoder()
feature_encoder.fit(catData)
encoded_dataset_X = feature_encoder.transform(catData)
encoded_dataset_X = encoded_dataset_X + 1

#Create Train/Test splits for our data
train_X, test_X, train_y, test_y = train_test_split(encoded_dataset_X, dataset.target.values, test_size = 0.3, random_state = 78578)

Here, we create, fit, and score the CategoricalNB Classifier on our data.

**1) Report your Accuracy.**

The accuracy was around $0.79$.

In [6]:
nb = CategoricalNB()
nb.fit(train_X, train_y)
#Use the NB score function to get your resulting accuracy
nb.score(test_X, test_y)

0.7904865897768375

**2) Classify using the same features (i.e. use the train_X, test_x, train_y, and test_y) on a KNN and a Random Forest (just like our previous participations, you can copy that previous code). Which ones do better? Worse?**

First of all, I used KNN and tested different n_neighbors hyperparamaters, and I found that $n=15$ yields the highest accuracy of $0.81$.

For the random forest, I tested different n_estimators and I found that $\text{n_estimator} = 96$ yields the highest accuracy of $0.82$.

Therefore, the worst was naive bayes which did $0.79$.

In [11]:
#(We went ahead and set up the imports for you)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

#Create and run your classifiers.

#knn


knn_lst_scores = [0] * 31

for i in range(1,31):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(train_X, train_y)
    knn_lst_scores[i] = knn.score(test_X, test_y)
knn_lst_scores = knn_lst_scores[1:]

print(f"The best hypterparater for knn (neighbors) with the highest accuracy is {knn_lst_scores.index(max(knn_lst_scores)) + 1} with an accuracy of {max(knn_lst_scores)}")

#rf


rf_lst_scores = [0] * 201
for i in range(50, 201):
    rf = RandomForestClassifier(n_estimators=i)
    rf.fit(train_X, train_y)
    rf_lst_scores[i] = rf.score(test_X, test_y)
rf_lst_scores = rf_lst_scores[50:]
print(f"The best hypterparater for rf (estimators) with the highest accuracy is {rf_lst_scores.index(max(rf_lst_scores)) + 50} with an accuracy of {max(rf_lst_scores)}")

The best hypterparater for knn (neighbors) with the highest accuracy is 15 with an accuracy of 0.8099365317682385
The best hypterparater for rf (estimators) with the highest accuracy is 96 with an accuracy of 0.8203098341636524
