---
title: "Naive Bayes"
format: html
---

# Introduction to Naive Bayes

Naive Bayes is a supervised machine learning algorithm used for classification tasks. The algorithm is rooted in Bayes Theorem in that it is based on the probability of a hypothesis, given data and prior knowledge. Below is the equation the Naive Bayes Algorithm uses.

![](./images/Bayes_rule.png)

https://www.saedsayad.com/naive_bayesian.htm

Where P(c|x)= The probability of the target variable class given the features <br>
P(x|c)= The probability of the features given the class <br>
and P(c), P(x) are the prior probabilties of the target variable class and features.

The algorithm estimates the probability of each class for a given data point and assing the class with the highest probability to the data point.

It assumes that all features are independent of each other; this is usually not the case in the real world but the algorithm still provides accurate predictions.

Ultimatley, through Naive Bayes you want to acheive accurate classifications given the algoirthm is given data with a set of variables. In my case, I would like to predict if a certain state's home value will increase substantially in a certain year based on census data.

Different variants of Naive Bayes are used for different applications. Multinomial Naive Bayes is used for word counts and frequency analysis (discrete data). Gaussian Naive Bayes is used for numeric data that is approximatley normally distributed and independent. Bernoulli Naive Bayes is used for binary data, like if a word appears in a document or not.

# Prepare Data for Naive Bayes

In [24]:
import pandas as pd

In [25]:
record=pd.read_csv('data/RecordData.csv')
record=record.drop('DP05_0073E',axis=1)

In [26]:
record.head()

Unnamed: 0,Year,DP02_0001E,DP02_0002E,DP02_0003E,DP02_0007E,DP02_0011E,DP02_0037E,DP02_0060E,DP02_0061E,DP02_0062E,...,DP04_0134E,DP05_0001E,DP05_0004E,DP05_0018E,DP05_0037E,DP05_0038E,DP05_0039E,DP05_0044E,RegionName,Typical Home Value
0,2018,0.007341,0.009226,-0.008186,0.090413,0.007228,0.004783,0.02271,-0.002918,0.018174,...,0.050667,0.002692,0.003198,0.010283,-0.001775,-0.000327,-0.123824,-0.027097,Alabama,0.038688
1,2019,0.022851,-0.265678,-0.360479,-0.317095,-0.802204,1.30299,-0.58836,-0.7052,0.439418,...,0.024112,0.003133,-0.006376,0.002545,0.005908,0.009572,0.05448,0.015884,Alabama,0.07036
2,2021,0.03688,0.013836,0.056032,-0.010988,0.08189,-0.050374,-0.025077,-0.044056,0.038532,...,0.066914,0.027878,0.008556,0.010152,-0.013376,-0.010947,0.067268,0.040587,Alabama,0.263465
3,2022,0.024848,0.041137,0.053954,-0.058821,0.009307,0.106362,-0.070979,-0.06272,-0.020504,...,0.060395,0.006829,-0.00106,-0.005025,0.006291,-0.002353,-0.017318,0.146484,Alabama,0.076587
4,2018,0.015195,0.000753,-0.043082,-0.078405,0.043289,0.942898,-0.208181,0.031255,0.017837,...,-0.019983,-0.003186,-0.009174,0.011594,-0.000634,0.139775,0.012942,-0.060828,Alaska,0.005789


Currently my explanatory variable is continuious. I must change it to a label variable. To do this, I will choose home value percent changes that are above the mean to 1s, home value percent changes that are below the mean below the mean to 0s.

In [27]:
mean=record['Typical Home Value'].mean()
record['Typical Home Value']=record['Typical Home Value'].apply(lambda x:1 if x>mean else 0)

In [30]:
record=record.drop(['Year','RegionName'],axis=1)

In [31]:
record.head()

Unnamed: 0,DP02_0001E,DP02_0002E,DP02_0003E,DP02_0007E,DP02_0011E,DP02_0037E,DP02_0060E,DP02_0061E,DP02_0062E,DP02_0063E,...,DP04_0047E,DP04_0134E,DP05_0001E,DP05_0004E,DP05_0018E,DP05_0037E,DP05_0038E,DP05_0039E,DP05_0044E,Typical Home Value
0,0.007341,0.009226,-0.008186,0.090413,0.007228,0.004783,0.02271,-0.002918,0.018174,0.033732,...,0.007581,0.050667,0.002692,0.003198,0.010283,-0.001775,-0.000327,-0.123824,-0.027097,0
1,0.022851,-0.265678,-0.360479,-0.317095,-0.802204,1.30299,-0.58836,-0.7052,0.439418,1.429793,...,-0.000968,0.024112,0.003133,-0.006376,0.002545,0.005908,0.009572,0.05448,0.015884,0
2,0.03688,0.013836,0.056032,-0.010988,0.08189,-0.050374,-0.025077,-0.044056,0.038532,0.007408,...,-0.004602,0.066914,0.027878,0.008556,0.010152,-0.013376,-0.010947,0.067268,0.040587,1
3,0.024848,0.041137,0.053954,-0.058821,0.009307,0.106362,-0.070979,-0.06272,-0.020504,0.014854,...,0.017788,0.060395,0.006829,-0.00106,-0.005025,0.006291,-0.002353,-0.017318,0.146484,0
4,0.015195,0.000753,-0.043082,-0.078405,0.043289,0.942898,-0.208181,0.031255,0.017837,-0.065209,...,-0.043262,-0.019983,-0.003186,-0.009174,0.011594,-0.000634,0.139775,0.012942,-0.060828,0


# Naive Bayes with Labeled Record Data

In [44]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [36]:
X=record.drop(['Typical Home Value'],axis=1)
y=record['Typical Home Value']

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [39]:
X_train.shape, X_test.shape

((142, 30), (62, 30))

In [41]:
naive = GaussianNB()
naive.fit(X_train, y_train)

In [42]:
y_pred = naive.predict(X_test)

In [43]:
y_pred 

array([1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1], dtype=int64)

In [45]:
accuracy_score(y_test, y_pred)

0.7096774193548387

Fairly accurate for a first model

In [46]:
y_pred_train = naive.predict(X_train)

In [47]:
accuracy_score(y_train, y_pred_train)

0.6197183098591549

Not very accurate on the training set so there is some undefitting going on.

May need to examine the data more to remove outliers