# Naive Bayes Algorithm

This algorithm is pretty much similar to Logistic Regression algorithm. It is a **classification technique** based on **Bayes Theorem** with an assumption of independence among predictors (ie. it assumes the presence of a particular feature in a class is unrelated to the presence of any other feature). 

Naive Bayes is known to outperform even highly sophisticated classification method. But it is very important to consider that continuous features follow the normal distribution.

The model assumes that there is an independence among predictors, it is suggested to remove correlated features, because the two highly correlated features will be voted twice in the model and it can lead to over inflating importance.

This model is used generally when the dimensionality of the input is very high. This algorithm can only be used for **classification problems**

**Bayes Theorem**
                    $$P(Y|X) = P(X|Y) P(X)$$
                    
This model basically calculates the probability of Y for given X, where X is the prior event and Y is the dependence event.

In [1]:
# Import the necessary packages
import pandas as pd
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Loading the datasets
training = pd.read_csv("./occupancy_data/occu_trg.csv")
test = pd.read_csv("./occupancy_data/occu_test.csv")

In [3]:
# Display the characteristics of both training and test datasets
print("Dimensions of training dataset: ", training.shape)
print("Dimensions of test dataset: ", test.shape)
print("The variables in training dataset are: \n", training.columns)

Dimensions of training dataset:  (8143, 6)
Dimensions of test dataset:  (2665, 6)
The variables in training dataset are: 
 Index(['Temperature', 'Humidity', 'Light', 'CO2', 'HumidityRatio',
       'Occupancy'],
      dtype='object')


In [4]:
# Train-Test Split
x_trg = training.drop("Occupancy", axis = 1)
y_trg = training["Occupancy"]

x_test = test.drop("Occupancy", axis = 1)
y_test = test["Occupancy"]

In [5]:
x_trg

Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio
0,23.18,27.2720,426.0,721.250000,0.004793
1,23.15,27.2675,429.5,714.000000,0.004783
2,23.15,27.2450,426.0,713.500000,0.004779
3,23.15,27.2000,426.0,708.250000,0.004772
4,23.10,27.2000,426.0,704.500000,0.004757
...,...,...,...,...,...
8138,21.05,36.0975,433.0,787.250000,0.005579
8139,21.05,35.9950,433.0,789.500000,0.005563
8140,21.10,36.0950,433.0,798.500000,0.005596
8141,21.10,36.2600,433.0,820.333333,0.005621


In [6]:
y_trg

0       1
1       1
2       1
3       1
4       1
       ..
8138    1
8139    1
8140    1
8141    1
8142    1
Name: Occupancy, Length: 8143, dtype: int64

In [7]:
x_test

Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio
0,23.700000,26.272000,585.200000,749.200000,0.004764
1,23.718000,26.290000,578.400000,760.400000,0.004773
2,23.730000,26.230000,572.666667,769.666667,0.004765
3,23.722500,26.125000,493.750000,774.750000,0.004744
4,23.754000,26.200000,488.600000,779.000000,0.004767
...,...,...,...,...,...
2660,24.290000,25.700000,808.000000,1150.250000,0.004829
2661,24.330000,25.736000,809.800000,1129.200000,0.004848
2662,24.330000,25.700000,817.000000,1125.800000,0.004841
2663,24.356667,25.700000,813.000000,1123.000000,0.004849


In [8]:
y_test

0       1
1       1
2       1
3       1
4       1
       ..
2660    1
2661    1
2662    1
2663    1
2664    1
Name: Occupancy, Length: 2665, dtype: int64

In [9]:
# Feature Scaling
sc = StandardScaler()

x_trg = sc.fit_transform(x_trg)
x_test = sc.fit_transform(x_test)

In [10]:
# Label Encoding
labelencoder_y = LabelEncoder()

y_trg = labelencoder_y.fit_transform(y_trg)
y_test = labelencoder_y.fit_transform(y_test)

In [11]:
y_trg

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [12]:
y_test

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

### Model Building - Naive Bayes

In [13]:
# Model Building - Naive Bayes
naive_occu = GaussianNB()

naive_occu.fit(x_trg, y_trg)
print("Accuracy of Naive Bayes model on training dataset: %0.3f"% naive_occu.score(x_trg, y_trg))
print("Accuracy of Naive Bayes model on test dataset: %0.3f"% naive_occu.score(x_test, y_test))

Accuracy of Naive Bayes model on training dataset: 0.979
Accuracy of Naive Bayes model on test dataset: 0.904


In [14]:
# Predicting the values on test dataset
naive_pred = naive_occu.predict(x_test)

In [15]:
naive_pred

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [16]:
# Confusion Matrix
naive_results = confusion_matrix(y_test, naive_pred)
print("The confusion matrix: \n", naive_results)

The confusion matrix: 
 [[1682   11]
 [ 245  727]]


### Logistic Regression Model

In [17]:
# Creating a Logistic Regression Model
log_occu = LogisticRegression()

log_occu.fit(x_trg, y_trg)
log_pred = log_occu.predict(x_test)
log_acc_score = accuracy_score(y_test, log_pred)
log_results = confusion_matrix(y_test, log_pred)

print("The accuracy of Logistic Regression model: %0.3f"% log_acc_score)
print("The confusion matrix of Logistic Regression model: \n", log_results)

The accuracy of Logistic Regression model: 0.890
The confusion matrix of Logistic Regression model: 
 [[1671   22]
 [ 270  702]]
