# Gaussian Naive Bayes Classifier

---





*   Install *pandas* library
*   Install using *anaconda prompt*: `conda install pandas`
*   Install using *cmd*: `pip install pandas`
*   [Read excel file](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) using pandas
*   [Read csv file](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) using pandas
*   You can read a txt file using *read_csv* method. [Check here](https://kite.com/python/answers/how-to-read-a-text-file-with-pandas-in-python)!








In [1]:
# we will need pandas to read our data from a csv file
import pandas as pd

# this will read the citrus.csv file and convert it to a dataframe. We have saved the dataframe in a variable named data
data = pd.read_csv("../Dataset/citrus.csv") 

# to check if the data has been read properly. this line will print first 3 rows
data.head(3)

Unnamed: 0,name,diameter,weight,red,green,blue
0,orange,2.96,86.76,172,85,2
1,orange,3.91,88.05,166,78,3
2,orange,4.42,95.17,156,81,2




*   For more details on how the [head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) function works!
*   Also checkout the [tail](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html)!
*   To get random instances use [sample](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html).

---











*   To install *scikit-learn* library either use the `pip` or `conda` commands. See [here](https://pypi.org/project/scikit-learn/).
*   Also checkout the installation dependencies and install the required libraries beforehand if using `pip` command!! If you are using `conda`, then the dependencies will be auto-resolved during installation!






In [2]:
# to separate our data into train and test set
from sklearn.model_selection import train_test_split



---

*   For more details on indexing and selecting a subset of data from pandas dataframe, see [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)!


In [3]:
# separate the feature vector
X = data.iloc[:,1:6]
X.head(3)

Unnamed: 0,diameter,weight,red,green,blue
0,2.96,86.76,172,85,2
1,3.91,88.05,166,78,3
2,4.42,95.17,156,81,2


In [4]:
# separate the class labels
y = data['name']
y.head(3)

0    orange
1    orange
2    orange
Name: name, dtype: object

In [5]:
# we will separate 30% data for testing, remaining 70% will be used to test the model

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)



*   To learn more about the parameters and outputs of the Gaussian Naive Bayes classifier, read the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) here!
*   If the model didn't work well for our input dataset, then maybe we would have tried to find optimal values for the model parameters (i.e., *priors* and *var_smoothing*)



In [6]:
# we will use Gaussian (normal distribution) Naive Bayes
from sklearn.naive_bayes import GaussianNB

# call the model constructor
gnb = GaussianNB()

# estimates the mean (μ) and standard deviation (σ) from training data
model = gnb.fit(X_train, y_train)

# check model performance on test data
y_predict = model.predict(X_test)



*   Please look into *y_predict* to see what values the model is predicting!



In [7]:
# let's check model performance
accuracy = model.score(X_test, y_test)
accuracy

0.9226666666666666



*   Why accuracy is **NOT** a good measure? See [here](https://tryolabs.com/blog/2013/03/25/why-accuracy-alone-bad-measure-classification-tasks-and-what-we-can-do-about-it/)!!!



In [8]:
TP=0 # True Positive
TN=0 # True Negative
FP=0 # False Positive
FN=0 # False Negative



*   Learn more about [Confusion Matrix](https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62)!



In [9]:
for i in range (len(y_test)):
    if y_test.iloc[i] == y_predict[i]:
        if y_predict[i] == 'orange':
            TP=TP+1
        else:
            TN=TN+1
    else:
        if y_predict[i] == 'grapefruit':
            FN=FN+1
        else:
            FP=FP+1



*   More on [precisiom, recall, and F1-measure](https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c)!



In [10]:
# sensitivity/ recall/ TPR (True Positive Rate)
TPR = TP/(TP+FN)
TPR

0.9272727272727272

In [11]:
# specificity/ TNR (true negative rate)
TNR = TN/(TN+FP)
TNR

0.9181518151815181

In [12]:
# Precision for orange
PPV = TP/(TP+FP)
PPV

0.9173884077281812

In [13]:
# precision for grapefruit
NPV = TN/(TN+FN)
NPV

0.9279519679786524

In [14]:
F1 = 2*TP/(2*TP+FP+FN)
F1

0.9223040857334226



*   If you have *n* classes then you can use [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) available from scikit-learn!

