# Naive Bayes Practical Implementation (Step-by-Step)

## Introduction to Naive Bayes Implementation
In this session, we will discuss the practical implementation of the Naive Bayes machine learning algorithm using Python and the Scikit-learn library.

## Variants of Naive Bayes
Previously, we discussed the variants of Naive Bayes. Initially, we saw the Bernoulli Naive Bayes, where features are binary, having values of either zeros or ones. For example, features like F1, F2, F3 would be represented as ones or zeros, such as 1101 or 1010. This binary representation is common when converting categorical features into numerical form.

## Sparse Matrix Representation
When categorical features are converted into numerical features, they often form a sparse matrix. A sparse matrix mainly contains zeros and ones across the dataset. This characteristic is especially relevant in Natural Language Processing (NLP) problem statements, where text data is converted into numerical data using techniques that result in sparse matrices.

## Use of Bernoulli and Multinomial Naive Bayes in NLP
In NLP, the conversion of text data to numerical data can result in binary or decimal values. For example, TF-IDF and Word2Vec produce decimal values, usually small in magnitude. For such sparse matrices, either Bernoulli or Multinomial Naive Bayes can be used. Both are suitable for NLP tasks, with Multinomial Naive Bayes being more common, but Bernoulli Naive Bayes is also a valid choice.

## Practical Implementation Overview
In this session, we will focus on the practical implementation of Naive Bayes using the Iris dataset from Scikit-learn. We will not cover Bernoulli or Multinomial Naive Bayes here, as those are more relevant to NLP and will be discussed later.

## Choosing the Appropriate Naive Bayes Variant
Since the Iris dataset features are continuous numerical values, the appropriate Naive Bayes variant to use is **Gaussian Naive Bayes**. This variant assumes that the features follow a Gaussian (normal) distribution.

## Python Code Implementation

We will now implement the Gaussian Naive Bayes classifier, breaking each step into its own cell.

### Step 1: Import Libraries
First, we import all the necessary modules from scikit-learn.

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### Step 2: Load the Dataset
We load the Iris dataset. `X` will hold the features (the measurements) and `y` will hold the target (the species of iris).

In [2]:
X, y = load_iris(return_X_y=True)

### Step 3: Split Data into Training and Testing Sets
We split our data, reserving 30% of it for testing the model's performance. `random_state=42` ensures we get the same split every time we run the code, making our results reproducible.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Step 4: Initialize the Model
We create an instance of the Gaussian Naive Bayes classifier.

In [4]:
gnb = GaussianNB()

### Step 5: Train (Fit) the Model
We train the model using our training data (`X_train` and `y_train`). The model learns the relationship between the features and the target variable.

In [5]:
gnb.fit(X_train, y_train)

### Step 6: Make Predictions
Now that the model is trained, we ask it to predict the species for the test features (`X_test`).

In [6]:
y_pred = gnb.predict(X_test)

### Step 7: Evaluate the Model - Confusion Matrix
We compare the model's predictions (`y_pred`) with the actual true values (`y_test`) to see where it got things right and wrong.

In [7]:
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[19  0  0]
 [ 0 12  1]
 [ 0  0 13]]


### Step 8: Evaluate the Model - Accuracy Score
We calculate the overall accuracy: (Number of correct predictions) / (Total number of predictions).

In [8]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy Score:", accuracy)

Accuracy Score: 0.9777777777777777


### Step 9: Evaluate the Model - Classification Report
Finally, we print a detailed report showing precision, recall, and f1-score for each class (each iris species).

In [9]:
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.92      0.96        13
           2       0.93      1.00      0.96        13

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45



## Conclusion
In this session, we implemented Gaussian Naive Bayes on the Iris dataset. As we progress and learn more about NLP, we will explore Bernoulli and Multinomial Naive Bayes in detail. Thank you for following along, and see you in the next video.

## Key Takeaways
* Naive Bayes has different variants such as Bernoulli, Multinomial, and Gaussian, each suited for specific data types.
* Bernoulli Naive Bayes is typically used for sparse binary features, common in NLP tasks.
* Gaussian Naive Bayes is appropriate for continuous numerical features, demonstrated with the Iris dataset.
* Practical implementation involves loading data, splitting into train and test sets, fitting the model, and evaluating with accuracy and classification reports.