# Naive Bayes Algorithm Overview
Naive Bayes classifiers are based on Bayes' theorem and make the assumption that the features are conditionally independent given the class. There are several types of Naive Bayes classifiers based on the nature of the data:

- **Gaussian Naive Bayes:** For continuous data, assumes that the features follow a normal distribution.
- **Multinomial Naive Bayes:** Suitable for discrete data, often used for text classification where feature vectors represent term frequencies.
- **Bernoulli Naive Bayes:** For binary/boolean data where features are either present or absent.

# Implementation Steps:

For most machine learning workflows involving Naive Bayes, these steps are followed:

### 1. Data Preprocessing:

- Split data into features (X) and target labels (y).
- Handle missing data and convert categorical variables to numerical values if required.
- Split the data into training and test sets.

### 2. Model Training:

- Select the appropriate Naive Bayes algorithm (Gaussian, Multinomial, or Bernoulli).
- Train the model on the training set using fit().

### 3. Prediction

- Use the trained model to predict the target labels for the test set using predict().

### 4. Evaluation:

- Evaluate the model's performance using various metrics.

# Evaluation Techniques:
Here are some common evaluation metrics used for Naive Bayes:

- **Confusion Matrix:** A table that shows the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). It is used to calculate other metrics.

- **Accuracy:** Measures the overall correctness of the model.
- **Precision:** Measures how many of the predicted positive samples are actually positive.
- **Recall (Sensitivity):** Measures how many of the actual positive samples are correctly predicted.
- **F1-Score:** The harmonic mean of precision and recall, used when there is an uneven class distribution.
- **ROC Curve and AUC:** Receiver Operating Characteristic curve plots the true positive rate against the false positive rate. The area under the curve (AUC) is a metric that summarizes the performance of the model.
- **Cross-validation:** Perform k-fold cross-validation to ensure that the model generalizes well across different subsets of the data.



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Importing the dataset

In [None]:
genclass = pd.read_csv("/kaggle/input/gender-classification-dataset/gender_classification_v7.csv")
genclass

# Info

In [None]:
genclass.info()

# Finding Missing Valuess and Clearing

In [None]:
genclass.isnull().sum()

# Describe

In [None]:
genclass.describe()

# Label Encoding 

In [None]:
# cata to numerical
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to the 'City' column
genclass['gender'] = label_encoder.fit_transform(genclass['gender'])

# Show the encoded data
genclass.head()

In [None]:
genclass.info()

# Creating Target variable and features
- X = Features 
- y = Target

In [None]:
# Splitting the data into features (X) and target (y)
X = genclass.drop(['gender'], axis=1)
y = genclass['gender']

# Train test Split

In [None]:
from sklearn.model_selection import train_test_split

# Splitting the dataset into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Gaussian Naive Bayes
- No changes are needed if your features are continuous.
- so we use the train data itself

In [None]:
from sklearn.naive_bayes import GaussianNB

# 1. Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions with GaussianNB
y_pred_gnb = gnb.predict(X_test)

# Feature Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Scaling features for MultinomialNB (converts data to a range between 0 and 1)
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# y_train_scaled = scaler.fit_transform(y_train)
# y_test_scaled = scaler.transform(y_test)

# Multinomial Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB

# 2. Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(X_train_scaled, y_train)

# Make predictions with MultinomialNB
y_pred_mnb = mnb.predict(X_test_scaled)


# Binarize Features

In [None]:
from sklearn.preprocessing import Binarizer

# 3. Bernoulli Naive Bayes (binarize features)
binarizer = Binarizer()
X_train_bin = binarizer.fit_transform(X_train)  # Fit only on training data
X_test_bin = binarizer.transform(X_test)        # Transform the test data

# Bernoulli Naive Bayes

In [None]:
from sklearn.naive_bayes import BernoulliNB

# 3. Bernoulli Naive Bayes
bnb = BernoulliNB()
bnb.fit(X_train_bin, y_train)

# Make predictions with BernoulliNB
y_pred_bnb = bnb.predict(X_test_bin)

# Evaluation

In [None]:
# Print accuracies
print("GaussianNB Test Accuracy:", gnb.score(X_test, y_test))
print("MultinomialNB Test Accuracy:", mnb.score(X_test_scaled, y_test))
print("BernoulliNB Test Accuracy:", bnb.score(X_test_bin, y_test))

### What is a Confusion Matrix?

A **confusion matrix** is a table that helps us understand how well a machine learning model is doing when it comes to making predictions. It tells us how many predictions were correct and where the model made mistakes. It’s called "confusion matrix" because it shows where the model is confused in making predictions!

#### Imagine this:
Let’s say you are the teacher of a class, and you give a test to your students. After grading, you want to see how many students passed and how many failed. But you also want to check if your predictions about who would pass or fail were correct.

So you create a table to compare:
- **Predicted** results (what you guessed).
- **Actual** results (the real outcome).

The confusion matrix helps in comparing these two things: your **predictions** vs. the **truth**.

### The Confusion Matrix Table

Here’s what a confusion matrix looks like:

|                    | **Predicted Positive** | **Predicted Negative** |
|--------------------|------------------------|------------------------|
| **Actual Positive** | True Positive (TP)     | False Negative (FN)    |
| **Actual Negative** | False Positive (FP)    | True Negative (TN)     |

Now, let’s break this down:

1. **True Positive (TP):**
   - These are the cases where the model correctly predicted **positive**, and the actual result is also **positive**.
   - **Example:** If the model predicts that a student passed, and they actually passed, this is a true positive.

2. **True Negative (TN):**
   - These are the cases where the model correctly predicted **negative**, and the actual result is also **negative**.
   - **Example:** If the model predicts that a student failed, and they actually failed, this is a true negative.

3. **False Positive (FP):** (Also called a "Type I Error")
   - These are the cases where the model predicted **positive**, but the actual result is **negative**.
   - **Example:** If the model predicts that a student passed, but they actually failed, this is a false positive. This is a mistake!

4. **False Negative (FN):** (Also called a "Type II Error")
   - These are the cases where the model predicted **negative**, but the actual result is **positive**.
   - **Example:** If the model predicts that a student failed, but they actually passed, this is a false negative. Another mistake!

### Example in Real Life:

Let’s say you’re a doctor, and you want to predict whether a patient has a disease. You perform a test and use a machine learning model to predict the result.

- **Positive** = Patient has the disease.
- **Negative** = Patient does not have the disease.

Now the confusion matrix helps you evaluate how your model performed:

|                    | **Predicted Disease**   | **Predicted No Disease** |
|--------------------|-------------------------|--------------------------|
| **Actual Disease**  | True Positive (TP)      | False Negative (FN)       |
| **Actual No Disease**| False Positive (FP)     | True Negative (TN)        |

- **True Positive (TP):** The model predicted that the patient has the disease, and the patient really has it.
- **True Negative (TN):** The model predicted that the patient doesn’t have the disease, and the patient really doesn’t have it.
- **False Positive (FP):** The model predicted that the patient has the disease, but the patient actually doesn’t (a false alarm!).
- **False Negative (FN):** The model predicted that the patient doesn’t have the disease, but the patient actually does (this could be dangerous!).

### What Does the Confusion Matrix Tell Us?

Once we fill the confusion matrix with numbers, we can use it to calculate important metrics to evaluate how good the model is:

1. **Accuracy:**
   - How often the model made the correct prediction.
   $$
   Accuracy = \frac{TP + TN}{TP + TN + FP + FN} 
   $$

   - Accuracy tells us the proportion of correct predictions, but it can be misleading if there’s an imbalance between the classes.

2. **Precision:**
   - Out of all the times the model predicted **positive**, how many were actually positive?
   $$
   Precision = \frac{TP}{TP + FP}
   $$
   
   - Precision helps us know how much we can trust the positive predictions.

3. **Recall (Sensitivity):**
   - Out of all the actual **positive** cases, how many did the model correctly identify?
   $$
   Recall = \frac{TP}{TP + FN}
   $$
   - Recall tells us how well the model is at identifying positive cases.

4. **F1-Score:**
   - The F1-Score combines precision and recall into one number, especially useful when we want a balance between them.
   $$
   F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}
   $$
   
   - It gives a more balanced view when we want both precision and recall to be considered


### Example for a Clear Understanding:

Let’s imagine you’re testing a model that predicts whether students pass an exam based on their practice scores. You predicted the results for 10 students.

- **Actual Results:** 6 students passed (Positive), 4 students failed (Negative).
- **Predictions:**
   - 5 students were predicted to pass (3 of them really passed, 2 didn’t).
   - 5 students were predicted to fail (3 of them really failed, 2 didn’t).

Now the confusion matrix would look like this:

|                    | **Predicted Pass**      | **Predicted Fail**      |
|--------------------|-------------------------|-------------------------|
| **Actual Pass**     | 3 (TP)                  | 3 (FN)                  |
| **Actual Fail**     | 2 (FP)                  | 2 (TN)                  |

- **True Positive (TP)** = 3 (The model correctly predicted 3 students passed).
- **True Negative (TN)** = 2 (The model correctly predicted 2 students failed).
- **False Positive (FP)** = 2 (The model incorrectly predicted 2 students would pass, but they failed).
- **False Negative (FN)** = 3 (The model incorrectly predicted 3 students would fail, but they passed).

Now you can calculate accuracy, precision, recall, etc., based on this matrix!

### In Summary:
- The confusion matrix is a simple table showing **correct** and **incorrect** predictions.
- It helps us calculate **accuracy, precision, recall, and F1-score**, which tell us how well our model is performing.
- **True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN)** help us understand where the model gets it right or wrong.

Understanding the confusion matrix helps you evaluate the model’s strengths and weaknesses clearly!



In [None]:
from sklearn.metrics import confusion_matrix, classification_report

print("=== === === === === === === === === === Gaussian Naive Bayes === === === === === === === === === ===")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_gnb))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_gnb))
# -------------------------------------------------------
print("\n=== === === === === === === === === === Multinomial Naive Bayes === === === === === === === === === ===")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_mnb))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_mnb))
# -------------------------------------------------------
print("\n=== === === === === === === === === === Bernoulli Naive Bayes === === === === === === === === === ===")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_bnb))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_bnb))


# Insights
In your classification report, the values **0** and **1** typically represent the different classes of the target variable you are predicting. Based on the context of your dataset (which seems to be a gender classification dataset), here’s what they likely indicate:

- **0**: Represents one gender (e.g., Male)
- **1**: Represents the other gender (e.g., Female)

### Breakdown of the Classification Report:
- **Precision**: This metric indicates the proportion of positive identifications (correct predictions) that were actually correct. 
  - For **class 0** (precision = 0.96): Out of all instances predicted as class 0, 96% were correctly identified.
  - For **class 1** (precision = 0.97): Out of all instances predicted as class 1, 97% were correctly identified.

- **Recall**: This metric indicates the proportion of actual positives that were correctly identified.
  - For **class 0** (recall = 0.97): Out of all actual instances of class 0, 97% were correctly predicted.
  - For **class 1** (recall = 0.96): Out of all actual instances of class 1, 96% were correctly predicted.

- **F1-Score**: This is the harmonic mean of precision and recall. It provides a balance between the two metrics, especially useful when you have an uneven class distribution.
  - Both classes have an F1-score of around 0.96, indicating good performance.

- **Support**: This indicates the number of actual occurrences of the class in the specified dataset.
  - For class 0: 502 instances
  - For class 1: 499 instances

- **Overall Accuracy**: The accuracy of the model across all instances is **96%**.

### Interpretation
In summary, the model performs very well on both classes, with high precision, recall, and F1-scores close to 0.96 for both genders. This suggests that the classifier is effective in distinguishing between the two genders based on the features provided. 

If you have further questions or need clarification on specific points, feel free to ask!

# Confusion Matrix & Classification Report

1. **Confusion Matrix:**
   - The `confusion_matrix` function compares the true values (`y_test`) with the predicted values (`y_pred_*` for each model).
   - It returns a matrix that helps you understand how many predictions were correct and where the model made errors (True Positives, True Negatives, False Positives, and False Negatives).

2. **Classification Report:**
   - The `classification_report` gives you detailed metrics:
     - **Precision:** How many selected items are relevant.
     - **Recall:** How many relevant items are selected.
     - **F1-Score:** The balance between precision and recall.
     - **Support:** The number of actual occurrences of each class in the dataset.

### Example Output for Confusion Matrix:

For a binary classification problem, the confusion matrix may look like this:

```
[[TN  FP]
 [FN  TP]]
```

- **TN (True Negative):** Correctly predicted negative cases.
- **FP (False Positive):** Incorrectly predicted positive cases (Type I error).
- **FN (False Negative):** Incorrectly predicted negative cases (Type II error).
- **TP (True Positive):** Correctly predicted positive cases.

By examining this matrix and the classification report, you'll have a clear picture of how well each Naive Bayes model is performing.

# Correlation heatmap
To create a **correlation heatmap** and extract insights from the confusion matrix, we can visualize how the true labels and predicted labels correlate. This can give you a better understanding of the model's performance, particularly in terms of misclassifications.

### Steps:
1. **Calculate the confusion matrix.**
2. **Convert the confusion matrix into a DataFrame** for easier manipulation.
3. **Create a heatmap** using `seaborn` or `matplotlib` to visualize the confusion matrix.
4. **Interpret insights** from the heatmap: Focus on where the model is performing well (high correlations on the diagonal) and where it's struggling (non-diagonal elements).

Here’s how you can do it for one model, e.g., Gaussian Naive Bayes:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report


# Step 1: Calculate the confusion matrix for Gaussian Naive Bayes
cm_gnb = confusion_matrix(y_test, y_pred_gnb)

# Step 3: Convert confusion matrix into a DataFrame for better visualization
cm_df_gnb = pd.DataFrame(cm_gnb, index=["Actual Negative", "Actual Positive"], 
                     columns=["Predicted Negative", "Predicted Positive"])


# Step 3: Plot a heatmap of the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm_df_gnb, annot=True, cmap="Blues", fmt="g", cbar=False)
plt.title("Gaussian Naive Bayes Confusion Matrix Heatmap")
plt.show()

In [None]:
# Calculate and display metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

print("Accuracy:", accuracy_score(y_test, y_pred_gnb))
print("Precision:", precision_score(y_test, y_pred_gnb))
print("Recall:", recall_score(y_test, y_pred_gnb))
print("F1 Score:", f1_score(y_test, y_pred_gnb))

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Step 1: Calculate the confusion matrix for Gaussian Naive Bayes
cm_mnb = confusion_matrix(y_test, y_pred_mnb)

# Step 3: Convert confusion matrix into a DataFrame for better visualization
cm_df_mnb = pd.DataFrame(cm_mnb, index=["Actual Negative", "Actual Positive"], 
                     columns=["Predicted Negative", "Predicted Positive"])


# Step 3: Plot a heatmap of the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm_df_mnb, annot=True, cmap="Blues", fmt="g", cbar=False)
plt.title("Multinomial Naive Bayes Confusion Matrix Heatmap")
plt.show()

In [None]:
# Calculate and display metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

# Calculate and display metrics
print("Accuracy:", accuracy_score(y_test, y_pred_mnb))
print("Precision:", precision_score(y_test, y_pred_mnb))
print("Recall:", recall_score(y_test, y_pred_mnb))
print("F1 Score:", f1_score(y_test, y_pred_mnb))

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Step 1: Calculate the confusion matrix for Gaussian Naive Bayes
cm_bnb = confusion_matrix(y_test, y_pred_bnb)

# Step 3: Convert confusion matrix into a DataFrame for better visualization
cm_df_bnb = pd.DataFrame(cm_bnb, index=["Actual Negative", "Actual Positive"], 
                     columns=["Predicted Negative", "Predicted Positive"])


# Step 3: Plot a heatmap of the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm_df_bnb, annot=True, cmap="Blues", fmt="g", cbar=False)
plt.title("Bernoulli Naive Bayes Confusion Matrix Heatmap")
plt.show()

In [None]:
# Calculate and display metrics
print("Accuracy:", accuracy_score(y_test, y_pred_bnb))
print("Precision:", precision_score(y_test, y_pred_bnb))
print("Recall:", recall_score(y_test, y_pred_bnb))
print("F1 Score:", f1_score(y_test, y_pred_bnb))

### Insights from the Correlation Chart (Confusion Matrix Heatmap):

- **Diagonal Elements (True Positives and True Negatives):**
  - High values on the diagonal (top left and bottom right) represent correct predictions:
    - **Top Left (True Negatives - TN):** These are the cases where the model correctly predicted "negative."
    - **Bottom Right (True Positives - TP):** These are the cases where the model correctly predicted "positive."
  - A high number on these diagonal elements indicates that the model is making many correct predictions.

- **Off-Diagonal Elements (False Positives and False Negatives):**
  - Non-diagonal values (top right and bottom left) represent errors:
    - **Top Right (False Positives - FP):** The model incorrectly predicted positive when it was actually negative. A high value here might indicate overprediction of the positive class.
    - **Bottom Left (False Negatives - FN):** The model incorrectly predicted negative when it was actually positive. A high value here means the model is missing positive cases, which could be concerning in critical applications (like detecting diseases).
  - A high value on the off-diagonal elements suggests that the model is making many errors.

### How to Interpret:
- **Balanced Performance:** If the diagonal values are significantly higher than the off-diagonal values, your model is performing well.
- **Misclassifications:** If the off-diagonal values are high, focus on where the model is making mistakes (False Positives or False Negatives). This may help you adjust thresholds, collect more data, or choose a better-suited model.

---
# Accuracy
Simply looking at **accuracy** alone is not sufficient to determine whether a model is suitable or reliable for your problem. Accuracy can be misleading in several cases, especially when dealing with **imbalanced datasets** or when your model might be overfitting or underfitting. Here's why:

### 1. **Imbalanced Datasets:**
   - If your dataset has a significant class imbalance (e.g., 95% of one class and 5% of another), accuracy might give a false sense of high performance. For example, if your model predicts the majority class all the time, it will have high accuracy, but it will fail to identify the minority class, which could be more important.
   - **Example:** In a dataset where 95% of the data belongs to Class A and only 5% to Class B, if the model predicts Class A for all instances, the accuracy will be 95%, but the model will be useless in detecting Class B.

   **What to use instead:** Metrics like **Precision, Recall, F1-Score**, or **ROC-AUC** (for binary classification) are better indicators for imbalanced data.

### 2. **Overfitting:**
   - A model might show very high accuracy on the training data but perform poorly on unseen data (test data). This is known as **overfitting**, where the model has learned the noise in the training set rather than the actual patterns.
   - Overfitting can be identified when the model has significantly higher accuracy on the training set than on the test set.

   **What to use instead:** Look at the difference in accuracy (or other metrics) between training and test sets. Using **cross-validation** can also help to assess model stability across different subsets of data.

### 3. **Underfitting:**
   - If the accuracy is low on both the training and test sets, it could indicate **underfitting**, where the model is too simple to capture the underlying patterns in the data.
   
   **What to use instead:** Consider if your model is too basic and might need more features, a more complex algorithm, or better tuning of hyperparameters.

### 4. **Accuracy Doesn't Account for Misclassifications:**
   - Accuracy only tells you how many instances were correctly classified out of the total. It doesn't provide details on **what types of errors** the model is making. For example, predicting a disease in medical diagnostics might require minimizing **false negatives** (cases where the disease is predicted to be absent when it is actually present), and accuracy might not be the best metric to capture this.

   **What to use instead:** Evaluate performance using a **Confusion Matrix**, where you can examine false positives and false negatives. From there, metrics like **Precision**, **Recall**, and **F1-Score** can help you decide whether the model is suitable for your use case.

---

### Key Points:
- **Accuracy is just one metric** and might be misleading in certain scenarios, especially with imbalanced datasets.
- **Additional metrics** like **Precision, Recall, F1-Score, and AUC-ROC** provide more insight into the model's actual performance.
- Use **cross-validation** and compare performance across the training and test sets to check for **overfitting or underfitting**.

By evaluating these aspects, you can better understand whether a model is suitable for your problem beyond just its accuracy score.
---

---

In [None]:
# next knn