# Naive Bayes

In this notebook, we explore various Naive Bayes classifiers to determine the most suitable one for our dataset. We begin by evaluating the performance of different Naive Bayes classifiers, including GaussianNB, CategoricalNB, and BernoulliNB, on the given training data. After identifying the best-performing classifier, we attempt to further improve the classification accuracy by using altered or partial training data.

The steps involved in this notebook are as follows:
1. **Initial Exploration**: Evaluate the performance of GaussianNB, CategoricalNB, and BernoulliNB classifiers on the training data.
2. **Binary Feature Selection**: Focus on binary features to enhance the performance of the BernoulliNB classifier.
3. **Hyperparameter Tuning**: Use GridSearchCV to find the optimal hyperparameters for the BernoulliNB classifier.
4. **Dimensionality Reduction**: Apply Principal Component Analysis (PCA) to reduce the dimensionality of the dataset and evaluate the classifier's performance on the reduced data.
5. **ROC Curve Analysis**: Plot ROC curves for different numbers of PCA components to visualize the classifier's performance.

By following these steps, we aim to identify the most effective Naive Bayes classifier and optimize its performance for our specific dataset.

## Initial Exploration

Loading the data

In [None]:
import os
import sys

sys.path.append(os.path.abspath("../scripts"))
from data_loader import DataLoader

data_loader = DataLoader()
X_train, y_train = data_loader.training_data
X_val, y_val = data_loader.validation_data
X_test, y_test = data_loader.test_data

In [None]:
from sklearn.metrics import accuracy_score, classification_report
from sklearn.naive_bayes import GaussianNB

# Initialize the classifier
nb_classifier = GaussianNB()

# Train the classifier
nb_classifier.fit(X_train, y_train)

# Make predictions on the validation set
y_val_pred = nb_classifier.predict(X_val)

# Calculate the accuracy
accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {accuracy}")
print(classification_report(y_val, y_val_pred))

In [None]:
# Make predictions on the test set
y_test_pred = nb_classifier.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_test_pred)
print(f"Test Accuracy: {accuracy}")
print(classification_report(y_test, y_test_pred))

In [None]:
from sklearn.naive_bayes import CategoricalNB

# Initialize the classifier
nb_classifier = CategoricalNB(alpha=1)

# Train the classifier
nb_classifier.fit(X_train, y_train)

# Make predictions on the validation set
y_val_pred = nb_classifier.predict(X_val)

# Calculate the accuracy
accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {accuracy}")
print(classification_report(y_val, y_val_pred))

In [None]:
from sklearn.naive_bayes import BernoulliNB

# Initialize the classifier
nb_classifier = BernoulliNB(alpha=1)

# Train the classifier
nb_classifier.fit(X_train, y_train)

# Make predictions on the validation set
y_val_pred = nb_classifier.predict(X_val)

# Calculate the accuracy
accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {accuracy}")
print(classification_report(y_val, y_val_pred))

The Bernoulli Naive Bayes classifier achieves the highest accuracy (82.3%), followed by the Categorical Naive Bayes classifier (80.6%) and the Gaussian Naive Bayes classifier (76.8%). This outcome aligns with the nature of the training data: most features are either binary (14) or categorical (5), while only 2 features follow a normal distribution. Therefore, the Bernoulli classifier performs best, as it is specifically optimized for binary data. The Categorical Naive Bayes classifier also performs relatively well, as it is suited to categorical data, which is prevalent in this dataset. In contrast, the Gaussian Naive Bayes classifier, which assumes normally distributed features, achieves the lowest accuracy due to the limited number of features with a normal distribution. 

However, these results are not really promising, as all classifiers except the Bernoulli classifier are worse than the naive classification of no diabetes.

Let's see if we can improve this.

## Binary feature selection

In [None]:
binary_features = [
    "HighBP",
    "HighChol",
    "CholCheck",
    "Smoker",
    "Stroke",
    "HeartDiseaseorAttack",
    "PhysActivity",
    "Fruits",
    "Veggies",
    "HvyAlcoholConsump",
    "AnyHealthcare",
    "NoDocbcCost",
    "DiffWalk",
    "Sex",
]

In [None]:
# Select the binary features
X_train_binary = X_train[binary_features]

# Initialize the classifier
nb_classifier = BernoulliNB(alpha=1)

# Train the classifier
nb_classifier.fit(X_train_binary, y_train)

# Make predictions on the validation set
X_val_binary = X_val[binary_features]
y_val_pred = nb_classifier.predict(X_val_binary)

# Calculate the accuracy
accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {accuracy}")
print(classification_report(y_val, y_val_pred))

Using only binary features for the Bernoulli Naive Bayes classifier **increases** the performance slightly.

## Hyperparameter Tuning

Hyperparemeter tuning with all featuers

In [None]:
import pandas as pd
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# Define the parameter grid
param_grid = {
    "alpha": [0.1, 0.2, 0.3, 0.5, 1.0, 3.0, 4.0, 5.0],
    "fit_prior": [True, False],
}

# Initialize the classifier
nb_classifier = BernoulliNB()

# specify the cross validation
stratified_10_fold_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=nb_classifier,
    param_grid=param_grid,
    cv=stratified_10_fold_cv,
    scoring="accuracy",
)

# Perform the grid search on the binary features
grid_search.fit(X_train, y_train)

# Print the best parameters and the best score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_}")

# Use the best estimator to make predictions on the validation set
best_nb_classifier = grid_search.best_estimator_
y_val_pred = best_nb_classifier.predict(X_val)

# Calculate the accuracy
accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy with Best Parameters: {accuracy}")
print(classification_report(y_val, y_val_pred))

Hyperparemeter tuning with only binary features

In [None]:
# Initialize the classifier
nb_classifier = BernoulliNB()

# specify the cross validation
stratified_10_fold_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=nb_classifier,
    param_grid=param_grid,
    cv=stratified_10_fold_cv,
    scoring="accuracy",
)

# Perform the grid search on the binary features
grid_search.fit(X_train_binary, y_train)

# Print the best parameters and the best score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_}")

# Use the best estimator to make predictions on the validation set
best_nb_classifier = grid_search.best_estimator_
y_val_pred = best_nb_classifier.predict(X_val_binary)

# Calculate the accuracy
accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy with Best Parameters: {accuracy}")
print(classification_report(y_val, y_val_pred))

In [None]:
results = pd.DataFrame(grid_search.cv_results_)
results.T

### Interpretation of Hyperparameter Tuning

The hyperparameter tuning did not improve the performance of the Bernoulli Naive Bayes classifier. This outcome can be attributed to the nature of the Bernoulli Naive Bayes classifier, which is inherently simple and assumes that the features are binary and conditionally independent given the class label. As a result, the classifier's performance is primarily influenced by the quality and relevance of the features rather than the hyperparameters. Therefore, significant improvements through hyperparameter tuning are challenging to achieve for this type of classifier.

## Dimensionality Reduction - PCA

As we have seen, Naive Bayes classifiers rely heavily on the quality and relevance of the input data. Now, we want to explore whether applying dimensionality reduction through Principal Component Analysis (PCA) can enhance their performance. By reducing the dataset to its most significant components, PCA may improve computational efficiency and potentially boost classifier accuracy. In this analysis, we’ll test if PCA genuinely contributes to better performance.

In [None]:
# Load the PCA datasets
train_pca = pd.read_csv("../data/pca/dataset_train_pca.csv")
val_pca = pd.read_csv("../data/pca/dataset_val_pca.csv")
test_pca = pd.read_csv("../data/pca/dataset_test_pca.csv")

In [None]:
# Split the PCA datasets into features and target
X_train_pca = train_pca.drop(columns=["Diabetes"])
y_train_pca = train_pca["Diabetes"]

X_val_pca = val_pca.drop(columns=["Diabetes"])
y_val_pca = val_pca["Diabetes"]

X_test_pca = test_pca.drop(columns=["Diabetes"])
y_test_pca = test_pca["Diabetes"]

In [None]:
# Initialize the classifier
nb_classifier = BernoulliNB()

# Train the classifier
nb_classifier.fit(X_train_pca, y_train_pca)

# Make predictions on the validation set
y_val_pred = nb_classifier.predict(X_val_pca)

# Calculate the accuracy
accuracy = accuracy_score(y_val_pca, y_val_pred)
print(f"Validation Accuracy: {accuracy}")
print(classification_report(y_val_pca, y_val_pred))

In [None]:
# Initialize the classifier
nb_classifier = GaussianNB()

# Train the classifier
nb_classifier.fit(X_train_pca, y_train_pca)

# Make predictions on the validation set
y_val_pred = nb_classifier.predict(X_val_pca)

# Calculate the accuracy
accuracy = accuracy_score(y_val_pca, y_val_pred)
print(f"Validation Accuracy: {accuracy}")
print(classification_report(y_val_pca, y_val_pred))

Using only best n components for classification

In [None]:
num_components = 5
X_train_best_components = X_train_pca.iloc[:, :num_components]
X_val_best_components = X_val_pca.iloc[:, :num_components]


# Initialize the classifier
nb_classifier = BernoulliNB()

# Train the classifier
nb_classifier.fit(X_train_best_components, y_train_pca)

# Make predictions on the validation set
y_val_pred = nb_classifier.predict(X_val_best_components)

# Calculate the accuracy
accuracy = accuracy_score(y_val_pca, y_val_pred)
print(f"Validation Accuracy: {accuracy}")
print(classification_report(y_val_pca, y_val_pred))

Only with 5 components or more the classifier does not predict trivially (all no diabetes)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve


# Function to plot ROC Curve
def plot_roc_curve(
    nb_classifier, X_train_pca, y_train_pca, X_val_pca, y_val_pca, num_components_list
):
    plt.figure(figsize=(10, 8))

    for num_components in num_components_list:
        # Select the best components
        X_train_best_components = X_train_pca.iloc[:, :num_components]
        X_val_best_components = X_val_pca.iloc[:, :num_components]

        # Train the classifier
        nb_classifier.fit(X_train_best_components, y_train_pca)

        # Predict probabilities
        y_val_prob = nb_classifier.predict_proba(X_val_best_components)[:, 1]

        # Compute ROC curve and ROC area
        fpr, tpr, _ = roc_curve(y_val_pca, y_val_prob)
        roc_auc = roc_auc_score(y_val_pca, y_val_prob)

        # Plot ROC curve
        plt.plot(fpr, tpr, label=f"Components: {num_components} (AUC = {roc_auc:.2f})")

    plt.plot([0, 1], [0, 1], "k--")
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("Receiver Operating Characteristic (ROC) Curve")
    plt.legend(loc="lower right")
    plt.grid(True)
    plt.show()


# List of number of components to test
num_components_list = [5, 7, 10, 20]

# Plot ROC Curve
plot_roc_curve(
    BernoulliNB(alpha=1),
    X_train_pca,
    y_train_pca,
    X_val_pca,
    y_val_pca,
    num_components_list,
)