<a href="https://colab.research.google.com/github/bhatzahidmajeed/breast_cancer_ml/blob/zahid/Gaussian_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gaussian Naive Bayes (GNB):
Gaussian Naive Bayes (GNB) is a classification technique used in Machine Learning (ML) based on the probabilistic approach and Gaussian distribution.
 - Gaussian Naive Bayes is a probabilistic classification algorithm based on the Bayes theorem, which makes the assumption that the features follow a Gaussian (normal) distribution.
 - It is commonly used for classification tasks, especially when dealing with continuous or numerical features.
 - In Gaussian Naive Bayes, the class conditional probability is modeled as a Gaussian distribution for each class, with mean and variance estimated from the training data.

Gaussian Naive Bayes can be used for breast cancer detection, especially when dealing with datasets that contain continuous or numerical features. It is a popular and effective algorithm for binary classification tasks, making it suitable for classifying breast tumors as either malignant (cancerous) or benign (non-cancerous).

In [1]:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import f1_score, precision_score, accuracy_score, recall_score, balanced_accuracy_score
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('breast-cancer.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [3]:
df.drop('id', axis=1, inplace=True)
df['diagnosis'] = (df['diagnosis'] == 'M').astype(int)
# Encode M (Malignant) as 1 & B (Benign) as 0.
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## Training the Model:
First, we create a function that can help us iterate our models faster.

In [4]:
def train_evaluate_model(model, X_train, y_train, X_test,y_test):
    model.fit(X_train, y_train)  # Fit the model instance
    predictions = model.predict(X_test) # Calculate predictions
    # Compute metrics for evaluation
    accuracy = accuracy_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)
    precision = precision_score(y_test, predictions, zero_division = 1)
    recall = recall_score(y_test, predictions)
    balanced_accuracy = balanced_accuracy_score(y_test, predictions)
    # Create a dataframe to visualize the results
    eval_df = pd.DataFrame([[accuracy, f1, precision, recall, balanced_accuracy]], columns=['Accuracy', 'F1-Score', 'Precision', 'Recall', 'Balanced Accuracy'])
    return eval_df

Then, we evaluate the model, without feature selection & later with various methods of feature selection.

### Case I: With All Features:

In [5]:
X = df.drop('diagnosis', axis = 1)
y = df['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
scaler = StandardScaler() # Create an instance of standard scaler
scaler.fit(X_train) # Fit it to the training data
X_train = scaler.transform(X_train) # Transform training data
X_test = scaler.transform(X_test) # Transform validation data
nb = GaussianNB()
results = train_evaluate_model(nb, X_train, y_train, X_test, y_test)

In [7]:
results.index = ['Naive Bayes Classifier (All Features)']
results.sort_values(by='F1-Score',ascending=False).style.background_gradient(cmap = sns.color_palette("ch:s=-.2,r=.6", as_cmap=True))

Unnamed: 0,Accuracy,F1-Score,Precision,Recall,Balanced Accuracy
Naive Bayes Classifier (All Features),0.964912,0.952381,0.97561,0.930233,0.958074


### Case II: Manual Feature Selection:

In [6]:
# Get the absolute value of the correlation:
corr = df.corr()
cor_target = abs(corr["diagnosis"])
cor_target

diagnosis                  1.000000
radius_mean                0.730029
texture_mean               0.415185
perimeter_mean             0.742636
area_mean                  0.708984
smoothness_mean            0.358560
compactness_mean           0.596534
concavity_mean             0.696360
concave points_mean        0.776614
symmetry_mean              0.330499
fractal_dimension_mean     0.012838
radius_se                  0.567134
texture_se                 0.008303
perimeter_se               0.556141
area_se                    0.548236
smoothness_se              0.067016
compactness_se             0.292999
concavity_se               0.253730
concave points_se          0.408042
symmetry_se                0.006522
fractal_dimension_se       0.077972
radius_worst               0.776454
texture_worst              0.456903
perimeter_worst            0.782914
area_worst                 0.733825
smoothness_worst           0.421465
compactness_worst          0.590998
concavity_worst            0

In [8]:
best_threshold = 0
best_f1_score = 0
best_names = None  # Initialize best_names to None

for threshold in np.arange(0, 1, 0.005):  # This will iterate from 0 to 1 with a step of 0.05
    # Select highly correlated features based on the current threshold
    relevant_features = cor_target[cor_target > threshold]
    names = [index for index, value in relevant_features.iteritems()]
    names.remove('diagnosis')  # Drop the target variable from the results

    # Check if there are any relevant features
    if not names:
        continue  # Skip this iteration and try the next threshold

    X = df[names]
    y = df['diagnosis']

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    scaler = StandardScaler()  # Create an instance of standard scaler
    scaler.fit(X_train)  # Fit it to the training data
    X_train = scaler.transform(X_train)  # Transform training data
    X_test = scaler.transform(X_test)  # Transform validation data

    nb = GaussianNB()

    # Train the model
    nb.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = nb.predict(X_test)

    # Calculate the F1 score
    current_f1_score = f1_score(y_test, y_pred)

    # Check if the current F1 score is better than the best F1 score so far
    if current_f1_score > best_f1_score:
        best_f1_score = current_f1_score
        best_threshold = threshold
        best_names = names  # Update the best_names with the current relevant features

# Check if any relevant features were found before proceeding
if best_names is None:
    print("No relevant features found with any threshold. Please adjust the threshold.")
else:
    print("Best Threshold:", best_threshold)
    print("Best F1 Score:", best_f1_score)
    print("Best Relevant Features:", best_names)
    print('No. of Best Relevant Features:', len(best_names))

Best Threshold: 0.745
Best F1 Score: 0.9655172413793104
Best Relevant Features: ['concave points_mean', 'radius_worst', 'perimeter_worst', 'concave points_worst']
No. of Best Relevant Features: 4


In [9]:
corr = df.corr()
cor_target = abs(corr["diagnosis"])
# Select highly correlated features:
# (Please put the threshold value manually)
relevant_features = cor_target[cor_target > 0.745]
# Collect the names of the features:
names = [index for index, value in relevant_features.iteritems()]
# Drop the target variable from the results:
names.remove('diagnosis')
len(names)

4

In [10]:
X = df[names]
y = df['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
scaler = StandardScaler() # Create an instance of standard scaler
scaler.fit(X_train) # Fit it to the training data
X_train = scaler.transform(X_train) # Transform training data
X_test = scaler.transform(X_test) # Transform validation data
nb = GaussianNB()
mfs = train_evaluate_model(nb, X_train, y_train, X_test, y_test)

In [11]:
mfs.index = ['Naive Bayes Classifier (Manual)']
results = results.append(mfs)
results.sort_values(by = 'F1-Score', ascending = False).style.background_gradient(cmap = sns.color_palette("ch:s=-.2,r=.6", as_cmap=True))

Unnamed: 0,Accuracy,F1-Score,Precision,Recall,Balanced Accuracy
Naive Bayes Classifier (Manual),0.973684,0.965517,0.954545,0.976744,0.974288
Naive Bayes Classifier (All Features),0.964912,0.952381,0.97561,0.930233,0.958074


### Case III: Variance Thresholding:
This method removes features with low variance, assuming that features with little variation provide limited information for the model.

In [12]:
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

# Initialize an array of threshold values to try
threshold_values = np.arange(0.1, 1.0, 0.005)

best_threshold = None
best_f1_score = 0.0

for threshold_value in threshold_values:
    # Perform variance thresholding
    variance_threshold = VarianceThreshold(threshold=threshold_value)
    X_high_variance = variance_threshold.fit_transform(X)

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X_high_variance, y, test_size=0.2, random_state=42)

    # Standardize the data
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    # Train the logistic regression model
    nb = GaussianNB()
    nb.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = nb.predict(X_test)

    # Calculate the F1 score
    f1 = f1_score(y_test, y_pred)

    # Check if this threshold gives a higher F1 score
    if f1 > best_f1_score:
        best_f1_score = f1
        best_threshold = threshold_value

print("Best Threshold Value:", best_threshold)
print("Best F1 Score:", best_f1_score)

Best Threshold Value: 0.1
Best F1 Score: 0.9382716049382717


In [13]:
X = df.drop('diagnosis', axis = 1)
y = df['diagnosis']
# (Please put the threshold value manually)
variance_threshold = VarianceThreshold(threshold=0.1)
# Fit the VarianceThreshold object to the data.
X_high_variance = variance_threshold.fit_transform(X)
# The "get_support" method returns a boolean mask of the selected features.
selected_features_mask = variance_threshold.get_support()
# You can use this mask to get the indices or names of the selected features.
selected_feature_indices = np.where(selected_features_mask)[0]
selected_feature_names = X.columns[selected_feature_indices]
selected_feature_names

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'texture_se', 'perimeter_se', 'area_se', 'radius_worst',
       'texture_worst', 'perimeter_worst', 'area_worst'],
      dtype='object')

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X_high_variance, y, test_size = 0.2, random_state = 42)
scaler = StandardScaler() # Create an instance of standard scaler
scaler.fit(X_train) # Fit it to the training data
X_train = scaler.transform(X_train) # Transform training data
X_test = scaler.transform(X_test) # Transform validation data
nb = GaussianNB()
vt = train_evaluate_model(nb, X_train, y_train, X_test, y_test)

In [15]:
vt.index = ['Naive Bayes Classifier (VT)']
results = results.append(vt)
results.sort_values(by = 'F1-Score', ascending = False).style.background_gradient(cmap = sns.color_palette("ch:s=-.2,r=.6", as_cmap=True))

Unnamed: 0,Accuracy,F1-Score,Precision,Recall,Balanced Accuracy
Naive Bayes Classifier (Manual),0.973684,0.965517,0.954545,0.976744,0.974288
Naive Bayes Classifier (All Features),0.964912,0.952381,0.97561,0.930233,0.958074
Naive Bayes Classifier (VT),0.95614,0.938272,1.0,0.883721,0.94186


## Stratified K-Fold:
In binary classification problems, an imbalanced dataset occurs when one class (the minority class) has significantly fewer samples compared to the other class (the majority class).

In [16]:
px.histogram(data_frame=df, x='diagnosis', color='diagnosis',color_discrete_sequence=['red','blue'])

Cross-validation is a technique that helps estimate the model's generalization ability by dividing the data into multiple folds and training/evaluating the model on different subsets of the data. Stratified K-Fold ensures that each fold maintains the same class distribution as the original dataset, which is particularly important for imbalanced datasets.

In [17]:
X = df.drop('diagnosis', axis = 1)
y = df['diagnosis']

# Define the number of folds for cross-validation
n_splits = 5  # Set the desired number of folds

# Create the StratifiedKFold object
stratified_kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

In [18]:
# Create an empty list to store the accuracy scores for each fold
accuracy_scores = []

# Loop through each fold and train/evaluate the logistic regression model
for train_index, test_index in stratified_kfold.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Create and train the logistic regression model
    model = GaussianNB()
    model.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test)

    # Calculate accuracy for this fold and store it in the list
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)

# Calculate the mean and standard deviation of the accuracy scores
mean_accuracy = np.mean(accuracy_scores)
std_accuracy = np.std(accuracy_scores)

# Print the mean and standard deviation of accuracy
print("Mean Accuracy:", mean_accuracy*100, '%')
print("Standard Deviation of Accuracy:", std_accuracy)

Mean Accuracy: 93.85343890700202 %
Standard Deviation of Accuracy: 0.023468351296512962
