PROJECT NAME: NAIVE BAYES UNDER THE HOOD

PROJECT GOAL:

The primary objective of this project is to demystify the Naive Bayes classifier, presenting it in an accessible manner that anyone can understand. It aims to break down the fundamental concepts behind this powerful machine learning algorithm, illustrating how it works and providing a step-by-step implementation at a basic level. By the end of this project, readers will have a clear understanding of the Naive Bayes classifier's principles, applications, and practical implementation, empowering them to apply these insights in real-world scenarios.

DESCRIPTION:

This project focuses on implementing a robust Naive Bayes classifier using Python, 
building upon the foundational concepts of probability theory and statistical inference. 
At its core, the Naive Bayes algorithm leverages Bayes' theorem, which relates the conditional and marginal probabilities of random variables, 
to make predictions based on input features. 

To enhance the model's reliability, we incorporate logarithmic transformation to address potential underflow issues 
that can arise during probability calculations, especially when dealing with very small values. 
When probabilities are multiplied together, as is common in Naive Bayes, the resulting product can quickly become exceedingly small, 
potentially falling below the range of standard floating-point representation and causing numerical instability. 
By applying logarithms, we transform these multiplications into sums, effectively stabilizing the program and eliminating the risk of underflow 
while maintaining the accuracy of our predictions.

Moreover, we introduce additive smoothing, a technique that mitigates the problem of zero probabilities when estimating class-conditional probabilities. 
This functionality allows users to specify the degree of smoothing, enhancing the model's performance in scenarios with limited training data 
or unseen features. By adding a small constant (smoothing parameter) to the frequency counts, we ensure that no probability is ever exactly zero, 
allowing the algorithm to gracefully handle unseen attribute values during prediction.

Finally, the project expands its applicability by enabling the Naive Bayes algorithm to work with numerical input attributes. 
Instead of treating these attributes as categorical, we model their conditional distribution using a normal (Gaussian) distribution. 
This approach entails estimating the mean and standard deviation for each numerical feature within each class, 
thereby allowing us to compute probabilities based on the properties of the normal distribution. 
Consequently, the classifier can effectively handle a mixture of categorical and continuous features, 
making it a versatile tool for various classification tasks. 

Through these enhancements, the project aims to provide a comprehensive implementation of the Naive Bayes classifier that balances mathematical rigor 
with practical usability.

DATA OVERVIEW:

The training data consists of 200 entries with the following attributes:
Age: Integer values (e.g., 23, 47)
Sex: Categorical values (F, M)
BP: Categorical values (HIGH, LOW, NORMAL)
Cholesterol: Categorical values (HIGH, NORMAL)
Na: Continuous values (float)
K: Continuous values (float)
Drug: Target variable (categorical with classes drugY, drugX, drugA, drugC, drugB)

USED PYTHON LIBRARIES: pandas, numpy, scipy.stats

In [27]:
import pandas as pd  # Importing the pandas library for data manipulation and analysis
import numpy as np  # Importing NumPy for numerical operations
from scipy.stats import norm  # Importing the normal distribution from scipy.stats

# Function to learn the Naive Bayes model
def learn(data, class_att, smoothing=1e-5): 
    model = {}  # Initialize an empty dictionary to store the model

    # Calculate the prior probabilities of each class
    apriori = data[class_att].value_counts(normalize=True)
    model['_apriori'] = apriori 

    # Loop through each attribute in the dataset, excluding the class attribute
    for attribute in data.drop(class_att, axis=1).columns:
        if data[attribute].dtype == 'object':  # Check if the attribute is categorical
            # Create a contingency table (counts of each category for each class) with smoothing to avoid zero probabilities
            mat_kont = pd.crosstab(data[attribute], data[class_att]) + smoothing
            mat_kont = mat_kont / mat_kont.sum(axis=0)  # Normalize the counts to get probabilities
            model[attribute] = np.log(mat_kont)  # Store the log of the probabilities in the model to avoid underflow
        else:
            # For numerical attributes, calculate the mean and standard deviation for each class
            mean_std = data.groupby(class_att)[attribute].agg(['mean', 'std'])
            model[attribute]= mean_std  # Store the mean and std in the model
    return model  # Return the trained model

# Function to predict the class of new instances
def predict(model, new_instance):
    class_probabilities = {}  # Initialize a dictionary to store class probabilities
    
    # Loop through each class in the model
    for class_value in model['_apriori'].index:
        probability = 0  # Initialize the probability for the current class
        
        # Loop through each attribute in the model
        for attribute in model:
            if attribute == '_apriori':  # If the attribute is the prior probabilities
                probability += np.log(model['_apriori'][class_value])  # Add the log prior probability
            else:
                # Check if the attribute is categorical and exists in the new instance
                if isinstance(model[attribute], pd.DataFrame):
                    if new_instance[attribute] in model[attribute].index:
                        probability += model[attribute].loc[new_instance[attribute], class_value]  # Add log probability
                    else:
                        probability += np.log(1e-5)  # Add a small value for unseen categories
                else: 
                    # For numerical attributes, calculate the probability using the normal distribution
                    mean, std = model[attribute].loc[class_value]
                    probability += np.log(norm.pdf(new_instance[attribute], mean, std))  # Add log of Gaussian probability
                    
        class_probabilities[class_value] = probability  # Store the calculated probability for the class

    # Get the class with the highest probability
    prediction = max(class_probabilities, key=class_probabilities.get)

    return prediction, class_probabilities  # Return the predicted class and its probabilities

In [28]:
# Load the dataset from a CSV file
data = pd.read_csv('C:/Users/Korisnik/Downloads/BIG DATA/BIG DATA/DOMACI 1/data/drug.csv') # local path on my PC

print("TRAINING DATA (FIRST 10 ROWS): ")
print(data.head(10))
print()
print("BASIC DATAFRAME INFO: ")
data.info()

TRAINING DATA (FIRST 10 ROWS): 
   Age Sex      BP Cholesterol        Na         K   Drug
0   23   F    HIGH        HIGH  0.792535  0.031258  drugY
1   47   M     LOW        HIGH  0.739309  0.056468  drugC
2   47   M     LOW        HIGH  0.697269  0.068944  drugC
3   28   F  NORMAL        HIGH  0.563682  0.072289  drugX
4   61   F     LOW        HIGH  0.559294  0.030998  drugY
5   22   F  NORMAL        HIGH  0.676901  0.078647  drugX
6   49   F  NORMAL        HIGH  0.789637  0.048518  drugY
7   41   M     LOW        HIGH  0.766635  0.069461  drugC
8   60   M  NORMAL        HIGH  0.777205  0.051230  drugY
9   43   M     LOW      NORMAL  0.526102  0.027164  drugY

BASIC DATAFRAME INFO: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          200 non-null    int64  
 1   Sex          200 non-null    object 
 2   BP           200 non-null    

- The dataset contains 200 entries and consists of 7 columns, which include both categorical and numerical features. 

- Columns:

  * Age: An integer column representing the age of the individuals. It is a numerical value.

  * Sex: A categorical column (with values 'M' and 'F') indicating the gender of the individuals. This feature can help capture any gender-related     trends in drug prescriptions.

  * BP: A categorical column indicating blood pressure levels with three possible values: 'HIGH', 'LOW', and 'NORMAL'. This information may play a   significant role in determining drug prescriptions based on blood pressure conditions.

  * Cholesterol: Another categorical column that describes cholesterol levels, also with values 'HIGH' and 'NORMAL'. 

  * Na: A float column representing sodium levels in the blood, measured as a continuous variable. This feature may help in predicting drug type based on biochemical factors.

  * K: A float column representing potassium levels, similar to the sodium column. This is also a continuous variable that may influence drug predictions.

  * Drug: Cathegorical variable. Indicating the prescribed drug type, with multiple classes (e.g., 'drugY', 'drugX', 'drugC', etc.). This is the outcome variable that the model aims to predict.

- There are no missing values in the dataframe.

- Target variable: 'Drug' 

In [41]:
print("TARGET VARIABLE ('Drug') INFO: ")
distinct_values = data['Drug'].unique()
value_counts = len(distinct_values)
print(f"NUMBER OF CLASSES: {value_counts}, CLASS LABELS: ", distinct_values)
print("VALUE COUNTS: ")
print(data['Drug'].value_counts())

TARGET VARIABLE ('Drug') INFO: 
NUMBER OF CLASSES: 5, CLASS LABELS:  ['drugY' 'drugC' 'drugX' 'drugA' 'drugB']
VALUE COUNTS: 
Drug
drugY    91
drugX    54
drugA    23
drugC    16
drugB    16
Name: count, dtype: int64


In [49]:
output_var = 'Drug'  # Specify the target variable (class attribute)

print("\nTraining Naive Bayes model...")
# Train the model using the learn function with a specified smoothing parameter
model = learn(data, output_var, 10000)
print("\nMODEL: ")
print(model)


Training Naive Bayes model...

MODEL: 
{'_apriori': Drug
drugY    0.455
drugX    0.270
drugA    0.115
drugC    0.080
drugB    0.080
Name: proportion, dtype: float64, 'Age':             mean        std
Drug                       
drugA  35.869565   9.696786
drugB  62.500000   7.127412
drugC  42.500000  16.725230
drugX  44.018519  16.435685
drugY  43.747253  17.031731, 'Sex': Drug     drugA     drugB     drugC     drugX     drugY
Sex                                                   
F    -0.693397 -0.693347 -0.693247 -0.693147 -0.692998
M    -0.692897 -0.692947 -0.693047 -0.693147 -0.693297, 'BP': Drug       drugA     drugB     drugC     drugX     drugY
BP                                                      
HIGH   -1.097081 -1.097547 -1.099145 -1.100411 -1.097848
LOW    -1.099379 -1.099145 -1.097547 -1.098612 -1.098646
NORMAL -1.099379 -1.099145 -1.099145 -1.096817 -1.099344, 'Cholesterol': Drug            drugA     drugB     drugC     drugX     drugY
Cholesterol                     

- The _apriori section gives the prior probabilities of each class (Drug type) in the dataset. These probabilities represent how likely each drug is to occur in the absence of any other information and they are calculated as 
number of appearances for each class divided by a total number of entries (200):
drugY: 0.455
drugX: 0.270
drugA: 0.115
drugC: 0.080
drugB: 0.080
- This means that drugY is the most common class, appearing in 45.5% of the data, while drugB and drugC are the least common.
- When it comes to numerical features like 'Age' for example, model contains the mean and std for every class label when it comes to that feature:
'Age':    mean        std
Drug                       
drugA  35.869565   9.696786
drugB  62.500000   7.127412
drugC  42.500000  16.725230
drugX  44.018519  16.435685
drugY  43.747253  17.031731
- When it comes to cathegorical features like 'BP' for example, we calculated the probability of a specific BP category given that a patient is taking a certain drug (CONDITIONAL PROBABILITY). Each entry corresponds to the log probability of having a specific BP level (HIGH, LOW, NORMAL) when a particular drug is being taken (drugA, drugB, etc.), calculated by a log of the conditional probability - log(P(BP∣Drug)):
BP
Drug       drugA     drugB     drugC     drugX     drugY
HIGH   -1.097081 -1.097547 -1.099145 -1.100411 -1.097848
LOW    -1.099379 -1.099145 -1.097547 -1.098612 -1.098646
NORMAL -1.099379 -1.099145 -1.099145 -1.096817 -1.099344
The values for these features are negative, indicating logged probabilities. The closer the log probability is to zero (less negative), the higher the corresponding probability for that category.

- The _apriori section gives the prior probabilities of each class (Drug type) in the dataset. These probabilities represent how likely each drug is to occur in the absence of any other information:
drugY: 0.455
drugX: 0.270
drugA: 0.115
drugC: 0.080
drugB: 0.080
- This means that drugY is the most common class, appearing in 45.5% of the data, while drugB and drugC are the least common.
- When it comes to numerical features like 'Age' for example, we got mean and std of Age for each class
- When it comes to cathegorical features like 'BP', we calculated the probability of a specific BP category given that a patient is taking a certain drug. Each entry in the table corresponds to the log probability of having a specific BP level (HIGH, LOW, NORMAL) when a particular drug is being taken (drugA, drugB, etc.), calculated by formula log(P(BP/Drug)):

In [38]:
# Prediction instances
new_instance = {'Age': 10, 'Sex': 'F', 'BP': 'HIGH', 'Cholesterol': 'HIGH', 'Na': 0.65222, 'K':0.4}
new_instance2 = {'Age': 25, 'Sex': 'F', 'BP': 'LOW', 'Cholesterol': 'LOW', 'Na': 0.222, 'K':0.3}
new_instance3 = {'Age': 60, 'Sex': 'M', 'BP': 'HIGH', 'Cholesterol': 'HIGH', 'Na': 0.66, 'K':0.061939}
new_instance4 = {
    'Age': 35,
    'Sex': 'M',
    'BP': 'HIGH',
    'Cholesterol': 'HIGH',
    'Na': 0.66,
    'K': 0.061939,
}

# Make predictions for each new instance and print the results
prediction, confidence = predict(model, new_instance)
prediction2, confidence2 = predict(model, new_instance2)
prediction3, confidence3 = predict(model, new_instance3)
prediction4, confidence4 = predict(model, new_instance4)

print("PREDICTIONS FOR NEW INSTANCES:")
print("Instance: ", new_instance)
print("Prediction:", prediction)
print("Confidence:", confidence)

print("\nInstance: ", new_instance2)
print("Prediction:", prediction2)
print("Confidence:", confidence2)

print("\nInstance: ", new_instance3)
print("Prediction:", prediction3)
print("Confidence:", confidence3)

print("\nInstance: ", new_instance4)
print("Prediction:", prediction4)
print("Confidence:", confidence4)

PREDICTIONS FOR NEW INSTANCES:
Instance:  {'Age': 10, 'Sex': 'F', 'BP': 'HIGH', 'Cholesterol': 'HIGH', 'Na': 0.65222, 'K': 0.4}
Prediction: drugY
Confidence: {'drugY': -37.81007822966299, 'drugX': -38.3355131055089, 'drugA': -39.185175011891516, 'drugC': -39.54924576403176, 'drugB': -39.5485460189035}

Instance:  {'Age': 25, 'Sex': 'F', 'BP': 'LOW', 'Cholesterol': 'LOW', 'Na': 0.222, 'K': 0.3}
Prediction: drugY
Confidence: {'drugY': -48.63080311277627, 'drugX': -49.15279464909273, 'drugA': -50.007300596669445, 'drugC': -50.368224368271484, 'drugB': -50.36992302467748}

Instance:  {'Age': 60, 'Sex': 'M', 'BP': 'HIGH', 'Cholesterol': 'HIGH', 'Na': 0.66, 'K': 0.061939}
Prediction: drugY
Confidence: {'drugY': -37.810376870847826, 'drugX': -38.3355131055089, 'drugA': -39.18467558622065, 'drugC': -39.5490459239032, 'drugB': -39.54814633864238}

Instance:  {'Age': 35, 'Sex': 'M', 'BP': 'HIGH', 'Cholesterol': 'HIGH', 'Na': 0.66, 'K': 0.061939}
Prediction: drugY
Confidence: {'drugY': -37.810376

ANALYSIS AND CONCLUSION:

The Naive Bayes classifier consistently predicts the drugY class, highlighting the challenges posed by imbalanced class distributions in our dataset. The predominance of drugY in the training data leads the model to favor this class, as it has learned to prioritize the most frequently occurring category over others.

It is important to note that this project focuses primarily on explaining the basic Naive Bayes algorithm, which is why the topic of balancing classes has been left out of our implementation. However, to mitigate the effects of class imbalance in real-world applications, practitioners can consider several strategies:

- Resampling Techniques: Implement methods such as oversampling the minority class or undersampling the majority class to create a more balanced training dataset.
- Cost-Sensitive Learning: Adjust the learning algorithm to give more weight to the minority class, helping to improve its predictive performance.
- Evaluation Metrics: Utilize metrics beyond accuracy, such as precision, recall, and F1-score, to gain a better understanding of the model's performance across all classes.

By being mindful of class imbalance and employing appropriate techniques, we can enhance the model's ability to generalize across all classes, ultimately leading to more reliable and equitable predictions.