<a href="https://www.kaggle.com/code/sacrum/ml-labs-08-naive-bayes?scriptVersionId=178243645" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

Train a Gaussian Naïve Bayes classifier on the MNIST digits dataset. Compare your results with classifiers including Decision Trees, Support Vector Classifier and Logistic Regression Classifier. Identify which technique performs best on the digits dataset and explain why the performance of this technique is better than all the others.

In [1]:
# Import Necessary Libraries

import pandas as pd
import numpy as np
import time

import warnings
warnings.filterwarnings("ignore")

# Data

In [2]:
from sklearn.datasets import fetch_openml

# Load the MNIST dataset
mnist = fetch_openml('mnist_784', version=1, cache=True, parser='auto')

# Pandas data frame with feature vectors
X = mnist.data

# Scale pixel values
X = X / 255.

# Labels
y = mnist.target

# Use sample of Data
N = -1
X = X[:N]
y = y[:N]

In [3]:
from sklearn.model_selection import train_test_split

# train test split
X_train, X_test, y_train, y_test = train_test_split(
	X,
	y,
	test_size=0.2,
	stratify=y
)

# show shapes
print("X_train.shape:", X_train.shape)
print("y_train.shape:", y_train.shape)
print("X_test.shape:", X_test.shape)
print("y_test.shape:", y_test.shape)

X_train.shape: (55999, 784)
y_train.shape: (55999,)
X_test.shape: (14000, 784)
y_test.shape: (14000,)


# Model

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Initializing Models
models = [
	DecisionTreeClassifier(),
	SVC(),
	LogisticRegression(),
	GaussianNB(),
]

In [5]:
results = {}

for model in models:
	print(">>>", model.__class__.__name__)

	# Fit on train data
	i = time.time()
	model.fit(X_train, y_train)
	print(f"    Training Time: {round(time.time() - i, 3)} seconds")

	# get predictions
	i = time.time()
	preds = model.predict(X_test)
	print(f"    Prediction Time: {round(time.time() - i, 3)} seconds")

	# get report
	report = classification_report(y_test, preds)

	print()
	print(report)
	print()


>>> DecisionTreeClassifier
    Training Time: 21.416 seconds
    Prediction Time: 0.036 seconds

              precision    recall  f1-score   support

           0       0.91      0.92      0.92      1381
           1       0.94      0.96      0.95      1575
           2       0.86      0.84      0.85      1398
           3       0.83      0.83      0.83      1428
           4       0.89      0.88      0.89      1365
           5       0.82      0.82      0.82      1262
           6       0.89      0.90      0.89      1375
           7       0.90      0.90      0.90      1459
           8       0.83      0.81      0.82      1365
           9       0.83      0.85      0.84      1392

    accuracy                           0.87     14000
   macro avg       0.87      0.87      0.87     14000
weighted avg       0.87      0.87      0.87     14000


>>> SVC
    Training Time: 228.699 seconds
    Prediction Time: 150.132 seconds

              precision    recall  f1-score   support

       

# Results

| Classifier             | Accuracy | Precision | Recall | F1-Score | Training Time (s) | Prediction Time (s) |
|------------------------|----------|-----------|--------|----------|--------------------|----------------------|
| DecisionTreeClassifier | 0.87     | 0.87      | 0.87   | 0.87     | 17.882             | 0.03                 |
| Support Vector Classifier (SVC) | 0.98     | 0.98      | 0.98   | 0.98     | 206.083            | 127.571              |
| Logistic Regression    | 0.92     | 0.92      | 0.92   | 0.92     | 26.421             | 0.063                |
| Gaussian Naïve Bayes   | 0.56     | 0.69      | 0.55   | 0.51     | 0.975              | 0.691                |

The Support Vector Classifier (SVC) still outperforms the other classifiers on the MNIST digits dataset, achieving an accuracy of 98%. This high accuracy is due to its ability to find optimal separating hyperplanes in high-dimensional spaces, which is well-suited for classifying the complex and non-linear relationships in the MNIST dataset.

In comparison, the DecisionTreeClassifier, LogisticRegression, and Gaussian Naïve Bayes classifiers have lower accuracies. The DecisionTreeClassifier suffers from potential overfitting, LogisticRegression may struggle with non-linear relationships, and Gaussian Naïve Bayes assumes independence between features, which may not hold true for the pixel values of the digits in MNIST.

Overall, the Support Vector Classifier (SVC) remains the best performing classifier on the MNIST digits dataset due to its ability to handle high-dimensional data and find optimal separating hyperplanes, leading to superior classification performance.

**Find More Labs**

This lab is from my Machine Learning Course, that is a part of my [Software Engineering](https://seecs.nust.edu.pk/program/bachelor-of-software-engineering-for-fall-2021-onward) Degree at [NUST](https://nust.edu.pk).

The content in the provided list of notebooks covers a range of topics in **machine learning** and **data analysis** implemented from scratch or using popular libraries like **NumPy**, **pandas**, **scikit-learn**, **seaborn**, and **matplotlib**. It includes introductory materials on NumPy showcasing its efficiency for mathematical operations, **linear regression**, **logistic regression**, **decision trees**, **K-nearest neighbors (KNN)**, **support vector machines (SVM)**, **Naive Bayes**, **K-means** clustering, principle component analysis (**PCA**), and **neural networks** with **backpropagation**. Each notebook demonstrates practical implementation and application of these algorithms on various datasets such as the **California Housing** Dataset, **MNIST** dataset, **Iris** dataset, **Auto-MPG** dataset, and the **UCI Adult Census Income** dataset. Additionally, it covers topics like **gradient descent optimization**, model evaluation metrics (e.g., **accuracy, precision, recall, f1 score**), **regularization** techniques (e.g., **Lasso**, **Ridge**), and **data visualization**.

| Title                                                                                                                   | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| ----------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [01 - Intro to Numpy](https://www.kaggle.com/code/sacrum/ml-labs-01-intro-to-numpy)                                     | The notebook demonstrates NumPy's efficiency for mathematical operations like array `reshaping`, `sigmoid`, `softmax`, `dot` and `outer products`, `L1 and L2 losses`, and matrix operations. It highlights NumPy's superiority over standard Python lists in speed and convenience for scientific computing and machine learning tasks.                                                                                                                                                                                              |
| [02 - Linear Regression From Scratch](https://www.kaggle.com/code/sacrum/ml-labs-02-linear-regression-from-scratch)     | This notebook implements `linear regression` and `gradient descent` from scratch in Python using `NumPy`, focusing on predicting house prices with the `California Housing Dataset`. It defines functions for prediction, `MSE` calculation, and gradient computation. Batch gradient descent is used for optimization. The dataset is loaded, scaled, and split. `Batch, stochastic, and mini-batch gradient descents` are applied with varying hyperparameters. Finally, the MSEs of the predictions from each method are compared. |
| [03 - Logistic Regression from Scratch](https://www.kaggle.com/code/sacrum/ml-labs-03-logistic-regression-from-scratch) | This notebook outlines the implementation of `logistic regression` from scratch in Python using `NumPy`, including functions for prediction, loss calculation, gradient computation, and batch `gradient descent` optimization, applied to the `MNIST` dataset for handwritten digit recognition and `Iris` data. And also inclues metrics like `accuracy`, `precision`, `recall`, `f1 score`                                                                                                                                         |
| [04 - Auto-MPG Regression](https://www.kaggle.com/code/sacrum/ml-labs-04-auto-mpg-regression)                           | The notebook uses `pandas` for data manipulation, `seaborn` and `matplotlib` for visualization, and `sklearn` for `linear regression` and `regularization` techniques (`Lasso` and `Ridge`). It includes data loading, processing, visualization, model training, and evaluation on the `Auto-MPG dataset`.                                                                                                                                                                                                                           |
| [05 - Desicion Trees from Scratch](https://www.kaggle.com/code/sacrum/ml-labs-05-desicion-trees-from-scratch)           | In this notebook, `DecisionTree` algorithm has been implmented from scratch and applied on dummy dataset                                                                                                                                                                                                                                                                                                                                                                                                                              |
| [06 - KNN from Scratch](https://www.kaggle.com/code/sacrum/ml-labs-06-knn-from-scratch)                                 | In this notebook, `K-Nearest Neighbour` algorithm has been implemented from scratch and compared with KNN provided in scikit-learn package                                                                                                                                                                                                                                                                                                                                                                                            |
| [07 - SVM](https://www.kaggle.com/code/sacrum/ml-labs-07-svm)                                                           | This notebook implements `SVM classifier` on `Iris Dataset`                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| [08 - Naive Bayes](https://www.kaggle.com/code/sacrum/ml-labs-08-naive-bayes)                                           | This notebook trains `Naive Bayes` and compares it with other algorithms `Decision Trees`, `SVM` and `Logistic Regression`                                                                                                                                                                                                                                                                                                                                                                                                            |
| [09 - K-means](https://www.kaggle.com/code/sacrum/ml-labs-09-k-means)                                                   | In this notebook `K-means` algorithm has been implemented using `scikit-learn` and different values of `k` are compared to understand the `elbow method` in `Calinski Harabasz Scores`                                                                                                                                                                                                                                                                                                                                                |
| [10 - UCI Adult Census Income](https://www.kaggle.com/code/sacrum/ml-labs-10-uci-adult-census-income)                   | Here I have used the UCI Adult Income dataset and applied different machine learning algorithms to find the best model configuration for predicting salary from the given information                                                                                                                                                                                                                                                                                                                                                 |
| [11 - PCA](https://www.kaggle.com/code/sacrum/ml-labs-11-pca)                                                           | `Principle Component Analysis` implemented from scratch                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| [12 - Neural Networks](https://www.kaggle.com/code/sacrum/ml-labs-12-neural-networks)                                   | This code implements neural networks with back propagation from scratch                                                                                                                                                                                                                                                                                                                                                                                                                                                               |