### Micro Chip QA Classification using Naive Bayes Classifier

### Our Objective
* To reliably classify whether a micro chip is suitable for production usage, based on results of the quality tests.

### Getting to know out MCQA dataset!
Micro chip dataset contains only 3 features as follows:
* Test_1 - Score quantifying the micro chip's performance on test 1.
* Test_2 - Score quantifying the micro chip's performance on test 2.
* QA_Passed - Target variable identifying if the mirco chip passed the test.

### Approach
* Initially we'll explore the dataset to check for imbalance & missing values.
* Explore correlation between various features in the dataset.
* Split the pre-processed dataset into train & test set respectively.
* Create & train a Naive Bayes Classifier using mlpack.
* We'll perform evaluation on our test set using various metrics to quantify the performance of our model.

In [None]:
!wget -q http://datasets.mlpack.org/microChip.csv

In [1]:
// Import necessary library headers.
#include <mlpack/xeus-cling.hpp>
#include <mlpack/core.hpp>
#include <mlpack/core/data/split_data.hpp>
#include <mlpack/methods/naive_bayes/naive_bayes_classifier.hpp>

In [2]:
#define WITHOUT_NUMPY 1
#include "matplotlibcpp.h"
#include "xwidgets/ximage.hpp"
#include "../utils/plot.hpp"

namespace plt = matplotlibcpp;

In [3]:
using namespace mlpack;

In [4]:
using namespace mlpack::data;

In [5]:
using namespace mlpack::naive_bayes;

In [6]:
// Utility functions for evaluation metrics.
double Accuracy(const arma::Row<size_t>& yPreds, const arma::Row<size_t>& yTrue)
{
    const size_t correct = arma::accu(yPreds == yTrue);
    return (double)correct / (double)yTrue.n_elem;
}

In [7]:
double Precision(const size_t truePos, const size_t falsePos)
{
    return (double)truePos / (double)(truePos + falsePos);
}

In [8]:
double Recall(const size_t truePos, const size_t falseNeg)
{
    return (double)truePos / (double)(truePos + falseNeg);
}

In [9]:
double F1Score(const size_t truePos, const size_t falsePos, const size_t falseNeg)
{
    double prec = Precision(truePos, falsePos);
    double rec = Precision(truePos, falseNeg);
    return 2 * (prec * rec) / (prec + rec);
}

In [10]:
void ClassificationReport(const arma::Row<size_t>& yPreds, const arma::Row<size_t>& yTrue)
{
    arma::Row<size_t> uniqs = arma::unique(yTrue);
    std::cout << std::setw(29) << "precision" << std::setw(15) << "recall" 
              << std::setw(15) << "f1-score" << std::setw(15) << "support" 
              << std::endl << std::endl;
    
    for(auto val: uniqs)
    {
        size_t truePos = arma::accu(yTrue == val && yPreds == val && yPreds == yTrue);
        size_t falsePos = arma::accu(yPreds == val && yPreds != yTrue);
        size_t trueNeg = arma::accu(yTrue != val && yPreds != val && yPreds == yTrue);
        size_t falseNeg = arma::accu(yPreds != val && yPreds != yTrue);
        
        std::cout << std::setw(15) << val
                  << std::setw(12) << std::setprecision(2) << Precision(truePos, falsePos) 
                  << std::setw(16) << std::setprecision(2) << Recall(truePos, falseNeg) 
                  << std::setw(14) << std::setprecision(2) << F1Score(truePos, falsePos, falseNeg)
                  << std::setw(16) << truePos
                  << std::endl;
    }
}

In [11]:
!mkdir data && cat microChip.csv | sed 1d > ./data/microChip_trim.csv

In [12]:
// Load the preprocessed dataset into armadillo matrix.
arma::mat microChipData;
data::Load("./data/microChip_trim.csv", microChipData);

In [13]:
// Examine first 5 samples from our dataset.
std::cout.precision(4);
std::cout.setf(std::ios::fixed);
std::cout << std::setw(10) << "Test_1" << std::setw(10) << "Test_2" << std::setw(13) << "QA_Passed" << std::endl;
std::cout << microChipData.submat(0, 0, microChipData.n_rows-1, 5).t() << std::endl;

    Test_1    Test_2    QA_Passed
   34.6237   78.0247         0
   30.2867   43.8950         0
   35.8474   72.9022         0
   60.1826   86.3086    1.0000
   79.0327   75.3444    1.0000
   45.0833   56.3164         0



In [14]:
// Plot the correlation matrix as a heatmap.
HeatMapPlot("microChip.csv", "coolwarm", "Micro Chip Correlation Heatmap", 1, 5, 5);
auto img = xw::image_from_file("./plots/Micro Chip Correlation Heatmap.png").finalize();
img

A Jupyter widget with unique id: 5805baabceaf495cb75a58ee15a42ff4

As we can infer from the above heatmap, there is some correlation between Test_1, Test_2 & QA_Passed.

### Exploratory Data Analysis

In [15]:
CountPlot("microChip.csv", "QA_Passed", "", "Distribution of target class");
auto img = xw::image_from_file("./plots/Distribution of target class.png").finalize();
img

A Jupyter widget with unique id: 2c39fe7cc0444a02a02fc5a1c43fa72d

In [16]:
PlotCatData("microChip.csv", 2, "Microchip Test 1", "Mircochip Test 2", "MCQA");
auto img = xw::image_from_file("./plots/MCQA.png").finalize();
img

A Jupyter widget with unique id: 70a0027ff76945a8bf56b1cd03765d1b

In [17]:
// Split the data into features (X) and target (y) variables, targets are the last row.
arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(microChipData.row(microChipData.n_rows-1));
// Targets are dropped from the loaded matrix.
microChipData.shed_row(microChipData.n_rows-1);

### Train Test Split
The dataset has to be split into training and test set. Here the dataset has 100 observations and the test ratio is taken as 25% of the total observations. This indicates that the test set should have 25% * 100 = 25 observations and training set should have 75 observations respectively.

In [18]:
// Split the dataset into train and test sets using mlpack.
arma::mat Xtrain, Xtest;
arma::Row<size_t> Ytrain, Ytest;
mlpack::data::Split(microChipData, targets, Xtrain, Xtest, Ytrain, Ytest, 0.25);

### Training the Naive Bayes Classifier
Naive Bayes is a machine learning algorithm we use to solve classification problems. It is based on the Bayes Theorem. It is one of the simplest yet powerful ML algorithms and assumes that all predictors are independent.
* It assumes that every feature is independent.
* It gives every feature the same level of importance.

$ P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)} $

In [19]:
NaiveBayesClassifier<> nbc(Xtrain, Ytrain, 2);

### Making Predictions on Test set

In [20]:
// Predict the values for test data using previously trained model as input.
arma::Row<size_t> output;
arma::mat probs;
nbc.Classify(Xtest, output, probs);

In [21]:
// Save predicted probabilities and ground truth as csv for generating a ROC AUC curve.
data::Save("./data/probabilities.csv", probs);
data::Save("./data/ytest.csv", Ytest);

### Evaluation metrics

* True Positive - The actual value was true & the model predicted true.
* False Positive - The actual value was false & the model predicted true, Type I error.
* True Negative - The actual value was false & the model predicted false.
* False Negative - The actual value was true & the model predicted false, Type II error.

`Accuracy`: is a metric that generally describes how the model performs across all classes. It is useful when all classes are of equal importance. It is calculated as the ratio between the number of correct predictions to the total number of predictions.

$$Accuracy = \frac{True_{positive} + True_{negative}}{True_{positive} + True_{negative} + False_{positive} + False_{negative}}$$

`Precision`: is calculated as the ratio between the number of positive samples correctly classified to the total number of samples classified as Positive. The precision measures the model's accuracy in classifying a sample as positive.

$$Precision = \frac{True_{positive}}{True_{positive} + False_{positive}}$$

`Recall`: is calulated as the ratio between the number of positive samples correctly classified as Positive to the total number of Positive samples. The recall measures the model's ability to detect Positive samples. The higher the recall, the more positive samples detected.

$$Recall = \frac{True_{positive}}{True_{positive} + False_{negative}}$$

* The decision of whether to use precision or recall depends on the type of problem begin solved.
* If the goal is to detect all positive samples then use recall.
* Use precision if the problem is sensitive to classifying a sample as Positive in general.

* ROC graph has the True Positive rate on the y axis and the False Positive rate on the x axis.
* ROC Area under the curve in the graph is the primary metric to determine if the classifier is doing well, the higher the value the higher the model performance.

In [22]:
// Classification report.
std::cout << "Accuracy: " << Accuracy(output, Ytest) << std::endl;
ClassificationReport(output, Ytest);

Accuracy: 0.9200
                    precision         recall       f1-score        support

              0        1.00            0.82          0.90               9
              1        0.88            1.00          0.93              14


In [23]:
// Plot ROC AUC Curve to visualize the performance of the model on TP & FP.
RocAucPlot("./data/ytest.csv", "./data/probabilities.csv", "ROC AUC Curve");
auto img = xw::image_from_file("./plots/ROC AUC Curve.png").finalize();
img

A Jupyter widget with unique id: 1ac4fbc468684d5091ba60bb16ed1c23

### Conclusion
From the above classification report & ROC AUC, we can infer that our Naive Bayes Classifier model kinda performs well on our micro chip QA. Feel free to play around with h-params, split ratio etc.