# Bayesian Classifiers

## Introduction

### Outline
In this notebook, we discuss a few different (but closely related) classifiers based on Baye's Theorems:
1. Bayes Classifier
2. Naive Bayes Classifier
3. Linear Discriminant Analysis
4. Quadratic Discriminant Analysis
I refer to these as Bayesian classifiers because they are all based on Baye's Theorem, but differ in the assumptions they make about the distribution of the variables in the dataset. `

### Setup of dataset
In this notebook, we assume that we are given a labelled (training) dataset with the following format:
- There are features $X_1,\dotsc,X_n$ (continuous and/or discrete).
- There is a categorical target $Y$ which takes values in some finite set $\mathcal{C} = \{c_1,\dotsc,c_k\}$. 
- There are $m$ instances in our dataset. 
- As usual, we denote by $\mathbf{x}_i \in \mathbb{R}^m$ the column vector of features for the $i$-th feature, and by $\mathbf{y} \in \mathcal{C}^m$ the column vector of true labels.

### (Soft) Classifiers
Our goal is to predict the label $c \in \mathcal{C}$ given a row of features $\vec{x} \in \mathbb{R}^n$. Thus, we want to construct a **classifier**, i.e. a function
\begin{equation*}
    F : \mathbb{R}^n \to \mathcal{C},
\end{equation*}
which assigns to any instance $\vec{x} = (x_1,\dotsc,x_n)$ of features a class (or label) $F(\vec{x}) \in \mathcal{C}$. 

For this purpose, it is natural to first construct a **soft classifier**, which assigns to each instance $\vec{x}$ an entire probability distribution over the classes $\mathcal{C}$. Since we are in the discrete case, asking for a probability distribution over the classes is equivalent to asking for a **probability mass function** (PMF), i.e. a function
\begin{equation*}
    p: \mathcal{C} \to [0,1],
\end{equation*}
such that $\sum_{c \in \mathcal{C}} p(c) = 1$. 

So, formally speaking, a soft classifier can be thought of as a function
\begin{align*}
    \mathbb{R}^n & \to \{ \textup{probability distributions over } \mathcal{C} \}, \\
    \vec{x} & \mapsto p_{\vec{x}},
\end{align*}
which assigns to each instance $\vec{x}$ a probability distribution $p_{\vec{x}}$ over the classes $\mathcal{C}$. The hard classifier $F$ is then obtained by taking the class with the highest probability:
\begin{equation*}
    F(\vec{x}) = \argmax_{c \in \mathcal{C}} \; p_{\vec{x}}(c).
\end{equation*}


## Bayes Classifier

### The most natural soft classifier
Consider the problem of assigning to each instance $\vec{x}$ a probability distribution over the classes $\mathcal{C}$. Such a distribution needs to come (somehow) from the data. There is basically only one (natural) way to obtain such a distribution from the dataset, namely, we can use the **conditional probability** of the class given the features:
\begin{align*}
    p_{\vec{x}} : \mathcal{C} & \to [0,1], \\
    c & \mapsto p(c | \vec{x}).
\end{align*}
This is the **Bayes Classifier**. It is the most natural classifier, because the logic is very intuitive and natural: we look at the data and say "in the training dataset, given that the features where $\vec{x}$, the probability distribution of the labels was $p(c | \vec{x})$, so I'm going to predict the same will also hold for future data in which the features are $\vec{x}$".

To convert this soft classifier into a hard classifier, we take the class with the highest probability condition on the features being equalt to $\vec{x}$:
\begin{equation*}
    F(\vec{x}) = \argmax_{c \in \mathcal{C}} \; p(c | \vec{x}).
\end{equation*}

### Implementing the Bayes Classifier
Just to keep up with our ongoing tradition of implementing ML models from scratch, we also implement the Bayes Classifier below, with some caveats:
- We assume that the features are discrete, so we can use a simple frequency count to estimate the conditional probabilities.
- We assume that we have a small number of features and classes, so we can use a brute-force approach to compute the conditional probabilities.
- In the case of unseen features, we assign the prior distribution of the classes to the instance.

In [None]:
import pandas as pd
import numpy as np
import itertools

class MyBayesClassifier:
    """
    MyBayesClassifier implements a simple Bayesian classifier for categorical data.
    It computes class priors and conditional probabilities based on the training data,
    and uses these estimates to compute posterior probabilities for new observations.
    If an unseen combination of feature values is encountered during prediction, the
    classifier defaults to the overall class priors.

    Attributes:
        X : pandas.DataFrame or array
            The feature data used during training. Internally, X is stored as a DataFrame.
        y : pandas.Series
            The target (label) vector corresponding to the training examples.
        classes : list
            A sorted list of unique target class values extracted from y.
        priors : dict
            A mapping from each class label to its prior probability based on the training data.
        cond_probs : dict
            A dictionary mapping a tuple of feature values (representing a unique combination)
            to a list of posterior probabilities for each class. The order of probabilities in
            the list corresponds to the order of classes in the 'classes' attribute.

    Methods:
        fit(X, y):
            Accepts a DataFrame or array of categorical features X and a vector y of targets.
            Computes and stores the unique classes, the class prior probabilities, and the
            conditional probabilities for each observed combination of feature values.
        
        predict_proba(X_test):
            Given new observations (as a DataFrame or array), returns a DataFrame of
            posterior probabilities for each class. For any row where the feature combination
            is unseen in the training data, the overall class priors are returned.
        
        predict(X_test):
            Uses predict_proba to calculate the posterior probabilities, and returns a Series
            containing the class label with the highest probability for each observation.
            
        posteriors():
            Returns a DataFrame listing every possible combination of feature values (based on the training data)
            and the corresponding conditional distribution (posterior probabilities) of y. There is one column
            per feature and one column per class.
        
        unseen_instances(X_test):
            Given a new test dataset, returns a DataFrame containing the rows representing feature combinations
            that were not observed in the training data.
            
    Example usage:
        clf = MyBayesClassifier()
        clf.fit(X_train, y_train)
        proba = clf.predict_proba(X_test)
        predictions = clf.predict(X_test)
        posteriors = clf.posteriors()
        new_instances = clf.unseen_instances(X_test)
    """
    def __init__(self):
        self.X = None
        self.y = None
        self.classes = None
        self.priors = None
        self.cond_probs = {}
    
    def fit(self, X, y):
        """
        Computes classes, priors, and conditional probabilities for 
        each unique combination of feature values in X.
        
        Parameters:
            X (pd.DataFrame or np.array): Features (categorical).
            y (pd.Series or np.array): Column vector of targets.
        """
        # If X is not a DataFrame, convert it.
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        self.X = X.copy()
        self.y = pd.Series(y).copy()
        
        # Determine the list of classes.
        self.classes = sorted(self.y.unique())
        
        # Compute class priors.
        prior_series = self.y.value_counts(normalize=True)
        self.priors = prior_series.to_dict()
        
        # Compute conditional probabilities for each observed combination.
        # Concatenate X and y for grouping.
        df_train = self.X.copy()
        df_train["target"] = self.y
        
        # Group by the feature columns.
        grouped = df_train.groupby(list(self.X.columns))
        
        for combo, group in grouped:
            # Ensure combo is a tuple even if single feature.
            if not isinstance(combo, tuple):
                combo = (combo,)
                
            # Count occurrences of each class for this feature combination.
            counts = group["target"].value_counts().to_dict()
            total = sum(counts.values())
            
            # Build the posterior probability vector in the order of self.classes.
            prob_vector = [counts.get(cls, 0) / total for cls in self.classes]
            
            self.cond_probs[combo] = prob_vector
            
        return self
    
    def predict_proba(self, X_test):
        """
        For each row in X_test, returns the posterior probabilities over classes.
        If a combination of feature values is unseen, the overall class priors are used.
        
        Parameters:
            X_test (pd.DataFrame or np.array): Test features.
        
        Returns:
            pd.DataFrame: DataFrame of posterior probabilities with columns corresponding
                          to the classes.
        """
        # Convert to DataFrame if necessary.
        if not isinstance(X_test, pd.DataFrame):
            X_test = pd.DataFrame(X_test, columns=self.X.columns)
        
        # Define a lookup function for a row of features.
        def lookup(row):
            combo = tuple(row)
            if combo in self.cond_probs:
                return self.cond_probs[combo]
            else:
                # Return priors in the order of self.classes.
                return [self.priors.get(cls, 0) for cls in self.classes]
        
        # Apply the lookup along axis=1.
        proba_series = X_test.apply(lookup, axis=1)
        proba_df = pd.DataFrame(proba_series.tolist(), 
                                index=X_test.index, 
                                columns=self.classes)
        return proba_df
    
    def predict(self, X_test):
        """
        Returns the predicted class label for each row in X_test.
        
        Parameters:
            X_test (pd.DataFrame or np.array): Test features.
        
        Returns:
            pd.Series: Predicted class for each row.
        """
        proba_df = self.predict_proba(X_test)
        # Return the class with maximum probability along each row.
        predictions = proba_df.idxmax(axis=1)
        return predictions
    
    def posteriors(self):
        """
        Returns a DataFrame with every possible combination of feature values from the training data 
        and the corresponding conditional distribution (posterior probabilities) of y.
        
        The resulting DataFrame contains one column per feature and one column per class.
        For any combination not observed during training, the overall class priors are returned.
        
        Returns:
            pd.DataFrame: DataFrame where each row represents a unique combination of feature values,
                          with feature columns (same names as in training data) and additional columns for 
                          each class (named using the class value).
        """
        # For each feature column, get unique sorted values.
        unique_vals = {col: sorted(self.X[col].unique()) for col in self.X.columns}
        # Generate all possible combinations.
        combos = list(itertools.product(*[unique_vals[col] for col in self.X.columns]))
        
        records = []
        for combo in combos:
            record = {col: combo[i] for i, col in enumerate(self.X.columns)}
            # Use cond_probs if seen; otherwise, use overall priors.
            if combo in self.cond_probs:
                probs = self.cond_probs[combo]
            else:
                probs = [self.priors.get(cls, 0) for cls in self.classes]
            for idx, cls in enumerate(self.classes):
                record[cls] = probs[idx]
            records.append(record)
        return pd.DataFrame(records)
    
    def unseen_instances(self, X_test):
        """
        Given a test dataset, returns the sub-dataframe containing rows that are not found 
        in the training data (i.e. their feature combination was not observed during training).
        
        Parameters:
            X_test (pd.DataFrame or np.array): Test features.
        
        Returns:
            pd.DataFrame: Sub-dataframe of X_test of the unseen instances.
        """
        # Convert to DataFrame if necessary.
        if not isinstance(X_test, pd.DataFrame):
            X_test = pd.DataFrame(X_test, columns=self.X.columns)
        
        # Mark rows whose feature combination is not in cond_probs.
        unseen_mask = X_test.apply(lambda row: tuple(row) not in self.cond_probs, axis=1)
        return X_test[unseen_mask]

Let's test out the Bayes Classifier on the `car_evaluation` dataset. 

In [None]:
#read in the car evaluation datasets
cars_train = pd.read_csv("../data/classification/car_evaluation/train.csv")
cars_test = pd.read_csv("../data/classification/car_evaluation/test.csv")

# Create X and y for train and test sets
X_train = cars_train.drop(columns=["Y"])
y_train = cars_train["Y"]
X_test = cars_test.drop(columns=["Y"])
y_test = cars_test["Y"]

# Initialize and fit the classifier.
clf = MyBayesClassifier()
clf.fit(X_train, y_train)

# Predict probabilities and classes for the train and test set
train_proba = clf.predict_proba(X_train)
train_predictions = clf.predict(X_train)
test_proba = clf.predict_proba(X_test)
test_predictions = clf.predict(X_test)

# append true and predicted values of train and test sets to the proba dataframes
train_proba["Y_true"] = y_train
train_proba["Y_pred"] = train_predictions
test_proba["Y_true"] = y_test
test_proba["Y_pred"] = test_predictions

### Limitations of the Bayes Classifier
The Bayes Classifier is a very natural classifier, but it has a few limitations. We give two of them below:
1. **Unseen features**: If we are given an instance $\vec{x}$ which does not appear in the training set, then the Bayes classifier will not know how to assign a probability distribution to the classes. There are a couple of potential ways to deal with this issue, but they are all not very satisfying:
    - We can assign a uniform distribution over the classes, i.e. 
    \begin{equation*}
        p(c | \vec{x}) = \frac{1}{|\mathcal{C}|} \textup{ for all } c \in \mathcal{C}.
    \end{equation*}
    - We can assign the prior distribution of the classes, i.e. 
    \begin{equation*}
        p(c | \vec{x}) = p(c) \textup{ for all } c \in \mathcal{C}.
    \end{equation*}
    - If the features are continuous (or more generally, if we have a way of distinguishing feature values "near" to $\vec{x}$), then we  can consider a small sample of our training data near the point $\vec{x} \in \mathbb{R}^n$ and use the conditional probability of the classes given this sample. 
2. **Computationally expensive**: The Bayes classifier requires us to compute the conditional probability of the classes given the features, which is computationally expensive. This is not a problem if we have a small number of features and classes, but it can be very expensive if we have many features and/or classes. Moreover, in general there aren't any particular assumptions we can make about the joint distributions which would allow us to compute the conditional probabilities in a more efficient way.

### Bayes Theorem recalled
Recall from last class that Bayes Theorem allows us to re-write the posteriors as follows:
\begin{equation*}
    p(c | \vec{x}) = \frac{p(\vec{x} | c) p(c)}{p(\vec{x})}.
\end{equation*}
Here, we have the following terminology and concepts:
- **Prior** $P(c)$:
    
    This is the probability $P(Y=c)$ that the target is $y$ before we have observed the features. We either make a reasonable assumption about this probability, or we estimate it from the data (e.g. by taking it to be the fraction of the training data that has target $y$ in the discrete case, or by computing an approximation to the density of the target at $y$ in the continuous case).
- **Likelihood** $P(\vec{x}|c)$:

    This is the condition probability $P(\vec{X} = \vec{x}|Y=c)$ of observing the features $\vec{x}$ given that the target is $y$.

- **Evidence** $P(\vec{x})$:

    This is the probability $P(\vec{X} = \vec{x})$ of observing the features $\vec{x}$, regardless of the target. 

- **Posterior** $P(c|\vec{x})$:

    This is the probability that the target is $y$ given that we have observed the features $\vec{x}$. 

Thus, the (soft) Bayes Classifier can be re-written as the mapping
\begin{align*}
    \mathbb{R}^n & \to \{ \textup{probability distributions over } \mathcal{C} \}, \\
    \vec{x} & \mapsto \left( p_{\vec{x}}(c) = \frac{p(\vec{x} | c) p(c)}{p(\vec{x})} \right).
\end{align*}
The hard classifier is then obtained by taking the class with the highest probability (which allows us to drop the denominator term $p(\vec{x})$):
\begin{align*}
    F(\vec{x}) & = \argmax_{c \in \mathcal{C}} \; p(c | \vec{x}) \\
    & = \argmax_{c \in \mathcal{C}} \; p(\vec{x} | c) p(c).
\end{align*}

### From Bayes Classifier to Bayesian Classifiers
Although the Bayes Classifier is not useful from the point of view of practical applications, it is still theoretically very useful because it is in (a sense which can be made precise) the "best possible" classifier that can be constructed from the data. Thus, any soft classifier can be thought of as a weakening or approximation of the Bayes Classifier, in which certain assumptions are made about likelihoods $p(\vec{x} \mid c)$ in order to make the computation of the posterior $p(c \mid \vec{x})$ more tractable. 

The priors $p(c)$ need only be computed once, and can be used for all instances $\vec{x}$. Thus, the main easing of computational burden comes from the fact that we estimate likelihoods rather than posteriors. The trade-off is that instead of constructing a PMF over the classes for each instance $\vec{x}$, we construct a PMF (or PDF) of the features given each class (which is "only" $|\mathcal{C}|$-many distributions). 

Three different assumptions lead to three different Bayesian classifiers:
1. Assuming conditional independence of the features given the class leads to the **Naive Bayes Classifier**.
2. Assuming that the likelihoods are Gaussian (with identical class-conditional covariances) leads to **Linear Discriminant Analysis**.
3. Assuming that the likelihoods are Gaussian (with different class-conditional covariances) leads to **Quadratic Discriminant Analysis**.
In the next sections, we will discuss each of these classifiers in turn.

## Naive Bayes Classifier

### Conditional independence
In NB, we will assume that the features $X_1,\dotsc,X_n$ are conditionally independent given $Y$. This means that the likelihoods factorize as
\begin{equation*}
    p(\vec{x} | c) = \prod_{i=1}^n p(x_i | c).
\end{equation*}
This is a very strong (some might say, *Naive*) assumption which very rarely holds in real life, but it is a very useful assumption because it allows us to compute the likelihoods in a much more efficient way. Note that for each class $c \in \mathcal{C}$, there are $n$ conditional distributions $p(x_i | c)$ to estimate, rather than a single joint distribution $p(\vec{x} | c)$, so there is a trade-off between computing many low-dimensional distributions rather than a single high-dimensional distribution rather.

Now, the computations of the likelihoods depends on whether the features are discrete or continuous. We will discuss both cases below separately.

### Categorical Naive Bayes
Fix a class $c \in \mathcal{C}$, and assume that the features are all categorical (i.e. discrete). Then, we can estimate the conditional distributions $p(X_i = x_i |Y= c)$ by computing the corresponding fractions in the dataset:
\begin{equation*}
    p(x_i \mid c) = \frac{\textup{number of instances with } X_i = x_i \textup{ and } Y = c}{\textup{number of instances with } Y = c}.
\end{equation*}
To implement this, we can use the `CategoricalNB` class from `sklearn.naive_bayes`. We illustrate this on the `car_evaluation` dataset, which is a dataset of categorical features.

In [None]:
# To run sklearn's CategoricalNB
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

# To compute accuracy of our predictions
from sklearn.metrics import accuracy_score

# To visualize the results in a confusion matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# read in car evaluation datasets
cars_train = pd.read_csv("../data/classification/car_evaluation/train.csv")
cars_test = pd.read_csv("../data/classification/car_evaluation/test.csv")

# Create X and y for train and test sets
X_train = cars_train.drop(columns=["Y"])
y_train = cars_train["Y"]
X_test = cars_test.drop(columns=["Y"])
y_test = cars_test["Y"]

# Convert categorical features into integer codes.
encoder = OrdinalEncoder()
X_train_enc = encoder.fit_transform(X_train)
X_test_enc = encoder.transform(X_test)

# Convert target labels into integer codes.
label_encoder = LabelEncoder()
y_train_enc = label_encoder.fit_transform(y_train)
y_test_enc = label_encoder.fit_transform(y_test)

# Create and train the CategoricalNB classifier.
# Create the classifier with the number of categories for each feature
n_categories = [len(encoder.categories_[i]) for i in range(len(encoder.categories_))]
clf = CategoricalNB(min_categories=n_categories)
clf.fit(X_train_enc, y_train_enc)

# Predict probabilities and classes for the train and test set
train_proba = clf.predict_proba(X_train_enc)
train_predictions = clf.predict(X_train_enc)
test_proba = clf.predict_proba(X_test_enc)
test_predictions = clf.predict(X_test_enc)

# Compute and print accuracy of our predictions (as a percentage) on train and test sets
train_accuracy = accuracy_score(y_train_enc, train_predictions)
test_accuracy = accuracy_score(y_test_enc, test_predictions)
print(f"Train accuracy: {train_accuracy * 100:.2f}%")
print(f"Test accuracy: {test_accuracy * 100:.2f}%")

# Print the class priors for the train and test set
print("Class priors in train set:")
print(cars_train["Y"].value_counts(normalize=True))
print("Class priors in test set:")
print(cars_test["Y"].value_counts(normalize=True))

# Convert numeric predictions back to original categorical labels.
train_predictions_orig = label_encoder.inverse_transform(train_predictions)
test_predictions_orig = label_encoder.inverse_transform(test_predictions)

# Compute confusion matrices using the true labels (already categorical) and the converted predictions.
train_cm = confusion_matrix(y_train, train_predictions_orig, labels=label_encoder.classes_)
test_cm = confusion_matrix(y_test, test_predictions_orig, labels=label_encoder.classes_)

# Convert the confusion matrices into DataFrames for prettier output, with the original categories as row and column labels.
train_cm_df = pd.DataFrame(train_cm, index=label_encoder.classes_, columns=label_encoder.classes_)
test_cm_df = pd.DataFrame(test_cm, index=label_encoder.classes_, columns=label_encoder.classes_)

# Compute fraction matrices: each element divided by the overall total of the confusion matrix
train_frac = (train_cm_df / train_cm_df.to_numpy().sum())*100
test_frac = (test_cm_df / test_cm_df.to_numpy().sum())*100

# Compute row-normalized matrices: each row divided by its total (in percentage)
train_row_norm = (train_cm_df.div(train_cm_df.sum(axis=1), axis=0)) * 100
test_row_norm = (test_cm_df.div(test_cm_df.sum(axis=1), axis=0)) * 100

# Create a figure with 3 rows and 2 columns of subplots
fig, axes = plt.subplots(3, 2, figsize=(8,8))

# First row: standard confusion matrices
sns.heatmap(train_cm_df, annot=True, fmt='d', cmap='Blues', ax=axes[0,0])
axes[0,0].set_title('Train predictions')
axes[0,0].set_xlabel('Predicted')
axes[0,0].set_ylabel('True')

sns.heatmap(test_cm_df, annot=True, fmt='d', cmap='Blues', ax=axes[0,1])
axes[0,1].set_title('Test predictions')
axes[0,1].set_xlabel('Predicted')
axes[0,1].set_ylabel('True')

# Second row: fraction matrices
sns.heatmap(train_frac, annot=True, fmt='.2f', cmap='Greens', ax=axes[1,0])
axes[1,0].set_title('Train prediction %')
axes[1,0].set_xlabel('Predicted')
axes[1,0].set_ylabel('True')

sns.heatmap(test_frac, annot=True, fmt='.2f', cmap='Greens', ax=axes[1,1])
axes[1,1].set_title('Test prediction %')
axes[1,1].set_xlabel('Predicted')
axes[1,1].set_ylabel('True')

# Third row: row-normalized matrices
sns.heatmap(train_row_norm, annot=True, fmt='.2f', cmap='Oranges', ax=axes[2,0])
axes[2,0].set_title('Train prediction % per true label')
axes[2,0].set_xlabel('Predicted')
axes[2,0].set_ylabel('True')

sns.heatmap(test_row_norm, annot=True, fmt='.2f', cmap='Oranges', ax=axes[2,1])
axes[2,1].set_title('Test prediction % per true label')
axes[2,1].set_xlabel('Predicted')
axes[2,1].set_ylabel('True')

plt.tight_layout()
plt.show()


### Gaussian Naive Bayes
Assume now that the features are all continuous, and fix a class $c \in \mathcal{C}$. In order to estimate the conditional distribution of $X_i$ given $Y=c$, we need to basically approximate the PDF. There are many ways to do this, but a very common and simple way is to assume that the features are **Gaussian** (or follow a **normal distribution**). The graph of the PDF is a bell-shaped curve, and it is in fact *the* Bell Curve.  Recall that the PDF of a Gaussian distribution with mean $\mu$ and variance $\sigma^2$ is given by
\begin{equation*}
    p(x) = \frac{1}{\sqrt{2 \pi} \sigma} \exp \left( {-\frac{(x - \mu)^2}{2 \sigma^2}} \right).
\end{equation*}
Thus, we can estimate the conditional distribution of $X_i$ given $Y=c$ by computing the mean $\mu_c$ and variance $\sigma_c^2$ of $X_i$ for all instances in the training set with $Y=c$. So, if there are $m_c$ instances in the training set with $Y=c$, then we have
\begin{align*}
    \mu_c & = \frac{1}{m_c} \sum_{i=1}^{m_c} x_i, \\
    \sigma_c^2 & = \frac{1}{m_c} \sum_{i=1}^{m_c} (x_i - \mu_c)^2.
\end{align*}
We can then use these estimates to compute the conditional distribution of $X_i$ given $Y=c$ as
\begin{equation*}
    p(x_i \mid c) = \frac{1}{\sqrt{2 \pi} \sigma_c} \exp \left( -\frac{(x_i - \mu_c)^2}{2 \sigma_c^2} \right).
\end{equation*}
Thus, the formula for the posterior is given by
\begin{align*}
    p(c \mid \vec{x}) & = \frac{p(c)}{p(\vec{x})} \cdot \prod_{i=1}^n p(x_i \mid c) \\
    & = \frac{p(c)}{p(\vec{x})} \cdot \prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma_c} \exp \left( {-\frac{(x_i - \mu_c)^2}{2 \sigma_c^2}} \right).
\end{align*}
It is convenient to take the log of both sides to get the **log posterior**:
\begin{align*}
    \log p(c \mid \vec{x}) & = \log p(c) - \log p(\vec{x}) + \sum_{i=1}^n \left( -\log(\sqrt{2 \pi} \sigma_c) - \frac{(x_i - \mu_c)^2}{2 \sigma_c^2} \right) \\
    & = \log p(c) - \log p(\vec{x}) + -n\log(\sqrt{2 \pi} \sigma_c) - \frac{1}{2\sigma_c^2} \sum_{i=1}^n (x_i - \mu_c)^2.
\end{align*}
The hard classifier is then obtained by taking the class with the highest log posterior:
\begin{align*}
    F_{\textup{NB}}(\vec{x}) & = \argmax_{c \in \mathcal{C}} \; \log p(c | \vec{x})\\
    & = \argmax_{c \in \mathcal{C}} \; \left( \log p(c) - \log p(\vec{x}) + -n\log(\sqrt{2 \pi} \sigma_c) - \frac{1}{2\sigma_c^2} \sum_{i=1}^n (x_i - \mu_c)^2 \right).
\end{align*}
So, if we put $A_c = n\log(\sqrt{2 \pi} \sigma_c) - \log p(c)$, then we have
\begin{align*}
    F_{\textup{NB}}(\vec{x}) & = \argmin_{c \in \mathcal{C}} \; \left( \frac{1}{2\sigma_c^2} \sum_{i=1}^n (x_i - \mu_c)^2 + A_c \right).
\end{align*}
(NOTE: each $x_i$ above is an entry in a row vector $\vec{x}$, so we are summing over the entries of $\vec{x}$).

### Implementing Gaussian Naive Bayes
We can implement Gaussian Naive Bayes using the `GaussianNB` class from `sklearn.naive_bayes`. We illustrate this on the `iris` dataset, which is a dataset of continuous features.

In [None]:
# import Gaussian NB
from sklearn.naive_bayes import GaussianNB

# read in iris datasets
iris_train = pd.read_csv("../data/classification/iris/train.csv")
iris_test = pd.read_csv("../data/classification/iris/test.csv")

# Create X and y for train and test sets
X_train = iris_train.drop(columns=["Y"])
y_train = iris_train["Y"]
X_test = iris_test.drop(columns=["Y"])
y_test = iris_test["Y"]

# Print the class priors for the train and test set
print("Class priors in train set:")
print(iris_train["Y"].value_counts(normalize=True))
print("Class priors in test set:")
print(iris_test["Y"].value_counts(normalize=True))

# Initialize and fit the classifier.
clf = GaussianNB()
clf.fit(X_train, y_train)

# Predict probabilities and classes for the train and test set
train_proba = clf.predict_proba(X_train)
train_predictions = clf.predict(X_train)
test_proba = clf.predict_proba(X_test)
test_predictions = clf.predict(X_test)

# Compute and print accuracy of our predictions (as a percentage) on train and test sets
train_accuracy = accuracy_score(y_train, train_predictions)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Train accuracy: {train_accuracy * 100:.2f}%")
print(f"Test accuracy: {test_accuracy * 100:.2f}%")

# Compute confusion matrices for train and test sets
train_cm = confusion_matrix(y_train, train_predictions, labels=clf.classes_)
test_cm = confusion_matrix(y_test, test_predictions, labels=clf.classes_)

# Compute percentage matrices: each element divided by the overall total of the confusion matrix
train_frac = (train_cm / train_cm.sum()) * 100
test_frac = (test_cm / test_cm.sum()) * 100

# Compute row-normalized matrices: each row divided by its total (in percentage)
train_row_norm = (train_cm / train_cm.sum(axis=1, keepdims=True)) * 100
test_row_norm = (test_cm / test_cm.sum(axis=1, keepdims=True)) * 100

# Create a figure with 3 rows and 2 columns of subplots
fig, axes = plt.subplots(3, 2, figsize=(6,6))
# First row: standard confusion matrices
sns.heatmap(train_cm, annot=True, fmt='d', cmap='Blues', ax=axes[0,0])
axes[0,0].set_title('Train predictions')
axes[0,0].set_xlabel('Predicted')
axes[0,0].set_ylabel('True')
sns.heatmap(test_cm, annot=True, fmt='d', cmap='Blues', ax=axes[0,1])
axes[0,1].set_title('Test predictions')
axes[0,1].set_xlabel('Predicted')
axes[0,1].set_ylabel('True')

# Second row: fraction matrices
sns.heatmap(train_frac, annot=True, fmt='.2f', cmap='Greens', ax=axes[1,0])
axes[1,0].set_title('Train prediction %')
axes[1,0].set_xlabel('Predicted')
axes[1,0].set_ylabel('True')
sns.heatmap(test_frac, annot=True, fmt='.2f', cmap='Greens', ax=axes[1,1])
axes[1,1].set_title('Test prediction %')
axes[1,1].set_xlabel('Predicted')
axes[1,1].set_ylabel('True')

# Third row: row-normalized matrices
sns.heatmap(train_row_norm, annot=True, fmt='.2f', cmap='Oranges', ax=axes[2,0])
axes[2,0].set_title('Train prediction % per true label')
axes[2,0].set_xlabel('Predicted')
axes[2,0].set_ylabel('True')
sns.heatmap(test_row_norm, annot=True, fmt='.2f', cmap='Oranges', ax=axes[2,1])
axes[2,1].set_title('Test prediction % per true label')
axes[2,1].set_xlabel('Predicted')
axes[2,1].set_ylabel('True')

plt.tight_layout()
plt.show()


That went quite well! Below, we visualize the individual estimated Gaussian distributions (blue curve) for each feature conditional on the class, and we overlay it with a kdeplot of the true distribution for comparison (red curve).

In [None]:
from scipy.stats import norm

# Get feature names and class labels
features = X_train.columns  # iris features
classes = clf.classes_
theta = clf.theta_         # shape (n_classes, n_features)
var = clf.var_            # shape (n_classes, n_features)

n_features = len(features)
n_classes = len(classes)

sns.set_style("whitegrid")

# Create a grid of subplots: one row per feature and one column per class.
fig, axes = plt.subplots(n_features, n_classes, figsize=(4*n_classes, 3*n_features), sharex=False, sharey=False)

for i, feat in enumerate(features):
    # Define an x-axis range based on the overall training data for this feature.
    feat_data = X_train[feat]
    x_min, x_max = feat_data.min(), feat_data.max()
    x_values = np.linspace(x_min, x_max, 200)
    
    for j, cls in enumerate(classes):
        # Get the correct subplot axis.
        ax = axes[i, j] if n_features > 1 and n_classes > 1 else (axes[j] if n_features==1 else axes[i])
        
        # Filter the training data for the current class.
        data_cls = X_train.loc[y_train == cls, feat]
        # Plot the true distribution as a KDE (use a reddish color).
        sns.kdeplot(data_cls, ax=ax, color='darkred', label="True KDE", fill=True, alpha=0.2)
        
        # Extract the estimated Gaussian parameters for this feature in this class.
        mu = theta[j, i]
        std = np.sqrt(var[j, i])
        # Compute the Gaussian pdf at the x values.
        pdf = norm.pdf(x_values, loc=mu, scale=std)
        # Plot the estimated Gaussian (using a blue color).
        ax.plot(x_values, pdf, color='steelblue', label="Predicted Gaussian")
        
        ax.set_title(f"{feat} | Class {cls}")

plt.tight_layout()
plt.show()

## Quadratic Discriminant Analysis

### Multi-variate Gaussian
Given continuous features $X_1,\dotsc,X_n$, we can assume that the joint distribution of the features given the class is a **multi-variate Gaussian**; this is a higher-dimensional analogue of the one-dimensional normal distribution. So, if there are two features, then it looks like a bell-shaped surface over the $X_1,X_2$ plane, and if there are three features, then it looks like a bell-shaped volume over the $X_1,X_2,X_3$ space (whatever the heck that is). 

The multi-variate Gaussian distribution is parameterized by a mean vector $\mu \in \mathbb{R}^n$ and a covariance matrix $\Sigma \in \mathbb{R}^{n \times n}$, and its PDF is given (for each $\vec{x} \in \mathbb{R}^n$) by
\begin{equation*}
    p(\vec{x}) = \frac{1}{\sqrt{(2 \pi)^n |\Sigma|}} \exp \left( -\frac{1}{2} (\vec{x} - \mu)^\top \Sigma^{-1} (\vec{x} - \mu) \right),
\end{equation*}
where $|\Sigma|$ is the determinant of the (sample) covariance matrix $\Sigma = [\textup{Cov}(\mathbf{x}_i,\mathbf{x}_j)]$. The determinant of a matrix is a scalar, and it is a measure of how "spread out" the data is. If the determinant is small, then the data is tightly clustered around the mean, and if the determinant is large, then the data is more spread out.
The inverse of the covariance matrix is called the **precision matrix**, and it is a measure of how "sharp" the data is. If the precision matrix is large, then the data is tightly clustered around the mean, and if the precision matrix is small, then the data is more spread out.

### QDA assumptions
In QDA, we assume that the joint distribution of the features given the class is a multi-variate Gaussian distribution with mean vector $\mu_c \in \mathbb{R}^n$ and covariance matrix $\Sigma_c \in \mathbb{R}^{n \times n}$, i.e.
\begin{equation*}
    p(\vec{x} | c) = \frac{1}{\sqrt{(2 \pi)^n |\Sigma_c|}} \exp \left( -\frac{1}{2} (\vec{x} - \mu_c)^\top \Sigma_c^{-1} (\vec{x} - \mu_c) \right).
\end{equation*}
NOTE: we have dropped the assumption that the features are independent given the class! So, for each class $c \in \mathcal{C}$, we have to now estimate a single multi-variate Gaussian distribution $p(\vec{x} | c)$, rather than $n$ independent distributions $p(x_i | c)$. This approach allows us to capture the correlations between the features (conditioned on the class).

Then, the posterior is given by
\begin{align*}
    p(c | \vec{x}) & = \frac{p(c)}{p(\vec{x})} \cdot p(\vec{x} | c) \\
    & = \frac{1}{p(\vec{x})} \cdot \frac{p(c)}{\sqrt{(2 \pi)^n |\Sigma_c|}} \exp \left( -\frac{1}{2} (\vec{x} - \mu_c)^\top \Sigma_c^{-1} (\vec{x} - \mu_c) \right).
\end{align*}

Taking the log of both sides and setting $$A_c = \log p(c) - \frac{n}{2} \log(2 \pi) - \frac{1}{2} \log(|\Sigma_c|),$$ we get the following formula for the log posterior:
\begin{equation*}
    \log p(c \mid \vec{x}) = A_c - \frac{1}{2} (\vec{x} - \mu_c)^\top \Sigma_c^{-1} (\vec{x} - \mu_c) - \frac{1}{2} \log p(\vec{x}).
\end{equation*}
The hard classifier is then obtained by taking the class with the highest log posterior:
\begin{align*}
    F_{\textup{QDA}}(\vec{x}) & = \argmax_{c \in \mathcal{C}} \; \log p(c | \vec{x})\\
    & = \argmax_{c \in \mathcal{C}} \; \left( A_c - \frac{1}{2} (\vec{x} - \mu_c)^\top \Sigma_c^{-1} (\vec{x} - \mu_c) - \frac{1}{2} \log p(\vec{x}) \right)\\
    & = \argmin_{c \in \mathcal{C}} \; \left( (\vec{x} - \mu_c)^\top \Sigma_c^{-1} (\vec{x} - \mu_c) - A_c \right),
\end{align*}
where we have dropped the term $-\frac{1}{2} \log p(\vec{x})$ because it does not depend on $c$.

NOTE: If you look at the quantity being minimized above, you will see that it is a quadratic form in $\vec{x}$, which is a measure of how far the instance $\vec{x}$ is from the mean $\mu_c$ of the class $c$, weighted by the precision matrix $\Sigma_c^{-1}$. Thus, the hard classifier is essentially assigning the class $c$ to the instance $\vec{x}$ which is closest to the mean $\mu_c$ of the class $c$, weighted by the precision matrix $\Sigma_c^{-1}$. Since the function is quadratic, the classifier will produce quadratic decision boundaries between the classes (hence the name **Quadratic Discriminant Analysis**).

### Implementing QDA
We illustrate this on the `wine` dataset, which is a dataset of continuous features. We can implement QDA using the `QuadraticDiscriminantAnalysis` class from `sklearn.discriminant_analysis`.

In [None]:
# import QDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

#read in wine datasets
wine_train = pd.read_csv("../data/classification/wine/train.csv")
wine_test = pd.read_csv("../data/classification/wine/test.csv")

# Create X and y for train and test sets
X_train = wine_train.drop(columns=["Y"])
y_train = wine_train["Y"]
X_test = wine_test.drop(columns=["Y"])
y_test = wine_test["Y"]

# Print the class priors for the train and test set
print("Class priors in train set:")
print(wine_train["Y"].value_counts(normalize=True))
print("Class priors in test set:")
print(wine_test["Y"].value_counts(normalize=True))

# Initialize and fit the classifier.
clf = QuadraticDiscriminantAnalysis()
clf.fit(X_train, y_train)

# Predict probabilities and classes for the train and test set
train_proba = clf.predict_proba(X_train)
train_predictions = clf.predict(X_train)
test_proba = clf.predict_proba(X_test)
test_predictions = clf.predict(X_test)

# Compute and print accuracy of our predictions (as a percentage) on train and test sets
train_accuracy = accuracy_score(y_train, train_predictions)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Train accuracy: {train_accuracy * 100:.2f}%")
print(f"Test accuracy: {test_accuracy * 100:.2f}%")

# Compute confusion matrices for train and test sets
train_cm = confusion_matrix(y_train, train_predictions, labels=clf.classes_)
test_cm = confusion_matrix(y_test, test_predictions, labels=clf.classes_)
# Compute percentage matrices: each element divided by the overall total of the confusion matrix
train_frac = (train_cm / train_cm.sum()) * 100
test_frac = (test_cm / test_cm.sum()) * 100
# Compute row-normalized matrices: each row divided by its total (in percentage)
train_row_norm = (train_cm / train_cm.sum(axis=1, keepdims=True)) * 100
test_row_norm = (test_cm / test_cm.sum(axis=1, keepdims=True)) * 100

sns.set_style("darkgrid")

# Create a figure with 3 rows and 2 columns of subplots
fig, axes = plt.subplots(3, 2, figsize=(6,6))
# First row: standard confusion matrices
sns.heatmap(train_cm, annot=True, fmt='d', cmap='Blues', ax=axes[0,0])
axes[0,0].set_title('Train predictions')
axes[0,0].set_xlabel('Predicted')
axes[0,0].set_ylabel('True')
sns.heatmap(test_cm, annot=True, fmt='d', cmap='Blues', ax=axes[0,1])
axes[0,1].set_title('Test predictions')
axes[0,1].set_xlabel('Predicted')
axes[0,1].set_ylabel('True')

# Second row: fraction matrices
sns.heatmap(train_frac, annot=True, fmt='.2f', cmap='Greens', ax=axes[1,0])
axes[1,0].set_title('Train prediction %')
axes[1,0].set_xlabel('Predicted')
axes[1,0].set_ylabel('True')
sns.heatmap(test_frac, annot=True, fmt='.2f', cmap='Greens', ax=axes[1,1])
axes[1,1].set_title('Test prediction %')
axes[1,1].set_xlabel('Predicted')
axes[1,1].set_ylabel('True')

# Third row: row-normalized matrices
sns.heatmap(train_row_norm, annot=True, fmt='.2f', cmap='Oranges', ax=axes[2,0])
axes[2,0].set_title('Train prediction % per true label')
axes[2,0].set_xlabel('Predicted')
axes[2,0].set_ylabel('True')
sns.heatmap(test_row_norm, annot=True, fmt='.2f', cmap='Oranges', ax=axes[2,1])
axes[2,1].set_title('Test prediction % per true label')
axes[2,1].set_xlabel('Predicted')
axes[2,1].set_ylabel('True')

plt.tight_layout()
plt.show()

## Linear Discriminant Analysis

### LDA assumptions
In LDA, we assume that the joint distribution of the features given the class is a multi-variate Gaussian distribution with mean vector $\mu_c \in \mathbb{R}^n$, but we assume that the covariance matrix $\Sigma_c$ is the same for all classes, i.e. $\Sigma_c = \Sigma$ for all $c \in \mathcal{C}$. This means that we are assuming that the features are independent given the class, but we are also assuming that the covariance matrix is the same for all classes. In other words, the features are not assumed to be independent given the class, rather, they are assumed to be correlated in the same way across all classes. This is a weaker assumption than QDA, and it allows us to estimate a single covariance matrix $\Sigma$ for all classes, rather than a separate covariance matrix for each class.

The posterior is given by
\begin{align*}
    p(c | \vec{x}) & = \frac{p(c)}{p(\vec{x})} \cdot p(\vec{x} | c) \\
    & = \frac{1}{p(\vec{x})} \cdot \frac{p(c)}{\sqrt{(2 \pi)^n |\Sigma|}} \exp \left( -\frac{1}{2} (\vec{x} - \mu_c)^\top \Sigma^{-1} (\vec{x} - \mu_c) \right).
\end{align*}

Taking the log of both sides and setting $$A_c = \log p(c) - \frac{n}{2} \log(2 \pi) - \frac{1}{2} \log(|\Sigma|),$$ we get the following formula for the log posterior:
\begin{equation*}
    \log p(c \mid \vec{x}) = A_c - \frac{1}{2} (\vec{x} - \mu_c)^\top \Sigma^{-1} (\vec{x} - \mu_c) - \frac{1}{2} \log p(\vec{x}).
\end{equation*}

The hard classifier is then obtained by taking the class with the highest log posterior:
\begin{align*}
    F_{\textup{LDA}}(\vec{x}) & = \argmax_{c \in \mathcal{C}} \; \log p(c | \vec{x})\\
    & = \argmax_{c \in \mathcal{C}} \; \left( A_c - \frac{1}{2} (\vec{x} - \mu_c)^\top \Sigma^{-1} (\vec{x} - \mu_c) - \frac{1}{2} \log p(\vec{x}) \right)\\
    & = \argmin_{c \in \mathcal{C}} \; \left( (\vec{x} - \mu_c)^\top \Sigma^{-1} (\vec{x} - \mu_c) - A_c \right),
\end{align*}
where we have dropped the term $-\frac{1}{2} \log p(\vec{x})$ because it does not depend on $c$.

### Implementing LDA
We illustrate this on the `wine` dataset, which is a dataset of continuous features. We can implement LDA using the `LinearDiscriminantAnalysis` class from `sklearn.discriminant_analysis`.

In [None]:
# import LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

#read in wine datasets
wine_train = pd.read_csv("../data/classification/wine/train.csv")
wine_test = pd.read_csv("../data/classification/wine/test.csv")

# Create X and y for train and test sets
X_train = wine_train.drop(columns=["Y"])
y_train = wine_train["Y"]
X_test = wine_test.drop(columns=["Y"])
y_test = wine_test["Y"]

# Print the class priors for the train and test set
print("Class priors in train set:")
print(wine_train["Y"].value_counts(normalize=True))
print("Class priors in test set:")
print(wine_test["Y"].value_counts(normalize=True))

# Initialize and fit the classifier.
clf = LinearDiscriminantAnalysis()
clf.fit(X_train, y_train)

# Predict probabilities and classes for the train and test set
train_proba = clf.predict_proba(X_train)
train_predictions = clf.predict(X_train)
test_proba = clf.predict_proba(X_test)
test_predictions = clf.predict(X_test)

# Compute and print accuracy of our predictions (as a percentage) on train and test sets
train_accuracy = accuracy_score(y_train, train_predictions)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Train accuracy: {train_accuracy * 100:.2f}%")
print(f"Test accuracy: {test_accuracy * 100:.2f}%")

# Compute confusion matrices for train and test sets
train_cm = confusion_matrix(y_train, train_predictions, labels=clf.classes_)
test_cm = confusion_matrix(y_test, test_predictions, labels=clf.classes_)
# Compute percentage matrices: each element divided by the overall total of the confusion matrix
train_frac = (train_cm / train_cm.sum()) * 100
test_frac = (test_cm / test_cm.sum()) * 100
# Compute row-normalized matrices: each row divided by its total (in percentage)
train_row_norm = (train_cm / train_cm.sum(axis=1, keepdims=True)) * 100
test_row_norm = (test_cm / test_cm.sum(axis=1, keepdims=True)) * 100

sns.set_style("darkgrid")

# Create a figure with 3 rows and 2 columns of subplots
fig, axes = plt.subplots(3, 2, figsize=(6,6))
# First row: standard confusion matrices
sns.heatmap(train_cm, annot=True, fmt='d', cmap='Blues', ax=axes[0,0])
axes[0,0].set_title('Train predictions')
axes[0,0].set_xlabel('Predicted')
axes[0,0].set_ylabel('True')
sns.heatmap(test_cm, annot=True, fmt='d', cmap='Blues', ax=axes[0,1])
axes[0,1].set_title('Test predictions')
axes[0,1].set_xlabel('Predicted')
axes[0,1].set_ylabel('True')

# Second row: fraction matrices
sns.heatmap(train_frac, annot=True, fmt='.2f', cmap='Greens', ax=axes[1,0])
axes[1,0].set_title('Train prediction %')
axes[1,0].set_xlabel('Predicted')
axes[1,0].set_ylabel('True')
sns.heatmap(test_frac, annot=True, fmt='.2f', cmap='Greens', ax=axes[1,1])
axes[1,1].set_title('Test prediction %')
axes[1,1].set_xlabel('Predicted')
axes[1,1].set_ylabel('True')

# Third row: row-normalized matrices
sns.heatmap(train_row_norm, annot=True, fmt='.2f', cmap='Oranges', ax=axes[2,0])
axes[2,0].set_title('Train prediction % per true label')
axes[2,0].set_xlabel('Predicted')
axes[2,0].set_ylabel('True')
sns.heatmap(test_row_norm, annot=True, fmt='.2f', cmap='Oranges', ax=axes[2,1])
axes[2,1].set_title('Test prediction % per true label')
axes[2,1].set_xlabel('Predicted')
axes[2,1].set_ylabel('True')

plt.tight_layout()
plt.show()