# Bayesian Classifiers

## Introduction

### Outline
In this notebook, we discuss a few different (but closely related) classifiers based on Baye's Theorems:
1. Bayes Classifier
2. Naive Bayes Classifier
3. Linear Discriminant Analysis
4. Quadratic Discriminant Analysis
I refer to these as Bayesian classifiers because they are all based on Baye's Theorem, but differ in the assumptions they make about the distribution of the variables in the dataset. `

### Setup of dataset
In this notebook, we assume that we are given a labelled (training) dataset with the following format:
- There are features $X_1,\dotsc,X_n$ (continuous and/or discrete).
- There is a categorical target $Y$ which takes values in some finite set $\mathcal{C} = \{c_1,\dotsc,c_k\}$. 
- There are $m$ instances in our dataset. 
- As usual, we denote by $\mathbf{x}_i \in \mathbb{R}^m$ the column vector of features for the $i$-th feature, and by $\mathbf{y} \in \mathcal{C}^m$ the column vector of true labels.

### (Soft) Classifiers
Our goal is to predict the label $c \in \mathcal{C}$ given a row of features $\vec{x} \in \mathbb{R}^n$. Thus, we want to construct a **classifier**, i.e. a function
\begin{equation*}
    F : \mathbb{R}^n \to \mathcal{C},
\end{equation*}
which assigns to any instance $\vec{x} = (x_1,\dotsc,x_n)$ of features a class (or label) $F(\vec{x}) \in \mathcal{C}$. 

For this purpose, it is natural to first construct a **soft classifier**, which assigns to each instance $\vec{x}$ an entire probability distribution over the classes $\mathcal{C}$. Since we are in the discrete case, asking for a probability distribution over the classes is equivalent to asking for a **probability mass function** (PMF), i.e. a function
\begin{equation*}
    p: \mathcal{C} \to [0,1],
\end{equation*}
such that $\sum_{c \in \mathcal{C}} p(c) = 1$. 

So, formally speaking, a soft classifier can be thought of as a function
\begin{align*}
    \mathbb{R}^n & \to \{ \textup{probability distributions over } \mathcal{C} \}, \\
    \vec{x} & \mapsto p_{\vec{x}},
\end{align*}
which assigns to each instance $\vec{x}$ a probability distribution $p_{\vec{x}}$ over the classes $\mathcal{C}$. The hard classifier $F$ is then obtained by taking the class with the highest probability:
\begin{equation*}
    F(\vec{x}) = \argmax_{c \in \mathcal{C}} \; p_{\vec{x}}(c).
\end{equation*}


## Bayes Classifier

### The most natural soft classifier
Consider the problem of assigning to each instance $\vec{x}$ a probability distribution over the classes $\mathcal{C}$. Such a distribution needs to come (somehow) from the data. There is basically only one (natural) way to obtain such a distribution from the dataset, namely, we can use the **conditional probability** of the class given the features:
\begin{align*}
    p_{\vec{x}} : \mathcal{C} & \to [0,1], \\
    c & \mapsto p(c | \vec{x}).
\end{align*}
This is the **Bayes Classifier**. It is the most natural classifier, because the logic is very intuitive and natural: we look at the data and say "in the training dataset, given that the features where $\vec{x}$, the probability distribution of the labels was $p(c | \vec{x})$, so I'm going to predict the same will also hold for future data in which the features are $\vec{x}$".

To convert this soft classifier into a hard classifier, we take the class with the highest probability condition on the features being equalt to $\vec{x}$:
\begin{equation*}
    F(\vec{x}) = \argmax_{c \in \mathcal{C}} \; p(c | \vec{x}).
\end{equation*}

### Limitations of the Bayes Classifier
The Bayes Classifier is a very natural classifier, but it has a few limitations. We give two of them below:
1. **Unseen features**: If we are given an instance $\vec{x}$ which does not appear in the training set, then the Bayes classifier will not know how to assign a probability distribution to the classes. There are a couple of potential ways to deal with this issue, but they are all not very satisfying:
    - We can assign a uniform distribution over the classes, i.e. 
    \begin{equation*}
        p(c | \vec{x}) = \frac{1}{|\mathcal{C}|} \textup{ for all } c \in \mathcal{C}.
    \end{equation*}
    - We can assign the prior distribution of the classes, i.e. 
    \begin{equation*}
        p(c | \vec{x}) = p(c) \textup{ for all } c \in \mathcal{C}.
    \end{equation*}
    - If the features are continuous (or more generally, if we have a way of distinguishing feature values "near" to $\vec{x}$), then we  can consider a small sample of our training data near the point $\vec{x} \in \mathbb{R}^n$ and use the conditional probability of the classes given this sample. 
2. **Computationally expensive**: The Bayes classifier requires us to compute the conditional probability of the classes given the features, which is computationally expensive. This is not a problem if we have a small number of features and classes, but it can be very expensive if we have many features and/or classes. Moreover, in general there aren't any particular assumptions we can make about the joint distributions which would allow us to compute the conditional probabilities in a more efficient way.

### Implementing the Bayes Classifier
Just to keep up with our ongoing tradition of implementing ML models from scratch, we also implement the Bayes Classifier below, with some caveats:
- We assume that the features are discrete, so we can use a simple frequency count to estimate the conditional probabilities.
- We assume that we have a small number of features and classes, so we can use a brute-force approach to compute the conditional probabilities.
- In the case of unseen features, we assign the prior distribution of the classes to the instance.

In [9]:
import pandas as pd
import numpy as np
import itertools

class MyBayesClassifier:
    """
    MyBayesClassifier implements a simple Bayesian classifier for categorical data.
    It computes class priors and conditional probabilities based on the training data,
    and uses these estimates to compute posterior probabilities for new observations.
    If an unseen combination of feature values is encountered during prediction, the
    classifier defaults to the overall class priors.

    Attributes:
        X : pandas.DataFrame or array
            The feature data used during training. Internally, X is stored as a DataFrame.
        y : pandas.Series
            The target (label) vector corresponding to the training examples.
        classes : list
            A sorted list of unique target class values extracted from y.
        priors : dict
            A mapping from each class label to its prior probability based on the training data.
        cond_probs : dict
            A dictionary mapping a tuple of feature values (representing a unique combination)
            to a list of posterior probabilities for each class. The order of probabilities in
            the list corresponds to the order of classes in the 'classes' attribute.

    Methods:
        fit(X, y):
            Accepts a DataFrame or array of categorical features X and a vector y of targets.
            Computes and stores the unique classes, the class prior probabilities, and the
            conditional probabilities for each observed combination of feature values.
        
        predict_proba(X_test):
            Given new observations (as a DataFrame or array), returns a DataFrame of
            posterior probabilities for each class. For any row where the feature combination
            is unseen in the training data, the overall class priors are returned.
        
        predict(X_test):
            Uses predict_proba to calculate the posterior probabilities, and returns a Series
            containing the class label with the highest probability for each observation.
            
        get_posteriors():
            Returns a DataFrame listing every possible combination of feature values (based on the training data)
            and the corresponding conditional distribution (posterior probabilities) of y. There is one column
            per feature and one column per class.
        
        unseen_instances(X_test):
            Given a new test dataset, returns a DataFrame containing the rows representing feature combinations
            that were not observed in the training data.
            
    Example usage:
        clf = MyBayesClassifier()
        clf.fit(X_train, y_train)
        proba = clf.predict_proba(X_test)
        predictions = clf.predict(X_test)
        all_posteriors = clf.get_posteriors()
        new_instances = clf.unseen_instances(X_test)
    """
    def __init__(self):
        self.X = None
        self.y = None
        self.classes = None
        self.priors = None
        self.cond_probs = {}
    
    def fit(self, X, y):
        """
        Computes classes, priors, and conditional probabilities for 
        each unique combination of feature values in X.
        
        Parameters:
            X (pd.DataFrame or np.array): Features (categorical).
            y (pd.Series or np.array): Column vector of targets.
        """
        # If X is not a DataFrame, convert it.
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        self.X = X.copy()
        self.y = pd.Series(y).copy()
        
        # Determine the list of classes.
        self.classes = sorted(self.y.unique())
        
        # Compute class priors.
        prior_series = self.y.value_counts(normalize=True)
        self.priors = prior_series.to_dict()
        
        # Compute conditional probabilities for each observed combination.
        # Concatenate X and y for grouping.
        df_train = self.X.copy()
        df_train["target"] = self.y
        
        # Group by the feature columns.
        grouped = df_train.groupby(list(self.X.columns))
        
        for combo, group in grouped:
            # Ensure combo is a tuple even if single feature.
            if not isinstance(combo, tuple):
                combo = (combo,)
                
            # Count occurrences of each class for this feature combination.
            counts = group["target"].value_counts().to_dict()
            total = sum(counts.values())
            
            # Build the posterior probability vector in the order of self.classes.
            prob_vector = [counts.get(cls, 0) / total for cls in self.classes]
            
            self.cond_probs[combo] = prob_vector
            
        return self
    
    def predict_proba(self, X_test):
        """
        For each row in X_test, returns the posterior probabilities over classes.
        If a combination of feature values is unseen, the overall class priors are used.
        
        Parameters:
            X_test (pd.DataFrame or np.array): Test features.
        
        Returns:
            pd.DataFrame: DataFrame of posterior probabilities with columns corresponding
                          to the classes.
        """
        # Convert to DataFrame if necessary.
        if not isinstance(X_test, pd.DataFrame):
            X_test = pd.DataFrame(X_test, columns=self.X.columns)
        
        # Define a lookup function for a row of features.
        def lookup(row):
            combo = tuple(row)
            if combo in self.cond_probs:
                return self.cond_probs[combo]
            else:
                # Return priors in the order of self.classes.
                return [self.priors.get(cls, 0) for cls in self.classes]
        
        # Apply the lookup along axis=1.
        proba_series = X_test.apply(lookup, axis=1)
        proba_df = pd.DataFrame(proba_series.tolist(), 
                                index=X_test.index, 
                                columns=self.classes)
        return proba_df
    
    def predict(self, X_test):
        """
        Returns the predicted class label for each row in X_test.
        
        Parameters:
            X_test (pd.DataFrame or np.array): Test features.
        
        Returns:
            pd.Series: Predicted class for each row.
        """
        proba_df = self.predict_proba(X_test)
        # Return the class with maximum probability along each row.
        predictions = proba_df.idxmax(axis=1)
        return predictions
    
    def get_posteriors(self):
        """
        Returns a DataFrame with every possible combination of feature values from the training data 
        and the corresponding conditional distribution (posterior probabilities) of y.
        
        The resulting DataFrame contains one column per feature and one column per class.
        For any combination not observed during training, the overall class priors are returned.
        
        Returns:
            pd.DataFrame: DataFrame where each row represents a unique combination of feature values,
                          with feature columns (same names as in training data) and additional columns for 
                          each class (named using the class value).
        """
        # For each feature column, get unique sorted values.
        unique_vals = {col: sorted(self.X[col].unique()) for col in self.X.columns}
        # Generate all possible combinations.
        combos = list(itertools.product(*[unique_vals[col] for col in self.X.columns]))
        
        records = []
        for combo in combos:
            record = {col: combo[i] for i, col in enumerate(self.X.columns)}
            # Use cond_probs if seen; otherwise, use overall priors.
            if combo in self.cond_probs:
                probs = self.cond_probs[combo]
            else:
                probs = [self.priors.get(cls, 0) for cls in self.classes]
            for idx, cls in enumerate(self.classes):
                record[cls] = probs[idx]
            records.append(record)
        return pd.DataFrame(records)
    
    def unseen_instances(self, X_test):
        """
        Given a test dataset, returns the sub-dataframe containing rows that are not found 
        in the training data (i.e. their feature combination was not observed during training).
        
        Parameters:
            X_test (pd.DataFrame or np.array): Test features.
        
        Returns:
            pd.DataFrame: Sub-dataframe of X_test of the unseen instances.
        """
        # Convert to DataFrame if necessary.
        if not isinstance(X_test, pd.DataFrame):
            X_test = pd.DataFrame(X_test, columns=self.X.columns)
        
        # Mark rows whose feature combination is not in cond_probs.
        unseen_mask = X_test.apply(lambda row: tuple(row) not in self.cond_probs, axis=1)
        return X_test[unseen_mask]

In [14]:
cars_train = pd.read_csv("../data/classification/car_evaluation/train.csv")
cars_test = pd.read_csv("../data/classification/car_evaluation/test.csv")
# Combine train and test data for consistent preprocessing.
X_train = cars_train.drop(columns=["Y"])
y_train = cars_train["Y"]
X_test = cars_test.drop(columns=["Y"])
y_test = cars_test["Y"]

# Initialize and fit the classifier.
clf = MyBayesClassifier()
clf.fit(X_train, y_train)

# Predict probabilities and classes for the train and test set
train_proba = clf.predict_proba(X_train)
train_predictions = clf.predict(X_train)
test_proba = clf.predict_proba(X_test)
test_predictions = clf.predict(X_test)

# append true and predicted values of train and test sets to the proba dataframes
train_proba["Y_true"] = y_train
train_proba["Y_pred"] = train_predictions
test_proba["Y_true"] = y_test
test_proba["Y_pred"] = test_predictions

In [15]:
posteriors = clf.get_posteriors()
posteriors

Unnamed: 0,X1,X2,X3,X4,X5,X6,acc,good,unacc,vgood
0,high,high,2,2,big,high,0.000000,0.000000,1.000000,0.000000
1,high,high,2,2,big,low,0.000000,0.000000,1.000000,0.000000
2,high,high,2,2,big,med,0.222142,0.039797,0.700434,0.037627
3,high,high,2,2,med,high,0.222142,0.039797,0.700434,0.037627
4,high,high,2,2,med,low,0.000000,0.000000,1.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...
1723,vhigh,vhigh,5more,more,med,low,0.000000,0.000000,1.000000,0.000000
1724,vhigh,vhigh,5more,more,med,med,0.000000,0.000000,1.000000,0.000000
1725,vhigh,vhigh,5more,more,small,high,0.000000,0.000000,1.000000,0.000000
1726,vhigh,vhigh,5more,more,small,low,0.000000,0.000000,1.000000,0.000000


In [17]:
train_proba

Unnamed: 0,acc,good,unacc,vgood,Y_true,Y_pred
0,1.0,0.0,0.0,0.0,acc,acc
1,0.0,0.0,1.0,0.0,unacc,unacc
2,1.0,0.0,0.0,0.0,acc,acc
3,0.0,1.0,0.0,0.0,good,good
4,0.0,0.0,1.0,0.0,unacc,unacc
...,...,...,...,...,...,...
1377,0.0,0.0,1.0,0.0,unacc,unacc
1378,0.0,0.0,1.0,0.0,unacc,unacc
1379,0.0,0.0,1.0,0.0,unacc,unacc
1380,0.0,0.0,1.0,0.0,unacc,unacc


In [18]:
clf.unseen_instances(X_test)

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,low,low,4,more,med,med
1,med,vhigh,3,4,big,low
2,high,vhigh,5more,4,small,high
3,med,med,2,2,small,low
4,high,high,3,more,big,high
...,...,...,...,...,...,...
341,low,low,3,4,med,low
342,vhigh,med,3,more,med,high
343,vhigh,low,4,4,small,low
344,med,med,3,2,big,med


### Baye's Theorem recalled
 

Recall from last class that we denote a row of features as $\vec{x} = (x_1,\dotsc,x_n)$, and we write $\vec{X} = \vec{x}$ to mean that $X_i = x_i$ for $i=1,\dotsc,n$. Then, Bayes' theorem states that for any possible values $y$ and $\vec{x}$ of $Y$ and $\vec{X}$, we have
\begin{equation*}
    P(c|\vec{x}) = \frac{P(\vec{x}|c)P(c)}{P(\vec{x})}.
\end{equation*}

Here, we have the following terminology and concepts:
- **Prior** $P(c)$:
    
    This is the probability $P(Y=c)$ that the target is $y$ before we have observed the features. We either make a reasonable assumption about this probability, or we estimate it from the data (e.g. by taking it to be the fraction of the training data that has target $y$ in the discrete case, or by computing an approximation to the density of the target at $y$ in the continuous case).
- **Likelihood** $P(\vec{x}|c)$:

    This is the condition probability $P(\vec{X} = \vec{x}|Y=c)$ of observing the features $\vec{x}$ given that the target is $y$.

- **Evidence** $P(\vec{x})$:

    This is the probability $P(\vec{X} = \vec{x})$ of observing the features $\vec{x}$, regardless of the target. 

- **Posterior** $P(c|\vec{x})$:

    This is the probability that the target is $y$ given that we have observed the features $\vec{x}$. 

Note that the denominator $P(\vec{x})$ can be computed as a sum over the possible classes of $Y$:
\begin{equation*}
    P(\vec{x}) = \sum_{c \in \mathcal{C}} P(\vec{x}|c)P(c).
\end{equation*}
From this it becomes quite clear that the posteriors $P(c|\vec{x})$ define a probability distribution over the possible class $c \in \mathcal{C}$. Indeed, each value is non-negative, and when we sum over all classes, we get $1$:
\begin{align*}
    \sum_{c \in \mathcal{C}} P(c| \vec{x}) & = \sum_{c \in \mathcal{C}} \frac{P(\vec{x}|c)P(c)}{ P(\vec{x}) } = 1.
\end{align*}
Quite logically, given a row of features $\vec{x}$, our best bet for which class to predict for $Y$ would be the class $c \in \mathcal{C}$ with the largest posterior probability (because this is (in principle) the most likely class that was observed in the dataset). This is the basic idea behind Bayesian Classifiers. 

Now, let's go over the various classifiers mentioned in the outline.

### 1. Bayes Classifier
This is the most direct classifier in the sense that it makes basically no assumptions about how the data are distributed; in practice, that means it is essentially impossible to implement. Nevertheless, it is still useful as a theoretical construct, and will help give us some intuition for Bayesian classifiers in general. 

First off, let's be completely clear that a classifier is a function
\begin{equation*}
    F : \mathbb{R}^n \to \mathcal{C},
\end{equation*}
which assigns to any instance $\vec{x} = (x_1,\dotsc,x_n)$ of features a class $F(\vec{x}) \in \mathcal{C}$. 

Now, the Baye's Classifier, which we denote by $F_{\textup{Bayes}}$, does the natural probabilistic thing: it classifies a given $\vec{x}$ into the class which bears the maximum posterior probability. Notationally, we write this as follows:
\begin{align*}
    F_{\textup{Bayes}}(\vec{x}) & = \argmax_{c \in \mathcal{C}} P(c \mid \vec{x})\\
    & = \argmax_{c \in \mathcal{C}} \dfrac{ P(\vec{x}| c) P(c) }{ P(\vec{x}) }.
\end{align*}
Note that the denominator is non-negative and fixed as we vary the classes, so to find the class which maximimizes the LHS, we may as well drop the denominator from the right-hand side. This yields the following simplified form of the Bayes Classifier:
\begin{align*}
    F_{\textup{Bayes}}(\vec{x}) & = \argmax_{c \in \mathcal{C}} P(\vec{x}| c) P(c)\\
    & = \argmax_{c \in \mathcal{C}} P(\vec{x},c),
\end{align*}
where $P(\vec{x},c)$ denotes the joint probability that $\vec{X} = \vec{x}$ and $Y = c$. 