# Pattern Recognition & Machine Learning Project

*Authors*: Aristeidis Daskalopoulos (AEM: 10640), Georgios Rousomanis (AEM: 10703)

----

<center>
This notebook contains the solutions for Parts A, B, and C.
</center>


## Part A
### Theoretical Analysis

In this part, we address a **binary classification problem** where the goal is to classify samples into one of two classes: $\omega_1$ or $\omega_2$, based on *a single feature* $x$ (a feature vector of dimensionality one).

To achieve this, we use the probability density function (PDF) of the feature $x$, which follows - for both classes - the distribution described below:

$$ p(x|\theta) = \frac{1}{\pi}\frac{1}{1 + (x - \theta)^2} \quad, $$

where $\theta$ is an unknown parameter which has to be defined for each one of the classes separately. This PDF is the probability distribution of the [Cauchy distribution](https://en.wikipedia.org/wiki/Cauchy_distribution) for $\gamma = 1$. 

To solve this decision problem, we will implement a Generative Probabilistic Model, following the steps described in the subsequent cells.


#### 1. Maximum Likelihood Estimation (MLE)

Our first goal is to estimate the parameters $\hat{\theta}_1$ and $\hat{\theta}_2$ using the Maximum Likelihood (ML) method. To do this, we aim to maximize the log-likelihood function with respect to $\theta_j$, for $j = 1, 2$; $\theta_1$ is the parameter of the first class, and $\theta_2$ refers to class $\omega_2$.
 

Assuming that the samples $D_j$ of the class $\omega_j$, for $j = 1, 2$, are ***independent and identically distributed (i.i.d.)***, meaning they have been drawn independently from the same distribution $p(x | \theta_j, \omega_j)$, the probability density function (PDF) for the samples can be expressed as:

$$ p(D_j | \theta_j) = \prod_{i=1}^{N_j} p(x_i | \theta_j)\quad.$$

We prefer to work with the log-likelihood function, because it simplifies the process - as it converts multiplication into addition, which is *less* error-sensitive in terms of computational arithmetic errors (and in terms of calculating derivates). The log-likelihood of our problem is:

$$ l(\theta_j) = \log p(D_j | \theta_j) = \sum_{i=1}^{N_j} \log p(x_i | \theta_j), \quad j = 1, 2 \quad \Rightarrow$$


$$ l(\theta_j) = \sum_{i=1}^{N_j} \log \left(\frac{1}{\pi}\frac{1}{1 + (x_i - \theta_j)^2}\right), \quad j = 1, 2 \quad \Rightarrow$$

$$ l(\theta_j) = - N_j \cdot \log \pi - \sum_{i=1}^{N_j} \log (1 + (x_i - \theta_j)^2), \quad j = 1, 2 \quad,$$

with which we can estimate the $\hat{\theta}_j$ for each class. This estimate, $\hat{\theta}_j$, is by definition the value of $\theta_j$ that maximizes the likelihood/log-likelihood. In our case the term $- N_j \cdot \log \pi$ is constant, so we practically just need to *minimize* the term $\sum_{i=1}^{N_j} \log (1 + (x_i - \theta_j)^2)$.

One approach to solving this problem is to calculate the derivatives and solve the following equations, where the solution gives the estimate $\hat{\theta}_j$ for each class:

$$ \frac{d}{d\theta_j} l(\theta_j) = 0 \quad \Rightarrow \quad \frac{d}{d\theta_j} \left(- N_j \cdot \log \pi - \sum_{i=1}^{N_j} \log (1 + (x_i - \theta_j)^2)\right) = 0 \quad \Rightarrow$$

$$ \sum_{i=1}^{N_j} \frac{d}{d\theta_j}\log (1 + (x_i - \theta_j)^2) = 0 \quad \Rightarrow$$

$$ \sum_{i=1}^{N_j} \frac{-2 \cdot (x_i - \theta_j)}{1 + (x_i - \theta_j)^2} = 0 \quad \Rightarrow \quad \sum_{i=1}^{N_j} \frac{x_i - \theta_j}{1 + (x_i - \theta_j)^2} = 0 \quad.$$

*For all the $\hat{\theta}_j$ that solve the above equation we should choose the one that gives the largest (max) value to the $l(\theta_j)$.*

Given $l(\theta_j)$, its derivative can be computed efficiently (e.g., using a library like SymPy) to solve the equation and obtain the estimate $\hat{\theta}_j$. However, by plotting $l(\theta_j)$ as requested, we inherently *calculate the values of the log-likelihood function across multiple points*. Consequently, selecting the value of $\theta$ that maximizes $l(\theta)$ provides the same solution, thereby **avoiding** the need for the derivative-based approach.



#### 2. Bayes Decision Rule

Using the Bayes Decision Rule, we classify to $\omega_1$ based on the following condition:

$$ P(\omega_1 | x) > P(\omega_2 | x) \quad, $$

which can be rewritten using the *Bayes formula* as:

$$ \frac{p(x|\omega_1) P(\omega_1)}{p(x)} > \frac{p(x|\omega_2) P(\omega_2)}{p(x)} \quad, $$

or equivalently:

$$ \log p(x|\omega_1) + \log P(\omega_1) > \log p(x|\omega_2) + \log P(\omega_2) \quad. $$

*Here, the class-conditional densities $p(x|\omega_1, \theta_1)$ and $p(x|\omega_2, \theta_2)$ have been fully defined using Maximum Likelihood (ML) estimation, for the parameters $\theta_1 = \hat{\theta}_1$ and $\theta_2 = \hat{\theta}_2$ respectively.* So, for each class we have that:

$$ p(x|\omega_j) = \frac{1}{\pi}\frac{1}{1 + (x - \hat{\theta}_j)^2}, \quad P(\omega_j) = \frac{||D_j||}{||D_1|| + ||D_2||}, $$

where $||D_j||$ is the total number of elements, $N_j$, that this dataset has.

We define the following discriminant function:

$$ g(x) = \log p(x|\omega_1) - \log p(x|\omega_2) + \log P(\omega_1) - \log P(\omega_2) \quad, $$

and based on the previous inequity we infer that using this discriminant function:
- If $g(x) > 0$, the sample with feature $x$ is classified into class $\omega_1$.
- Otherwise, it is classified into class $\omega_2$.

The above **rule** implies that we theoretically expect the discriminant function $g(x)$ to be greater than zero when a sample from the $D_1$ set (class $\omega_1$) is provided. Based on this rule, the feature space - represented by the real number line $\mathbb{R}$ - is divided into two distinct regions: $\mathbb{R}_1$ and $\mathbb{R}_2$. To complete the theoretical analysis of this section, these regions must be defined by determining their boundaries, which can be found by solving $g(x) = 0$:


$$ \log(\frac{1}{\pi}\frac{1}{1 + (x - \hat{\theta}_1)^2}) - \log(\frac{1}{\pi}\frac{1}{1 + (x - \hat{\theta}_2)^2}) + \log(\frac{||D_1||}{||D_1|| + ||D_2||}) - \log(\frac{||D_2||}{||D_1|| + ||D_2||}) = 0 \quad \Rightarrow$$

$$ -\log(1 + (x - \hat{\theta}_1)^2) + \log(1 + (x - \hat{\theta}_2)^2) + \log(\frac{||D_1||}{||D_2||}) = 0, \quad Let\ r = \frac{||D_1||}{||D_2||} \quad \Rightarrow$$

$$ \log(\frac{1 + (x - \hat{\theta}_2)^2}{1 + (x - \hat{\theta}_1)^2}) = -\log(r) \quad \Rightarrow \quad \frac{1 + (x - \hat{\theta}_2)^2}{1 + (x - \hat{\theta}_1)^2} = \frac{1}{r} \quad \Rightarrow$$

$$ r(1 + (x - \hat{\theta}_2)^2) = 1 + (x - \hat{\theta}_1)^2 \quad \Rightarrow \quad (r-1)x^2 - 2(r\hat{\theta}_2 - \hat{\theta}_1)x + (r\hat{\theta}_2^2 + r - \hat{\theta}_1^2 - 1) = 0 $$

The solutions to this quadratic equation define the decision boundary points. These points separate regions $\mathbb{R}_1$ and $\mathbb{R}_2$ in the feature space. If the equation above has two real solutions, $x_a$ and $x_b$, then one of the regions, $\mathbb{R}_j$, will be an interval spanning $(-\infty, x_a) \cup (x_b, +\infty)$, while the other will correspond to the interval $[x_a, x_b]$. We will specify these intervals after estimating the values of $\hat{\theta}_j$.



### Algorithm Implementation

Having outlined the theoretical approach to solving this problem, we now proceed with its implementation.


In [59]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

$D_1$ and $D_2$ are the two example datasets provided. $D_1$ contains the values of the index $x$ for class $\omega_1$, while $D_2$ contains the values for class $\omega_2$:

In [60]:
D1 = np.array([2.8, -0.4, -0.8, 2.3, -0.3, 3.6, 4.1])  # No stress data (class omega_1)
D2 = np.array([-4.5, -3.4, -3.1, -3.0, -2.3])  # Intense stress data (class omage_2)

Now we are ready to implement our classifier. The methods that are implemented are the following:

1. `compute_pdf`: Computes the probability density function evaluation for given $\theta$ and $x$ values
2. `loglkhood`: Calculates the log-likelihood of a dataset $D$ given parameter $\theta$
3. `fit`: Finds the optimal $\hat{\theta}$ parameter that maximizes the log-likelihood for the given dataset
4. `predict`: Predicts the class $\omega$ by evaluating the discriminant function $g(x)$ using the fitted parameters and prior probabilities

More detailed information about these methods can be found in the functions definitions.

In [61]:
class ClassifierA:

    @staticmethod
    def compute_pdf(theta, x) -> np.ndarray:
        """
        Compute the probability density function (PDF) evaluation for given theta and x values.
        
        For theta (M elements) and x (N elements), returns an M x N matrix where element (i,j)
        represents the PDF evaluation at x[j] for theta[i]. 
        
        Uses the distribution formula: p(x|θ) = 1/(π(1 + (x-θ)²))
        
        Args:
            theta: Location parameter(s) of the distribution. Can be scalar or array.
            x: Data point(s) to evaluate. Can be scalar or array.
        
        Returns:
            numpy.ndarray: Matrix of PDF evaluations with shape (M, N) where M is the number
                        of theta values and N is the number of x values.
        """
        # Compute differences using broadcasting
        x, theta = map(np.atleast_1d, (x, theta))
        diff = x[None, :] - theta[:, None]

        # By simply modifying the returned PDF here, the same classifier
        # can be utilized without requiring any additional changes (for the same requirements).
        return 1.0 / (np.pi * (1.0 + diff * diff))

    @staticmethod
    def loglkhood(theta, D) -> np.ndarray:
        """
        Compute the log-likelihood of dataset D given parameter theta: l(θ).
        
        For M different theta values, returns an array of M log-likelihood values.
        
        Args:
            theta: Location parameter(s) of the distribution. Can be scalar or array.
            D: Dataset for which the log-likelihood is computed. Can be scalar or array.
        
        Returns:
            numpy.ndarray: Array of log-likelihood values for each theta. 
                        Shape: (M, 1) where M is the number of theta values.
        """
        D, theta = map(np.atleast_1d, (D, theta))
        
        # Row-wise sum of the log of probabilities computed for each data point in D
        # each row corresponds to a different theta value
        return np.sum(np.log(ClassifierA.compute_pdf(theta, D)), axis=1)

    @staticmethod
    def fit(D, theta_min, theta_max, npoints=10000, plot=False, labels=0) -> float:
        """
        Find the optimal theta parameter that maximizes the log-likelihood for the given dataset.

        This function evaluates the log-likelihood for a range of theta values between
        theta_min and theta_max, using npoints for precision. The theta value corresponding to
        the maximum log-likelihood is selected as the optimal theta.

        Optionally, a plot of the log-likelihood curve can be generated. The plot includes a marker
        for the optimal theta value, with customized labels for different datasets (if specified).

        Args:
            D: Dataset for fitting the model. Can be scalar or array-like.
            theta_min (float): Minimum value of theta for the search range.
            theta_max (float): Maximum value of theta for the search range.
            npoints (int, optional): Number of points to sample between theta_min and theta_max. 
                                    Default is 10000.
            plot (bool, optional): If True, generates a plot of the log-likelihood curve. Default is False.
            labels (int, optional): Dataset label for customizing plot annotations (1, 2, or others if needed). 
                                    Default is 0.

        Returns:
            float: The optimal theta value that maximizes the log-likelihood for the dataset.
        """
        theta_candidates  = np.linspace(theta_min, theta_max, npoints)
        lkhood_values     = ClassifierA.loglkhood(theta_candidates, D)
        opt_theta         = theta_candidates[np.argmax(lkhood_values)]
        
        if plot:
            # Determine appropriate label
            labels = labels if labels else ""
            
            # Plot the log-likelihood curve:
            plt.plot(theta_candidates, lkhood_values, label=rf'$\log P(D{labels}|\theta)$', 
                     color = 'blue' if labels == 1 else 'green')
            # Mark the optimal theta value on the plot:
            plt.scatter(opt_theta, ClassifierA.loglkhood(opt_theta, D), color='red', marker='x')
            
            plt.xlabel(rf'$\theta{labels}$')
            plt.ylabel('Log-likelihood')
            plt.title(f'Log-likelihood plot for Part A, $\omega{labels}$')
            plt.legend()
            plt.show()
        
        return opt_theta
    
    @staticmethod
    def predict(D, p1, p2, theta1, theta2) -> np.ndarray:
        """
        Predict class membership using the log-ratio discriminant function. 
        If theta1 is 1D array of M elements, theta2 is 1D array of K elements and 
        D is 1D array of N elements it returns a 2D array of max(M, K) x N elements 
        where the element at row i and column j is the prediction of feature D[j] corresponding 
        to theta1[i] and theta2[i] parameters.
        
        Args:
            D: Dataset points for prediction. Can be scalar or array.
            p1: Prior probability of class 1 (no stress). Must be in range (0, 1).
            p2: Prior probability of class 2 (intense stress). Must be in range (0, 1).
            theta1: Location parameter for class 1 distribution.
                Can be scalar or array of M elements.
            theta2: Location parameter for class 2 distribution.
                Can be scalar or array of K elements.
        
        Returns:
            numpy.ndarray: Discriminant function values. Shape is (max(M,K), N) where:
                        - M is the number of theta1 values
                        - K is the number of theta2 values
                        - N is the number of points in D
                        Positive values indicate class 1, negative values indicate class 2.
        """
        # Input validation check:
        if not 0 < p1 < 1 or not 0 < p2 < 1:
            raise ValueError("Prior probabilities must be between 0 and 1")
        if abs(p1 + p2 - 1) > 1e-5:
            raise ValueError("Prior probabilities must sum to 1")
            
        # g(x) = log(p(x|θ₁)) - log(p(x|θ₂)) + log(p₁) - log(p₂)
        return (np.log(ClassifierA.compute_pdf(theta1, D)) - 
                np.log(ClassifierA.compute_pdf(theta2, D)) + 
                np.log(p1) - np.log(p2))


#### 1. Maximum Likelihood Estimation (Results)

We execute the `fit` method for each dataset, $D_j$, to determine the optimal Maximum Likelihood estimations, $\hat{\theta}_j$. Additionally, we plot the log-likelihood function:

$$ l(\theta) = \log P(D_j | \theta), \quad \text{for} \; j = 1, 2, $$

and highlight the point where the likelihood reaches its maximum.

In [None]:
clf = ClassifierA()  # Create an instance of the Classifier
theta1 = clf.fit(D1, -10, +10, npoints=10000, plot=True, labels=1)
theta2 = clf.fit(D2, -10, +10, npoints=10000, plot=True, labels=2)
print(f'theta1 ML estimation (no stress):      {theta1}')
print(f'theta2 ML estimation (intense stress): {theta2}')

#### 2. Bayes Decision Rule (Results)

Based on the previously evaluated $\hat{\theta}_j$ results, we can now, before running the code, solve the equation $g(x) = 0$ and determine the two regions, $\mathbb{R}_1$ and $\mathbb{R}_2$, in the feature space. Specifically, we have found that $\hat{\theta}_1 \simeq 2.6$ and $\hat{\theta}_2 \simeq -3.16$.

To solve the quadratic equation:

$$ (r-1)x^2 - 2(r\hat{\theta}_2 - \hat{\theta}_1)x + (r\hat{\theta}_2^2 + r - \hat{\theta}_1^2 - 1) = 0, $$

where $r = \frac{7}{5}$, we can apply the quadratic formula and get the roots:

$$ x_a \approx −34.57, \quad x_b \approx −0.55.$$

These values define the boundaries of the regions $\mathbb{R}_1$ and $\mathbb{R}_2$ in the feature space.


In [None]:
# Calculate apriori probabilities for each class:
N1 = len(D1)
N2 = len(D2)
p1 = N1 / (N1 + N2)
p2 = N2 / (N1 + N2)

# Get the discriminant values for the two classes:
predictions1 = clf.predict(D1, p1, p2, theta1, theta2)
predictions2 = clf.predict(D2, p1, p2, theta1, theta2)

# Scatter plot of the data points and the discriminant function values:
plt.scatter(D1, predictions1, label='no stress', color='blue', marker='o')
plt.scatter(D2, predictions2, label='intense stress', color='green', marker='x')
# Add a horizontal dashed line (threshold for classification):
plt.axhline(y=0.0, color='red', linestyle='--', label="threshold")

plt.xlabel('x')
plt.ylabel('g(x)')
plt.title('Discriminant function values for D1, D2 datasets')
plt.legend()
plt.show()

From the above plot, we observe only one misclassification, where a sample that should have been classified as "no stress" was incorrectly predicted as "intense stress." Additionally, it is evident that some "no stress" values are near the threshold.  

It is important to note that, due to the very limited number of training samples, splitting $D_j$ into training and validation sets will not yield reliable results.  

Now, we proceed to validate the intervals $\mathbb{R}_1$ and $\mathbb{R}_2$ by evaluating the predictor with specific values and checking the sign of the outcomes. While we lack a method to confirm whether these results are classified correctly, we will ensure that the intervals match those obtained theoretically:

In [None]:
xValues = np.linspace(-50, 10, 100000)
predictions = clf.predict(xValues, p1, p2, theta1, theta2)

if len(predictions.shape) > 1:
    predictions = predictions[0]  # Take first row if 2D array

plt.figure(figsize=(12, 2))

# Plot points with positive predictions in red:
plt.plot(xValues[predictions > 0], np.zeros_like(xValues[predictions > 0]), 'ro', 
         label='Class ω₁', markersize=1)

# Plot points with negative predictions in blue:
plt.plot(xValues[predictions <= 0], np.zeros_like(xValues[predictions <= 0]), 'bo', 
         label='Class ω₂', markersize=1)

plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.grid(True, axis='x')
plt.title('Decision Regions on the Real Line')
plt.xlabel('x')
plt.yticks([])
plt.legend()
plt.show()

x_a = xValues[predictions <= 0][0]
x_b = xValues[predictions <= 0][len(xValues[predictions <= 0]) - 1]

print(f"R1 interval: (-inf, {x_a}) and ({x_b}, +inf) --> class ω₁")
print(f"R2 interval: [{x_a}, {x_b}]                  --> class ω₂")

### Part B

#### 1.

The apriori PDF of $\theta$ is given by:

$$ p(\theta) = \frac{1}{10\pi} \frac{1}{1 + (\theta / 10)^2} $$

The likelihood $ p(D|\theta) $ is computed by:

$$ p(D_j|\theta) = \prod_{n=1}^{N_j} p(x_n|\theta), \quad j = 1, 2 \quad. $$

The a posteriori PDF will be:

$$ p(\theta|D_j) = \frac{p(D_j|\theta) p(\theta)}{\int p(D_j|\theta) p(\theta) \, d\theta}, \quad j = 1, 2. $$

#### 2.

Consider Bayesian Estimation Rule:

$$ p(\omega_1 | x, D_1) > p(\omega_2 | x, D_2) $$

or

$$ 
\frac{p(x | D_1) P(\omega_1)}{p(x | D_1) P(\omega_1) + p(x | D_2) P(\omega_2)} > 
\frac{p(x | D_2) P(\omega_2)}{p(x | D_1) P(\omega_1) + p(x | D_2) P(\omega_2)}
$$

or

$$ \log p(x | D_1) + \log P(\omega_1) > \log p(x | D_2) + \log P(\omega_2) \quad. $$

By selecting as discriminant function the:

$$ h(x) = \log p(x | D_1) - \log p(x | D_2) + \log P(\omega_1) - \log P(\omega_2) \quad, $$

we classify the element with feature $x$ to class $\omega_1$ if $h(x) > 0$ and to $\omega_2$ otherwise.



In [None]:
class ClassifierB:

    @staticmethod
    def compute_pdf(theta, x):
        """
        Compute the probability of x given a specific theta. If theta is a 1D array
        of M elements and x is 1D array of N elements it returns a matrix M x N where
        the element at row i and column j is the evaluation of the PDF at point x[j]
        for theta[i].
        
        Args:
            theta: Scalar or array, location parameter of the distribution
            x: Scalar or array, the data points to evaluate
        
        Returns:
            numpy.ndarray: The probability for each x given theta
        """
        x = np.asarray(x)  # Ensure x is an array
        theta = np.asarray(theta)  # Ensure theta is an array
        # Convert theta to 1D array if it is a scalar
        if theta.ndim == 0:
            theta = np.array([theta])
        return 1 / (1 + (x - theta[:, np.newaxis]) ** 2) / np.pi

    @staticmethod
    def p_theta(theta):
        """
        Compute the apriori probability density function of theta.
        If theta is a 1D array of M elements it returns a 1D array with
        M elements with the evaluations of the function for each theta value.
        
        Args:
            theta: Scalar or array, the parameter values to evaluate.
        
        Returns:
            numpy.ndarray: The prior probability for each theta.
        """
        theta = np.asarray(theta)  # Ensure theta is an array
        return 1 / (1 + (theta / 10) ** 2) / (10 * np.pi)

    @staticmethod
    def p_D_theta(theta, D):
        """
        Compute the likelihood of the dataset D given a parameter theta.
        This is the product of individual probabilities p(x | theta) for all x in D.
        If theta is a 1D array of M elements it returns a 1D array of M elements with
        the likelihood values of the dataset D for each theta parameter.
        
        Args:
            theta: Scalar or array, the parameter values to evaluate.
            D: Scalar of array, the dataset.
        
        Returns:
            numpy.ndarray: The likelihood values for each theta.
        """
        # Computes the product of all elements in each row i.e. the likelihood of each theta given dataset D
        return np.prod(ClassifierB.compute_pdf(theta, D), axis=1)
    
    @staticmethod
    def p_theta_D(theta, D):
        """
        Compute the posterior probability density function using Bayes' theorem.
        If theta is a 1D array of M elements it returns a 1D array of M elements with the 
        evaluation of the posterior PDF at each theta.
        
        Args:
            theta: Scalar or array, the parameter values to evaluate.
            D: Scalar or array, the dataset.
        
        Returns:
            numpy.ndarray: The posterior probability values for each theta.
        """
        theta_max = 1000  # Range limit for theta
        npoints = 5000  # Number of points for numerical integration
        x = np.linspace(-theta_max, theta_max, npoints)  # Theta range for integration
        y = ClassifierB.p_D_theta(x, D) * ClassifierB.p_theta(x)  # Unnormalized posterior
        return ClassifierB.p_D_theta(theta, D) * ClassifierB.p_theta(theta) / np.trapz(y, x)
    
    @staticmethod
    def p_x_D(x, D):
        theta_max = 1000  # Range limit for theta
        npoints = 5000  # Number of points for numerical integration
        theta_values = np.linspace(-theta_max, theta_max, npoints)  # Theta range for integration

        # Compute posterior pdf values accross the theta range and convert it to a column vector
        posterior = ClassifierB.p_theta_D(theta_values, D)[:, np.newaxis]

        # Compute the values of p(x | theta) * p(theta | D) accross the range of theta for each point x
        # Each row corresponds to the same theta and each column to the same x
        p_values = posterior * ClassifierB.compute_pdf(theta_values, x)
        
        # Integrate along the rows for each column to get the evaluation of p(x | D) at each point x
        return np.trapz(p_values, theta_values, axis=0)

    @staticmethod
    def predict(D, D1, D2, p1, p2):
        """
        Predict the class of each data point by evaluating a discriminant function.
        
        Args:
            D: Scalar or array, the dataset for which predictions are to be made
            D1: Training dataset for class 1 (no stress)
            D2: Training dataset for class 2 (intense stress)
            p1: float, apriori probability of class 1 (no stress)
            p2: float, apriori probability of class 2 (intense stress)
        
        Returns:
            numpy.ndarray: The predicted discriminant function values for each data point
        """
        return np.log(ClassifierB.p_x_D(D, D1)) - np.log(ClassifierB.p_x_D(D, D2)) + np.log(p1) - np.log(p2)


clf = ClassifierB()  # Create an instance of the Classifier
theta_max = 40  # Range limit for theta
npoints = 1000  # Number of points to plot
x = np.linspace(-theta_max, theta_max, npoints)  # Theta range for plotting
y = clf.p_theta(x)  # Apriori pdf values for theta
y1 = clf.p_theta_D(x, D1)  # A posteriori pdf values for theta given dataset D1
y2 = clf.p_theta_D(x, D2)  # A posteriori pdf values for theta given dataset D2

idx1 = np.argmax(y1)  # Find the index of the maximum value of the posterior pdf given D1
idx2 = np.argmax(y2)  # Find the index of the maximum value of the posterior pdf given D2
print(f'theta1 (no stress): {x[idx1]}')  # Print theta value that gives the maximum a posteriori pdf value given D1
print(f'theta2 (intense stress): {x[idx2]}')  # Print theta value that gives the maximum a posteriori pdf value given D2

# Plot prior and posterior pdf of theta
plt.plot(x, y, label=r'$p(\theta)$', color='red')
plt.plot(x, y1, label=r'$p(\theta|D1)$', color='blue')
plt.plot(x, y2, label=r'$p(\theta|D2)$', color='green')

# Mark the peak points of the posterior pdfs
plt.scatter(x[idx1], y1[idx1], color='blue', marker='x')
plt.scatter(x[idx2], y2[idx2], color='green', marker='x')

# Labeling the plot
plt.xlabel(r'$\theta$')
plt.ylabel('Probability Density')
plt.title(r'Apriori & a posteriori density functions of $\theta$')
plt.legend()
plt.show()

# Get the discriminant values for the two classes
predictions1 = clf.predict(D1, D1, D2, p1, p2)
predictions2 = clf.predict(D2, D1, D2, p1, p2)

# Scatter plot of the data points and the discriminant function values
plt.scatter(D1, predictions1, label='no stress', color='blue', marker='o')
plt.scatter(D2, predictions2, label='intense stress', color='green', marker='x')

# Add a horizontal dashed line (threshold for classification)
plt.axhline(y=0.0, color='red', linestyle='--', label="threshold")

# Labeling the plot
plt.xlabel('x')
plt.ylabel('h(x)')
plt.title('Discriminant function values for D1, D2 datasets')
plt.legend()
plt.show()


We see that the a posteriori density functions $p(\theta | D_j)$ are very sharp at $\hat{\theta}_j$, where
$\hat{\theta}_j$ is very close to the ML estimation. Thus, the influence of the prior information on the 
uncertainty of the value of $\theta$ can be ignored. 

It is evident that Bayesian Estimation gives us better results than ML estimation. This is beacause BE takes 
into account the prior distribution of the $\theta$ parameter, leading to better solutions.


### Part C

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()  # Load the Iris dataset
rnd_seed = 42  # Random seed for reproducibility

X = iris.data[:, :2]  # Extract the first two features of the dataset
y = iris.target  # Get the target values of the dataset

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=rnd_seed)
max_depth = 100  # Maximum depth of the tree
depths = np.arange(1, max_depth + 1)  # Range of depths

# ================================ Section 1 ================================

accuracies_DT = np.zeros(max_depth)  # accuracy achieved for each depth of the decision tree

# Test the accuracy of the DT for different tree depths
for depth in depths:
    # Create an instance of the classifier
    clf = DecisionTreeClassifier(max_depth=depth, random_state=rnd_seed)
    clf.fit(X_train, y_train)  # Train the classifier with the training data
    y_pred = clf.predict(X_test)  # Find predictions of the model for the test set
    accuracies_DT[depth - 1] = accuracy_score(y_test, y_pred)  # Calculate accuracy of the model

best_depth_DT = np.argmax(accuracies_DT) + 1  # Find the depth of the tree that gives the best accuracy
best_accuracy_DT = accuracies_DT[best_depth_DT - 1]  # Find the best accuracy
print(f'Decision Tree: Best depth={best_depth_DT}, Accuracy={best_accuracy_DT}')

# Create a meshgrid to plot decision boundaries
npoints = 1000
x_min, x_max = X[:, 0].min(), X[:, 0].max()  # Define x-axis range
y_min, y_max = X[:, 1].min(), X[:, 1].max()  # Define y-axis range
x_margin = 0.1 * (x_max - x_min)  # Define x-axis margin
y_margin = 0.1 * (y_max - y_min)  # Define y-axis margin
xx, yy = np.meshgrid(np.linspace(x_min - x_margin, x_max + x_margin, npoints), 
                     np.linspace(y_min - y_margin, y_max + y_margin, npoints))

# Create an instance of the classifier with the optimal tree depth
clf = DecisionTreeClassifier(max_depth=best_depth_DT, random_state=rnd_seed)
clf.fit(X_train, y_train)  # Train the classifier with the training data

# Predict the class for each point in the grid
predictions = clf.predict(np.c_[xx.ravel(), yy.ravel()])  # Use grid points as input
predictions = predictions.reshape(xx.shape)  # Reshape predictions to match the grid's shape

# Plot the decision boundaries
plt.contourf(xx, yy, predictions, alpha=0.8, cmap=plt.cm.Paired)

# Plot the training and testing points
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, marker='s', edgecolor='k', label='Train')
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='o', edgecolor='k', label='Test')

plt.title(f'Decision Boundaries for DT (depth: {best_depth_DT})')
plt.legend()
plt.show()

# ================================ Section 2 ================================


n_trees = 100  # Number of trees
gamma = 0.5  # Fraction of the original training data to use for bootstrap sampling
accuracies_RF = np.zeros(max_depth)  # Accuracy achieved by the Random Forest Classifier for each tree depth

# Test the accuracy of the RF for different tree depths
for depth in depths:
    # Create an instance of the RF classifier
    clf = RandomForestClassifier(n_estimators=n_trees, max_depth=depth, random_state=rnd_seed, bootstrap=True, max_samples=gamma, n_jobs=-1)
    clf.fit(X_train, y_train)  # Train the classifier
    y_pred = clf.predict(X_test)  # Make predictions
    accuracies_RF[depth - 1] = accuracy_score(y_test, y_pred)  # Compute the accuracy

best_depth_RF = np.argmax(accuracies_RF) + 1  # Find the depth of the tree that gives the best accuracy for RF
best_accuracy_RF = accuracies_RF[best_depth_RF - 1]  # Find the best accuracy for RF
print(f'Random Forest: Best depth={best_depth_RF}, Accuracy={best_accuracy_RF}')

# Create an instance of the random forest classifier with the optimal depth
clf = RandomForestClassifier(n_estimators=n_trees, max_depth=best_depth_RF, random_state=rnd_seed, bootstrap=True, max_samples=gamma, n_jobs=-1)
clf.fit(X_train, y_train)  # Train the classifier
predictions = clf.predict(np.c_[xx.ravel(), yy.ravel()])  # Make predictions, use as input the grid points
predictions = predictions.reshape(xx.shape)  # Reshape predictions to match the grid's shape

# Plot the decision boundaries
plt.contourf(xx, yy, predictions, alpha=0.8, cmap=plt.cm.Paired)

# Plot the training and testing points
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, marker='s', edgecolor='k', label='Train')
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='o', edgecolor='k', label='Test')

plt.title(f'Decision Boundaries for RF (depth: {best_depth_RF})')
plt.legend()
plt.show()

# Plot the accuracy versus the depth of the tree for DT and RF
plt.plot(depths, accuracies_DT, label='DT', color='red')
plt.plot(depths, accuracies_RF, label='RF', color='blue')
plt.title('Accuracy vs Depth for DT & RF')
plt.xlabel('Depth')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

gammas = np.linspace(0.1, 1, 20)  # gamma parameter range
accuracies_RF = np.zeros(gammas.size)

# Test the accuracy of the RF for different values of the gamma parameter
# for fixed tree depth equal to the optimal
for i in range(0, gammas.size):
    # Create an instance of the RF classifier
    clf = RandomForestClassifier(n_estimators=n_trees, max_depth=best_depth_RF, random_state=rnd_seed, bootstrap=True, max_samples=gammas[i], n_jobs=-1)
    clf.fit(X_train, y_train)  # Train the classifier
    y_pred = clf.predict(X_test)  # Make predictions
    accuracies_RF[i] = accuracy_score(y_test, y_pred)  # Compute the accuracy

# Plot accuracy versus gamma parameter of the RF for fixed tree depth
plt.plot(gammas, accuracies_RF)
plt.title(f'Accuracy vs gamma for RF (depth: {best_depth_RF})')
plt.xlabel(r'$\gamma$')
plt.ylabel('Accuracy')
plt.show()