# SYDE 675 --- Assignment 1
**Student ID: 20823934**

*Note:* Please include your numerical student ID only, do *not* include your name.

*Note:* Cells you need to fill out are marked with a "writing hand" symbol. Of course, you can add new cells in between the instructions, but please leave the instructions intact to facilitate marking.

In [1]:
# Import numpy and matplotlib
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta

# Fix the numpy random seed for reproducible results
np.random.seed(18945)

# Some formating options
%config InlineBackend.figure_formats = ['svg']

# 3. Programming

## 3.1 Overcoming a Prior

As we covered in class, the Beta distribution is the conjugate prior to a Bernoulli distribution.  Here we will generate observations from $n$ coin flips from an unfair coin with probability of a heads as $P(H=1) = p$, where the true value is $p=0.3$. 

In [2]:
def flip_coin(p=0.3, num_flips=1):
    return (np.random.uniform(low=0,high=1,size=(num_flips,)) < p).astype(int)

We will compute a posterior distribution for $p$ by updating a Beta distribution using observations from the above ```flip_coin``` function.  To illustrate the effect a prior can have, we will compute the posterior distribution using a "good", "bad", and "uninformative" set of parameters, the "pseudocounts" discussed in class. 

| Condition | $a$ | $b$ | 
|:--- |:---:|:---:|
| Good | 1 | 2 |
| Bad | 10 | 1 | 
| Uninformative | 1 | 1 |

Below we show how to evaluate the PDF of a Beta distribution given parameters $a$ and $b$.

In [None]:
good_condition = {'a':10,'b':30}
bad_condition = {'a':10,'b':1}
uninformative_condition = {'a':1,'b':1}

ps = np.linspace(0,1,100) # The true p lies in the range [0,1], want to evalute ps.

cond_names = ['Good','Bad','Uninformative']
plt.figure(figsize=(10,5))
for cond_idx, condition in enumerate([good_condition, bad_condition, uninformative_condition]):
    plt.subplot(1,3,1+cond_idx)

    beta_rv = beta(a=condition['a'],b=condition['b'])
    plt.plot(ps, beta_rv.pdf(ps), label='Prior')    # Plots the Beta probability density function between 0 and 1 at 100 decimal points
    plt.title(cond_names[cond_idx])
    
    plt.legend()
    
plt.tight_layout()

Your task is to 

1. Compute and plot the posterior distribution over $p$ for each condition, for $n \in \{10,100,1000,10000\}$.
2. For each condition, plot the error in the MAP estimate of $p$ as a function of $n$.

Hints: 

1. The Beta distribution is the conjugate prior for the Bernoulli distribution.
1. The MAP estimate will be the value of $p \in [0,1]$ for which the posterior PDF takes on a maximum value.  
2. For the purposes of this assigment, you can approximate the MAP estimate by finely sampling the domain $[0,1]$, say 100 sample points, _i.e._, ```ps = np.linspace(0,1,100)```. 

In [None]:
# ✍ \<YOUR SOLUTION HERE\>

ns = [10, 100, 1000, 10000]

# 1. Compute and plot the posterior distribution over p for each condition and each n
np.random.seed(18945)
plt.figure(figsize=(10,5))
for cond_idx, condition in enumerate([good_condition, bad_condition, uninformative_condition]):
    # Aim the lines at the appropriate subplot
    plt.subplot(1,3,1+cond_idx)    
    for n in ns:
        # Overcome the conjugate prior by updating it with more flips
        additional_flips = flip_coin(num_flips=n)
        new_a = condition['a'] + np.sum(additional_flips)
        new_b = condition['b'] + n - np.sum(additional_flips)
        beta_rv = beta(a=new_a, b=new_b)
        
        plt.plot(ps, beta_rv.pdf(ps), label=f'n={n}')
    
    # Plot the initial prior
    beta_rv = beta(a=condition['a'],b=condition['b'])
    plt.plot(ps, beta_rv.pdf(ps), label='Prior')
    
    plt.title(cond_names[cond_idx])
    plt.xlabel('p')
    plt.ylabel('Probability')
    plt.legend()
plt.tight_layout()

# 2. For each condition, plot the error in the MAP estimate of p as a function of n
np.random.seed(18945)
plt.figure(figsize=(10,5))
for cond_idx, condition in enumerate([good_condition, bad_condition, uninformative_condition]):
    plt.subplot(1,3,1+cond_idx)
    errors = []
    for n in ns:
        # Overcome the conjugate prior by updating it with more flips
        additional_flips = flip_coin(num_flips=n)
        new_a = condition['a'] + np.sum(additional_flips)
        new_b = condition['b'] + n - np.sum(additional_flips)
        beta_rv = beta(a=new_a, b=new_b)
        
        # The MAP estimate is the value of p for which the posterior PDF has the max value
        # Use the absolute error
        map_estimate = ps[np.argmax(beta_rv.pdf(ps))]
        errors.append(np.abs(map_estimate - 0.3))
    plt.plot(ns, errors, 'o-')
    plt.title(cond_names[cond_idx])
    plt.xlabel('n')
    plt.ylabel('Error')
plt.tight_layout()

**Observations**

Regardless of the quality of the conjugate prior, the addition of additional observations was able to overcome it and give a more accurate estimate of the true Heads probability. The top three plots show that the posterior PDF converge to around `p=0.3` as more samples are taken. The bottom three plots confirm this by showing a lower error rate as a function of n. However, none of the cases obtained zero error, as sampling is still an estimation technique. Additionally, even though the error decreased monotonically for this seed, such results are not guaranteed.

## 3.2 Classification with a Kernel Density Estimator.

Here we will use the Scikit-Learn kernel density estimator to construct a Bayesian classifier.  We will test it using the two moons data set.  

Remember that given a dataset of observations $S = ((x_{1},y_{1}),\ldots,(x_{m},y_{m}))$ and a kernel density estimator $p(x\mid (x_{1},\ldots,x_{m}))$, we can construct our Bayesian classifier using the following elements:


1. The likelihood is approximated: $p(x \mid C = c) = p(x \mid \{x_{i} \in S\mid_{x} : y_{i} = c\})$.  
1. The marginal distribution for $x$ is approximated: $p(x) = p(x \mid S\mid_{x})$, 
1. The prior is approximated $p(C=c) = \frac{|\{y_{i} \in S\mid_{y} : y_{i} = c\}|}{|S\mid_{y}|}$,  Note that we can implement this with a simple histogram with 2 bins for a binary classification.

Let us generate some data points.

In [None]:
num_training_samples = 100
num_testing_samples = 100

colors = ['tab:blue','tab:orange']


from sklearn.datasets import make_moons

np.random.seed(0)
X_train, y_train = make_moons(n_samples=num_training_samples, noise=0.1)
X_test, y_test = make_moons(n_samples=num_testing_samples, noise=0.1)


plt.scatter(X_train[:,0], X_train[:,1],color=[colors[y.astype(int)] for y in y_train])

Now we need to fit a kernel density estimator.  We will use SciKit Learn's Kernel Density Estimator to fit the training data.  We will use a Gaussian kernel. We will modify the _length scale_ or _bandwidth_ parameter to change the performance of the classifier.

In [None]:
from sklearn.neighbors import KernelDensity

kde = KernelDensity(kernel='gaussian', bandwidth=1).fit(X_train)

prob_train = np.exp(kde.score_samples(X_train)) # score_samples returns the log probability
prob_test = np.exp(kde.score_samples(X_test))
plt.scatter(X_train[:,0], X_train[:,1], marker='x', c=prob_train, label='Train')
plt.scatter(X_test[:,0], X_test[:,1], marker='o', c=prob_test, label='Test')
plt.colorbar()
plt.legend()

Note that in the above, we computed the probability the samples using the statement ```np.exp(kde.score_samples(X_train))```.
We exponentiate the ```score_samples``` function because it returns the log probability, as opposed to just the probability.

Next you will construct the classifier using a kernel density estimator for the whole data set (as constructed above), one just for class 0 samples, and another just for class 1 samples. 



In [7]:
def classification_error(y_true, y_pred):
    return np.mean(y_true != y_pred)

# Using a trained classifier to predict the class of the test samples
def classifier(x_test, kde_0, kde_1, kde_marginal, prior):
    likelihood_0 = np.exp(kde_0.score_samples(x_test))          # ✍ p(x | D_0) --> Use the KDE for class 0
    likelihood_1 = np.exp(kde_1.score_samples(x_test))          # ✍ p(x | D_1) --> Use the KDE for class 1
    marginal_prob = np.exp(kde_marginal.score_samples(x_test))  # ✍ p(x | D) --> Get the marginal probability
    posterior_0 = likelihood_0 * prior / marginal_prob          # ✍ p(C=0|x) = p(x|C=0)p(C=0)/p(x) --> Assume prior is for class 0
    posterior_1 = likelihood_1 * (1 - prior) / marginal_prob    # ✍ p(C=1|x) = p(x|C=1)p(C=1)/p(x)
    
    return np.argmax([posterior_0, posterior_1], axis=0)

You will compute the testing and training classification error, which is the classification performance on the testing and training sets using KDEs that are fit using only the training data set.

Below you will plot the classification error, which you can compute using the ```sklearn.metrics.zero_one_loss``` function.  Plot the classification errors as a function of the kernel bandwidth, testing values of $h \in \{0.01,0.1,1,10,100\}$.  Use the same kernel bandwidth for all three estimators.

In [None]:
# ✍ \<YOUR SOLUTION HERE\>
from sklearn.metrics import zero_one_loss

def plot_classification_error(kernel='gaussian', num_training_samples=100, num_testing_samples=100, seed=0):
    np.random.seed(seed)
    X_train, y_train = make_moons(n_samples=num_training_samples, noise=0.1)
    X_test, y_test = make_moons(n_samples=num_testing_samples, noise=0.1)
    
    bandwidths = [0.01, 0.1, 1, 10, 100]
    train_errors = []
    test_errors = []
    for bandwidth in bandwidths:
        kde_0 = KernelDensity(kernel=kernel, bandwidth=bandwidth).fit(X_train[y_train == 0])    # Fit using only the training set
        kde_1 = KernelDensity(kernel=kernel, bandwidth=bandwidth).fit(X_train[y_train == 1])
        kde_marginal = KernelDensity(kernel=kernel, bandwidth=1).fit(X_train)
        prior = np.mean(y_train == 0)   # Assume prior is for class 0
        
        y_train_pred = classifier(X_train, kde_0, kde_1, kde_marginal, prior)
        train_errors.append(zero_one_loss(y_train, y_train_pred))   # Use zero-one loss which is the same as classification_error()
        
        y_test_pred = classifier(X_test, kde_0, kde_1, kde_marginal, prior)
        test_errors.append(zero_one_loss(y_test, y_test_pred))
        
    plt.plot(bandwidths, train_errors, 'o-', label='Train')
    plt.plot(bandwidths, test_errors, 'o-', label='Test')
    plt.xscale('log')
    plt.title(f'Kernel: {kernel}, Train samples: {num_training_samples}, Test samples: {num_testing_samples}')
    plt.xlabel('Bandwidth')
    plt.ylabel('Classification Error')
    plt.legend()
    plt.show()
    
plot_classification_error(kernel='gaussian', num_training_samples=100)

Repeat the above process for a Top Hat kernel with $n=100$ training sample points.

In [None]:
# ✍ \<YOUR SOLUTION HERE\>
plot_classification_error(kernel='tophat', num_training_samples=100)

Repeat the process for a Gaussian kernel with $n=400$ training sample points.

In [None]:
# ✍ \<YOUR SOLUTION HERE\>
plot_classification_error(kernel='gaussian', num_training_samples=400)

**Observations**

All three setups exhibit signs of overfitting, where the training loss is lower than the test loss. This is especially evident with the Top Hat kernel, which has 0 train error when `h = 0.01` but 0.5 test error. In all three setups, the error increases with bandwidth, which makes sense because generalization leads to loss of precision. Both Gaussian kernels were able to achieve 0 train and test loss with `h = 0.01` and `h = 0.1`, and there's not much difference in error at the larger bandwidths. However, `train samples = 400` had slightly higher test error, which could indicate increased overfitting.