# Homework 2
**Christian Steinmetz**

Due on November 22nd

In [1]:
import numpy as np

## Problem formulation
We set up an experimental framework to study various aspects of overfitting. 
The input space is $X = [−1, 1]$ with uniform input probability density, $P(x) =\frac{1}{2}$.
We consider the two models $\mathcal{H}_2$ and $\mathcal{H}_{10}$. 
The target function is a polynomial of degree $Q_f$, which we write as $f(x)=\sum_{q=0}^{Q_f} a_q L_q(x)$, where $Lq(x)$ are the Legendre polynomials.
We use the Legendre polynomials because they are a convenient orthogonal basis for the polynomials on $[−1, 1]$.
The first two Legendre polynomials are $L_0(x) = 1$, $L_1(x) = x$.
The higher order Legendre polynomials are defined by the recursion:

\begin{equation}
    L_k(x) = \frac{2k-1}{k} x L_{k-1}(x) - \frac{k-1}{k} L_{k-2}(x)
\end{equation}

The data set is $\mathcal{D} = (x_1, y_1), ..., (x_N , y_N)$, where $y_n = f (x_n)+\sigma \epsilon_n$ and $\epsilon_n$ are i.i.d. standard Normal random variables.

For a single experiment, with specified values for $Q_f$, $N$, $\sigma$, generate a random degree $Q_f$ target function by selecting coefficients $a_q$ independently from a standard Normal distribution, rescaling them so that $\mathbb{E}_{a,x}[f^2]=1$. Generate a data set, selecting
$x_1,...,x_N$ independently from $P(x)$ and $y_n = f (x_n)+\sigma \epsilon_n$. 

Let $g_2$ and $g_{10}$ be the best fit hypotheses to the data from $\mathcal{H_2}$ and $\mathcal{H}_{10}$, respectively, with respective out-of sample errors $E_{out}(g_2)$ and $E_{out}(g_{10})$.

In [14]:
def legendre_poly(k, x):
    if k == 0:
        return 1
    elif k == 1:
        return x
    else:
        return ((2*k-1)/k) * x * legendre_poly(k-1, x) - ((k-1)/k) * legendre_poly(k-2, x)

In [15]:
def generate_a_q():
        
    # deteremine the degree, Q_f
    Q_f = np.random.randint(3, 10)
    
    # sample values for a_q
    a_q = np.random.standard_normal(size=Q_f)
    
    # normalize for E(f^2) = 1 and we use x=0 since E(x)=0
    E = f(0, a_q)**2
    
    return a_q/E

In [16]:
def f(x, a_q):
    return np.sum([a * legendre_poly(k, x) for k, a in enumerate(a_q)])

In [17]:
def generate_dataset(f, a_q, sigma, N):
    X = (np.random.rand(N) * 2) - 1
    Y = f(X, a_q) + (sigma * np.random.standard_normal(size=N))

    return X, Y

In [19]:
a_q = generate_a_q()
X, Y = generate_dataset(f, a_q, 0.0, 10)

**(a) Why do we normalize $f$?** [Hint: how would you interpret σ?]

Based upon it's appearence in $y_n$, $\sigma$ appears to be a scaling factor of the Gaussian noise term $\sigma \epsilon_n$, since $\epsilon_n$ is drawn from a standard normal. This scaling term has the effect of increasing the variance of the distribution of this noise term, since larger values of $\sigma$ will result in greater spread in likely values of the noise. 

We know also that our polynomial coefficents, $a_q$, are drawn from a standard normal as well, and we also know that the second moment of a random variable $X$ is given by $\mathbb{E}(X^2)$, which corresponds to its varience. By normalizing our coefficients, $a_q$, of the target function $f(x)$, so that $\mathbb{E}_{a,x}[f^2]=1$, we enforce the output of our function to be a standard normal random variable ($\mu=0$ and $\sigma=1$).

From a higher level, we can see that $y_n$ is the sum of two standard normal random variables, where the term $\sigma$ controls the relative influence of the gaussian noise term, $\sigma\epsilon_n$. When $\sigma=1$ the output of $y_n$ is the direct sum of two standard normals, whereas when $\sigma=0$ the output is just the output of $f$.

**(b) How can we obtain $g_2$, $g_{10}$?** [Hint: pose the problem as linear regression]

To find the best fit hypothesis we can use all of the datapoints in $D$ to fit a linear regression model within $\mathcal{H}_2$ and $\mathcal{H}_{10}$, where we attempt to minimize the in-sample error for each model. 

**(c) How can we compute $E_{out}$ analytically for a given $g_{10}$?**

Since we know the target function, $f$, and our best fit hypothesis will also provide a function, we can measure the distance between these to calculate $E_{out}$. A perfect prediction of $f$ will mean that our best fit hypothesis matches the exact same target function exactly. 

**(d) Vary $Q_f$, $N$, $\sigma$ and for each combination of parameters, run a large number of experiments, each time computing $E_{out}(g_2)$ and $E_{out}(g_{10})$.**

Averaging these out-of-sample errors gives estimates of the expected out-of-sample error for the given learning scenario ($Q_f$, $N$, $\sigma$) using $\mathcal{H_2}$ and $\mathcal{H_{10}}$. 

Let 

\begin{equation}
    E_{out}(\mathcal{H}_2) = \text{average over experiments}(E_{out}(g_2)), \\
    E_{out}(\mathcal{H}_{10}) = \text{average over experiments}(E_{out}(g_{10})).
\end{equation}

Define the overfit measure $E_{out}(H\mathcal{H}_{10}) − E_{out}(\mathcal{H}_2)$. When is the overfit measure significantly positive (i.e. overfitting is serious) as opposed to significantly
negative? Try the choices $Q_f \in \{1, 2,..., 100\}, N \in \{20, 25,..., 120\}, \sigma^2 \in \{0, 0.05, 0.1,..., 2\}$.

Explain your observations.

**(e) Why do we take the average over many experiments?** Use the variance to select an acceptable number of experiments to average over.