# Homework 2
**Christian Steinmetz**

Due on November 22nd

In [3]:
import numpy as np

## Problem formulation
We set up an experimental framework to study various aspects of overfitting. 
The input space is $X = [−1, 1]$ with uniform input probability density, $P(x) =\frac{1}{2}$.
We consider the two models $\mathcal{H}_2$ and $\mathcal{H}_{10}$. 
The target function is a polynomial of degree $Q_f$, which we write as $f(x)=\sum_{q=0}^{Q_f} a_q L_q(x)$, where $Lq(x)$ are the Legendre polynomials.
We use the Legendre polynomials because they are a convenient orthogonal basis for the polynomials on $[−1, 1]$.
The first two Legendre polynomials are $L_0(x) = 1$, $L_1(x) = x$.
The higher order Legendre polynomials are defined by the recursion:

\begin{equation}
    L_k(x) = \frac{2k-1}{k} x L_{k-1}(x) - \frac{k-1}{k} L_{k-2}(x)
\end{equation}

The data set is $\mathcal{D} = (x_1, y_1), ..., (x_N , y_N)$, where $y_n = f (x_n)+\sigma \epsilon_n$ and $\epsilon_n$ are i.i.d. standard Normal random variables.

For a single experiment, with specified values for $Q_f$, $N$, $\sigma$, generate a random degree $Q_f$ target function by selecting coefficients $a_q$ independently from a standard Normal distribution, rescaling them so that $\mathbb{E}_{a,x}[f^2]=1$. Generate a data set, selecting
$x_1,...,x_N$ independently from $P(x)$ and $y_n = f (x_n)+\sigma \epsilon_n$. 

Let $g_2$ and $g_{10}$ be the best fit hypotheses to the data from $\mathcal{H_2}$ and $\mathcal{H}_{10}$, respectively, with respective out-of sample errors $E_{out}(g_2)$ and $E_{out}(g_{10})$.

In [17]:
def legendre_poly(k, x):
    if k == 0:
        return 1
    elif k == 1:
        return x
    else:
        return ((2*k-1)/k) * x * legendre_poly(k-1, x) - ((k-1)/k) * legendre_poly(k-2, x)

In [80]:
def generate_a_q():
        
    # deteremine the degree, Q_f
    Q_f = np.random.randint(3, 10)
    
    # sample values for a_q
    a_q = np.random.standard_normal(size=Q_f)
    
    # normalize for E(f^2) = 1
    E = np.mean([f(-1, a_q)**2, f(1, a_q)**2])
    
    return a_q/E

In [81]:
def f(x, a_q):
    return np.sum([a * legendre_poly(k, x) for k, a in enumerate(a_q)])

In [82]:
def generate_dataset(f, a_q, sigma, N):
    X = np.random.choice([-1, 1], size=N)
    Y = f(X, a_q) + (sigma * np.random.standard_normal(size=N))

    return X, Y

In [83]:
a_q = generate_a_q()
X, Y = generate_dataset(f, a_q, 0.0, 10)

(a) Why do we normalize $f$? [Hint: how would you interpret σ?]

(b) How can we obtain $g_2$, $g_{10}$? [Hint: pose the problem as linear regression]

(c) How can we compute $E_{out}$ analytically for a given $g_{10}$?

(d) Vary $Q_f$, $N$, $\sigma$ and for each combination of parameters, run a large number of experiments, each time computing $E_{out}(g_2)$ and $E_{out}(g_{10})$. 
Averaging these out-of-sample errors gives estimates of the expected out-of-sample error for the given learning scenario ($Q_f$, $N$, $\sigma$) using $\mathcal{H_2}$ and $\mathcal{H_{10}}$. 

Let 

\begin{equation}
    E_{out}(\mathcal{H}_2) = \text{average over experiments}(E_{out}(g_2)), \\
    E_{out}(\mathcal{H}_{10}) = \text{average over experiments}(E_{out}(g_{10})).
\end{equation}

Define the overfit measure $E_{out}(H\mathcal{H}_{10}) − E_{out}(\mathcal{H}_2)$. When is the overfit measure significantly positive (i.e. overfitting is serious) as opposed to significantly
negative? Try the choices $Q_f \in \{1, 2,..., 100\}, N \in \{20, 25,..., 120\}, \sigma^2 \in \{0, 0.05, 0.1,..., 2\}$.

Explain your observations.

(e) Why do we take the average over many experiments? Use the variance to select an acceptable number of experiments to average over.