# Simulation for Exercise 3.2

## Outline

What do I need to calculate the confidence intervals?

- Functions to generate $\mathbf{X}$ and $\mathbf{y}$ given distribution of $X$, parameters $\beta$, variance of $\mathbf{y}$, and sample size $N$
- Function to calculate $\hat{\beta}$ from $\mathbf{X}$ and $\mathbf{y}$ - ~~use scikit-learn~~
- Regression function $\text{E}(Y|X) = f(X) = X^T\beta$
- Function to calculate variance estimate $\hat{\sigma}^2$
- Functions for $t$ and $\chi^2$ percentiles based on degrees of freedom and confidence level - scipy?
- Functions for endpoints of confidence intervals
- ~~Function to average simulations~~
- Function to plot results

This should all be packaged in a big function that generates and plots confidence bounds against the original function. It will be a function of $(\beta,\mu,\tau^2,\sigma^2,N,\alpha)$.

To simplify things:

- Assume data points $x_1,\ldots, x_N$ are sampled from a normal distribution with mean $\mu$ and variance $\tau^2$
- By default, generate confidence bounds and plot for $x_0$ within two standard deviations of $\text{E}(X)=\mu$ (can make this an optional parameter)
- *Don't average simulations - just do one*

I think I should start with the big function and work down.

### Main function

`run_sim(beta,alpha,N,sigma2,xmean,xvar,plot_range=None)`

- Should print $f(x)$, $\hat{f}(x)=x^T\hat{\beta}$, $\hat{\sigma}^2$, $x_0^T(\mathbf{X}^T\mathbf{X})^{-1}x_0$, confidence intervals at xmean
- Print training error?
- Should plot regression function f and upper and lower limits of confidence bounds for two methods for $x_0$ in given range. Print range and confidence level
- Set `plot_range = [xmean-2*sqrt(xvar), xmean+2*sqrt(var)]` if not specified

### Plot function

`plot_conf(regfun, zconf, chiconf, plot_range)`
- plot regression function and upper and lower limits for z and chi methods for $x_0$ in plot_range
- Show legend
- Print range and confidence level

### Generate confidence interval functions

`gen_zconf(regfunhat, sigma2hat, X, alpha, XTXinv = None)`
- Return function zconf: x_0 -> [zconf_upper, zconf_lower]

`gen_chiconf(regfunhat, sigma2hat, X, alpha, XTXinv = None)`
- Return function chiconf: x_0 -> [chiconf_upper, chiconf_lower]

Thoughts:
- Both of these functions make use of $(\mathbf{X}^T\mathbf{X})^{-1}$. It would speed things up to store this somewhere
- Store values like XTXinv, X(XTXinv), sigmahat*t, sigmahat*chi in the function so they don't have to be recalculated
- Both functions calculate x_0^T(XTXinv)x_0, but these are all 4 dimensional so shouldn't be computationally restrictive

### Calculate $\hat{\sigma}$

`gen_sigma2hat(y,yhat)`
- Returns $\hat{\sigma}^2 = \frac{1}{N-p-1}(\mathbf{y}-\hat{\mathbf{y}})^T(\mathbf{y}-\hat{\mathbf{y}})$

### Calculate estimated regression function

`gen_regfunhat(betahat)`
- Returns function regfunhat:x -> $x^T\hat{\beta}$

#### Calculate least squares estimate $\hat{\beta}$

`fit_ls(X, y, XTXinv = None)`
- Returns $\hat{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}\mathbf{y}$ and $\hat{y}=X\hat{\beta}$

#### Calculate $(\mathbf{X}^T\mathbf{X})^{-1}$

`gen_XTXinv(X)`

### Sample data

`sample_data(beta,N,sigma,xmean,xvar)`
- Return numpy arrays $X$ and $y$

## Test Space

In [50]:
import numpy as np
from numpy import linalg
from matplotlib import pyplot as plt

In [57]:
def sample_data(beta, N, sigma, xmean, xvar):
    X = np.random.normal(loc=xmean, scale=xvar, size=(N))
    X = np.expand_dims(X, axis=1)
    
    bias = np.ones(shape=(N, 1))
    X = np.concatenate((bias, X, X**2, X**3), axis=1)
    
    y = np.random.normal(loc=X@beta, scale=sigma, size=N)
    
    return X, y

def fit_ls(X, y):
    betahat = linalg.inv(X.T@X)@X.T@y
    yhat = X@betahat
    
    return betahat, yhat



In [69]:
beta = [0, 0, 0, 1]
N = 10000
sigma = 1
xmean = 0
xvar = 2

In [70]:
X, y = sample_data(beta, N, sigma, xmean, xvar)

In [51]:
XTXinv = linalg.inv(X.T@X)

In [71]:
betahat, yhat = fit_ls(X, y)

In [72]:
betahat, beta

(array([ 0.00451426,  0.00476163, -0.00151666,  1.00009733]), [0, 0, 0, 1])

In [73]:
yhat, y

(array([  0.10258289,  -1.04422809,   0.5394553 , ...,  32.04872766,
        -23.08442243,  -1.57031092]),
 array([ -0.48691099,  -0.44964531,   1.22843217, ...,  32.05909578,
        -24.63605879,  -2.09971816]))