Last week, we looked at the problem of linear regression, and particularly about how some funny things can happen when we give a model too much freedom.  In particular, we saw that as models became more versatile and complex (perhaps by increasing the number of available degrees of freedom), they started to fit the data noise, rather than the underlying function.  We saw that one way to control this was to limit the degrees of freedom of a model (not always possible or straightforward), and the other was to explicitly add regularization.  But even then, this doesn't really give us alot of tangible insight into whether or not we're fitting the model or the noise, and because of this, we also don't have any understanding of how much we can trust the parameters that we find.  We can see this with a demonstration.  First, suppose there's some physical process that takes an input $x$, and outputs $y$ deterministically.  

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rcParams['figure.figsize'] = [12,8]
np.random.seed(42)

def y(x):
    return -1 + x + x**2 + 0*x**3

We have access to the generating function $y(x) = -1 + x + x^2$, but let's pretend that we don't.  Now, let's imagine that we want to infer the model $y$ by measuring the output $\hat{y}=y(x) + \epsilon$ at some discrete points $x\in[0,1]$, where the measurement is subject to some random noise $\epsilon$.

In [None]:
x = np.random.rand(11)
x.sort()
epsilon = 1e-1
yhat = y(x) + epsilon*np.random.randn(11)

Now, we do the normal thing and fit the data to a polynomial of order, say, 3, then plot it.

In [None]:
degree = 3
X = np.vander(x,degree+1,increasing=True)
w_0 = np.linalg.solve(np.dot(X.T,X),np.dot(X.T,yhat))
plt.plot(x,yhat,'k.')
plt.plot(x,np.dot(X,w_0))
plt.xlabel('x')
plt.ylabel('y')
plt.show()

In [None]:
w_0

What if we take our measurements again, and do the same procedure?

In [None]:
yhat = y(x) + epsilon*np.random.randn(11)
w_1 = np.linalg.solve(np.dot(X.T,X),np.dot(X.T,yhat))
plt.plot(x,yhat,'k.')
plt.plot(x,np.dot(X,w_0))
plt.xlabel('x')
plt.ylabel('y')
plt.show()

The fits look similar.  But how about the parameter values?

In [None]:
plt.plot(w_0)
plt.plot(w_1)
plt.show()

Huh, this is somewhat troubling.  Even though we know for certain that the model that is generating the data is the same, we're getting different parameter values each time.  This is especially troubling if our model parameters have a real-world meaning that we need to use.  

What if we were to run the same experiment, say 10000 times though?

In [None]:
w_list = []
N_experiments = 10000
for i in range(N_experiments):
    yhat = y(x) + epsilon*np.random.randn(11)
    w_list.append(np.linalg.solve(np.dot(X.T,X),np.dot(X.T,yhat)))
w_array = np.array(w_list)

In [None]:
fig,axs = plt.subplots(nrows=1,ncols=4,sharey=True)
axs[0].hist(w_array[:,0],20,normed=True)
axs[1].hist(w_array[:,1],20,normed=True)
axs[2].hist(w_array[:,2],20,normed=True)
axs[3].hist(w_array[:,3],20,normed=True)
plt.show()

In [None]:
w_array.mean(axis=0)

If we can run our experiments an infinite number of times, the mean of all the different solutions for $w$ gets quite close to the correct value!  It even detected that we didn't need to include the cubic term.  Additionally, maybe the width of these histograms can give us a sense of how much trust we should be putting in a given parameter.  Doing it this way is clearly much better than dealing with all this silly regularization, right? Problem solved, see you next semester. 

But we have a bit of a problem.  We only have a certain amount of data.  But we still want to be able to access the kind of information that we have available to us from the above analysis.  The way to do this is, of course, to move to a way of thinking about these problems that sees data and parameters (and models even) as distributions of possible values (as histograms), rather than as single points, which allows us to fully embrace uncertainty.  But to get to this framework, we really need to understand probability first.

In [None]:
x,yhat

In [None]:
alpha = 0
beta = 1./epsilon**2
X = np.vander(x,degree+1,increasing=True)
Sigma = np.linalg.inv(beta*np.dot(X.T,X) + alpha*np.eye(degree+1))
w = np.dot(Sigma,np.dot(beta*X.T,yhat))

In [None]:
Sigma

In [None]:
w

In [None]:
plt.plot(x,yhat,'k.')
plt.plot(x,np.dot(X,w))
plt.xlabel('x')
plt.ylabel('y')
plt.show()

In [None]:
fig,axs = plt.subplots(nrows=1,ncols=4,sharey=True)
axs[0].hist(w_array[:,0],20,normed=True)
axs[1].hist(w_array[:,1],20,normed=True)
axs[2].hist(w_array[:,2],20,normed=True)
axs[3].hist(w_array[:,3],20,normed=True)
w0s = np.linspace(w_array[:,0].min(),w_array[:,0].max(),101)
w1s = np.linspace(w_array[:,1].min(),w_array[:,1].max(),101)
w2s = np.linspace(w_array[:,2].min(),w_array[:,2].max(),101)
w3s = np.linspace(w_array[:,3].min(),w_array[:,3].max(),101)
axs[0].plot(w0s,1./np.sqrt(2*np.pi*Sigma[0,0])*np.exp(-0.5*(w0s - w[0])**2/Sigma[0,0]))
axs[1].plot(w1s,1./np.sqrt(2*np.pi*Sigma[1,1])*np.exp(-0.5*(w1s - w[1])**2/Sigma[1,1]))
axs[2].plot(w2s,1./np.sqrt(2*np.pi*Sigma[2,2])*np.exp(-0.5*(w2s - w[2])**2/Sigma[2,2]))
axs[3].plot(w3s,1./np.sqrt(2*np.pi*Sigma[3,3])*np.exp(-0.5*(w3s - w[3])**2/Sigma[3,3]))
plt.show()

In [None]:
from scipy.stats import multivariate_normal

xhat = np.linspace(0,1,101)
Xhat = np.vander(xhat,degree+1,increasing=True)

for i in range(2000):
    w_rand = multivariate_normal(w,Sigma).rvs()
    plt.plot(xhat,np.dot(Xhat,w_rand),'r-',alpha=0.005)
plt.plot(x,yhat,'k.')
plt.show()

In [None]:
x