# Linear regression

<img src="cricket.jpg" width=30%>
For centuries, it's been understood that the frequency of cricket chirps increases as temperature increases.  In this problem, you will determine the functional relationship between these two variables such that cricket chirps can be used as a thermometer. 

To begin, import the data file cricket.txt.  The first column is the temperature in degrees C, while the second column is the number of cricket chirps per 15 seconds.  Using scikit-learn's model selection tools, we can split the data into a training set, which will be used to train the model, and a test set, which will be used to validate the model's performance on data that was *not* used to train it.  

In [None]:
import numpy as np
data = np.loadtxt('crickets.txt')

data -= data.mean(axis=0)
data /= data.std(axis=0)

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(data[:,0], data[:,1], test_size=0.5, random_state=42)

train_idx = np.argsort(X_train)
X_train = X_train[train_idx]
Y_train = Y_train[train_idx]


test_idx = np.argsort(X_test)
X_test = X_test[test_idx]
Y_test = Y_test[test_idx]



In [None]:
import matplotlib.pyplot as plt
plt.plot(X_train,Y_train,'ko')
plt.xlabel('Temp')
plt.ylabel('Freq.')

In [None]:
Sigma_z

In [None]:
# Concatenate X,Y into Z
Z_train = np.c_[X_train,Y_train]

mu_z = Z_train.mean(axis=0)
Sigma_z = np.cov(Z_train.T)

mu_x = mu_z[0]
mu_y = mu_z[1]

sigma2_xy = Sigma_z[0,1]
sigma2_x = Sigma_z[0,0]
sigma2_y = Sigma_z[1,1]

mu_bar = mu_y + sigma2_xy/sigma2_x*(X_train-mu_x)
sigma2_bar = sigma2_y - sigma2_xy**2/sigma2_x 


plt.plot(X_train,Y_train,'ko')
plt.xlabel('Temp')
plt.ylabel('Freq.')
plt.errorbar(X_train,mu_bar,yerr=2*np.sqrt(sigma2_bar))


### 1. Ordinary Least Squares
Your first task is to define a function that will fit a polynomial of arbitrary degree to the data, subject to Tikhonov regularization.  To do this you will have to generate the Design matrix $\Phi(X_{obs})$, and solve the normal equations 
$$
(\Phi^T \Phi + \lambda I) \mathbf{w} = \phi^T Y_{obs},
$$
where $\mathbf{w}$ is the vector of polynomial coefficients.  Plot the data with the best-fitting polynomial of degree 1 (a line) overlain.  A handy fact is that if you would like to evaluate this model at some location (or set of locations) $X_{pred}$, the corresponding *prediction* $Y_{pred}$ is given by 
$$
Y_{pred} = \underbrace{\Phi(X_{pred})}_{m\times n} \underbrace{\mathbf{w}}_{n\times 1}.
$$
As such, it might be helpful to define a function that computes $\Phi(X)$ outside of fit\_polynomial.

In [None]:
def fit_polynomial(X,Y,d,lamda=0):
    """  Find the ordinary least squares fit of an independent 
        variable X to a dependent variable y"""
    Phi = np.column_stack([X**i for i in range(d+1)])
    mod_I = np.eye(d+1)
    #mod_I[0,0] = 0.0
    w = np.linalg.solve(Phi.T @ Phi + lamda*mod_I,Phi.T @ Y)
      
    return w, Phi

w_line, Phi_train = fit_polynomial(X_train,Y_train,100,lamda=0)

In [None]:
X_interp = np.linspace(-2,2,1000)
d = 100
Phi_interp = np.column_stack([X_interp**i for i in range(d+1)])
Y_interp = Phi_interp @ w_line



In [None]:
plt.plot(X_interp,Y_interp)
plt.plot(X_train,Y_train,'ko')
plt.ylim(-5,5)

### 2. Overfitting
With the above function in hand, now we will explore the effect of fitting higher degree polynomials to the data.  Fit the training data using polynomials from degree 1 to 15, without regularization (i.e. $\lambda=0$).  For each of these fits, record the resulting root mean square error 
$$
RMSE = \sqrt{\sum_{i=1}^m (Y_{pred,i} - Y_{obs,i})^2}
$$

in both the training and test data.  Plot both of these RMSE values as a function of polynomial degree (Using a logarithmic scale for RMSE is helpful). 

In [None]:
train_rmse_list = []
test_rmse_list = []
m = len(Y_train)
m_test = len(Y_test)
degrees = np.linspace(0,15,16).astype(int)
for d in degrees:
    w_line, Phi_train = fit_polynomial(X_train,Y_train,d,lamda=0)
    Y_train_pred = Phi_train @ w_line
    rmse_train = np.sqrt(((Y_train_pred - Y_train)**2).sum()/m)
    train_rmse_list.append(rmse_train)
    
    _, Phi_test = fit_polynomial(X_test,Y_test,d,lamda=0)
    Y_test_pred = Phi_test @ w_line
    rmse_test = np.sqrt(((Y_test_pred - Y_test)**2).sum()/m_test)
    test_rmse_list.append(rmse_test)
    
  
    
    
    
    
    
    #! Use the function you generated above to fit 
    #! a polynomial of degree d to the cricket data
 
    #! Compute and record RMSE for both the training and
    #! test sets.  IMPORTANT: Don't fit a new set of 
    #! weights to the test set!!!

plt.semilogy(degrees,train_rmse_list,'-')
plt.semilogy(degrees,test_rmse_list,'--')
plt.xlabel('degree')
plt.ylabel('Loss')
plt.show()

### 3. Regularization
Fix the polynomial degree at 15, and now fit the training data for regularization parameter $\lambda \in [10^{-9},10^2]$ (you'll want to distribute these points in log-space; see below).  As above, compute the RMSE in the training and test sets, and plot as a function of $\lambda$.

In [None]:
train_rmse = []
test_rmse = []
lamdas = np.logspace(-9,2,12)
d = 15
for lamda in lamdas:
    #! Use the function you generated above to fit 
    #! a polynomial of degree 15 to the cricket data
    #! with varying lambda 
    
    #! Compute and record RMSE for both the training and
    #! test sets.  IMPORTANT: Don't fit a new set of 
    #! weights to the test set!!!

#plt.loglog(lamdas,train_rmse)
#plt.loglog(lamdas,test_rmse)
#plt.show()

# Mauna Kea CO2

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml

co2 = fetch_openml(data_id=41187, as_frame=True)

import pandas as pd

co2_data = co2.frame
co2_data["date"] = pd.to_datetime(co2_data[["year", "month", "day"]])
co2_data = co2_data[["date", "co2"]].set_index("date")
co2_data.head()

Y = co2_data['co2'].to_numpy()
X = np.linspace(0,43,len(Y))

In [None]:
plt.plot(X,Y)


Phi = np.c_[np.ones_like(X),X,np.sin(2*np.pi*X)]
w_opt = np.linalg.solve(Phi.T @ Phi,Phi.T @ Y)


plt.plot(X,Phi @ w_opt)

In [None]:
w_opt