# [CSCI 3397/PSYC 3317] Lab 5b: Linear Regression

**Posted:** Thursday, February 17, 2022

**Due:** Thursday, February 24, 2022

__Total Points__: 8 pts

__Submission__: please rename the .ipynb file as __\<your_username\>_lab5b.ipynb__ before you submit it to canvas. Example: weidf_lab5b.ipynb.

In [None]:
## utility functions

# dataset split
def data_split(N, ratio=[6,2,2]):
    # generate a shuffle array
    shuffle_idx = np.arange(N)
    np.random.shuffle(shuffle_idx)
    # divide into train-val-test by the ratio
    data_split = (np.cumsum(ratio)/float(sum(ratio))*N).astype(int)
    out_idx = [None] * len(ratio)
    out_idx[0] = shuffle_idx[:data_split[0]]
    for i in range(1,len(ratio)):
        out_idx[i] = shuffle_idx[data_split[i-1] : data_split[i]]
    return out_idx

def MSE(y,y_hat):
    # Lec. 9, page 19
    return ((y-y_hat)**2).mean()

# 1. Linear regression

## 1.1 Dataset Generation

In [None]:
# data generation
import time
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(123)

# gt polynomial: y = 0.73x + 0.58
theta_gt = [0.58, 0.73]  # theta_0,theta_1
num_pt = 50

X = np.random.uniform(4, 10, num_pt).reshape(-1,1)
Y = theta_gt[0] + theta_gt[1] * X + 0.5 *np.random.normal(0, 1, num_pt).reshape(-1,1)

# for visualization
XX = np.linspace(4,10,100).reshape(-1,1)

train_idx, val_idx, test_idx = data_split(len(Y))

X_train, Y_train = X[train_idx], Y[train_idx]
X_val, Y_val = X[val_idx], Y[val_idx]
X_test, Y_test = X[test_idx], Y[test_idx]

plt.plot(X_train,Y_train,'ro')

## 1.2 Use 1-dim linear regression formula

Lec. 9, page 23

In [None]:
theta_1 = (X_train*Y_train).mean()-X_train.mean()*Y_train.mean()
theta_1 = theta_1/ ((X_train**2).mean()-X_train.mean()**2)
theta_0 = Y_train.mean()-theta_1*X_train.mean()

plt.plot(X_train,Y_train,'ro')
plt.plot(XX,XX*theta_1+theta_0,'b-')
plt.legend(['data', 'regression result'])

print('gt theta', theta_gt)
print('estimated theta (1-dim formula)', [theta_0,theta_1])


# evaluation on the training data
Y_train_hat = X_train*theta_1+theta_0
print('MSE error', MSE(Y_train, Y_train_hat))

## 1.3 Use N-dim linear regression formula

lec.9, page 25

In [None]:
X_train_aug = np.hstack([np.ones([len(X_train),1]),X_train])

theta_nd = np.linalg.solve(np.matmul(X_train_aug.T, X_train_aug), np.matmul(X_train_aug.T, Y_train))
print('estimated theta (N-dim formula)', theta_nd)

# evaluation on the training data
Y_train_hat = X_train*theta_nd[1]+theta_nd[0]
print('MSE error', MSE(Y_train, Y_train_hat))

# 2. Polynomial regression (Linear regression + polynomial feature)

## 2.1 Data generation

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(123)

# gt polynomial: y = x^2 -10x +25
num_pt = 10

X2 = np.random.uniform(4, 10, num_pt).reshape(-1,1)
theta_gt2 = [25,-10,1]
Y2 = theta_gt2[0] + theta_gt2[1]*X2 +theta_gt2[2]*X2**2 + 0.5 *np.random.normal(0, 1, num_pt).reshape(-1,1)

# for visualization
XX2 = np.linspace(4,10,100).reshape(-1,1)

train_idx2, val_idx2, test_idx2 = data_split(len(Y2))

X_train2, Y_train2 = X2[train_idx2], Y2[train_idx2]
X_val2, Y_val2 = X2[val_idx2], Y2[val_idx2]
X_test2, Y_test2 = X2[test_idx2], Y2[test_idx2]

plt.plot(X_train2,Y_train2,'ro')

## 2.1 Use N-dim linear regression formula

lec.9, page 25,30

In [None]:
X_train2_aug = np.hstack([np.ones([len(X_train2),1]),X_train2,X_train2**2])

theta_nd2 = np.linalg.solve(np.matmul(X_train2_aug.T, X_train2_aug), np.matmul(X_train2_aug.T, Y_train2))
print('gt theta', theta_gt2)
print('estimated theta (N-dim formula)',theta_nd2)

# evaluation on the training data
Y_train2_hat = X_train2 * X_train2 * theta_nd2[2] + X_train2 * theta_nd2[1]+theta_nd2[0]
print('MSE error', MSE(Y_train2, Y_train2_hat))

# Exercise (8 pts)

## (1) [3 pts] Polynomial regressor for any order of K

- [2 pt] Build a function to do polynomial regression with the input order K (e.g., $\sum_{i=0}^{k}\theta_ix^i$) and return the estimated theta
- [1 pt] Sanity check: for K=2, print the MSE error for the train data (X_train2, Y_train2) in section 2 and check if the MSE values agree.

In [None]:
# hint: create the feature in the beginning and use for-loop to fill in each feature dimension
def train_PR(x,y,k):
    ### Your code starts here
    
    ### Your code ends here
    # retun estimated theta
    
    
    

## (2) [5 pts] Model selection
Let `Ks=np.arange(1,11)`
- (a) [1 pt] For each K value, train a polynomial regression model with order=K, evaluate its MSE on the training data.
- (b) [1 pt] Draw a line-plot of the training MSE (`plt.plot(x,y,'-')`) and answer "which K shall we choose if the goal is to minimize the training error"
- (c) [1 pt] Evaluate the trained models (different K values) above on the validation data and answer "which K shall we choose if the goal is to minimize the validation error"
- (d) [1 pt] Repeat (c) on the test data as the "final/real-world" evaluation.
- (e) [1 pt] Which model selection criteria is better: minimize training error or validation error? Briefly explain why.

Lec. 9, slide 42 

In [None]:
### Your code starts here

### Your code ends here