# 5 Alternative Solution to Ridge and Fake Data/Features perspectives

**STOP: If you have not completed Problem 2, please do that first!**

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

We will need some functions from previous problems.

In [2]:
def generate_data(x_range, func, sigma=1, n=80):
    y_range = np.array([func(x) + np.random.normal(0, sigma) for x in x_range])
    random_indicies = np.arange(len(x_range))
    np.random.shuffle(random_indicies)
    x = x_range[random_indicies[:n]]
    y = y_range[random_indicies[:n]]
    return x, y

def get_features(d, x_scalars):
    X = []
    for x in x_scalars:
        X.append([x**i for i in range(d+1)])
    return np.array(X)

def ols(X, y):
    ### BEGIN CODE ###
    
    ### END CODE ###
    return w_hat

def ridge(X, y, lambd=1):
    return np.linalg.inv(X.T@X + lambd*np.eye(X.shape[1]))@X.T@y

## 5.1 Alternative Solution to Ridge Regression

An important detail to note in OLS is that the closed-form solution $w = (X^TX)^{-1} X^Ty$ is designed for a regression problem where the data matrix is *tall*, or has more data points than features ($n > d$).

---

**5.1.1. Sanity Check:** Suppose that the features of a tall data matrix $X$ were linearly independent. **Comment on the existence of a solution. How does that tie into OLS?**

YOUR ANSWER HERE:

---

**5.1.2** Now suppose that $X$ is a square matrix and has linearly independent columns. **Comment on the existence of a solution. How would you find the solution?**

YOUR ANSWER HERE:

---

**5.1.3.** Now suppose that $X$ is a wide matrix, where it has more features than data points ($n < d$). **Comment on the existence of a solution. *(Hint: Are the columns of $X$ linearly independent?)***

YOUR ANSWER HERE:

---

Now you know the three scenarios for the shape the data matrix can take on, let's now focus on the wide matrix case more. 

**5.1.4. Since there can be infinite solutions $w$ to the system $Xw = y$ if $X$ is wide, what would be the "best" $w$ in this case. *(Hint: Think about ridge regression, what was it trying to minimize?)***

YOUR ANSWER HERE:

---

Now that we understand what the goal with solving regression problem with wide matrices is, let's formally define the problem and the closed-form solution:

Optimization Problem:

$$\underset{w}{\min} \|w\|^2_2 \text{ s.t. } Xw = y$$

Closed-Form Solution:

$$w = X^T(XX^T)^{-1}y$$

This looks very similar to the OLS solution! It turns out this solution is known as the *minimum-norm solution* and later in EECS 16B you will learn how this solution is derived.

Similarily, if we were to add a ridge pentalty to this minimum-norm solution, then we would arrive at an alternative closed-form solution for Ridge Regression:

$$w = X^T(XX^T + \lambda I)^{-1}y$$

You don't need to fully understand the significance of this alternative solution for ridge regression right now, but it is useful to notice that the matrix multiplication of $XX^T$ consists of only dot products between $\vec{x}_i$ training data. This property connects very well with Kernels, a topic you will learn about in a future lesson.

**5.1.5. Fill in the code below and run the cell to verify that the alternative closed-form solution for ridge regression gives us the same result:**

In [None]:
# Generating the polynomial toy model data again
x_range = np.linspace(-3, 1, 101, endpoint=True)
func = lambda x: x**3 + 3*x**2 - 2
np.random.seed(123)
x, y = generate_data(x_range, func, 0.4, 80)

N = 40
D = 7
x_train = x[:N] 
y_train = y[:N]
x_test = x[N:]
y_test = y[N:]
X_train = get_features(D, x_train)

def ridge_alternative(X, y, lambd=0.1):
    ### BEGIN CODE ###
    
    ### END CODE ###
    return w

lambd = 0.1
w_ridge = ridge(X_train, y_train, lambd)
w_ridge_alternative = ridge_alternative(X_train, y_train, lambd)
print(f'w_ridge: {w_ridge}')
print(f'w_alternative: {w_ridge_alternative}')

## 5.2 Fake Data and Fake Features Perspective

We are going to introduce two final perspectives on Ridge Regression, which are the fake data and fake features perspectives. More specifically, we will see that the fake data perspective will net us the standard closed-form solution for ridge regression while the fake features perspective will net us the alternative solution to ridge regression.

---

**5.2.1. Fake Data Perspective**

Given that we have a properly constructed $X$ matrix and $\vec{y}$ vector, let us add fake data points to $X$ and $\vec{y}$ such that:

$$\hat{X} = \begin{bmatrix}
X\\
\sqrt{\lambda}I
\end{bmatrix}$$

$$\hat{y}=\begin{bmatrix}
\vec{y}\\
0
\end{bmatrix}$$

**Show that the closed-form OLS solution using the augmented $\hat{X}$ matrix and $\hat{y}$ vector will evaluate to the closed-form solution for ridge regression.**

YOUR ANSWER HERE:


**5.2.2. Fill in the code below and run the cell to see that the fake data perspective gives us the same result as ridge**

In [None]:
def ridge_fake_data(X, y, lambd = 0.1):
    ### BEGIN CODE ###
    
    ### END CODE ###
    w = ols(X_hat, y_hat)
    return w;

X_train = get_features(D, x_train)
lambd = 0.1
w_ridge = ridge(X_train, y_train, lambd)
w_fake_data = ridge_fake_data(X_train, y_train, lambd)
print(f'w_ridge: {w_ridge}')
print(f'w_fake_data: {w_fake_data}')



**5.2.3. Now that we have a learned $\hat{w}$, explain how we could make predictions on test data $X_{test}$**

YOUR ANSWER HERE: 

---

**5.2.4. Fake Features Perspective**
Let's augment the data matrix again, except this time we are adding fake features such that:

$$\hat{X} = \begin{bmatrix}
X \sqrt{\lambda}I
\end{bmatrix}$$

Notice that the $\hat{X}$ matrix is wide now, so we need to use the minimum-norm solution instead of OLS. In addition the weight vector we find using the minimum-norm solution will actually have two components: one for the original features and one for the fake features. We will show this decomposition by defining the weight vector from the minimum-norm solution as $\begin{bmatrix}
\hat{w}\\
\hat{\epsilon}
\end{bmatrix}$. Show that the minimum-norm solution with the augmented $\hat{X}$ matrix will net us the same $\hat{w}$ as the alternative closed-form solution for ridge regression:

YOUR ANSWER HERE: 


**5.2.5. Fill in the code below and run the cell to see that the fake features perspective gives us the same result as ridge**

In [None]:
def ridge_fake_features(X, y, lambd = 0.1):
    X_hat = np.hstack((X, np.sqrt(lambd)*np.eye(X.shape[0])))
    ### BEGIN CODE ###
    
    ### END CODE ###
    return w;

X_train = get_features(D, x_train)
lambd = 1
w_ridge_alternative = ridge_alternative(X_train, y_train, lambd)
w_fake_data = ridge_fake_features(X_train, y_train, lambd)
print(f'w_ridge_alternative: {w_ridge_alternative}')
print(f'w_fake_data: {w_fake_data}')

**5.2.6. Now imagine instead of only keeping $\hat{w}$, we actually kept the entire $(n\times d)$-dimension weight vector from the minimum norm solution: $\begin{bmatrix}
\hat{w}\\
\hat{\epsilon}
\end{bmatrix}$. Explain how we could augment $X_{test}$ so our predictions are equivalent to $X_{test}\hat{w}$.**

YOUR ANSWER HERE: