## HW 2 
Emily Yamauchi

Function $F$ is defined as :

$$
F(\beta) = \frac{1}{2n}\sum_{i=1}^n(y_i-x_i^\top \beta)^2 + \frac{\lambda}{n}||\beta||_2^2
$$
and the optimal solution as :
$$
F(\beta^*) = \min_{\beta \in \mathbb{R}^d}F(\beta)
$$

1. Assume that $d = 1$ and $n = 1$. The sample size is then of size 1 and boils down to just $(x, y)$. The function $F$ writes simply as

$$
F(\beta) = \frac{1}{2}(y-x\beta)^2 + \lambda\beta^2
$$
    Compute and write down the gradient $\nabla{F}$ of $F$.

Solution:

$$
\begin{align*}
F(\beta) &= \frac{1}{2}(y-x\beta)^2 + \lambda\beta^2 \\
\frac{d}{d\beta}F(\beta) &= \frac{1}{2}2(y-x\beta)(-x)+2\lambda\beta \\
&= -x(y-x\beta) + 2\lambda\beta \\
\end{align*}
$$

2. Assume now $d > 1$ and $n > 1$. Using the previous result and the linearity of differentiation, compute and write down the gradient $\nabla F(\beta)$ of $F$.

$$
\begin{align*}
F(\beta) &= \frac{1}{2n}\sum_{i=1}^n(y_i-x_i^\top \beta)^2 + \frac{\lambda}{n}||\beta||_2^2 \\
&= \frac{1}{2n}(Y - X^\top \beta)^\top (Y - X^\top \beta)+ \frac{\lambda}{n}\beta^\top \beta \\
&= \frac{1}{2n}\left(Y^\top Y - Y^\top X\top \beta - (X^\top \beta)^\top Y + (X^\top \beta)^\top X^\top \beta\right) + \frac{\lambda}{n}\beta^\top I \beta \\
&= \frac{1}{2n}(Y^\top Y - Y^\top X^\top \beta - \beta^\top XY + \beta^\top XX^\top \beta) + \frac{\lambda}{n}\beta^\top I \beta \\
\frac{d}{d\beta}F(\beta)&=\frac{1}{2n}[0-XY-XY+(XX^\top + (XX^\top)^\top \beta] + \frac{\lambda}{n}(I+I^\top)\beta \\
&= \frac{1}{2n}[-2XY+2(XX^\top)\beta] + \frac{\lambda}{n}2\beta \\
&= \frac{-X(Y-X^\top \beta)}{n} + \frac{2\lambda\beta}{n} \\
\end{align*}
$$

Consider the `Penguins` dataset, which you should load and divide into training and test sets using the code below

In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

In [2]:
#Load the data
file = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv'
penguins = pd.read_csv(file, sep=',', header=0)
penguins = penguins.dropna()

In [3]:
#Create our X matrix with the predictors and y vector with the response

X = penguins.drop('body_mass_g', axis=1)
X = pd.get_dummies(X, drop_first=True)
y = penguins['body_mass_g']

In [5]:
#Divide the data into training and test sets. By default, 25% goes into the test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Standardize the data. Note that you can convert a dataframe into an array by using `np.array()`

In [38]:
def standardize(X_df):
    """
    Apply standardization function to df type input X
    Returns: standardized data in array type
    """

    for col in X_df.columns:
        X_df[col] = X_df[col] / np.std(X_df[col])
    
    return np.array(X_df)

In [43]:
#standardize(X_train)

Write a function *computegrad* that computes and returns $\nabla F(\beta)$ for any $\beta$. Avoid using `for` loops by vectorizing the computation.

In [44]:
def computegrad(beta, X, y, lamb):
    """
    Computes the gradient of the function given the beta, X and y vectors, and lambda
    """
    
    n = len(X)
    d1 = -X * (y + np.dot(X.T, beta)) / n
    d2 = 2 * lamb/n * beta
    
    return d1 + d2

Write a function *graddescent* that implements the gradient descent algorithm described in Algorithm 1. The function *graddescent* calls the function *computegrad* as a subroutine. The function takes as input the initial point, the constant step-size value, and the maximum number of iterations. The stopping criterion is the maximum number of iterations.