# DATA 558 Midterm

Will Wright

### Exercise 1

**Instruction**  
Compute the gradient $\nabla F(\beta)$ where the objective is:
$$\min_{\mathbf{\beta \in \mathbb{R}^d}} F(\beta):=\frac{1}{n}\sum_{i=1}^{n} \frac{1}{\rho}log(1+exp(-\rho y_ix_i^T\beta)) + \lambda\lVert\beta\rVert_2^2$$

**Solution**  
Start by moving the scalar $\frac{1}{\rho}$ outside the summation:  
$F(\beta)=\frac{1}{n\rho}\sum_{i=1}^{n} log(1+exp(-\rho y_ix_i^T\beta)) + \lambda\lVert\beta\rVert_2^2$

Next, break up into two terms:  
(1) $\frac{1}{n\rho}\sum_{i=1}^{n} log(1+exp(-\rho y_ix_i^T\beta))$  
(2) $\lambda\lVert \beta\rVert_2^2$ 

Find the derivative of the first term:  
(1) $\frac{\partial}{\partial \beta}\Big[\frac{1}{n\rho}\sum_{i=1}^{n} log(1+exp(-\rho y_ix_i^T\beta))\Big]$  
  
> Move the constant and summation to the outside of the derivative: 
$\frac{1}{n\rho}\sum_{i=1}^{n}\frac{\partial}{\partial \beta}\Big[ log(1+exp(-\rho y_ix_i^T\beta))\Big]$ 
  
> Use the chain rule with the following functions and their derivatives:  
$(f\circ g \circ h) = log(g\circ h)$  
$(f\circ g \circ h)' = \frac{1}{(g\circ h)} \cdot (g \circ h)' $  
$(g\circ h) = 1+exp(-h)$  
$(g\circ h)' = -exp(-h)\cdot h'$  
$h = \rho y_ix_i^T\beta$  
$h' = \rho y_ix_i$  
  
> Putting the chains together, we have:  
$(f\circ g \circ h)' = -\rho y_ix_i \cdot \frac{exp(- \rho y_ix_i^T\beta)}{1+exp(-\rho y_ix_i^T\beta)}$
  
> Re-apply constant scalar and summation:  
$=\frac{1}{n\rho}\sum_{i=1}^{n}-\rho y_ix_i \cdot \frac{exp(- \rho y_ix_i^T\beta)}{1+exp(-\rho y_ix_i^T\beta)}$

> Move the constant $-\rho$ outside the summation and cancel it out:
$=-\frac{1}{n}\sum_{i=1}^{n}y_ix_i \cdot \frac{exp(- \rho y_ix_i^T\beta)}{1+exp(-\rho y_ix_i^T\beta)}$

Next, derive the second term:  
(2) $\frac{\partial}{\partial \beta}\lambda\lVert \beta\rVert_2^2$  
  
> Move the constant $\lambda$ outside the derivation and convert to matrix form:  
$ =\lambda \frac{\partial}{\partial \beta}\beta^T \beta$
  
> Multiply by the identity matrix $I$:  
$ =\lambda \frac{\partial}{\partial \beta}\beta^T I \beta$
  
> Apply the property $\frac{\partial}{\partial x}x^T Ax = (A+A^T)x$:  
$ =\lambda (I + I^T)\beta$  
  
> Given that $(I + I^T)$ is simply a scalar of 2 for matrices, we can simplify:  

> $=2\lambda \beta $  
  
Next, we add (1) and (2) to get:  
$\nabla F(\beta) = -\frac{1}{n}\sum_{i=1}^{n}y_ix_i \cdot \frac{exp(- \rho y_ix_i^T\beta)}{1+exp(-\rho y_ix_i^T\beta)} + 2\lambda \beta$

**Instruction**  
Consider the Spam dataset from The Elements of Statistical Learning. Standardize the data, if you have not done so already. Be sure to use the training and test splits from the website.

In [93]:
# Load Packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import copy
import sklearn.preprocessing
from sklearn.linear_model import LogisticRegression

spam = pd.read_table('https://web.stanford.edu/~hastie/ElemStatLearn/datasets/spam.data', 
                   delim_whitespace=True, header = None)
test_indicator = pd.read_table('https://web.stanford.edu/~hastie/ElemStatLearn/datasets/spam.traintest',
                         delim_whitespace=True, header = None)

In [94]:
x = np.asarray(spam)[:, 0:-1]
y = np.asarray(spam)[:, -1]*2 - 1
test_indicator = np.array(test_indicator).T[0]

# Divide the data into train, test sets
x_train = x[test_indicator == 0, :]
x_test = x[test_indicator == 1, :]
y_train = y[test_indicator == 0]
y_test = y[test_indicator == 1]

# Standardize the data.
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

# Keep track of the number of samples and dimension of each sample
n_train = len(y_train)
n_test = len(y_test)
d = np.size(x, 1)

**Instruction**  
Write a function _myrhologistic_ that implements the accelerated gradient algorithm to train the $\ell_2^2$-regularized binary logistic regression with $\rho$-logistic loss. The function takes as input
the initial step-size for the backtracking rule, the $\epsilon$ for the stopping criterion based on thenorm of the gradient of the objective, and the value of $\rho$.

**Instruction**  
Train you $\ell_2^2$-regularized binary logistic regression with $\rho$-logistic loss with $\rho=2$ and $\epsilon=10^{-3}$ o the Spam dataset for $\lambda=1$. Report your misclassification error for this value of $\lambda$.

**Instruction**  
Write a function _crossval_ that implements leave-one-out cross-validation.