# DATA 558 Midterm

Will Wright

### Exercise 1

**Instructions**  
Compute the gradient $\nabla F(\beta)$ where the objective is:
$$\min_{\mathbf{\beta \in \mathbb{R}^d}} F(\beta):=\frac{1}{n}\sum_{i=1}^{n} \frac{1}{\rho}log(1+exp(-\rho y_ix_i^T\beta)) + \lambda\lVert\beta\rVert_2^2$$

**Solution**  
Start by moving the scalar $\frac{1}{\rho}$ outside the summation:  
$F(\beta)=\frac{1}{n\rho}\sum_{i=1}^{n} log(1+exp(-\rho y_ix_i^T\beta)) + \lambda\lVert\beta\rVert_2^2$

Next, break up into two terms:  
(1) $\frac{1}{n\rho}\sum_{i=1}^{n} log(1+exp(-\rho y_ix_i^T\beta))$  
(2) $\lambda\lVert \beta\rVert_2^2$ 

Find the derivative of the first term:  
(1) $\frac{\partial}{\partial \beta}\Big[\frac{1}{n\rho}\sum_{i=1}^{n} log(1+exp(-\rho y_ix_i^T\beta))\Big]$  
  
> Move the constant and summation to the outside of the derivative: 
$\frac{1}{n\rho}\sum_{i=1}^{n}\frac{\partial}{\partial \beta}\Big[ log(1+exp(-\rho y_ix_i^T\beta))\Big]$ 
  
> Use the chain rule with the following functions and their derivatives:  
$(f\circ g \circ h) = log(g\circ h)$  
$(f\circ g \circ h)' = \frac{1}{(g\circ h)} \cdot (g \circ h)' $  
$(g\circ h) = 1+exp(-h)$  
$(g\circ h)' = -exp(-h)\cdot h'$  
$h = \rho y_ix_i^T\beta$  
$h' = \rho y_ix_i$  
  
> Putting the chains together, we have:  
$(f\circ g \circ h)' = -\rho y_ix_i \cdot \frac{exp(- \rho y_ix_i^T\beta)}{1+exp(-\rho y_ix_i^T\beta)}$
  
> Re-apply constant scalar and summation:  
$=\frac{1}{n\rho}\sum_{i=1}^{n}-\rho y_ix_i \cdot \frac{exp(- \rho y_ix_i^T\beta)}{1+exp(-\rho y_ix_i^T\beta)}$

> Move the constant $-\rho$ outside the summation and cancel it out:
$=-\frac{1}{n}\sum_{i=1}^{n}y_ix_i \cdot \frac{exp(- \rho y_ix_i^T\beta)}{1+exp(-\rho y_ix_i^T\beta)}$

Next, derive the second term:  
(2) $\frac{\partial}{\partial \beta}\lambda\lVert \beta\rVert_2^2$  
  
> Move the constant $\lambda$ outside the derivation and convert to matrix form:  
$ =\lambda \frac{\partial}{\partial \beta}\beta^T \beta$
  
> Multiply by the identity matrix $I$:  
$ =\lambda \frac{\partial}{\partial \beta}\beta^T I \beta$
  
> Apply the property $\frac{\partial}{\partial x}x^T Ax = (A+A^T)x$:  
$ =\lambda (I + I^T)\beta$  
  
> Given that $(I + I^T)$ is simply a scalar of 2 for matrices, we can simplify:  

> $=2\lambda \beta $  
  
Next, we add (1) and (2) to get:  
$\nabla F(\beta) = -\frac{1}{n}\sum_{i=1}^{n}y_ix_i \cdot \frac{exp(- \rho y_ix_i^T\beta)}{1+exp(-\rho y_ix_i^T\beta)} + 2\lambda \beta$

**Instructions**  
Consider the Spam dataset from The Elements of Statistical Learning. Standardize the data, if you have not done so already. Be sure to use the training and test splits from the website.

In [117]:
# Load Packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn.preprocessing
import scipy.linalg
from sklearn.linear_model import LogisticRegression

spam = pd.read_table('https://web.stanford.edu/~hastie/ElemStatLearn/datasets/spam.data', 
                   delim_whitespace=True, header = None)
test_indicator = pd.read_table('https://web.stanford.edu/~hastie/ElemStatLearn/datasets/spam.traintest',
                         delim_whitespace=True, header = None)

In [100]:
x = np.asarray(spam)[:, 0:-1]
y = np.asarray(spam)[:, -1]*2 - 1
test_indicator = np.array(test_indicator).T[0]

# Divide the data into train, test sets
x_train = x[test_indicator == 0, :]
x_test = x[test_indicator == 1, :]
y_train = y[test_indicator == 0]
y_test = y[test_indicator == 1]

# Standardize the data.
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

# Keep track of the number of samples and dimension of each sample
n_train = len(y_train)
n_test = len(y_test)
d = np.size(x, 1)

In [105]:
def convergence_plots(x_vals, lambduh):
    """
    Plot the convergence in terms of the function values and the gradients
    Input:
        - x_vals: Values the gradient descent algorithm stepped to
    """
    n, d = x_vals.shape
    fs = np.zeros(n)
    grads = np.zeros((n, d))
    for i in range(n):
        fs[i] = obj(x_vals[i], lambduh)
        grads[i, :] = computegrad(x_vals[i], lambduh)
    grad_norms = np.linalg.norm(grads, axis=1)
    plt.subplot(121)
    plt.plot(fs)
    plt.xlabel('Iteration')
    plt.ylabel('Objective value')
    plt.subplot(122)
    plt.plot(grad_norms)
    plt.xlabel('Iteration')
    plt.ylabel('Norm of gradient')
    plt.suptitle('Function Value and Norm of Gradient Convergence', fontsize=16)
    plt.subplots_adjust(left=0.2, wspace=0.8, top=0.8)
    plt.show()

In [112]:
def compute_misclassification_error(beta_opt, x, y):
    y_pred = 1/(1+np.exp(-x.dot(beta_opt))) > 0.5
    y_pred = y_pred*2 - 1 # Convert to +/- 1
    return np.mean(y_pred != y)

def plot_misclassification_error(betas_grad, betas_fastgrad, x, y, save_file='', title=''):
    niter_grad = np.size(betas_grad, 0)
    error_grad = np.zeros(niter_grad)
    niter_fg = np.size(betas_fastgrad, 0)
    error_fastgrad = np.zeros(niter_fg)
    for i in range(niter_grad):
        error_grad[i] = compute_misclassification_error(betas_grad[i, :], x, y)
    for i in range(niter_fg):
        error_fastgrad[i] = compute_misclassification_error(betas_fastgrad[i, :], x, y)
    fig, ax = plt.subplots()
    ax.plot(range(1, niter_grad + 1), error_grad, label='gradient descent')
    ax.plot(range(1, niter_fg + 1), error_fastgrad, c='red', label='fast gradient')
    plt.xlabel('Iteration')
    plt.ylabel('Misclassification error')
    if title:
        plt.title(title)
    ax.legend(loc='upper right')
    if not save_file:
        plt.show()
    else:
        plt.savefig(save_file)

**Instructions**  
Write a function _myrhologistic_ that implements the accelerated gradient algorithm to train the $\ell_2^2$-regularized binary logistic regression with $\rho$-logistic loss. The function takes as input
the initial step-size for the backtracking rule, the $\epsilon$ for the stopping criterion based on the norm of the gradient of the objective, and the value of $\rho$.

In [125]:
# rho logistic gradient
def computegrad(beta, lambduh, rho, x, y):
    yx = y[:, np.newaxis]*x
    denom = 1+np.exp(-rho*yx.dot(beta))
    grad = 1/len(y)*np.sum(-rho*yx*np.exp(-yx.dot(beta[:, np.newaxis]))/
        denom[:, np.newaxis], axis=0) + 2*lambduh*beta
    return grad

# rho logistic objective
def objective(beta, lambduh, rho, x, y):
    return 1/len(y) * np.sum(1/rho*np.log(1 + np.exp(-rho*y*x.dot(beta)))) + lambduh*np.linalg.norm(beta)**2

# backtracking with rho logistic
def backtracking(beta, lambduh, rho, x, y, eta=1, alpha=0.5, betaparam=0.8, maxiter=100):
    grad_beta = computegrad(beta, lambduh, rho, x=x, y=y)
    norm_grad_beta = np.linalg.norm(grad_beta)
    found_eta = 0
    iter = 0
    while found_eta == 0 and iter < maxiter:
        if objective(beta - eta * grad_beta, lambduh, rho, x=x, y=y) < \
            objective(beta, lambduh, rho, x=x, y=y)- alpha * eta * norm_grad_beta ** 2:
                found_eta = 1
        elif iter == maxiter:
            raise ('Max number of iterations of backtracking line search reached')
        else:
            eta *= betaparam
            iter += 1
    return eta

def myrhologistic(beta_init, theta_init, lamduh, rho, eta_init, x, y, eps):
    beta = beta_init
    theta = theta_init
    grad_theta = computegrad(theta, lambduh, rho, x=x, y=y)
    grad_beta = computegrad(beta, lambduh, rho, x=x, y=y)
    beta_vals = beta
    theta_vals = theta
    iter = 0
    while np.linalg.norm(grad_beta) > eps:
        eta = backtracking(theta, lambduh, rho, eta=eta_init, x=x, y=y)
        beta_new = theta - eta*grad_theta
        theta = beta_new + iter/(iter+3)*(beta_new-beta)
        # Store all of the places we step to
        beta_vals = np.vstack((beta_vals, beta))
        theta_vals = np.vstack((theta_vals, theta))
        grad_theta = computegrad(theta, lambduh, rho, x=x, y=y)
        grad_beta = computegrad(beta, lambduh, rho, x=x, y=y)
        beta = beta_new
        iter += 1
    return beta_vals

**Instructions**  
Train your $\ell_2^2$-regularized binary logistic regression with $\rho$-logistic loss with $\rho=2$ and $\epsilon=10^{-3}$ o the Spam dataset for $\lambda=1$. Report your misclassification error for this value of $\lambda$.

In [118]:
rho = 2
eps = 10**-3
lambduh = 1
beta_init = np.zeros(d)
theta_init = np.zeros(d)
eta_init = 1/(scipy.linalg.eigh(1/len(y_train)*x_train.T.dot(x_train),
                                eigvals=(d-1, d-1),
                                eigvals_only=True)[0]+lambduh)

In [147]:
test = myrhologistic(beta_init, theta_init, lambduh, rho, eta_init, x, y, eps)

  """
  


In [148]:
test.shape

(8157, 57)

In [149]:
test[-1,:]

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan])

In [150]:
test[-2,:]

array([-5.64627828e-03, -4.29436769e-02, -7.77057546e-03,  1.17521652e-03,
       -3.58701682e-03, -4.19408329e-04,  1.24437140e-02,  6.05395030e-03,
        1.98862085e-03, -1.28640596e-02, -2.29711213e-04, -4.31100408e-02,
       -3.20018007e-03, -3.93638968e-03,  1.28082432e-03,  2.76967049e-02,
        6.78325785e-03,  2.88372610e-03, -2.53825223e-02,  7.56110702e-03,
        9.50938189e-03,  1.86769453e-03,  7.23994128e-03,  8.00809574e-03,
       -3.84862848e-02, -1.76483908e-02,  1.27878388e-01, -4.85902856e-03,
       -1.08111622e-02, -6.29383850e-03, -4.43407624e-03, -2.44658567e-04,
       -1.35577206e-02, -1.75839217e-04, -9.22292053e-03, -2.60965854e-03,
       -9.76388432e-03, -2.96638581e-03, -1.46431302e-02,  9.71512411e-04,
       -1.11963757e-02, -2.69865303e-02, -4.02145539e-03, -1.75274388e-02,
       -2.53009256e-02, -3.45655638e-02, -4.68002978e-04, -4.62777224e-03,
       -1.39082786e-03, -8.96087574e-03, -2.59387829e-03,  9.45408832e-03,
        5.09374034e-03,  

In [144]:
np.linalg.norm(computegrad(test[-2,:], lambduh, rho, x=x, y=y)) < eps

False

In [146]:
eta_init

0.12818700723615611

**Instructions**  
Write a function _crossval_ that implements leave-one-out cross-validation and hold-out cross-validation. You may either write a function that implements each variant separately depinging on the case, or write a general cross-validation function that can be instantiated in each case.

In [None]:
def crossval():
    

**Instructions**  
Find the optimal value of $\lambda$ using leave-one-out cross-validation.  Find the optimal value of $\lambda$ useing hold-out cross-validation with a 80%/20% split for the training set/testing set.  Report your misclassification errors for the two values of $\lambda$ found.

# Exercise 2 - Data Competition Project

**Instructions**  
Pick two classes of your choice from the dataset. Train a classifier using $\ell_2^2$-regularized binary
logistic regression with $\rho$-logistic loss on the training set using your own accelerated gradient algorithm with $\rho = 2$, $\epsilon = 10^{−3}$, and $\lambda = 1$. Be sure to use the features you previously generated with the provided script rather than the raw image features. Plot, with different colors, the misclassification error on the training set and on the validation set vs iterations.

**Instructions**  
Find the value of the regularization parameter $\lambda$ using using leave-one-out cross-validation. Find the value of the regularization parameter $\lambda$ using using hold-out cross-validation. Train a classifier using $\ell_2^2$-regularized binary logistic regression with $\rho$-logistic loss on the training set using your own accelerated gradient algorithm with that value of $\lambda$ found by hold-out cross-validation. Plot, with different colors, the misclassification error on the training set and on the validation set vs. iterations.

**Instructions**  
Consider all pairs of classes from the dataset. For each pair of classes, train a classifier using
a $\ell_2^2$-regularized binary logistic regression with $\rho$-logistic loss on the training set comprising
only the data-points for that pair of classes using your own fast gradient algorithm. For each
pair of classes, find the value of the regularization parameter $\lambda$ using hold-out cross-validation
on the training set comprising only the data-points for that pair of classes.

**Instructions**  
Write a function that for any new data point predicts its label. To do this, you will perform the following: input the data point into each classifier (for each pair of classes) you trained above. Record the class predicted by each classifier. Then your prediction for this data point is the most frequently predicted class. If there is a tie, randomly choose between the tied classes. Report the misclassification error on the validation set and test set. Report the precision/recall on the validation set.
