# Lecture 5--Optimization

## 1. Data science: Logistic regression

### 1.1. Derivation

#### Linear formulation

$$\mathcal L=\prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}$$
$$\mathcal L=\prod_{i=1}^n F(x_i'\beta)^{y_i}(1-F(x_i'\beta))^{1-y_i}$$

$$\ln\mathcal L=\sum_{i=1}^n y_i \ln{F(x_i'\beta)}+(1-y_i)\ln{(1-F(x_i'\beta))}$$
$$\ln\mathcal L=\left[\ln{F(\beta'\mathbf X')}\right]y+\left[\ln{(\mathbf{1}'-F(\beta'\mathbf X'))}\right](1-y)$$

In [20]:
expit = lambda x: 1/(1+np.exp(-x))
def loglike(x,y,b):
    Fx = expit(b.T@x.T)
    return np.log(Fx)@y+np.log(1-Fx)@(1-y)

#### Gradient
$$\frac{d\ln\mathcal L}{d\beta}=\mathbf X'\text{diag}\left(\frac{f(\mathbf X\beta)}{F(\mathbf X\beta)}\right)y-\mathbf X'\text{diag}\left(\frac{f(\mathbf X\beta)}{\mathbf 1-F(\mathbf X\beta)}\right)(1-y)$$
$$\frac{d\ln\mathcal L}{d\beta}=\mathbf X'\text{diag}\left(\frac{f(\mathbf X\beta)(1-F(\mathbf X\beta))}{(1-F(\mathbf X\beta))F(\mathbf X\beta)}\right)y-\mathbf X'\text{diag}\left(\frac{f(\mathbf X\beta)F(\mathbf X\beta)}{\mathbf F(\mathbf X\beta)(1-F(\mathbf X\beta))}\right)(1-y)$$
$$\frac{d\ln\mathcal L}{d\beta}=\mathbf X'\left[\text{diag}\left(1-F(\mathbf X\beta)\right)y-\text{diag}\left(F(\mathbf X\beta)\right)(1-y)\right]$$
$$\frac{d\ln\mathcal L}{d\beta}=\mathbf X'\left[\text{diag}\left(y-F(\mathbf X\beta)y-F(\mathbf X\beta)+F(\mathbf X\beta)y)\right)\right]\mathbf 1$$
$$\frac{d\ln\mathcal L}{d\beta}=\mathbf X'\left[\text{diag}\left(y-F(\mathbf X\beta)\right)\right]\mathbf 1$$
$$\frac{d\ln\mathcal L}{d\beta}=\mathbf X'\left[y-F(\mathbf X\beta)\right]$$

In [1]:
def gradient(x,y,b):
    Fx = expit(x@b)
    return x.T@(y-Fx)

#### Hessian
$$\frac{d}{d\beta}\frac{d\ln\mathcal L}{d\beta}'=\frac{d}{d\beta}\left[y'-F(\beta'\mathbf X')\right]\mathbf X$$
$$\frac{d^2\ln\mathcal L}{d\beta d\beta'}=-\mathbf X'\left[\text{diag}\left(f(\mathbf X\beta)\right)\right]\mathbf X$$

In [2]:
def hessian(x,y,b):
    Fx = expit(x@b)
    fx = Fx*(1-Fx)
    return -x.T@np.diagflat(fx.flatten())@x

#### Crammer-Rao Lower Bound

**Lemma**

Under regularity conditions:

(A)
$$\text{E}\left[\hat\theta\right]=\theta$$
(B)
$$\text{E}\left[\frac{d\ln{\mathcal{L}}}{d\theta}\right]=0$$
(C)
$$\text{E}\left[\frac{d^2\ln{\mathcal{L}}}{d\theta d\theta'}\right]=-\text{E}\left[\frac{d\ln{\mathcal{L}}}{d\theta}\frac{d\ln{\mathcal{L}}}{d\theta'}\right]$$
(D)
$$\int_\Omega{\frac{d\mathcal{L}}{d\theta}\hat\theta'dz}=\mathbf{I}_r$$
The variance of $\hat\theta$ can be bounded by
$$\text{Var}\left[\hat\theta\right]\ge\left[-\text{E}\left[\frac{d^2\ln{\mathcal{L}}}{d\theta d\theta'}\right]\right]^{-1}$$



*Proof*

Define $\mathbf{P}=\text{E}\left[(\hat\theta-\theta)(\hat\theta-\theta)'\right]$, $\mathbf{R}=\text{E}\left[\left(d\ln{\mathcal{L}}\big/d\theta\right)\left(d\ln{\mathcal{L}}\big/d\theta'\right)\right]$, and $\mathbf{Q}=\text{E}\left[(\hat\theta-\theta)\left(d\ln{\mathcal{L}}\big/d\theta'\right)\right]$

$$
\begin{align}
\mathbf{Q}&=\text{E}\left[(\hat\theta-\theta)\frac{d\ln{\mathcal{L}}}{d\theta'}\right] \\
          &=\text{E}\left[\hat\theta\frac{d\ln{\mathcal{L}}}{d\theta'}\right]+\theta\text{E}\left[\frac{d\ln{\mathcal{L}}}{d\theta'}\right] \\
          &=\text{E}\left[\hat\theta\frac{1}{\mathcal{L}}\frac{d\mathcal{L}}{d\theta'}\right]+\mathbf{0} \\
          &=\int_\Omega{\left[\hat\theta\frac{1}{\mathcal{L}}\frac{d\mathcal{L}}{d\theta'}\right]\mathcal{L}dz} \\
          &=\int_\Omega{\frac{d\mathcal{L}}{d\theta}\hat\theta' dz} \\
          &=\mathbf{I}_r
\end{align}
$$

Note that 
$$
\begin{bmatrix}
\mathbf{P} & \mathbf Q \\
\mathbf Q' & \mathbf R 
\end{bmatrix}\ge \mathbf{0}
$$

Because the left hand side is a variance-covariance matrix. Premultiply both sides of the equation by $[\mathbf I_r,\mathbf R^{-1}]$ and postmultiply by $[\mathbf I_r,\mathbf R^{-1}]'$ and you get:

$$\mathbf P - \mathbf R^{-1} \ge \mathbf{0}$$

or 

$$\mathbf P \ge \mathbf R^{-1}$$

or 

$$\text{E}\left[(\hat\theta-\theta)(\hat\theta-\theta)'\right] \ge \left[\text{E}\left[\frac{d\ln{\mathcal{L}}}{d\theta}\frac{d\ln{\mathcal{L}}}{d\theta'}\right]\right]^{-1} $$

This can be rewritten as the Crammer-Rao lower bound by substituting the variance definition and assumption (C):

$$\text{Var}\left[\hat\theta\right]\ge\left[-\text{E}\left[\frac{d^2\ln{\mathcal{L}}}{d\theta d\theta'}\right]\right]^{-1}$$



**Theorem** Crammer-Rao Lower Bound

Assume
$\mathcal{L}$ is continuous and differentiable. For any unbiased estimator $\hat\theta$, the variance is bounded below by
$$\text{Var}\left[\hat\theta\right]\ge\left[-\text{E}\left[\frac{d^2\ln{\mathcal{L}}}{d\theta d\theta'}\right]\right]^{-1}$$


*Proof*

(A) is assumed because the estimator is unbiased. (B) and (C) true by the fundamental theorem of calculus. D is true by the fundamental theorem of calculus if (A) is true.

(B)
$$
\begin{align}
\text{E}\left[\frac{d\ln{\mathcal{L}}}{d\theta}\right] &= \text{E}\left[\frac{1}{\mathcal{L}}\frac{d\mathcal{L}}{d\theta}\right] = \int_\Omega{\left[\frac{1}{\mathcal{L}}\frac{d\mathcal{L}}{d\theta}\right]\mathcal Ldz} \\
&=\frac{d}{d\theta}\int_\Omega{\mathcal{L}dz}=\frac{d}{d\theta}1=0 
\end{align}
$$

(C)
$$
\begin{align}
\text{E}\left[\frac{d^2\ln{\mathcal{L}}}{d\theta d\theta'}\right]&=\text{E}\left[\frac{d}{d\theta}\frac{d\ln{\mathcal{L}}}{d\theta'}\right]=\text{E}\left[\frac{d}{d\theta}\left(\frac{1}{\mathcal{L}}\frac{d\mathcal{L}}{d\theta'}\right)\right]\\
&=\text{E}\left[\left(\frac{1}{\mathcal{L}}\frac{d^2\mathcal{L}}{d\theta d\theta'}\right)\right]+\text{E}\left[\left(\frac{-1}{\mathcal{L}^2}\frac{d\mathcal{L}}{d\theta}\frac{d\mathcal{L}}{d\theta'}\right)\right]\\
&=\int_\Omega{\frac{d^2\mathcal{L}}{d\theta d\theta'}dz}-\text{E}\left[\frac{d\ln{\mathcal{L}}}{d\theta}\frac{d\ln{\mathcal{L}}}{d\theta'}\right]\\
&=\frac{d^2}{d\theta d\theta'}\int_\Omega{\mathcal{L}dz}-\text{E}\left[\frac{d\ln{\mathcal{L}}}{d\theta}\frac{d\ln{\mathcal{L}}}{d\theta'}\right]\\
&=\mathbf{0}-\text{E}\left[\frac{d\ln{\mathcal{L}}}{d\theta}\frac{d\ln{\mathcal{L}}}{d\theta'}\right]=-\text{E}\left[\frac{d\ln{\mathcal{L}}}{d\theta}\frac{d\ln{\mathcal{L}}}{d\theta'}\right]
\end{align}
$$
(D)
$$\int{\frac{d\mathcal{L}}{d\theta}\hat\theta'dz}=\frac{d}{d\theta}\int{\mathcal{L}\hat\theta'dz}=\frac{d}{d\theta}E[\hat\theta]=\frac{d}{d\theta}\theta=\mathbf{I}_r$$


## 2. Programming

### 2.1 Grid search

Search over a given parameter space. Check every possible option for the optimum value

In [92]:
from itertools import product
def grid_search(func,space,maximize=False):
    vstates = [(x,func(x)) for x in space]
    vstates.sort(key=lambda x: x[1])
    if maximize: return vstates[-1][0]
    return vstates[0][0]

x = np.linspace(0,10,1000).tolist()
func = lambda x: (x[0]-4.0001)**2*(x[1]-6.0001)**2
grid_search(func,product(x,x))

(4.004004004004004, 5.995995995995996)

### 2.2 Gradient descent

Walk along the slope of the curve by steps proportional to the opposite of the size of the gradient. 

In [88]:
def gradient_descent(func,gradient,init_x:np.ndarray,learning_rate:float=0.05,max_reps:int=1000,tolerance:float=1e-6,maximize=False):
    x = init_x.copy()
    for i in range(max_reps):
        gx = gradient(x)
        x0 = x.copy()
        x += gx*learning_rate if maximize else -gx*learning_rate
        if (func(x)<flast and maximize and i>2) or (func(x)>flast and (not maximize) and i>2): 
            x = x0
            break
    return x
gradient_descent(gradient,np.array([0.75,0.15]))

### 2.3 Newton's method

In [23]:
def newton(gradient,hessian,init_x:np.ndarray,max_reps:int=100,tolerance:float=1e-6):
    x = init_x.copy()
    for i in range(max_reps):
        update = -np.linalg.solve(hessian(x),gradient(x))
        x += update
        if np.abs(update).sum()<tolerance:
            return (x,i)
    raise Exception('Newton did not converge')

### 2.4 Complete code

In [24]:
from cleands import *

class likelihood_model(learning_model):
    def evaluate_lnL(self,pred): raise NotImplementedError
    @property
    def lnL(self): return self.evaluate_lnL(self.fitted)
    @property
    def aic(self): return 2*self.n_feat-2*self.lnL
    @property
    def bic(self): return np.log(self.n_obs)*self.n_feat-2*self.lnL
    @property
    def deviance(self): return 2*self.lnL-2*self._null_lnL_()
    def _gradient_(self,coef): raise NotImplementedError
    def _hessian_(self,coef): raise NotImplementedError
    def _null_lnL_(self): return self.evaluate_lnL(np.ones(self.y.shape)*self.y.mean())
    def __vcov_params_lnL__(self): return -np.linalg.inv(self._hessian_(self.params))
    def __max_likelihood__(self,init_params,gradient=None,hessian=None):
        if gradient==None: gradient=self._gradient_
        if hessian==None: hessian=self._hessian_
        return newton(gradient,hessian,init_params)
class linear_model(prediction_model,likelihood_model):
    def __init__(self,x,y):
        super(linear_model,self).__init__(x,y)
        self.params = self.__fit__(x,y)
    def __fit__(self,x,y): return np.linalg.solve(x.T@x,x.T@y)
    def predict(self,target): return target@self.params
    def evaluate_lnL(self,pred): return -self.n_obs/2*(np.log(2*np.pi*(self.y-pred).var())+1)
    @property
    def r_squared(self):
        return 1-self.residuals.var()/self.y.var()
    @property
    def adjusted_r_squared(self):
        return 1-(1-self.r_squared)*(self.n_obs-1)/(self.n_obs-self.n_feat)
    @property
    def degrees_of_freedom(self):
        return self.n_obs-self.n_feat
    @property
    def ssq(self):
        return self.residuals.var()*(self.n_obs-1)/self.degrees_of_freedom
class logistic_regressor(linear_model):
    def __fit__(self,x,y):
        params,self.iters = self.__max_likelihood__(np.zeros(self.n_feat))
        return params
    @property
    def vcov_params(self):return self.__vcov_params_lnL__()
    def evaluate_lnL(self,pred):return self.y.T@np.log(pred)+(1-self.y).T@np.log(1-pred)
    def _gradient_(self,coefs):return self.x.T@(self.y-expit(self.x@coefs))
    def _hessian_(self,coefs):
        Fx = expit(self.x@coefs)
        return -self.x.T@np.diagflat((Fx*(1-Fx)).values)@self.x
    def predict(self,target):return expit(target@self.params)

class LogisticRegressor(logistic_regressor,broom_model):
    def __init__(self,x_vars:list,y_var:str,data:pd.DataFrame,*args,**kwargs):
        super(LogisticRegressor,self).__init__(data[x_vars],data[y_var],*args,**kwargs)
        self.x_vars = x_vars
        self.y_var = y_var
        self.data = data
    def _glance_dict_(self):
        return {'mcfadden.r.squared':self.r_squared,
                'adjusted.r.squared':self.adjusted_r_squared,
                'self.df':self.n_feat,
                'resid.df':self.degrees_of_freedom,
                'aic':self.aic,
                'bic':self.bic,
                'log.likelihood':self.lnL,
                'deviance':self.deviance,
                'resid.var':self.ssq}

## 3. Programming challenges

### 3.1 Recursive partitioning trees

Write a class that implements a recursive partitioning algorithm. Use our common machine learning code.

### 3.2 Quaternions

The Quaternions are a generalization of complex numbers. Where the complex numbers have two components, $a$ and $b$, for a number $a+bi$, the Quaternions have four parts $a, b, c$ and $d$: $$a+bi+cj+dk$$

The Quaternions have four basic operations: addition, subtraction, multiplication. and the inverse. Your job is to write a quaternion class which implements these operations. You can learn how to perform these operations on the Quaternions' wikipedia page.