# Optimization in neural networks

* finding internal weight and bias minimizing loss (distance between observed and predicted val)

**Regression perceptron**
* prediction func $\^{y} = x_1 w_1 + x_2 w_2 +b$ (identity activation)
* square loss $L(y,\^{y}) = \frac{1}{2}(y-\^{y})^2$
* gradient $\nabla L = 
\begin{bmatrix}
    \frac{\delta L}{\delta w_1} \\
    \frac{\delta L}{\delta w_2}\\
    \frac{\delta L}{\delta b}
\end{bmatrix} $

Appling chain rule

$$
\frac{\delta L}{\delta b} =
    \frac{\delta L}{\delta \^{y}} \cdot \frac{\delta \^{y}}{\delta b}
    = -(y-\^{y})\\
\frac{\delta L}{\delta w_1} =
    \frac{\delta L}{\delta \^{y}} \cdot \frac{\delta \^{y}}{\delta w_1}
    = -(y-\^{y}) x_1\\
\frac{\delta L}{\delta w_2} =
    \frac{\delta L}{\delta \^{y}} \cdot \frac{\delta \^{y}}{\delta w_2}
    = -(y-\^{y}) x_2
$$

Move in the oposite direction of gradient using gradient descent approach.

**Classification perceptron**
* prediction func $\^{y} = \frac{1}{1+e^{-(x_1 w_1 + x_2 w_2 +b)}}$ (sigmoid activation)
* sigmoid derivative $\frac{d}{dz} \sigma(z) = \sigma(z) * (1-\sigma(z))$ (see chain section)
* log-loss $L(y,\^{y}) = -y \ln(\^{y}) - (1-y) \ln (1-\^{y})$
* gradient $\nabla L = 
\begin{bmatrix}
    \frac{\delta L}{\delta w_1} \\
    \frac{\delta L}{\delta w_2}\\
    \frac{\delta L}{\delta b}
\end{bmatrix} $

Appling chain rule  

$$
\frac{\delta L}{\delta b} =
    \frac{\delta L}{\delta \^{y}} \cdot \frac{\delta \^{y}}{\delta b}
    = \frac{-(y-\^{y})}{\^{y}(1-\^{y})} \^{y}(1-\^{y})
    = -(y-\^{y})\\
\frac{\delta L}{\delta w_1} =
    \frac{\delta L}{\delta \^{y}} \cdot \frac{\delta \^{y}}{\delta w_1}
    = \frac{-(y-\^{y})}{\^{y}(1-\^{y})} \^{y}(1-\^{y}) x_1
    = -(y-\^{y}) x_1\\
\frac{\delta L}{\delta w_2} =
    \frac{\delta L}{\delta \^{y}} \cdot \frac{\delta \^{y}}{\delta w_2}
    = \frac{-(y-\^{y})}{\^{y}(1-\^{y})} \^{y}(1-\^{y}) x_2
    = -(y-\^{y}) x_2
$$

Move in the oposite direction of gradient using gradient descent approach.

**Neural networks**
* chaining the partial derivatives on the graph (example for a weight in the first layer of two-layer clf below), other params are calculated in a similar way 

$$
\frac{\delta L}{\delta w_{11}} = 
    \frac{\delta z_1}{\delta w_{11}} \cdot \frac{\delta a_1}{\delta z_1}
    \cdot \frac{\delta\^{y}}{\delta a_1} \cdot \frac{\delta \^{y}}{\delta z}
    \cdot \frac{\delta L}{\delta\^{y}} = \\
    x_1 \cdot a_1(1-a_1) \cdot w_1 \cdot \^{y}(1-\^{y}) \cdot \frac{-(y-\^{y})}{\^{y}(1-\^{y})} = \\
    -x_1w_1a_1(1-a_1)(y-\^{y})
$$

# Newton's method

**Univariate**

* goal is to find zeros on a function, algoritm
    * construct a tangent in a given point
    * check intersection with x axis, project back onto the function
    * repeat until you are in the zero point
    * interative step $x_1 = x_0 - \frac{f(x_0)}{f'(x_0)}$

* in optimization, we apply this on a derivative of a func
* second derivative + (convex), -(concave), 0(line)
* $f'(x) = 0, f''(x) > 0 \rightarrow min$
* $f'(x) = 0, f''(x) < 0 \rightarrow max$
* $f'(x) = 0, f''(x) = 0 \rightarrow inconclusive$

**Hessian**
* multivariable optimization, partial derivatives along the dimensions
* $
H = \begin{bmatrix}
    \frac{\delta^2 f(x,y)}{\delta^2 x} & \frac{\delta^2 f(x,y)}{\delta x \delta y}\\
    \frac{\delta^2 f(x,y)}{\delta y \delta x}& \frac{\delta^2 f(x,y)}{\delta^2 y}\\
\end{bmatrix}
$

* for understanding the curvarture of the shape, you need to get eigen-values $det(H(x_{0},y_{0})-\lambda I)$
* convex - eigen-values>0, convex - eigen-values<0, eigen-values mixed - saddle

**Multivariate**

* iterative step $
\begin{bmatrix} x_{k+1}\\y_{k+1}\end{bmatrix} =
\begin{bmatrix} x_{k}\\y_{k}\end{bmatrix} - H^{-1}(x_k, y_k) \nabla (x_k,y_k)
$



In [3]:
import numpy as np
np.linalg.eigvals(np.array([[2,0],[0,10]]))

array([ 2., 10.])