# Gradient

## what is gradient

### clarification

- derivative
- partial derivative
- gradient <br>
$\nabla f=\displaystyle\left(\frac{\partial f}{\partial x_{1}};\frac{\partial f}{\partial x_{2}};\dots\frac{\partial f}{\partial x_{n}}\right)$

**note:** gradient of a funciton is a vector, with its direction resembling the increasing direction of the function and its norm the velocity.



### How to search for minima?

$θ_{t+1}=θ_{t}-α_{t}∇f(θ_{t})$

- example:

    - function: $J(θ_{1},θ_{2})=θ_{1}^{2}+θ_{2}^{2}$
    - objective: $\displaystyle\min_{θ_{1},θ_{2}}J(θ_{1},θ_{2})$
    - update rules: $\displaystyleθ_{1}:=θ_{1}-α\frac{d}{dθ_{1}}J(θ_{1},θ_{2})$,$\displaystyleθ_{2}:=θ_{2}-α\frac{d}{dθ_{2}}J(θ_{1},θ_{2})$
    - derivatives: $\displaystyle\frac{d}{dθ_{1}}J(θ_{1},θ_{2})=\frac{d}{dθ_{1}}θ_{1}^{2}+\frac{d}{dθ_{1}}θ_{2}^{2}=2θ_{1}$,$\displaystyle\frac{d}{dθ_{2}}J(θ_{1},θ_{2})=\frac{d}{dθ_{2}}θ_{1}^{2}+\frac{d}{dθ_{2}}θ_{2}^{2}=2θ_{2}$



two key problem:

1. local minima
2. saddle point

## optimizer performace

- initialization status
- learning rate
- momentum
- etc.


## activation function

### step function

$\begin{equation*}
o=\begin{cases}
        1,&if∑_{i=0}^{n}ω_{i}x_{i}>0 \\
        0,&otherwise
    \end{cases}
\end{equation*}$

### sigmoid/logistic

$\begin{equation*}
f(x)=σ(x)=\frac{1}{1+e^{-x}}
\end{equation*}$

#### derivative of sigmoid

$\begin{equation*}\begin{array}{rcl}\displaystyle
\frac{d}{dx}σ(x)&=&\frac{d}{dx}(\frac{1}{1+e^{-x}}) \\
&=&\frac{e^{-x}}{(1+e^{-x})^{2}} \\
&=&\frac{(1+e^{-x})-1}{(1+e^{-x})^{2}} \\
&=&\frac{(1+e^{-x})}{(1+e^{-x})^{2}} -(\frac{1}{1+e^{-x}})^{2} \\
&=&σ(x)-σ(x)^{2}  \\
σ'&=&σ(1-σ)
\end{array}\end{equation*}$


In [1]:
import torch 

a = torch.linspace(-100, 100, 10)
print(torch.sigmoid(a))

tensor([0.0000e+00, 1.6655e-34, 7.4564e-25, 3.3382e-15, 1.4945e-05, 9.9999e-01,
        1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00])


### tanh

$\begin{equation*}\begin{array}{rcl}
f(x)=tanh(x)&=&\displaystyle\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} \\
&=&2sigmoid(2x)-1
\end{array}\end{equation*}$

#### derivative of tanh
$\begin{equation*}\displaystyle
\frac{d}{dx}tanh(x)=1-tanh^{2}(x)
\end{equation*}$


In [2]:
import torch

a = torch.linspace(-1, 1, 10)
print(torch.tanh(a))

tensor([-0.7616, -0.6514, -0.5047, -0.3215, -0.1107,  0.1107,  0.3215,  0.5047,
         0.6514,  0.7616])


### ReLU

$\begin{equation*}
f(x)=\begin{cases}
    0, & if~x<0 \\
    x, & if~x\ge0
\end{cases}
\end{equation*}$


In [3]:
import torch
from torch.nn import functional as F

a = torch.linspace(-1, 1, 10)
print(torch.relu(a))
print(F.relu(a))

tensor([0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1111, 0.3333, 0.5556, 0.7778,
        1.0000])
tensor([0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1111, 0.3333, 0.5556, 0.7778,
        1.0000])


## typical loss

- Mean Square Error, mse
- Cross Entropy Loss

    - binary
    - multi-class
    - +softmax

### mse

- loss = $\sum[y-(xw+b)]^2$
- $L2-norm=||y-(xw+b)||_{2}$
- *loss=norm$(y-(xw+b))^2$*
```python
import torch
y=torch.rand(4)
pred=torch.rand(4)
loss=torch.norm(y-pred,2).pow(2)
```

In [4]:
# demo
import torch
from torch.nn import functional as F

x = torch.ones(1)
w = torch.full([1], 2.)
# mse = F.mse_loss(torch.ones(1), x * w)
# torch.autograd.grad(mse, [w])
# RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
w.requires_grad_()
# torch.autograd.grad(mse, [w])
# RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
mse = F.mse_loss(torch.ones(1), x * w)
torch.autograd.grad(mse, [w])



(tensor([2.]),)