# 1: some writing

## Neural nets (1)
$$
\begin{align}
y &= \sigma(W\cdot x + b) \\
\end{align}
$$

```python
x,T = get_data()
L = Loss(y,T)
while training:
    y = NN(x)
    #use gradients to descend loss surface
    NN.weights -= learning_rate * dLdW(y,T)
    NN.biases -= learning_rate * dLdb(y,T)
    
```
## Radial basis functions (2)

$$
\begin{align}
y &= W\cdot\phi( \parallel x - c\parallel)\\
\end{align}
$$


```python
x,T = get_data()
#pick centers via clustering
RBF.centers = kmeans(x)
while training:
    y = rbf(x)
    #fit weights with linear least squares
    RBF.weights -= Learning_rate * RBF.weights * (y-T)
```


### Similarities

Clearly some of the elements of each equation has simply been renamed and/or moved. 

* The biases in (1) do the same job as the centers in (2).
* The activation finction in (1) is the same as the radial radial basis function in (2). (in the sense that they are non-linearities).
* The dot products in both (1) and (2) aggregate a row of weights with ____.
* The weights themselves act as a transform on the data, rotating, shifting, inverting, ...

### Disimilarities

* We have added a norm to (2). 
* The weights are applied outside the non-linearity.
* The activation functions used in (1) and the radial basis functions used in (2) are generally quite different.
* When training, the centers of (2) are generally selected using some clustering algorithm ([3 learning phases for rbfs](http://www.sciencedirect.com/science/article/pii/S0893608001000272))

## Discussion

This makes all the information pass through a single scalar value, before being mapped back out into some vector.

### Non-linearity

$$
\begin{align}
&\textbf{NN} \\
\sigma (z) &= \frac{1}{1+e^{-z}}  \tag{sigmoid} \\
\rho (z) &= max(z,0)   \tag{relu}\\
&\textbf{RBF} \\
\phi (r)&=e^{-(\varepsilon r)^{2}}\\
\phi (r)&={\sqrt {1+(\varepsilon r)^{2}}} \tag{Multiquadric}\\
\end{align}
$$

Squared term to remove negative values, 

### 

The job of the weights it to (generally) map into some space where the ..., like the principle components if the covariance. This allows use to 

### Directional invariance

What does the radial distance invariance buy us?
Effectively, we are throwing out the information about the direction of our inputs.
In what instances would this be advisable?

### Training

RBF training costs more up front, but linear least squares is cheaper than backprop while training. Given that ... better init.

http://www.cc.gatech.edu/~isbell/tutorials/rbf-intro.pdf

# 2: some math
$$
\begin{align}
&\textbf{Definitions} \\
E(T,Y) &= \sum_t \textbf t^t \cdot log(\textbf y^t) \\
Y &= \frac{e^{z}}{\sum_j e^{z_j}} \\
z &= Wx \\
&\textbf{Partial derivatives} \\
\frac{\partial E}{\partial Y} &= \sum_t T^t \cdot \frac{1}{Y^t} \\
\frac{\partial Y^t}{\partial z^t} &= Y^t(\delta_{ij}-Y^t) \tag{see derivation below}\\
\frac{\partial z^t}{\partial W^t} &= x^t \\
&\textbf{Chain rule} \\
\frac{\partial E}{\partial Y} \frac{\partial Y^t}{\partial z^t} \frac{\partial z^t}{\partial W^t}  &= \sum_t T^t \frac{1}{Y^t} \quad Y^t(\delta_{ij}-Y^t) \quad  x^t \\
\frac{\partial E}{\partial W} &= \sum_t T^t (\delta_{ij}-Y^t) x^t \\
\end{align}
$$

***
$$
\begin{align}
&\textbf{Softmax derivation} \\
 \\
&\textbf{Case: $i = k$} \\
\frac{\partial y_k}{\partial z_i} &=\frac{\partial}{\partial z_i} \Big( \frac{e^{z_i}}{\sum_j e^{z_j}} \Big) \\
&= \frac{e^{z_i}\big(\sum_j e^{z_j}\big) - e^{z_i}e^{z_i}}{\big(\sum_j e^{z_j}\big)^2} \tag{Quotient rule} \\
&= \frac{e^{z_i}}{\sum_j e^{z_j}}  - \frac{e^{z_i}e^{z_i}}{\big(\sum_j e^{z_j}\big)^2} \\
&= \frac{e^{z_i}}{\sum_j e^{z_j}} \Big(- \frac{e^{z_i}}{\sum_j e^{z_j}} \Big) \\
&= y_i (1 - y_i) \\
 \\
&\textbf{Case: $i \neq k$} \\
\frac{\partial y_k}{\partial z_i} &=\frac{\partial}{\partial z_i} \Big( \frac{e^{z_k}}{\sum_j e^{z_j}} \Big) \\
&= e^{z_k} (-1)\big(\frac{1}{\sum_j e^{z_j}}\big)^2 e^{z_i} \tag{chain rule}\\
&= -y_i y_k \\
 \\
 \\
\frac{\partial y_k}{\partial z_i} &= y_i(\delta_{ik}-y_k) \tag{kroneker delta}
\end{align}
$$