### Gradient estimation

* Why do we need to do this?
* Why does backprop not work with stochastic functions?
* 


* [MuProp: Unbiased Backpropagation for Stochastic Neural Networks](http://arxiv.org/abs/1511.05176)
* [Gradient Estimation Using Stochastic Computation Graphs](http://arxiv.org/abs/1506.05254)



Imagine a single layer feed forward neural network, with some Loss function = $f_L$, parameters = W, input = x, output = y.
$$
\begin{align}
&= \frac{\partial}{\partial W} \mathop{\mathbb{E}}_{x\sim X} \big[ f_L(x,W) \big] \tag{definition}\\
&=   \mathop{\mathbb{E}}_{x\sim X} \big[ \frac{\partial}{\partial W} f_L(x,W) \big]  \tag{because x is not dependent on W} \\
&= \frac{1}{N} \mathop{\sum}_i \big( \frac{\partial}{\partial W} f_L(x_i,W) \big)  \tag{because X uniform} \\
\end{align}
$$

which leads us to the traditional SGD. However, if we also want to learn which x values to sample from X, the we need those gradients as well. E.g. some form of active learning then we have 


$$
\begin{align}
&= \frac{\partial}{\partial \phi} \mathop{\mathbb{E}}_{\phi \sim \Phi} \big[ f_L(g(\phi),W) \big] \tag{definition}\\
\end{align}
$$

but we dont really have a distribution on $\phi$/ it is just = 1 for a single element of $\Phi$. So cant we get rid of the expectation??? Nah, how does the sampling work now?

In [60]:
import autograd
import autograd.numpy as np

In [79]:
x = np.random.random((1,784))
W = np.random.random((784,10))
T = np.random.random((1,10))

In [80]:
def Softmax(z):
    return np.exp(z)/np.sum(np.exp(z))

def Deterministic(x,W):
    return Softmax(np.dot(x,W))

def Stochastic(x,W):
    y = Deterministic(x,W)
    z = np.random.choice([n for n in range(10)],p=y.squeeze())
    onehot = np.zeros((1,10))
    onehot[0,z] = 1.0
    return onehot

In [81]:
print(Deterministic(x,W))
print(Stochastic(x,W))

[[  6.66403521e-01   1.41200475e-04   9.17651759e-07   6.53917560e-02
    2.08619066e-02   2.35149785e-01   2.14904900e-03   1.49700438e-04
    5.85828240e-03   3.89388216e-03]]
[[ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.]]


In [82]:
def DeterLoss(W,x):
    return np.sum(T-Deterministic(x,W))

def StocLoss(W,x):
    return np.sum(T-Stochastic(x,W))

dL = autograd.grad(DeterLoss)
sdL = autograd.grad(StocLoss)

try:
    y = dL(W,x)
    print('Deterministic works')
    stoc_y = sdL(W,x)
    print('Stochastic works')
except:
    print('hmm...')

Deterministic works
hmm...


Ok, so;
* Autograd could (??) backprop through the `random.choice` function, but that doesnt seem like it solves anything.
* Need some unbiased estimator? But now the framework wont be functional anymore? As we need states to remember things, so we can calculate a mean field? Or we just do a large amount of expensive MC simulations?
    * Wait a minute. All we are trying to do with these MC simulations are trying to estimate the probability distribution. In many cases we already know this, like above.