### Weight initialization for a single neuron

Note that this is pseudocode, and the weights and values for `n` are not indexable

given some matrix X

```py
W[l] = np.random.randn(X.shape) * np.sqrt(1/n[l-1])
```

or if using relu activation

```py
W[l] = np.random.randn(X.shape) * np.sqrt(2/n[l-1])
```

or if using tanh activation, sometimes this is used (called Xavier initialization)
```py
W[l] = np.random.randn(X.shape) * np.sqrt(2/n[l-1]+n[l])
```

- Sampling a random gaussian and multiplying it by the `sqrt` of the `n[l-1]` term gives the correct variance (interesting for PPL, see Pyro?)
- In practice, the above are just starting points, and the variance is tunable as a part of more general hyperparameter search

These weight initialization techniques help prevent vanishing/exploding gradients and also help reduce required training time (especially wrt dropout)

## Gradient checking

# $\frac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}=g(\theta)$

This follows from $\lim_{\epsilon\rightarrow0}$, and allows us to verify any given $g(\theta)$

In pseudocode
```py
for i in range(gradients):
    d_approx(theta)[i] = J(theta)/2*epsilon
# ...
assert(euclidean_dist(d_approx(theta)-d(theta) ~= 10**-7)
```

- Don't use gradient checking in training, only during debugging
- Doesn't work with dropout