Prepared By:
- Ashish Sharma <accssharma@gmail.com>
- AI Saturdays - Week 7 (July)
- AI Developers, Boise 

# Resources:
- [Live CNN training demo](https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html)
- [k-nearest neighbor demo](vision.stanford.edu/teaching/cs231n-demos/knn/)
- [Tensorflow - train your first neural network](https://www.tensorflow.org/tutorials/keras/basic_classification)

# Notes - Convolutional Neural Networks

- cs231n 2016 winter [lecture 5](https://www.youtube.com/watch?v=gYpoJMlgyXA) and [lecture 6](https://www.youtube.com/watch?v=hd_KFJ5ktUc)

## Myth
- ConvNets need a lot of data to train
- Well if you have a small dataset, we rarely ever train ConvNets from scratch. We finetune the pre-trained models. Transfer learning.
- Download such pre-trained networks from caffe-zoo kind of models repo.

## Why even use any non-linear function?
- If not used, then your neural network is a whole total linear sandwich of NN layers.
- Non-linear functions give all these wiggles to help you train all these networks


## Training a neural network is as simple as four step (Mini-batch SGD):
- Sample a batch of data
- Forward prop through the graph, get loss
- Backprop to calculate the gradients
- Update the parameters using the gradient


## Training a Neural Network
- 1957 - Frank Rosenblatt came up with the idea of Perceptron [[Original Paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.335.3398&rep=rep1&type=pdf)]
    - Three main questions:
        - How is information about the physical world sensed or detected? (province of sensory physiology)
        - How is that information stored or remembered?
        - How does that information in storage or memory influence recognition and behavior?
    - 
    
- non-linear - step function, which is not a differential function
- no backprop, no loss function concept
- 1960- Widrow and Hoff - circuit that can learn
- 1986 - First time back-propagation became popular (Rumelhart et al)
    - could not scale well
    - stayed same for almost 20 years
- 2006 - Hinton and Salakhutdinov
    - Reinvigorated research in deep learning
    - RBM
    - Backprop works, but you have to be careful in initialization
- 2010 - GMM/HMM concept - Microsoft in Speech Recognition
- 2012 - Convnet - AlexNet

- Why did they start working?
    - Parameter initialization
    - GPUs
    - More Data
    
- Overview of Training Neural Networks
    - One time setup
        - activation function
        - preprocessing
        - weight initialization
        - regularization
        - gradient checking
    - Training Dynamics
        - babysitting the learning process
        - parameter updates
        - hyperparameter optimization
    - Evaluation
        - model ensembles
        
        
### Activation Functions
- Sigmoid [Historically commonly used]
    - Saturated neurons (very closed to 0 and 1 "kill" the gradients)
    - As Backprop works with chain rule of derivative, but since the sigmoid values near to 0 and 1 would have very small (~0) derivative, so when they are passed along to the earlier layers in the neurons, then the gradients stop flowing to the earlier layers.
    - Sigmoid values near 0 and 1 are in saturated regime
    - Sigmoid works well only with the sigmoid outputs in the active regimes, values not close to 0 and 1 but close to 0.5.
    - Sigmoid outputs are not zero-centric
    - exp() is a bit compute expensive part of the non linearity

- tanh
    - LeCun 1991 recommended tanh instead of sigmoid
    - squashes numbers to range [-1,1]
    - zero centered
    - still kills gradient when saturated
    
- ReLU
    - max(0,x) - converges much faster than sigmoid/tanh in practice (6X)
    - does not saturate (in +region)
    - very computationally efficient
    - Not zero-centric output
    - An annoyance
        - what is the gradient when x<0
    - Be wary of:
        - dead ReLU
        - people like to initialize ReLU neurons with slightly positive biases (eg. 0.001)

- Leaky ReLU
    - f(x) = max(0.01x, x)
        - don't let ReLU die
    - does not saturate
    - computationally efficient
    - converges much faster (6X)


- Parametric Rectifier
    - max(alpha* x, x) - backprop into alpha (parameter)
    
- Exponential Linear Units (ELU)
    - all benefits of ReLU
    - Does not die
    - Closer to zero mean outputs
    - comutation requires exp() - downside
    
- Maxout "Neuron"
    - Generalizes ReLU and Leaky ReLU
    - Linear Regime, does not saturate, does not die!
    - does not have the basic form of dot product -> non linearity
    - doubles the number of parameters/neuron - downside
    
- TLDR
    - use ReLU, be careful with your learning rates
    - tryo out ReLU/MaxOut/ELU
    - Try out tanh but don't expect much
    - Don't use sigmoid
    
    
### Data Preprocessing
- zero-centered data, subtract by mean
- normalized data (normalize by st dev, normalization) - in images that't not as common
- see PCA and whitening of the data

### Images 
- Subtract the mean image (eg. Alexnet) (mean image = [32,32,3] array)
- Subtract per-channel mean (man along each channel = 3 number)
- not common to normalize variance, to do PCA and whitening in Images

### Weight initialization
- one of the reasons that the early NNs did not work
- How not to do weight initialization?
    - W = 0
        - no symmetry breaking
        - all neurons compute the same thing
    - small random numbers
        - gaussian with zero mean and 1e-2 std
        - W = 0.01* np.random(D,H)
        - works ~0kay for small networks, but can lead to non-homogeneous distributions of activations across the layers of a network
        
     - Xavier initialization
         - works reasonably well only with linearity
         - does not work with non-linearity
         - but when using the ReLU nonlinearity, it breaks
         - He et al., additional /2
     - you always want roughly unit gaussian activations!
     - SOLUTION: Batch Normalization

### Batch Normalization:
    - you want unit gaussian activations? just make them so?
    - perfectly differentiable function
    - new layers - perfectly differentiable
    - X minibatch, D neurons
    - Approach: 
        - compute the empirical mean and variance independently for each dimension
        - Normalize with variance accross the minibatch
        - scale and shift
    - Usually inserted after FC and ConvNet before activation/non-linearity layer
    - Pros
        - improves gradient flow through the network
        - allows higher learning rate
        - [V IMP] reduce the strong dependence on initialization
        - acts as a form of regularization in a funny way, and slightly reduces the need for dropout, may be!
            - stochastically jittering
    - At test time, BatchNorm layer functions differently
        - The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used
        - eg. can be estimated during training with running averages
    - Cons
        - slow down penalty
        - someone pointed may be 30% overhead
        
    
### Choose the architecture
- say we start one hidden layer of 50 neurons
    - weights and biases initialization
- disable regularization
- returns the loss and the gradient of the parameters
- crank up regularization, loss went 
- Hyper parameter optimization
- couarse to fine cross-validation stages
- learning rates: sample and optimize in log space
- grid search of hyperparameters
    - Grid layout vs Random layout
    - Always use random layout
- nw architecture
- monitor and visualize loss curve
- Training vs validation accuracy
    - big gap - overfitting
    - no gap - increase model capacity
- Track the ratio of weight updates/weight magnitudes


- Takeaways
    - we do not want our outputs to be in killed region which results in zero or near zero gradients
    - 
    
## TODO
- parameter update schemes
- learning rate schedules
- Dropout
- Gradient Checking
- MOdel ensembles

### Parameter Updates
- Stochastic Gradient Descent
    - What is the trajectory along which we converge towards the minimum with SGD?
        - bounce alot
    - Solution: Momentum update and integrate velocity
    - allows building up velocity
- Momentum update
    - SGD has a problem where updates are faster in steep direction and bounce back and forth in the shallow direction
    - physical interpretation as ball rolling down the loss function + friction (mu coefficient)
    - velocity decays to zero update with friction concept
   
- Nesterov Momentum update
- AdaGrad update
    - per parameter adaptive gradient
    - shallow gradients are scaled up
    - deep gradients are decayed
    
- RMSProp Update
    - vanilla implementation of AdaGrad would decay to zero UPDATES
    - but Geoff Hinton's approach with decay rate helps avoiding the final decay to zero UPDATES
    
- Adam
    - AdaGrad + Momentum
 
- All update approaches use LR as hyperparaemeter
    - decay learning rate eg. step decay: by half every few epochs )
    - exponential decay
    - 1/t decay
    
## Optimization Techniques

### Second order optimization methods
- faster
- no hyperparameter
- second order taylor expansion
- eg. Newton parameter udpate
- Why is it impractical in Neural Networks?
    - involves Hessian matrix
    - imagine Hessian matrix of 100000 parameters and inverting it

### L-BFGS
- usually works very well in full batch
- gives bad results when tried to transfer to mini-batch setup - active research

## Suggestion
- Use Adam 
- Use L-BFGS if you can afford to do full batch updates

## Ensembles

## Regularization

### Dropout
- p : probability of choosing a mask of dropping out neurons
- randomly drop the unit - activation
- effects both forward and backward direction (weights associated witht he dropped units aren't updated)
    - SEE HOW DOES IT EFFECT GRADIENT UPDATES OF THE DROPPED NEURONS?
    - Back propagate the mask as well
- How could this possibly be a good idea?
    - prevent overfitting?
    - control of your variance of your model 
    - see bias/variance trade off
    - mny ensembles of smaller neural network - randomly
    - allows function approximation more effectively and not let specific featuers to highly affect the output space (eg. cat classification)
    - training a large ensemble of models (that share parameters)
        - your sub-sampling your NN
    - Each binary mask is one model, gets trained on only ~one data point
- At test time
    - generally, no dropout on test image in forward pass
    - scale the activations on the dropped out layers by p
    - we must scale the activations so that for each neuron: output at test time = expected output at training time
- MORE COMMON: Inverted dropout

### Gradient Checking
- see class notes
   
    