<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Quick-intro" data-toc-modified-id="Quick-intro-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Quick intro</a></span></li><li><span><a href="#Biological-motivation-and-connections" data-toc-modified-id="Biological-motivation-and-connections-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Biological motivation and connections</a></span></li><li><span><a href="#Single-neuron-as-a-linear-classifier" data-toc-modified-id="Single-neuron-as-a-linear-classifier-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Single neuron as a linear classifier</a></span><ul class="toc-item"><li><span><a href="#Binary-Softmax-classifier." data-toc-modified-id="Binary-Softmax-classifier.-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Binary Softmax classifier.</a></span></li><li><span><a href="#Binary-SVM-classifier" data-toc-modified-id="Binary-SVM-classifier-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Binary SVM classifier</a></span></li><li><span><a href="#Regularization-interpretation" data-toc-modified-id="Regularization-interpretation-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Regularization interpretation</a></span></li></ul></li><li><span><a href="#Commonly-used-activation-functions-(网页中有-functions'-curves)" data-toc-modified-id="Commonly-used-activation-functions-(网页中有-functions'-curves)-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Commonly used activation functions (网页中有 functions' curves)</a></span><ul class="toc-item"><li><span><a href="#Sigmoid" data-toc-modified-id="Sigmoid-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Sigmoid</a></span><ul class="toc-item"><li><span><a href="#对-sigmoid-的分析：" data-toc-modified-id="对-sigmoid-的分析：-4.1.1"><span class="toc-item-num">4.1.1&nbsp;&nbsp;</span>对 sigmoid 的分析：</a></span></li></ul></li><li><span><a href="#Tanh-结合图看更好" data-toc-modified-id="Tanh-结合图看更好-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Tanh 结合图看更好</a></span></li><li><span><a href="#ReLU" data-toc-modified-id="ReLU-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>ReLU</a></span></li><li><span><a href="#Leaky-ReLU" data-toc-modified-id="Leaky-ReLU-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Leaky ReLU</a></span></li><li><span><a href="#Maxout" data-toc-modified-id="Maxout-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Maxout</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></li></ul></div>

# [Neural Networks Part 1: Setting up the Architecture](http://cs231n.github.io/neural-networks-1/)

model of a biological neuron, activation functions, neural net architecture, representational power

## Quick intro
* <span class="burk">Two layer neural network's</span> score function:
$$ s = W_2 \max(0, W_1 x) $$
where 
    * $W_1 \in R^{100 \times 3072}$ and $W_2 \in R^{10\times 100}$ parameters or weights
    * $\max(0, -)$ is a non-linearity that is applied elementwise.
    
    
* <span class="burk">Three layer  neural network</span> score function:
$$ s = W_3 \max(0, W_2 \max(0, W_1 x)) $$
where 
    * all of $W_3,W_2,W_1$ are parameters to be learned.
    * The sizes of the <span class="girk">intermediate hidden vectors are hyperparameters</span> of the network and we’ll see how we can set them later. 
    
    
## Biological motivation and connections
<img src="http://cs231n.github.io/assets/nn1/neuron_model.jpeg" width="425"/>

* Based on this rate code interpretation, we model the firing rate of the neuron with an activation function $f$, which represents the frequency of the spikes along the axon. 
* Historically, a common choice of activation function is the sigmoid function ${\displaystyle \sigma=\frac{1}{1+\exp(-x)}}$, since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1. 

In [None]:
# An example code for forward-propagating a single neuron might look as follows.
class Neuron(object):
  # ... 
  def forward(self, inputs):
    """ assume inputs and weights are 1-D numpy arrays and bias is a number """
    cell_body_sum = np.sum(inputs * self.weights) + self.bias
    firing_rate = 1.0 / (1.0 + math.exp(-cell_body_sum)) ## sigmoid activation function
    return firing_rate

## Single neuron as a linear classifier
### Binary Softmax classifier.
* we can interpret  ${\displaystyle \sigma\left(\sum_{i}w_{i}x_{i}+b\right)}$ to be the probability of one of the classes $P(y_i = 1 \mid x_i; w)$. The probability of the other class would be $P(y_i = 0 \mid x_i; w) = 1 - P(y_i = 1 \mid x_i; w)$. 
* With this interpretation, we can formulate the <span class="girk">cross-entropy loss</span> as we have seen in the Linear Classification section, and optimizing it would lead to a <span class="girk">binary Softmax classifier</span> (also known as logistic regression).

### Binary SVM classifier
* Alternatively, we could attach a <span class="girk">max-margin hinge loss</span> to the output of the neuron and train it to become a binary Support Vector Machine.

### Regularization interpretation
* The regularization loss in both SVM/Softmax cases could in this biological view be interpreted as **<span class="girk">gradual forgetting</span>**, since it would have the effect of driving all synaptic weights ww towards zero after every parameter update.

<span class="burk">A single neuron can be used to implement a binary classifier (e.g. binary Softmax or binary SVM classifiers)</span>

## Commonly used activation functions (网页中有 functions' curves)
* Sigmoid
* Tanh
* ReLU
* Leaky ReLU
* Maxout
* TLDR

### Sigmoid
* mathematical form $\sigma(x) = 1 / (1 + e^{-x})$
* it takes a real-valued number and “squashes” it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1.
* frequent use historically since it has a nice interpretation.

<img class =“left” src='http://cs231n.github.io/assets/nn1/sigmoid.jpeg' width='350'/>

#### 对 sigmoid 的分析：
<span class="burk">Sigmoids saturate and kill gradients</span>.
* A very undesirable property of the sigmoid neuron is that when the neuron’s activation saturates at either tail of 0 or 1, <span class="girk">the gradient at these regions is almost zero</span>. 
* Recall that during backpropagation, this (local) gradient will be multiplied to the gradient of this gate’s output for the whole objective. Therefore, if the local gradient is very small, it will effectively “kill” the gradient and almost no signal will flow through the neuron to its weights and recursively to its data.
* Additionally, one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation.  For example, if the initial weights are too large then most neurons would become saturated and the network will barely learn.
    
<span class="burk">Sigmoid outputs are not zero-centered</span>.
* undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered
* This has implications on the dynamics during gradient descent, because if the <span class="girk">data coming into a neuron is always positive</span> (e.g. $x>0$ elementwise in $f=w^Tx+b$)), then the gradient on the weights $w$ will during <span class="girk">backpropagation become either all be positive, or all negativ</span>e (depending on the gradient of the whole expression ff).
* However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.

### Tanh 结合图看更好

<img class =“right” src='http://cs231n.github.io/assets/nn1/tanh.jpeg' width='350'/>
* It squashes a real-valued number to the range $[-1, 1]$. 
* Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. 
* Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.
* Relationship with sigmoid $\tanh(x)=2\sigma(2x)−1.$

### ReLU
<img src='http://cs231n.github.io/assets/nn1/relu.jpeg' width='350'/>
* The Rectified Linear Unit (ReLU) has become very popular in the last few years.
* It computes the function $f(x)=\max(0,x)$. In other words, the activation is simply thresholded at zero.


分析:
* (+) It was found to <span class="girk">greatly accelerate</span> (e.g. a factor of 6 in [Krizhevsky et al.](http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf)) the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its <span class="girk">linear, non-saturating form</span>.
* (+) Compared to tanh/sigmoid neurons that involve expensive operations (exponentials, etc.), the ReLU can be <span class="girk">implemented by simply thresholding a matrix of activations at zer</span>o.
* (-) Unfortunately, ReLU units can be fragile 易损的 during training and can “die”.
    * For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that <span class="girk">the neuron will never activate on any datapoint again</span>. 全为 0 了呗.
    * If this happens, then the gradient flowing through the unit will forever be zero from that point on.
    * That is, the ReLU units can <span class="girk">irreversibly die</span> during training since they can get knocked off the data manifold.
        * For example, you may find that <span class="girk">as much as 40% of your network can be “dead”</span> (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. 
    * With a <span class="girk">proper setting of the learning rate</span> this is less frequently an issue.

### Leaky ReLU
<img src='http://cs231n.github.io/assets/nn1/alexplot.jpeg' width='350'/>
* <span class="girk">Leaky ReLUs</span> are one attempt to fix the “dying ReLU” problem. 
* Instead of the function being zero when $x < 0$, a leaky ReLU will instead have a small negative slope (of 0.01, or so).
* Mathimatic formula　$$f(x) = \mathbb{1}(x < 0) (\alpha x) + \mathbb{1}(x>=0) (x)$$ where $\alpha$ is a small constant.

* Some people report success with this form of activation function, but <span class="girk">the results are not always consistent</span>. 
    * The slope in the negative region can also be made into a parameter of each neuron, as seen in PReLU neurons, introduced in [Delving Deep into Rectifiers](http://arxiv.org/abs/1502.01852), by Kaiming He et al., 2015. Like almost died neurons.
    * However, the consistency of the benefit across tasks is presently unclear.


### Maxout
* Maxout neuron (introduced recently by Goodfellow et al.) that generalizes the ReLU and its leaky version.
$$ \max(w_1^Tx+b_1, w_2^Tx + b_2) $$

* Notice that both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have $w1,b1=0$).


分析:
* (+) The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU).
* (-) However, unlike the ReLU neurons it doubles the number of parameters for every single neuron, leading to a high total number of parameters.

### Conclusion
* It is very rare to mix and match different types of neurons in the same network, even though there is no fundamental problem with doing so.

* What neuron type should I use?
    * (1) Use the ReLU non-linearity, (2) be careful with your learning rates and (3) possibly monitor the fraction of “dead” units in a network.
    * If this concerns you, give Leaky ReLU or Maxout a try. 
    * Never use sigmoid.
    * Try tanh, but expect it to work worse than ReLU/Maxout.
  

## Neural Network architectures

### Layer-wise organization

* 