<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Setting-up-the-data-and-the-model" data-toc-modified-id="Setting-up-the-data-and-the-model-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setting up the data and the model</a></span><ul class="toc-item"><li><span><a href="#Data-Preprocessing" data-toc-modified-id="Data-Preprocessing-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Mean-subtraction" data-toc-modified-id="Mean-subtraction-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Mean subtraction</a></span></li><li><span><a href="#Normalization" data-toc-modified-id="Normalization-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Normalization</a></span></li><li><span><a href="#PCA-and-Whitening" data-toc-modified-id="PCA-and-Whitening-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>PCA and Whitening</a></span></li></ul></li><li><span><a href="#Weight-Initialization" data-toc-modified-id="Weight-Initialization-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Weight Initialization</a></span></li><li><span><a href="#Batch-Normalization" data-toc-modified-id="Batch-Normalization-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Batch Normalization</a></span></li><li><span><a href="#Regularization-(L2/L1/Maxnorm/Dropout)" data-toc-modified-id="Regularization-(L2/L1/Maxnorm/Dropout)-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Regularization (L2/L1/Maxnorm/Dropout)</a></span></li></ul></li></ul></div>

# [Neural Networks Part 2: Setting up the Data and the Loss](http://cs231n.github.io/neural-networks-2/)

* data preprocessing, 
* weight initialization,
* batch normalization,
* regularization (L2/dropout), 
* loss functions

## Setting up the data and the model
### Data Preprocessing
There are <span class="girk">three common forms of data preprocessing</span> a data matrix $X$, where $X\in R^{N\times D}$, where $N$ is the number of data and $D$ the feature length.
* Mean subtraction
* Normalization
* PCA and Whitening

#### Mean subtraction
* most common form of preprocessing 
* subtracting the mean across every individual feature in the data $\rightarrow$  centering the cloud of data around the origin along every dimension.
* python code: ``` X -= np.mean(X, axis = 0)```

#### Normalization
* normalizing the data dimensions so that they are of approximately the same scale.

* <span class="burk">two common ways of achieving this normalization</span>
    * (1) divide each dimension by its <span class="girk">standard deviation</span>, once it has been zero-centered. Python code ```X /= np.std(X, axis = 0)```
    * (2) normalizes each dimension so that the <span class="girk">$min$ and $max$ along the dimension is -1 and 1</span> respectively.
* In case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is <span class="girk">not strictly necessary</span> to perform this additional preprocessing step.


![](http://cs231n.github.io/assets/nn2/prepro1.jpeg)
Figure. (a) origin image, (b) centered image, (c) centered & scaled (normalized) image.

#### PCA and Whitening
* another form of preprocessing.
    * first centered as described above
    * Then, we can compute the covariance matrix that tells us about the correlation structure in the data
    

    
* <span class="burk">评价 In practice</span>:
    * We mention PCA/Whitening in these notes for completeness, but these transformations are not used with Convolutional Networks. PCA/Whitening is computational expensive due to the SVD decomposition.
    * However, it is very important to zero-center the data, and it is common to see normalization of every pixel as well.
    
    
![](http://cs231n.github.io/assets/nn2/prepro2.jpeg)
Figure. PCA / Whitening. <span class="girk">Left</span>: Original toy, 2-dimensional input data. <span class="girk">Middle</span>: After performing PCA. The data is centered at zero and then rotated into the eigenbasis of the data covariance matrix. This decorrelates the data (the covariance matrix becomes diagonal). <span class="girk">Right</span>: Each dimension is additionally scaled by the eigenvalues, transforming the data covariance matrix into the identity matrix. Geometrically, this corresponds to stretching and squeezing the data into an isotropic gaussian blob.

In [None]:
# Assume input data matrix X of size [N x D]
X -= np.mean(X, axis = 0) # zero-center the data (important)
cov = np.dot(X.T, X) / X.shape[0] # get the data covariance matrix

* The (i,j) element of the data covariance matrix contains the covariance
* between i-th and j-th dimension of the data. 
* cov is  positive semi-definite
* SVD factorization of the data covariance matrix

In [None]:
U,S,V = np.linalg.svd(cov)

* the columns of U are the eigenvectors and S is a 1-D array of the singular values. 
* To decorrelate the data, we project the original (but zero-centered) data into the eigenbasis:

In [None]:
## 保留全部
Xrot = np.dot(X, U) # decorrelate the data

## PCA-reduced, 保留前 100 位 principle components
Xrot_reduced = np.dot(X, U[:,:100]) # 前100位, Xrot_reduced becomes [N x 100]
# It is very often the case that you can get very good performance by training 
# linear classifiers or neural networks on the PCA-reduced datasets,
# obtaining savings in both space and time.

* whiten the data:
    * divide by the eigenvalues (which are square roots of the singular values)

In [None]:
Xwhite = Xrot / np.sqrt(S + 1e-5)

### Weight Initialization
* Pitfall: <span class="burk">all zero initialization</span>
    * not good. 
    * there is no source of asymmetry between neurons if their weights are initialized to be the same.
* <span class="burk">Small random numbers</span>. ```W = 0.01* np.random.randn(D,H)``` where ```randn``` samples from a zero mean, unit standard deviation gaussian.
    * smaller numbers will work strictly better.
        * a Neural Network layer that has very small weights will during backpropagation compute very small gradients on its data (since this gradient is proportional to the value of the weights). 
        * This could greatly diminish the “gradient signal” flowing backward through a network, and could become a concern for deep networks.
        
        
* <span class="burk">Calibrating the variances with 1/sqrt(n)</span>. ```w = np.random.randn(n) / sqrt(n)```,  where n is the number of its inputs.
    * <span class="mark">Benefits</span>: This ensures that all neurons in the network initially have <span class="girk">approximately the same output distribution</span> and empirically improves the rate of convergence.

* Initializing the biases.
    * it is more common to simply use 0 bias initialization.

### Batch Normalization
* [Batch Normalization (BN)-李宏毅](https://www.youtube.com/watch?v=BZh1ltr5Rkg)
    * 介绍了问题的由来, 解决方案的细节, 非常好的教程.
    * 大致思想:
        * 目标: 规范化每一层输入的分布, 使得模型对更稳定的输入优化参数. 
        * 计算: 对 activiation 前一层作 BN 操作. 分为两步: (1) 对 batch data feature 维的计算均值和方差. 然后将数据减均值, 除方差. 大概就是这个思路. (2) 为了增加灵活性, 乘上参数 $\gamma$, 并对结果 加上 $\beta$, 这里 $\gamma$ 和 $\beta$ 均需要通过网络计算.
        * Testing 时, 也做 BN, 需要 batch mean 和 std, 以及 $\gamma, \beta$. 这时候的 batch 均值和方差使用 training 阶段的结果, batch SGD 时 batch 均值和方差的累计平均 (越接近收敛的 batch 权重越大); $\gamma$ 和 $\beta$ 是网络训练出来的结果. 具体怎么训练的, 已经集成很好, 所以不必在意.
    * 优势:
        * 


### Regularization (L2/L1/Maxnorm/Dropout)