# Neural Networks

神经网络受人脑工作启发，如今得到广泛的应用。  
很古老的算法，模拟人脑的工作。  
人大脑皮层的区域经过重新连接，能够学习到新的感知。比如使用视觉传感器传回的电压施加在舌头上，人可以尝试使用舌头”看东西“。失明人士通过回声定位等。说明大脑有自己的学习机制，是否能将这种学习机制运动在计算机上，想想就是一件很激动的事情。

## Non-Linear Hypothesis

当用于分类的特征空间维数很大，并且决策面表现得非常复杂时，使用logistic Regression则需要使用到分类特征的多项式组合。由于多项式组合和特征维数n是指数级关系，新组合的特征n'非常大，导致使用logistic Regression计算代价很大。所以<span class="burk">不适用</span>。

## Neural Networks Model

![权值矩阵定义及维度，符号定义](picture/4.jpg)

自学习可以用来分类的特征，输出层之前所做的事情。将特征映射到新的空间，可以用来做简单的线性分类，则逻辑回归差不多。

## forward propagation：Vectorized implementation

向量化前向传播：  
\begin{align*}x = \begin{bmatrix}x_0 \newline x_1 \newline\cdots \newline x_n\end{bmatrix} &z^{(j)} = \begin{bmatrix}z_1^{(j)} \newline z_2^{(j)} \newline\cdots \newline z_n^{(j)}\end{bmatrix}\end{align*}  
\begin{align*}z_k^{(2)} = \Theta_{k,0}^{(1)}x_0 + \Theta_{k,1}^{(1)}x_1 + \cdots + \Theta_{k,n}^{(1)}x_n \newline a^{(j)} = g(z^{(j)}) \newline z^{(j+1)} = \Theta^{(j)}a^{(j)} \newline h_\Theta(x) = a^{(j+1)} = g(z^{(j+1)})\end{align*}  

x代表输入，z代表权值和输入乘积求和，Θ代表权值，a代表对z的激活输出。z和a的上角标代表所在层，Θ的上角标i代表从i层到i+1层的权值。向量计算步骤：  
1. 首先为输入向量x或者a^i添加偏置1  
2. 添加偏置后的输入向量x或者a^i和当前层到下一层的权值做内积得到z^i+1
3. 将z^i+1代入激活哦函数得到激活输出a^i+1

## example for xnor(x1, x2)(not xor(x1, x2)

![](picture/5.jpg)

经过隐含层的特征映射，将线性不可分的特征映射到线性可分的空间。

## multiple output units：one-vs-all

类似于逻辑回归，有几个输出则输出层包含几个神经元，代表几个分类器。

# cost function

## regularized logistic regression's cost function  
\begin{align*} J(\theta) = - \frac{1}{m} \sum_{i=1}^m [ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2\end{align*}

## regularized neural networks' cost function  
\begin{gather*} J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{gather*}

注意：在逻辑回归和神经网络中，吴恩达教授不会正则化偏置参数Θ0，原意是正不正则化对结果影响不大。倾向于不正则化。

# BackPropagation Algorithm  
误差反向传播，是一种与最优化算法（如梯度下降法）结合使用的，用来训练人工神经网络的常见方法。

BP算法：  
![](picture/6.jpg)

## 针对sigmoid激活函数具体参数更新步骤： 
 
Given training set {(x(1),y(1))⋯(x(m),y(m))}  
Set Δ(l)i,j := 0 for all (l,i,j), (hence you end up having a matrix full of zeros)  
For training example t =1 to m:  
% 代表每一个训练样本  
1. 前向传播计算每一层输出a(layer)
Set a(1):=x(t)  
Perform forward propagation to compute a(l) for l=2,3,…,L  
2. 更新梯度   
2.1 计算δ（理解为梯度的中间值，递归计算各个层的梯度）  
Using y(t), compute δ(L)=a(L)−y(t)  
然后计算δ(L−1),δ(L−2),…,δ(2)  
使用的更新公式为：\begin{align*} \delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)}) \end{align*}  
2.2 计算Δ  
\begin{align*} \Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)} \newline \text 矢量化为：\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T\end{align*}  
2.3 计算梯度D  
\begin{align*} D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right) \text if  j≠0\newline D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j} \text if  j=0\end{align*}

# backpropagation in practice  
因为在具体使用优化方式拟合参数的时候，需要对向量的形式做相应的变化。  
以Octave优化方法fminunc()为例：她优化的参数和所需的梯度都是向量形式n*1的形式。

## unrolling parameters  
将参数从矩阵转换为向量  
将梯度结果也展成向量的形式  
D1，D2，D3 -> deltavc  
thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]  
deltaVector = [ D1(:); D2(:); D3(:) ]  

从unrolling parameters变为之前的矩阵方法    
Theta1 = reshape(thetaVector(1:110),10,11)  
Theta2 = reshape(thetaVector(111:220),10,11)  
Theta3 = reshape(thetaVector(221:231),1,11)  

## gradient checking  
为了验证自己的梯度后向传播公式推导是否正确，需要进行数值计算验证  
\begin{align*} \dfrac{\partial}{\partial\Theta}J(\Theta) \approx \dfrac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon} \newline \dfrac{\partial}{\partial\Theta_j}J(\Theta) \approx \dfrac{J(\Theta_1, \dots, \Theta_j + \epsilon, \dots, \Theta_n) - J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n)}{2\epsilon}\end{align*}

更新过程：  
```bash  
epsilon = 1e-4;  
for i = 1:n,  
  thetaPlus = theta;  
  thetaPlus(i) += epsilon;  
  thetaMinus = theta;  
  thetaMinus(i) -= epsilon;  
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)  
end;  
```  

1. 为什么使用梯度下降而非近似数值计算偏导？  
数值计算代价太高。  
2. epsilon选择？  
经验值： ϵ=10^−4

## random initialization  
随机初始化参数，初始化的参数尽量小，趋近于0.

如果所有的参数都初始化为0，考虑简单的三层模型，会发现所有的隐含层输出都相同，并且隐含层除了偏置以外的节点计算的梯度都相同。可想而知在之后的更新中，从输入到隐含层的参数都一样，结果是隐含节点输出一样，造成很大的冗余，因为隐含层输入的有用信息其实只有一维特征，网络性能退化。<span class="mark">为了打破这种对称性，’symmetry breaking‘</span>，这时可以使用随机初始化参数。 
```bash  
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.  

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;  
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;  
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;  
```  

# 网络训练步骤总结

## pick a  network architecture  
1. 总共需要多少层，以及隐含层需要包含的节点数目。
2. 输入层节点数目和特征数相同。  
3. 输出层节点数目和类别数相同。  
4. 每一个隐含层通常节点数相同，并且越多性能越好，计算代价也越高。通常和输入层相同，或者略多，2~4倍。
5. 隐含层层数：通常是一层。如果多于一层，建议每一层包含的节点数相同。  
6. <span class="burk">3层的神经网络性能已经表现得很好，增加网络层数不能提升网络的性能？为什么。--金野</span>

## training a neural network  
1. 随机初始化权值，symmetry breaking  <span class="mark">epsilon=sqrt(6)/(sqrt(s(l)) + (s(l+1)))在实验说明文档上</span>  
2. 对每一个训练样本做前向传播计算。  
3. 计算损失函数。  
4. 反向传播计算梯度。（记得梯度和参数都是向量形式4.1）  
5. 使用梯度检测确定反向传播的准确性。然后关闭梯度检测。（数值计算）<span class="mark">epsilon = 10^-4</span>   
6. 使用梯度下降或者其他优化算法最小化损失函数。得到最优化参数。  

在进行梯度反向传播时，最好使用for循环为每一个样本计算梯度：  
```bash
for i = 1:m,
   Perform forward propagation and backpropagation using example (x(i),y(i))  
   (Get activations a(l) and delta terms d(l) for l = 2,...,L  
```
当然也可以不使用for循环，使用更高级的实现方式。  

# 计算梯度和损失函数Octave实现  
```bash  
function [J grad] = nnCostFunction(nn_params, ...  
                                   input_layer_size, ...  
                                   hidden_layer_size, ...  
                                   num_labels, ...  
                                   X, y, lambda)  
%NNCOSTFUNCTION Implements the neural network cost function for a two layer  
%neural network which performs classification  
%   [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...  
%   X, y, lambda) computes the cost and gradient of the neural network. The  
%   parameters for the neural network are "unrolled" into the vector  
%   nn_params and need to be converted back into the weight matrices.   
%   
%   The returned parameter grad should be a "unrolled" vector of the  
%   partial derivatives of the neural network.  
%  

% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices  
% for our 2 layer neural network  
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...  
                 hidden_layer_size, (input_layer_size + 1));  

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...  
                 num_labels, (hidden_layer_size + 1));  

% Setup some useful variables  
m = size(X, 1);  
           
% You need to return the following variables correctly   
J = 0;  
Theta1_grad = zeros(size(Theta1));  
Theta2_grad = zeros(size(Theta2));  

% ====================== YOUR CODE HERE ======================  
% Instructions: You should complete the code by working through the  
%               following parts.  
%  
% Part 1: Feedforward the neural network and return the cost in the  
%         variable J. After implementing Part 1, you can verify that your  
%         cost function computation is correct by verifying the cost  
%         computed in ex4.m  
%  
% Part 2: Implement the backpropagation algorithm to compute the gradients  
%         Theta1_grad and Theta2_grad. You should return the partial derivatives of  
%         the cost function with respect to Theta1 and Theta2 in Theta1_grad and  
%         Theta2_grad, respectively. After implementing Part 2, you can check  
%         that your implementation is correct by running checkNNGradients  
%  
%         Note: The vector y passed into the function is a vector of labels  
%               containing values from 1..K. You need to map this vector into a   
%               binary vector of 1's and 0's to be used with the neural network  
%               cost function.  
%  
%         Hint: We recommend implementing backpropagation using a for-loop  
%               over the training examples if you are implementing it for the   
%               first time.  
%  
% Part 3: Implement regularization with the cost function and gradients.  
%  
%         Hint: You can implement this around the code for  
%               backpropagation. That is, you can compute the gradients for  
%               the regularization separately and then add them to Theta1_grad  
%               and Theta2_grad from Part 2.  
%  
% feedforward  
% 三层网络做前向传播  
a1 = [ones(m, 1), X];  
z2 = a1*Theta1';  
a2 = sigmoid(z2);  
a2 = [ones(m, 1), a2];  
z3 = a2*Theta2';  
a3 = sigmoid(z3);      % m by K  

% compute cost without regularization  
% y_matrix quckily init? 
% 将目标输出由1,2,3,4,5,6,7,8,9,10的形式变为0,1向量形式  
y_matrix = zeros(m, num_labels);  
for i = 1:m  
  y_matrix(i, y(i)) = 1;  
end  
% 损失函数计算  
J = -1/m * sum(sum(y_matrix.*log(a3) + (1-y_matrix).*log(1 - a3)));  

% compute cost with regularization  
% 添加正则项
J_reg = lambda/(2*m) * (sum(sum(Theta1(:, 2:end).^2)) + sum(sum(Theta2(:, 2:end).^2)));%not contain   theta0
J += J_reg;  

% backpropagation algorithm implement without regularization  
% 误差反向传播
delta1 = zeros(size(Theta1));  
delta2 = zeros(size(Theta2));  
% 累计m个训练样本的梯度  
for t = 1:m  
  delta_3 = a3(t, :)' - y_matrix(t, :)';                     % K by 1  
  delta_2 = Theta2'*delta_3.*sigmoidGradient([1, z2(t, :)]');% (s2 + 1) by 1  
  delta2 += delta_3*a2(t, :);                               % K by (s2 + 1)  
  delta1 += delta_2(2: end)*a1(t, :);                       % s2 by n  
end  
% 取梯度的均值  
Theta1_grad = 1/m * delta1;  
Theta2_grad = 1/m * delta2;  

% backpropagation algorithm implement with regularization  
% 添加正则项梯度  
Theta1(:, 1) = 0;  
Theta1_grad += lambda/m * Theta1;  
Theta2(:, 1) = 0;  
Theta2_grad += lambda/m * Theta2;  


% -------------------------------------------------------------  

% =========================================================================  

% Unroll gradients  
% unrolling为向量形式
grad = [Theta1_grad(:) ; Theta2_grad(:)];  


end  

```  

# 从训练到测试整个过程

```bash  
% load data  
% 加载数据X,y
load('ex4data1.mat');  

% define parameters  
input_layer_size = 400;  
hidden_layer_size = 25;  
num_labels = 10;  
theta1 = randInitializeWeights(input_layer_size, hidden_layer_size); % 25 by 401  
theta2 = randInitializeWeights(hidden_layer_size, num_labels);  % 10 by 26  
initial_nn_params = [theta1(:); theta2(:)];  

% fmincg optimal parameters  
%  After you have completed the assignment, change the MaxIter to a larger  
%  value to see how more training helps.  
options = optimset('MaxIter', 50);  
%  You should also try different values of lambda  
lambda = 1;  
% Create "short hand" for the cost function to be minimized  
costFunction = @(p) nnCostFunction(p, ...  
                                   input_layer_size, ...  
                                   hidden_layer_size, ...  
                                   num_labels, X, y, lambda);  
% Now, costFunction is a function that takes in only one argument (the  
% neural network parameters)  
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);  

% predict   
theta1 = reshape(nn_params(1 : (input_layer_size+1)*hidden_layer_size ), [hidden_layer_size, input_layer_size+1]);
theta2 = reshape(nn_params((input_layer_size+1)*hidden_layer_size+1 : end), [num_labels, hidden_layer_size  +1]); 
p = predict(theta1, theta2, X);  
correct = size(find(p == y), 1)/size(X, 1);  
```  