# 对数损失函数 和平方平均误差

之前，Luis讲述了对数损失函数。除此之外，还有很多其他误差函数都可以应用在神经网络中. 现在我们来介绍另一个，平方平均误差。从名字可以看出，它表示预测值和标签值的差的平方的平均值。接下来，我们会讲述更多细节，然后在学生录取数据集上实现向后传播算法。

令我兴奋的是，我们会用 numpy 的矩阵乘法高效实现这一算法。

# 1.学习权重

你了解了如何使用感知器来构建 AND 和 XOR 运算，但它们的权重都是人为设定的。如果你要进行一个运算，例如预测大学录取结果，但你不知道正确的权重是什么，该怎么办？你要从样本中学习权重，然后用这些权重来做预测。

要了解我们将如何找到这些权重，可以从我们的目标开始考虑。我们想让网络做出的预测与真实值尽可能接近。为了能够衡量，我们需要有一个指标来了解预测有多差，也就是`误差 (error)`。一个普遍的指标是误差平方和 sum of the squared errors (SSE)：

<img src="../../sources/img/NeuralNetworkIntro/implement-gradient-descent-1.png" alt="" width="1000" height="1200" />


# 2.梯度下降

In [1]:
from IPython.display import HTML

HTML('<iframe src="https://www.youtube.com/embed/29PmNG7fuuM" width="800" height="450" frameborder="0" allowfullscreen></iframe>')


如 Luis 所说，用梯度下降，我们通过多个小步骤来实现目标。在这个例子中，我们希望一步一步改变权重来减小误差。借用前面的比喻，误差就像是山，我们希望走到山下。下山最快的路应该是最陡峭的那个方向，因此我们也应该寻找能够使误差最小化的方向。我们可以通过计算误差平方的梯度来找到这个方向。

梯度是改变率或者斜度的另一个称呼。如果你需要回顾这个概念，可以看下可汗学院对这个问题的 [讲解](https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/gradient-and-directional-derivatives/v/gradient)。

<img src="../../sources/img/NeuralNetworkIntro/implement-gradient-descent-2.png" alt="" width="1000" height="1200" />
<img src="../../sources/img/NeuralNetworkIntro/implement-gradient-descent-3.png" alt="" width="600" height="300" />

梯度就是对多变量函数导数的泛化。我们可以用微积分来寻找误差函数中任意一点的梯度，它与输入权重有关，下一节你可以看到如何推导梯度下降的步骤。

下面我画了一个拥有两个输入的神经网络误差示例，相应的，它有两个权重。你可以将其看成一个地形图，同一条线代表相同的误差，较深的线对应较大的误差。

每一步，你计算误差和梯度，然后用它们来决定如何改变权重。重复这个过程直到你最终找到接近误差函数最小值的权重，即中间的黑点。

<img src="../../sources/img/NeuralNetworkIntro/implement-gradient-descent-4.png" alt="" width="600" height="300" />


# 注意事项

因为权重会走向梯度带它去的位置，它们有可能停留在误差小，但不是最小的地方。这个点被称作局部最低点。如果权重初始值有错，梯度下降可能会使得权重陷入局部最优，例如下图所示。

<img src="../../sources/img/NeuralNetworkIntro/implement-gradient-descent-5.png" alt="" width="600" height="300" />

有方法可以避免这一点，被称作 [momentum](http://sebastianruder.com/optimizing-gradient-descent/index.html#momentum).

# 3.梯度下降 - 数学

In [2]:

HTML('<iframe src="https://www.youtube.com/embed/7sxA5Ap8AWM" width="800" height="450" frameborder="0" allowfullscreen></iframe>')


如果你对这个知识点不熟悉，可以查看可汗学院的相关课程 [多元微积分](https://www.khanacademy.org/math/multivariable-calculus)。

# 4.梯度下降：代码

<img src="../../sources/img/NeuralNetworkIntro/implement-gradient-descent-6.png" alt="" width="1200" height="300" />

```
# Defining the sigmoid function for activations 
# 定义 sigmoid 激活函数
def sigmoid(x):
    return 1/(1+np.exp(-x))

# Derivative of the sigmoid function
# 激活函数的导数
def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Input data
# 输入数据
x = np.array([0.1, 0.3])
# Target
# 目标
y = 0.2
# Input to output weights
# 输入到输出的权重
weights = np.array([-0.8, 0.5])

# The learning rate, eta in the weight step equation
# 权重更新的学习率
learnrate = 0.5

# the linear combination performed by the node (h in f(h) and f'(h))
# 输入和权重的线性组合
h = x[0]*weights[0] + x[1]*weights[1]
# or h = np.dot(x, weights)

# The neural network output (y-hat)
# 神经网络输出
nn_output = sigmoid(h)

# output error (y - y-hat)
# 输出误差
error = y - nn_output

# output gradient (f'(h))
# 输出梯度
output_grad = sigmoid_prime(h)

# error term (lowercase delta)
error_term = error * output_grad

# Gradient descent step 
# 梯度下降一步
del_w = [ learnrate * error_term * x[0],
          learnrate * error_term * x[1]]
# or del_w = learnrate * error_term * x

```

### 练习题

In [4]:
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

def sigmoid_prime(x):
    """
    # Derivative of the sigmoid function
    """
    return sigmoid(x) * (1 - sigmoid(x))

learnrate = 0.5
x = np.array([1, 2, 3, 4])
y = np.array(0.5)

# Initial weights
w = np.array([0.5, -0.5, 0.3, 0.1])

### Calculate one gradient descent step for each weight
### Note: Some steps have been consilated, so there are
###       fewer variable names than in the above sample code

# TODO: Calculate the node's linear combination of inputs and weights
h = None

# TODO: Calculate output of neural network
nn_output = None

# TODO: Calculate error of neural network
error = None

# TODO: Calculate the error term
#       Remember, this requires the output gradient, which we haven't
#       specifically added a variable for.
error_term = None

# TODO: Calculate change in weights
del_w = None

print('Neural Network output:')
print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)

Neural Network output:
None
Amount of Error:
None
Change in Weights:
None


### 正确答案

In [6]:
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

def sigmoid_prime(x):
    """
    # Derivative of the sigmoid function
    """
    return sigmoid(x) * (1 - sigmoid(x))

learnrate = 0.5
x = np.array([1, 2, 3, 4])
y = np.array(0.5)

# Initial weights
w = np.array([0.5, -0.5, 0.3, 0.1])

### Calculate one gradient descent step for each weight
### Note: Some steps have been consilated, so there are
###       fewer variable names than in the above sample code

# TODO: Calculate the node's linear combination of inputs and weights
h = np.dot(x, w)

# TODO: Calculate output of neural network
nn_output = sigmoid(h)

# TODO: Calculate error of neural network
error = y - nn_output

# TODO: Calculate the error term
#       Remember, this requires the output gradient, which we haven't
#       specifically added a variable for.
error_term = error * sigmoid_prime(h)
# Note: The sigmoid_prime function calculates sigmoid(h) twice,
#       but you've already calculated it once. You can make this
#       code more efficient by calculating the derivative directly
#       rather than calling sigmoid_prime, like this:
# error_term = error * nn_output * (1 - nn_output)

# TODO: Calculate change in weights
del_w = learnrate * error_term * x

print('Neural Network output:')
print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)

Neural Network output:
0.6899744811276125
Amount of Error:
-0.1899744811276125
Change in Weights:
[-0.02031869 -0.04063738 -0.06095608 -0.08127477]


# 5.实现梯度下降

<img src="../../sources/img/NeuralNetworkIntro/implement-gradient-descent-7.png" alt="" width="1200" height="300" />
<img src="../../sources/img/NeuralNetworkIntro/implement-gradient-descent-8.png" alt="" width="1200" height="300" />

我们的目标是基于这些特征来预测一个学生能否被研究生院录取。这里，我们将使用有一个输出层的网络。用 sigmoid 做为激活函数。

## 数据清理

你也许认为有三个输入单元，但实际上我们要先做数据转换。`rank` 是类别特征，其中的数字并不表示任何相对的值。排名第 2 并不是排名第 1 的两倍；排名第 3 也不是排名第 2 的 1.5 倍。因此，我们需要用 [dummy variables](https://en.wikipedia.org/wiki/Dummy_variable_(statistics))   来对 `rank` 进行编码。把数据分成 4 个新列，用 0 或 1 表示。排名为 1 的行对应 `rank_1` 列的值为 1 ，其余三列的值为 0；排名为 2 的行对应 `rank_2` 列的值为 1 ，其余三列的值为 0，以此类推。

我们还需要把 `GRE` 和 `GPA` 数据标准化，也就是说使得它们的均值为 0，标准偏差为 1。因为 `sigmoid` 函数会挤压很大或者很小的输入，所以这一步是必要的。很大或者很小输入的梯度为 0，这意味着梯度下降的步长也会是 0。由于 GRE 和 GPA 的值都相当大，我们在初始化权重的时候需要非常小心，否则梯度下降步长将会消失，网络也没法训练了。相对地，如果我们对数据做了标准化处理，就能更容易地对权重进行初始化。

这只是一个简单介绍，你之后还会学到如何预处理数据，如果你想了解我是怎么做的，可以查看下面编程练习中的 `data_prep.py` 文件。

<img src="../../sources/img/NeuralNetworkIntro/example-data.png" alt="" width="1200" height="300" />

现在数据已经准备好了，我们看到有六个输入特征：`gre`、`gpa`，以及四个 `rank` 的虚拟变量 （dummy variables）。

### 均方差

这里我们要对如何计算误差做一点小改变。我们不计算 SSE，而是用误差平方的均值（mean of the square errors，MSE）。现在我们要处理很多数据，把所有权重更新加起来会导致很大的更新，使得梯度下降无法收敛。为了避免这种情况，你需要一个很小的学习率。这里我们还可以除以数据点的数量 mm 来取平均。这样，无论我们有多少数据，我们的学习率通常会在 0.01 to 0.001 之间。我们用 MSE（下图）来计算梯度，结果跟之前一样，只是取了平均而不是取和。

<img src="../../sources/img/NeuralNetworkIntro/implement-gradient-descent-9.png" alt="" width="1200" height="300" />
<img src="../../sources/img/NeuralNetworkIntro/implement-gradient-descent-10.png" alt="" width="1200" height="300" />
<img src="../../sources/img/NeuralNetworkIntro/implement-gradient-descent-11.png" alt="" width="1200" height="300" />


In [9]:
import numpy as np
import pandas as pd


admissions = pd.read_csv('../../sources/files/NeuralNetworkIntro/implement-gradient-descent-binary.csv')

# Make dummy variables for rank
data = pd.concat([admissions, pd.get_dummies(admissions['rank'], prefix='rank')], axis=1)
data = data.drop('rank', axis=1)

# Standarize features
for field in ['gre', 'gpa']:
    mean, std = data[field].mean(), data[field].std()
    data.loc[:,field] = (data[field]-mean)/std
    
# Split off random 10% of the data for testing
np.random.seed(42)
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
data, test_data = data.ix[sample], data.drop(sample)

# Split into features and targets
features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']



def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))

# TODO: We haven't provided the sigmoid_prime function like we did in
#       the previous lesson to encourage you to come up with a more
#       efficient solution. If you need a hint, check out the comments
#       in solution.py from the previous lecture.

# Use to same seed to make debugging easier
np.random.seed(42)

n_records, n_features = features.shape
last_loss = None

# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

# Neural Network hyperparameters
epochs = 1000
learnrate = 0.5

for e in range(epochs):
    del_w = np.zeros(weights.shape)
    for x, y in zip(features.values, targets):
        # Loop through all records, x is the input, y is the target

        # Note: We haven't included the h variable from the previous
        #       lesson. You can add it if you want, or you can calculate
        #       the h together with the output

        # TODO: Calculate the output
        output = None

        # TODO: Calculate the error
        error = None

        # TODO: Calculate the error term
        error_term = None

        # TODO: Calculate the change in weights for this sample
        #       and add it to the total weight change
        del_w += 0

    # TODO: Update weights using the learning rate and the average change in weights
    weights += 0

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        out = sigmoid(np.dot(features, weights))
        loss = np.mean((out - targets) ** 2)
        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss


# Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated


('Train loss: ', 0.2641867982508648)
('Train loss: ', 0.2641867982508648)
('Train loss: ', 0.2641867982508648)
('Train loss: ', 0.2641867982508648)
('Train loss: ', 0.2641867982508648)
('Train loss: ', 0.2641867982508648)
('Train loss: ', 0.2641867982508648)
('Train loss: ', 0.2641867982508648)
('Train loss: ', 0.2641867982508648)
('Train loss: ', 0.2641867982508648)
Prediction accuracy: 0.475


## 解决方案

In [None]:
import numpy as np
from data_prep import features, targets, features_test, targets_test


def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))

# TODO: We haven't provided the sigmoid_prime function like we did in
#       the previous lesson to encourage you to come up with a more
#       efficient solution. If you need a hint, check out the comments
#       in solution.py from the previous lecture.

# Use to same seed to make debugging easier
np.random.seed(42)

n_records, n_features = features.shape
last_loss = None

# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

# Neural Network hyperparameters
epochs = 1000
learnrate = 0.5

for e in range(epochs):
    del_w = np.zeros(weights.shape)
    for x, y in zip(features.values, targets):
        # Loop through all records, x is the input, y is the target

        # Activation of the output unit
        #   Notice we multiply the inputs and the weights here 
        #   rather than storing h as a separate variable 
        output = sigmoid(np.dot(x, weights))

        # The error, the target minus the network output
        error = y - output

        # The error term
        #   Notice we calulate f'(h) here instead of defining a separate
        #   sigmoid_prime function. This just makes it faster because we
        #   can re-use the result of the sigmoid function stored in
        #   the output variable
        error_term = error * output * (1 - output)

        # The gradient descent step, the error times the gradient times the inputs
        del_w += error_term * x

    # Update the weights here. The learning rate times the 
    # change in weights, divided by the number of records to average
    weights += learnrate * del_w / n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        out = sigmoid(np.dot(features, weights))
        loss = np.mean((out - targets) ** 2)
        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss


# Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))