## 两种梯度下降法
> **批量梯度下降法 Batch Gradient Descent：所有样本都参与运算，计算量大**

> **随机梯度下降法 Stochastic Gradient Descent：一次随机取出一个样本参与运算，但是并不能保证一次沿着减小的方向或是减小最快的方向移动。**

> **学习率（步长）需要随着循环次数的增加而减小，因为接近最优解的时候希望步长小一些**

<img src='./picture/6-1.png' style='float:middle'>

<img src='./picture/6-2.png' style='float:middle'>

In [1]:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(666)
x = 2 * np.random.random(size=100)
y = x * 4. + 3. + np.random.normal(size=100)
X = x.reshape(-1 , 1) #100行1列 便于扩展到多维数组

In [2]:
def J(theta, X_b, y): #x_b已经增加了全为1的矩阵
    try:
        return np.sum((y - X_b.dot(theta))**2) / len(X_b)
    except:
        return float('inf') #若超出界限则返回一个浮点数的最大值
    
def dJ(theta, X_b, y):
    res = np.empty(len(theta))
    res[0] = np.sum(X_b.dot(theta) - y)
    for i in range(1, len(theta)):
        res[i] = (X_b.dot(theta) - y).dot(X_b[:,i])
    return res * 2 / len(X_b)

def gradient_descent(X_b, y, initial_theta, eta,n_iters = 1e4 ,epsilon=1e-10):
    theta = initial_theta
    i_iter = 0
    
    while i_iter < n_iters:
        gradient = dJ(theta, X_b, y)
        last_theta = theta
        theta = theta - eta * gradient
        
        if(abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
            break
            
        i_iter += 1
        
    return theta

In [3]:
%%time
X_b = np.hstack([np.ones((len(x), 1)), X])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01
theta = gradient_descent(X_b, y, initial_theta, eta)

Wall time: 64.1 ms


In [4]:
theta

array([3.0239203 , 4.00498585])

## 随机梯度下降法
> **由于方法并不能保证都是下降的，所以终止条件有变化**

> **批量梯度下降法统计了所有样本的损失函数，随机梯度下降法只有一个样本的损失函数**

In [5]:
def dJ_sgd(theta, X_b_i, y_i):           #只传入某一列矩阵, 不再需要除以len(y)
    return X_b_i.T.dot(X_b_i.dot(theta) - y_i) * 2.

In [6]:
def sgd(X_b, y , initial_theta, n_iters):
    t0 = 5
    t1 = 50
    
    def learning_rate(t):
        return t0 / (t + t1)
    
    theta = initial_theta
    for cur_iter in range(n_iters):
        rand_i = np.random.randint(len(X_b))
        gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i])   
        theta = theta - learning_rate(cur_iter) * gradient
        
    return theta

In [9]:
X_b = np.hstack([np.ones((len(x), 1)), X])
initial_theta = np.zeros(X_b.shape[1])
theta = sgd(X_b, y, initial_theta, n_iters=len(X_b)//3) ##只检查了样本的1/3

In [10]:
theta

array([2.93924254, 3.64381672])