# 普通最小二乘法回归
---
最标准的线性模型是“普通最小二乘回归”，通常简称为“线性回归”。 它没有对coef_施加任何额外限制，因此当特征数量很大时，它会变得行为异常，并且模型会过拟合。
$$\underset{\theta}{min} {|| X\theta - y||_2}^2$$
通过回归方程求导得到的最佳系数$\hat\theta = (X^TX)^{-1}X^Ty$

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
np.set_printoptions(precision=4, suppress=True, threshold=16)
%matplotlib inline

**make_regression**
```
Generate a random regression problem.
    n_samples=100,
    n_features=100,
    n_informative=10,
    n_targets=1,
    bias=0.0,
    effective_rank=None,
    tail_strength=0.5,
    noise=0.0,
    shuffle=True,
    coef=False,
    random_state=None,
```
```
Returns
-------
X : array of shape [n_samples, n_features]
    The input samples.

y : array of shape [n_samples] or [n_samples, n_targets]
    The output values.

coef : array of shape [n_features] or [n_features, n_targets], optional
    The coefficient of the underlying linear model. It is returned only if
    coef is True.
```

In [None]:
# 只有10个是有用的特征 , 添加了噪声
X, y,  = make_regression(n_samples=200, n_features=30, n_informative=10,
                                  noise=100, coef=True, random_state=5)  
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=5, train_size=60, test_size=140)
print(X_train.shape)
print(y_train.shape)

**R2 score, 用来计算[可决系数](https://baike.baidu.com/item/%E5%8F%AF%E5%86%B3%E7%B3%BB%E6%95%B0)(the coefficient of determination)**
$$R^2(y, \hat y) = 1 - \frac {\sum_{i=1}^{n\_sample}(y_i - \hat y_i)^2}{\sum_{i=1}^{n\_sample}(y_i - \bar y)^2}$$
其中$y_i$表示第i个样本的真实值, $\hat y_i$ 为预测值,   
$\bar y = \frac 1 {n\_sample}\sum_{i=1}^{n\_sample} y_i$为实际值的均值
R2 score的值为1.0时最佳, 也可能为负数

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
print("R^2 on training set: %f" % lr.score(X_train, y_train))
print("R^2 on test set: %f" % lr.score(X_test, y_test))

In [None]:
from sklearn.metrics import r2_score
# 整体集合的R2 score
r2_score(np.dot(X, true_coef), y)

In [None]:
plt.figure(figsize=(10, 5))
# 系数从大到小排序 画图比较
coefficient_sorting = np.argsort(true_coef)[::-1]
plt.plot(true_coef[coefficient_sorting], "o", label="true")
plt.plot(lr.coef_[coefficient_sorting], "o", label="linear regression")

plt.legend()

**learning curve 学习曲线**
---
A learning curve shows the validation and training score of an estimator for varying numbers of training samples.

In [None]:
from sklearn.model_selection import learning_curve

train_sizes, train_scores, valid_scores = learning_curve(
    LinearRegression(), X, y, train_sizes=np.linspace(.1, 1, 5), cv=5)
train_sizes

In [None]:
train_scores

In [None]:
valid_scores

In [None]:
def plot_learning_curve(est, X, y):
    plt.figure()
    train_sizes, train_scores, test_scores = learning_curve(
        LinearRegression(), X, y, train_sizes=np.linspace(.1, 1, 20), cv=5)
    estimator_name = est.__class__.__name__
    # 训练集的 训练集大小-socre 分数 曲线
    line = plt.plot(train_sizes, train_scores.mean(axis=1), '--', label=f"train scores {estimator_name}")
    plt.plot(train_sizes, test_scores.mean(axis=1), '-', label=f"test scores {estimator_name}")
    plt.xlabel("Training set size")
    plt.legend(loc='best')
    plt.ylim(-0.1, 1.1)

In [None]:
plot_learning_curve(LinearRegression(), X, y)
# 不同训练集大小 训练出来的 结果

**普通最小二乘法的复杂度**

该方法使用 X 的奇异值分解来计算最小二乘解。如果 X 是一个形状为 (n_samples, n_features)的矩阵，设$$n_{samples} \geq n_{features}$$, 则该方法的复杂度为$$O(n_{samples} n_{fearures}^2)$$

# 岭回归（L2 惩罚）
---
岭估计器是普通LinearRegression的简单正则化（称为 l2 惩罚）。 特别是，它具有的优点是，在计算上不比普通的最小二乘估计更昂贵。
$$\underset {\theta}{min} ||X\theta - y||_2^2 + \alpha ||\theta||_2^2$$
其中， $\alpha \geq 0$ 是控制系数收缩量的复杂性参数： $\alpha$ 的值越大，收缩量越大，模型对共线性的鲁棒性也更强。

最优的$\hat \theta = (X^TX+\alpha I)^{-1}X^Ty$, 它是一个关于$\alpha$的函数.  
**岭回归的复杂度**
这种方法与 `普通最小二乘法` 的复杂度是相同的.

让我们加载一个不满秩（low effective rank）数据集来比较岭回归和线性回归。秩是矩阵线性无关组的数量，满秩是指一个$m \times n$矩阵中行向量或列向量中现行无关组的数量等于$min(m,n)$。

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

In [None]:
# 建一个有3个自变量的数据集，但是其秩为2，因此3个自变量中有两个自变量存在相关性
X, y = make_regression(n_samples=2000, n_features=3, effective_rank=2, noise=10)
X

In [None]:
def plot_regression(lr, X, y):
    n_sample, n_feature = X.shape
    n_bootstraps = 1000 # 1000次
    coefs = np.zeros((n_bootstraps, n_feature))
    scores = np.zeros((n_bootstraps, 2))
    
    for i in range(n_bootstraps):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)
        lr.fit(X_train, y_train)
        scores[i] = (lr.score(X_train, y_train), lr.score(X_test, y_test))
        coefs[i] = lr.coef_
    f, axes = plt.subplots(nrows=n_feature, sharex=True, sharey=True, figsize=(10, 8))
    for i, ax in enumerate(axes):
        # 频率分布直方图
        ax.hist(coefs[:, i], alpha=.5)
        ax.set_title("Coef {}".format(i))
    plt.show()
    return coefs, scores

In [None]:
# 普通的线性回归
coefs_lr, scores_lr = plot_regression(LinearRegression(), X, y)

In [None]:
# 岭回归
coefs_ridge, scores_ridge = plot_regression(Ridge(), X, y)  # 正则化系数 alpha 默认1.0

显然, 岭回归的系数更接近0

In [None]:
np.mean(coefs_ridge - coefs_lr, axis=0)

从均值上看，线性回归比岭回归的系数要大很多。均值显示的差异其实是线性回归的系数隐含的偏差

In [None]:
coefs_lr.var(0)

In [None]:
coefs_ridge.var(0)

In [None]:
scores_lr

In [None]:
scores_lr.mean(0), scores_ridge.mean(0)

岭回归的系数方差也会小很多。这就是机器学习里著名的偏差-方差均衡(Bias-Variance Trade-off)

## 优化岭回归参数 
---
用OLS（普通最小二乘法）做回归也许可以显示两个变量之间的某些关系；但是，当alpha参数正则化之后，那些关系就会消失.

在linear_models模块中，有一个对象叫RidgeCV，表示**岭回归交叉检验**（ridge cross-validation）。这个交叉检验类似于**留一交叉验证法**（leave-one-out cross-validation，LOOCV）

指定cv属性的值将触发(通过GridSearchCV的)交叉验证。例如，cv=10将触发10折的交叉验证，而不是广义交叉验证(GCV)。

In [None]:
from sklearn.linear_model import RidgeCV

X, y = make_regression(n_samples=100, n_features=2, effective_rank=1, noise=10)
rcv = RidgeCV(alphas=np.logspace(-5, 5, 11))
rcv.fit(X, y)

In [None]:
# 拟合模型之后，alpha参数就是最优参数：
rcv.alpha_

In [None]:
# 查看0.1附近更优的alpha值
rcv = RidgeCV(alphas=np.linspace(.05, .2, 16))
rcv.fit(X, y)
rcv.alpha_

In [None]:
alpha_list = np.linspace(0.001, 1, 1000)
rcv = RidgeCV(alphas=alpha_list, store_cv_values=True)  # 保存交叉检验的数据
rcv.fit(X, y)

In [None]:
rcv.cv_values_.shape  # 100次交叉验证  1000个不同alpha 的均方根误差

In [None]:
min_alpha_idx = rcv.cv_values_.mean(0).argmin()
min_alpha_idx

In [None]:
alpha_list[min_alpha_idx]

In [None]:
rcv.alpha_

In [None]:
def plt_ridgecv(alpha_list, X, y):
    rcv = RidgeCV(alphas=alpha_list, store_cv_values=True)  # 保存交叉检验的数据
    rcv.fit(X, y)
    min_alpha_idx = rcv.cv_values_.mean(0).argmin()
    f, ax = plt.subplots(figsize=(10, 6))
    ax.set_title(r"Various values of $\alpha$")
    xy = (alpha_list[min_alpha_idx], rcv.cv_values_.mean(axis=0)[min_alpha_idx])
    xytext = (xy[0] + .01, xy[1] + .1)
    
    ax.annotate(r'Chosen $\alpha$', xy=xy, xytext=xytext,
            arrowprops=dict(facecolor='black', shrink=0, width=0)
            )
    ax.plot(alpha_list, rcv.cv_values_.mean(axis=0))

In [None]:
plt_ridgecv(np.linspace(0.001, 0.1, 100), X, y)