# 多元线性回归
![m](img/mul-x.png)

y 前面的就是X的伪逆矩阵   
下面推到如下:
$$
(y-X\theta)^\mathrm T(y-X\theta)
$$

$$
=y^\mathrm Ty - \theta^\mathrm TX^\mathrm Ty - yX\theta + \theta^\mathrm TX^\mathrm TX\theta
$$


$$
E_{in}=y^\mathrm Ty - 2yX\theta + \theta^\mathrm TX^\mathrm TX\theta
$$

对$\theta$ 求导

$$
\frac{\partial E_{in}}{\partial \theta} = 0 - 2yX + 2X^\mathrm TX\theta = 0
$$

$$
yX = X^\mathrm TX\theta
$$

两边同时乘以$(X^\mathrm TX)^{-1}$
$$
\theta = (X^\mathrm TX)^{-1}yX
$$

注意到:
$$
yX = X^\mathrm Ty
$$

<font color=red>多元线性回归的正规方程解(时间复杂度: $O(n^3)$)</font>

$$
\theta = (X^\mathrm TX)^{-1}X^\mathrm Ty 
$$

## 实现多元线性回归
![m2](img/mul-x2.png)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

In [2]:
boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50]
y = y[y < 50]

In [3]:
X.shape

(490, 13)

In [4]:
y.shape

(490,)

In [5]:
#train_test_split?
from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)

In [6]:
from playML.LinearRegression import LinearRegression
reg = LinearRegression()

In [7]:
reg.fit_normal(X_train, y_train)

LinearRegression()

In [8]:
reg.coef_

array([-1.18919477e-01,  3.63991462e-02, -3.56494193e-02,  5.66737830e-02,
       -1.16195486e+01,  3.42022185e+00, -2.31470282e-02, -1.19509560e+00,
        2.59339091e-01, -1.40112724e-02, -8.36521175e-01,  7.92283639e-03,
       -3.81966137e-01])

In [9]:
reg.interception_

34.161435496213905

In [10]:
reg.score(X_test, y_test)

0.8129802602658359

# 使用sklearn 实现的多元回归

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

In [12]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

In [13]:
lin_reg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [14]:
lin_reg.coef_

array([-1.14235739e-01,  3.12783163e-02, -4.30926281e-02, -9.16425531e-02,
       -1.09940036e+01,  3.49155727e+00, -1.40778005e-02, -1.06270960e+00,
        2.45307516e-01, -1.23179738e-02, -8.80618320e-01,  8.43243544e-03,
       -3.99667727e-01])

In [15]:
lin_reg.intercept_

32.645660839653466

In [16]:
# 与前面的不一样是因为训练数据不一致导致的

In [17]:
lin_reg.score(X_test, y_test)

0.8008916199519102

## KNN regressor

In [18]:
from sklearn.neighbors import KNeighborsRegressor
knn_reg = KNeighborsRegressor()

In [19]:
knn_reg.fit(X_train, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform')

In [20]:
knn_reg.score(X_test, y_test)

0.602674505080953

In [21]:
from sklearn.model_selection import GridSearchCV
param_grid = [
    {
        "weights":["uniform"],
        "n_neighbors": [i for i in range(1,5)]
    },
    {
        "weights": ["distance"],
        "n_neighbors": [i for i in range(1, 5)],
        "p": [i for i in range(1,4)]
    }
]

knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs = -1)
grid_search.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform'),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4]}, {'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4], 'p': [1, 2, 3]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [22]:
grid_search.best_params_

{'n_neighbors': 4, 'p': 1, 'weights': 'distance'}

In [23]:
# 这里使用的是CV验证
grid_search.best_score_

0.6012047857024175

In [24]:
# 这是真正的r2 score
grid_search.best_estimator_.score(X_test, y_test)

0.7281175497354979