线性回归区别

![简单线性回归和多元线性回归区别](./img/vs.png)

分类数据用dummy variable转换

![dummy variable](./img/dummy_var.png)

总是要省略一列防止过度拟合 (X-1)
![omit_one](./img/omit_one.png)


反向淘汰（Backward Elimination）是一种特征选择技术，常用于多元线性回归模型中。其基本思想是：先将所有的预测变量包含在模型中，然后逐步移除最不重要的特征，直到所有剩余的特征都是重要的（即对模型的预测性能有显著贡献）。

反向淘汰的具体步骤如下：

**包含所有特征：**首先，我们建立一个包含所有预测变量的模型。

**计算每个特征的重要性：**然后，我们计算每个特征对模型预测能力的贡献。在多元线性回归中，这通常通过计算每个特征的 P 值来实现。P 值表示了特征的系数是否显著不等于零的概率。较高的 P 值（通常大于某个阈值，如 0.05）表示特征可能不重要。

**移除最不重要的特征：**我们移除 P 值最大（即最不重要）的特征。

**重复步骤 2 和 3：**我们反复计算每个特征的重要性，然后移除最不重要的特征，直到所有剩余的特征都是重要的。

前向选择（Forward Selection）是一种特征选择技术，常用于多元线性回归模型中。其基本思想是：从没有预测变量的模型开始，然后逐步添加最重要的特征，直到添加更多的特征不能显著提高模型的预测性能。

前向选择的具体步骤如下：

**开始没有特征的模型：**首先，我们建立一个没有预测变量的模型，只有一个截距。

**计算每个特征的重要性：**然后，我们计算每个特征对模型预测能力的贡献。在多元线性回归中，这通常通过计算每个特征的 P 值来实现。P 值表示了特征的系数是否显著不等于零的概率。较低的 P 值（通常小于某个阈值，如 0.05）表示特征可能是重要的。

**添加最重要的特征：**我们添加 P 值最小（即最重要）的特征。

**重复步骤 2 和 3：**我们反复计算每个特征的重要性，然后添加最重要的特征，直到添加更多的特征不能显著提高模型的预测性能。

双向淘汰（Bidirectional Elimination），也被称为逐步回归（Stepwise Regression），是一种特征选择方法，结合了前向选择和反向淘汰的策略。它在每一步都会评估添加新特征或者删除已有特征对模型性能的影响，然后选择最好的策略。

双向淘汰的具体步骤如下：

**开始没有特征的模型：**首先，我们建立一个没有预测变量的模型，只有一个截距。

**前向选择：**然后，我们像前向选择一样，逐步添加最重要的特征，直到添加更多的特征不能显著提高模型的预测性能。

**反向淘汰：**在每次添加新特征后，我们像反向淘汰一样，检查是否有已有的特征变得不重要（例如，它的 P 值变得较高）。如果有，我们就移除这个特征。

**重复步骤 2 和 3：**我们反复进行前向选择和反向淘汰，直到不能添加或移除任何特征。

In [1]:
import numpy as np
import matplotlib as plt
import pandas as pd

In [2]:
dataset = pd.read_csv('50_Startups.csv')
dataset.head(10)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94
5,131876.9,99814.71,362861.36,New York,156991.12
6,134615.46,147198.87,127716.82,California,156122.51
7,130298.13,145530.06,323876.68,Florida,155752.6
8,120542.52,148718.95,311613.29,New York,152211.77
9,123334.88,108679.17,304981.62,California,149759.96


## 选取除了最后一列的所有列，并将选取的数据转换为一个 NumPy 数组作为自变量X

In [3]:
X = dataset.iloc[:, :-1].values

## 选取最后一列，并将选取的数据转换为一个 NumPy 数组作为因变量y

In [4]:
y = dataset.iloc[:, 4].values

处理分类数据

In [5]:
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder()
X_reshaped = X[:, 3].reshape(-1, 1)  # 第三列地点，转换为二维数组
X_onehot = onehotencoder.fit_transform(X_reshaped).toarray()  # 对第一列进行 one-hot 编码

处理虚拟变量陷阱，去除第一列

In [6]:
X_onehot = X_onehot[:, 1:]

去除掉最后一列(地址)， 将编码的结果合并在一起：

In [7]:
X_new = np.concatenate([X_onehot, X[:, :-1]], axis=1)
X_new

array([[0.0, 1.0, 165349.2, 136897.8, 471784.1],
       [0.0, 0.0, 162597.7, 151377.59, 443898.53],
       [1.0, 0.0, 153441.51, 101145.55, 407934.54],
       [0.0, 1.0, 144372.41, 118671.85, 383199.62],
       [1.0, 0.0, 142107.34, 91391.77, 366168.42],
       [0.0, 1.0, 131876.9, 99814.71, 362861.36],
       [0.0, 0.0, 134615.46, 147198.87, 127716.82],
       [1.0, 0.0, 130298.13, 145530.06, 323876.68],
       [0.0, 1.0, 120542.52, 148718.95, 311613.29],
       [0.0, 0.0, 123334.88, 108679.17, 304981.62],
       [1.0, 0.0, 101913.08, 110594.11, 229160.95],
       [0.0, 0.0, 100671.96, 91790.61, 249744.55],
       [1.0, 0.0, 93863.75, 127320.38, 249839.44],
       [0.0, 0.0, 91992.39, 135495.07, 252664.93],
       [1.0, 0.0, 119943.24, 156547.42, 256512.92],
       [0.0, 1.0, 114523.61, 122616.84, 261776.23],
       [0.0, 0.0, 78013.11, 121597.55, 264346.06],
       [0.0, 1.0, 94657.16, 145077.58, 282574.31],
       [1.0, 0.0, 91749.16, 114175.79, 294919.57],
       [0.0, 1.0, 86419.7

拆分训练和测试数据

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2, random_state=0)
X_train

array([[1.0, 0.0, 55493.95, 103057.49, 214634.81],
       [0.0, 1.0, 46014.02, 85047.44, 205517.64],
       [1.0, 0.0, 75328.87, 144135.98, 134050.07],
       [0.0, 0.0, 46426.07, 157693.92, 210797.67],
       [1.0, 0.0, 91749.16, 114175.79, 294919.57],
       [1.0, 0.0, 130298.13, 145530.06, 323876.68],
       [1.0, 0.0, 119943.24, 156547.42, 256512.92],
       [0.0, 1.0, 1000.23, 124153.04, 1903.93],
       [0.0, 1.0, 542.05, 51743.15, 0.0],
       [0.0, 1.0, 65605.48, 153032.06, 107138.38],
       [0.0, 1.0, 114523.61, 122616.84, 261776.23],
       [1.0, 0.0, 61994.48, 115641.28, 91131.24],
       [0.0, 0.0, 63408.86, 129219.61, 46085.25],
       [0.0, 0.0, 78013.11, 121597.55, 264346.06],
       [0.0, 0.0, 23640.93, 96189.63, 148001.11],
       [0.0, 0.0, 76253.86, 113867.3, 298664.47],
       [0.0, 1.0, 15505.73, 127382.3, 35534.17],
       [0.0, 1.0, 120542.52, 148718.95, 311613.29],
       [0.0, 0.0, 91992.39, 135495.07, 252664.93],
       [0.0, 0.0, 64664.71, 139553.16, 137962.

拟合模型

In [9]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()

regressor.fit(X_train, y_train)

测试预测结果(all-in策略， 使用所有特征)

In [10]:
y_pred = regressor.predict(X_test)

构建优化模型，使用反向淘汰策略

In [11]:
import statsmodels.api as sm

添加截距

In [12]:
X_train = np.append(arr=np.ones((40, 1)).astype(int), values=X_train, axis=1)
X_train

array([[1, 1.0, 0.0, 55493.95, 103057.49, 214634.81],
       [1, 0.0, 1.0, 46014.02, 85047.44, 205517.64],
       [1, 1.0, 0.0, 75328.87, 144135.98, 134050.07],
       [1, 0.0, 0.0, 46426.07, 157693.92, 210797.67],
       [1, 1.0, 0.0, 91749.16, 114175.79, 294919.57],
       [1, 1.0, 0.0, 130298.13, 145530.06, 323876.68],
       [1, 1.0, 0.0, 119943.24, 156547.42, 256512.92],
       [1, 0.0, 1.0, 1000.23, 124153.04, 1903.93],
       [1, 0.0, 1.0, 542.05, 51743.15, 0.0],
       [1, 0.0, 1.0, 65605.48, 153032.06, 107138.38],
       [1, 0.0, 1.0, 114523.61, 122616.84, 261776.23],
       [1, 1.0, 0.0, 61994.48, 115641.28, 91131.24],
       [1, 0.0, 0.0, 63408.86, 129219.61, 46085.25],
       [1, 0.0, 0.0, 78013.11, 121597.55, 264346.06],
       [1, 0.0, 0.0, 23640.93, 96189.63, 148001.11],
       [1, 0.0, 0.0, 76253.86, 113867.3, 298664.47],
       [1, 0.0, 1.0, 15505.73, 127382.3, 35534.17],
       [1, 0.0, 1.0, 120542.52, 148718.95, 311613.29],
       [1, 0.0, 0.0, 91992.39, 135495.07, 2

使用所有特征来拟合模型

In [13]:
X_opt = X_train[:, [0, 1, 2, 3, 4, 5]]
X_opt = np.array(X_opt, dtype=float)
X_opt

array([[1.0000000e+00, 1.0000000e+00, 0.0000000e+00, 5.5493950e+04,
        1.0305749e+05, 2.1463481e+05],
       [1.0000000e+00, 0.0000000e+00, 1.0000000e+00, 4.6014020e+04,
        8.5047440e+04, 2.0551764e+05],
       [1.0000000e+00, 1.0000000e+00, 0.0000000e+00, 7.5328870e+04,
        1.4413598e+05, 1.3405007e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 4.6426070e+04,
        1.5769392e+05, 2.1079767e+05],
       [1.0000000e+00, 1.0000000e+00, 0.0000000e+00, 9.1749160e+04,
        1.1417579e+05, 2.9491957e+05],
       [1.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.3029813e+05,
        1.4553006e+05, 3.2387668e+05],
       [1.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.1994324e+05,
        1.5654742e+05, 2.5651292e+05],
       [1.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.0002300e+03,
        1.2415304e+05, 1.9039300e+03],
       [1.0000000e+00, 0.0000000e+00, 1.0000000e+00, 5.4205000e+02,
        5.1743150e+04, 0.0000000e+00],
       [1.0000000e+00, 0.0000000e+00,

拟合模型

In [14]:
regressor_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()

In [15]:
print(regressor_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.950
Model:                            OLS   Adj. R-squared:                  0.943
Method:                 Least Squares   F-statistic:                     129.7
Date:                Wed, 26 Jul 2023   Prob (F-statistic):           3.91e-21
Time:                        11:43:26   Log-Likelihood:                -421.10
No. Observations:                  40   AIC:                             854.2
Df Residuals:                      34   BIC:                             864.3
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       4.255e+04   8358.538      5.091      0.0

删除第二列，因为 P 值最大 表示是否在加州，这个特征对模型的预测能力没有贡献

In [16]:
X_opt = X_train[:, [0, 1, 3, 4, 5]]
X_opt = np.array(X_opt, dtype=float)

In [17]:
regressor_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()

In [18]:
print(regressor_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.950
Model:                            OLS   Adj. R-squared:                  0.944
Method:                 Least Squares   F-statistic:                     166.7
Date:                Wed, 26 Jul 2023   Prob (F-statistic):           2.87e-22
Time:                        11:43:26   Log-Likelihood:                -421.12
No. Observations:                  40   AIC:                             852.2
Df Residuals:                      35   BIC:                             860.7
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       4.292e+04   8020.397      5.352      0.0

删除第一列，因为 P 值最大 表示是否在纽约，这个特征对模型的预测能力没有贡献

In [19]:
X_opt = X_train[:, [0, 3, 4, 5]]
X_opt = np.array(X_opt, dtype=float)
X_opt

array([[1.0000000e+00, 5.5493950e+04, 1.0305749e+05, 2.1463481e+05],
       [1.0000000e+00, 4.6014020e+04, 8.5047440e+04, 2.0551764e+05],
       [1.0000000e+00, 7.5328870e+04, 1.4413598e+05, 1.3405007e+05],
       [1.0000000e+00, 4.6426070e+04, 1.5769392e+05, 2.1079767e+05],
       [1.0000000e+00, 9.1749160e+04, 1.1417579e+05, 2.9491957e+05],
       [1.0000000e+00, 1.3029813e+05, 1.4553006e+05, 3.2387668e+05],
       [1.0000000e+00, 1.1994324e+05, 1.5654742e+05, 2.5651292e+05],
       [1.0000000e+00, 1.0002300e+03, 1.2415304e+05, 1.9039300e+03],
       [1.0000000e+00, 5.4205000e+02, 5.1743150e+04, 0.0000000e+00],
       [1.0000000e+00, 6.5605480e+04, 1.5303206e+05, 1.0713838e+05],
       [1.0000000e+00, 1.1452361e+05, 1.2261684e+05, 2.6177623e+05],
       [1.0000000e+00, 6.1994480e+04, 1.1564128e+05, 9.1131240e+04],
       [1.0000000e+00, 6.3408860e+04, 1.2921961e+05, 4.6085250e+04],
       [1.0000000e+00, 7.8013110e+04, 1.2159755e+05, 2.6434606e+05],
       [1.0000000e+00, 2.3640930e+

In [20]:
regressor_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()

In [21]:
print(regressor_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.950
Model:                            OLS   Adj. R-squared:                  0.946
Method:                 Least Squares   F-statistic:                     227.8
Date:                Wed, 26 Jul 2023   Prob (F-statistic):           1.85e-23
Time:                        11:43:26   Log-Likelihood:                -421.19
No. Observations:                  40   AIC:                             850.4
Df Residuals:                      36   BIC:                             857.1
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       4.299e+04   7919.773      5.428      0.0

删除第四列，因为 P 值最大 表示是行政开销，这个特征对模型的预测能力没有贡献

In [22]:
X_opt = X_train[:, [0, 3, 5]]
X_opt = np.array(X_opt, dtype=float)
X_opt

array([[1.0000000e+00, 5.5493950e+04, 2.1463481e+05],
       [1.0000000e+00, 4.6014020e+04, 2.0551764e+05],
       [1.0000000e+00, 7.5328870e+04, 1.3405007e+05],
       [1.0000000e+00, 4.6426070e+04, 2.1079767e+05],
       [1.0000000e+00, 9.1749160e+04, 2.9491957e+05],
       [1.0000000e+00, 1.3029813e+05, 3.2387668e+05],
       [1.0000000e+00, 1.1994324e+05, 2.5651292e+05],
       [1.0000000e+00, 1.0002300e+03, 1.9039300e+03],
       [1.0000000e+00, 5.4205000e+02, 0.0000000e+00],
       [1.0000000e+00, 6.5605480e+04, 1.0713838e+05],
       [1.0000000e+00, 1.1452361e+05, 2.6177623e+05],
       [1.0000000e+00, 6.1994480e+04, 9.1131240e+04],
       [1.0000000e+00, 6.3408860e+04, 4.6085250e+04],
       [1.0000000e+00, 7.8013110e+04, 2.6434606e+05],
       [1.0000000e+00, 2.3640930e+04, 1.4800111e+05],
       [1.0000000e+00, 7.6253860e+04, 2.9866447e+05],
       [1.0000000e+00, 1.5505730e+04, 3.5534170e+04],
       [1.0000000e+00, 1.2054252e+05, 3.1161329e+05],
       [1.0000000e+00, 9.199

In [23]:
regressor_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()

In [24]:
print(regressor_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.950
Model:                            OLS   Adj. R-squared:                  0.947
Method:                 Least Squares   F-statistic:                     349.0
Date:                Wed, 26 Jul 2023   Prob (F-statistic):           9.65e-25
Time:                        11:43:26   Log-Likelihood:                -421.30
No. Observations:                  40   AIC:                             848.6
Df Residuals:                      37   BIC:                             853.7
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       4.635e+04   2971.236     15.598      0.0

删除第三列，因为 P 值最大（0.071） 表示是市场开销，这个特征对模型的预测能力没有贡献

In [25]:
X_opt = X_train[:, [0, 5]]
X_opt = np.array(X_opt, dtype=float)

In [26]:
regressor_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()

In [27]:
print(regressor_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.564
Model:                            OLS   Adj. R-squared:                  0.552
Method:                 Least Squares   F-statistic:                     49.08
Date:                Wed, 26 Jul 2023   Prob (F-statistic):           2.42e-08
Time:                        11:43:26   Log-Likelihood:                -464.50
No. Observations:                  40   AIC:                             933.0
Df Residuals:                      38   BIC:                             936.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       5.971e+04   8319.622      7.177      0.0

最终结论：只有 R&D 开销对模型的预测能力有贡献