# scikit-learn 实现线性回归

我们学习了什么是最小二乘法，以及使用 Python 对最小二乘线性回归进行了完整实现。那么，我们如何利用机器学习开源模块 scikit-learn 实现最小二乘线性回归方法呢？

使用 scikit-learn 实现线性回归的过程会简单很多，这里要用到 LinearRegression() 类 。看一下其中的参数：

```
sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)

- fit_intercept: 默认为 True，计算截距项。
- normalize: 默认为 False，不针对数据进行标准化处理。
- copy_X: 默认为 True，即使用数据的副本进行操作，防止影响原数据。
- n_jobs: 计算时的作业数量。默认为 1，若为 -1 则使用全部 CPU 参与运算。
```

In [2]:
import pandas as pd

df = pd.read_csv(
    "https://labfile.oss.aliyuncs.com/courses/1081/course-5-boston.csv")
print(df.shape)
df.head()

(506, 14)


Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


```
每列数据的列名解释如下：

CRIM: 城镇犯罪率。#
ZN: 占地面积超过 2.5 万平方英尺的住宅用地比例。
INDUS: 城镇非零售业务地区的比例。
CHAS: 查尔斯河是否经过 (=1 经过，=0 不经过)。
NOX: 一氧化氮浓度（每 1000 万份）。
RM: 住宅平均房间数。#
AGE: 所有者年龄。
DIS: 与就业中心的距离。
RAD: 公路可达性指数。
TAX: 物业税率。
PTRATIO: 城镇师生比例。
BLACK: 城镇的黑人指数。
LSTAT: 人口中地位较低人群的百分数。#
MEDV: 城镇住房价格中位数。$
```

In [3]:
features = df[['crim', 'rm', 'lstat']]
print(features.shape)
features.head()

(506, 3)


Unnamed: 0,crim,rm,lstat
0,0.00632,6.575,4.98
1,0.02731,6.421,9.14
2,0.02729,7.185,4.03
3,0.03237,6.998,2.94
4,0.06905,7.147,5.33


In [4]:
target = df['medv']
print(target.shape)
target.head()

(506,)


0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: medv, dtype: float64

## train test 3：7分割


In [5]:
len(features)

506

In [6]:
split_num = int(len(features)*0.7)

X_train = features[:split_num]
y_train = target[:split_num]

X_test = features[split_num:]
y_test = target[split_num:]

## 使用 scikit-learn 线性回归

In [7]:
from sklearn.linear_model import LinearRegression

model = LinearRegression() # 训练
model.fit(X_train, y_train)

model.coef_ , model.intercept_ #输出参数 与 偏置

# 参数有三，因为参考三个元素

(array([ 0.69979497, 10.13564218, -0.20532653]), -38.000969889690275)

In [8]:
preds = model.predict(X_test)
print(type(preds))
preds

<class 'numpy.ndarray'>


array([17.77439141, 21.09512448, 27.63412265, 26.78577951, 25.38313368,
       24.3286313 , 28.4257879 , 25.12834727, 16.82806601, 20.76498858,
       52.3350748 , -0.18169806, 12.01475786,  7.87878077, 15.13155699,
       32.93748235, 37.07872049, 29.50613719, 25.50800832, 12.35867972,
        9.08901644, 47.08374238, 35.31759193, 33.3738765 , 38.34913316,
       33.10414639, 91.3556125 , 35.11735022, 19.69326952, 18.49805269,
       14.03767555, 20.9235166 , 20.41406182, 21.92218226, 15.20451678,
       18.05362998, 21.26289453, 23.18192502, 15.87149504, 27.70381826,
       27.65958772, 30.17151829, 27.04987446, 21.52730227, 37.82614512,
       22.09872387, 34.71166346, 32.07959454, 29.45253042, 29.51137956,
       41.49935191, 62.4121152 , 13.64508882, 24.71242033, 18.69151684,
       37.4909413 , 54.05864658, 34.94758034, 15.01355249, 30.17849355,
       32.22191275, 33.90252834, 33.02530285, 28.4416789 , 69.60201087,
       34.7617152 , 31.65353442, 24.5644437 , 24.78130285, 24.00

In [9]:
import numpy as np

# 绝对值平均误差 
def mae_value(pre_output, ture_output):
    n = len(pre_output)
    return sum(np.abs(pre_output-ture_output))/n

# 均方误差
def mse_value(pre_output, ture_output):
    n = len(pre_output)
    return sum(np.square(pre_output-ture_output))/n


In [10]:
# 计算误差

mae = mae_value(preds, y_test)
mse = mse_value(preds, y_test)

print(mae, mse)

13.02206307278018 303.83312472235764
