# 《按图索骥学机器学习》-《A03机器学习开发流程》

这是《按图索骥学机器学习》-《A03机器学习开发流程》的讲义
这门课程之所以叫按图索骥，是因为学习资料都放到了思维导图当中，大家可以根据自己的情况，选择合适的学习路径，自主学习

![avatar](pic/swnt.png)

导图和有关学习资料都放在了github(git.code946.com)上，并且在不断迭代和更新中

## 第一步：加载数据集

In [16]:
from sklearn.datasets import load_boston

In [17]:
# 获取数据
lb = load_boston()

print(lb['DESCR'])
print(lb.keys())
print(lb['feature_names'])
for i in range(10):
    print(lb.data[i])
    print(lb.target[i])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

## 第二步：数据处理

### 特征处理：查看数据集中的数据是否有缺失值

In [18]:
import pandas as pd

pd.DataFrame(lb.data).isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
dtype: int64

### 分割数据集

train : 训练集
test ：测试集
x : 特征值
y: 目标值

x_train ：训练集中的特征值
x_test：测试集中的特征值
y_train：训练集中的目标值
y_test：测试集中国的目标值

In [19]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(lb.data,lb.target,test_size=0.25)

## 第三步选择模型：线性回归

In [20]:
from sklearn.linear_model import LinearRegression

## 第四步训练模型

In [21]:
lr = LinearRegression()
lr.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## 第五步模型评价

In [23]:
# 通过传入测试集的特征值，模型计算出对应的目标值（模型预测出的）预测值
y_predict = lr.predict(x_test)

print("预测值：",y_predict)
print("真实值：",y_test)

预测值： [37.7696424  18.50063955 13.08198736 17.34479812  8.53718404 15.35328755
 15.33895135 22.09498553  6.45849233 20.61095659 20.70737133 21.01899275
 21.5671184  29.84872204 22.33191995 25.38824381  6.94352121 20.52014786
 14.20009318 28.75396373 20.18585548 18.05659075 24.5092971  23.46917592
 22.71534378 39.49111955 18.34351975 31.62942059 30.0705854  37.22361853
 15.12597802 22.82601066 24.79121559 20.29634213 12.65646525 16.5483236
 19.21068791 30.96931453 19.08764487 24.44506934 16.6226502  19.22501005
 27.43313113 18.6538912  29.82697817 27.04337518 32.29975203 19.24608934
 19.48242818 28.71260503 18.65746845 15.5964108  25.28963253 11.73260401
 19.80686398 30.3435758  19.2788143  15.55941984 24.67579798 20.89805436
 26.71927871 26.73171508  3.48716542 23.55817554 22.2953745  27.81031594
 27.47832981 23.79177862 35.49921597 14.24514374 21.38008698  2.74507102
 25.23395065 25.24571261 30.30182559 10.58907752 16.55476647 23.62706305
 30.3469208  31.75919428 31.78910656 31.9491683

In [25]:
from sklearn.metrics import mean_squared_error

print("均方误差：",mean_squared_error(y_test,y_predict))

均方误差： 25.641020432427624


In [26]:
print(lr.score(x_test,y_test))

0.6686206831707937


## 第六步：模型的保存和启用

In [28]:
from sklearn.externals import joblib

# joblib.dump(lr,"house.pkl")

In [29]:
lr2 = joblib.load("house.pkl")
print(lr2.score(x_test,y_test))

0.6686206831707937
