# 算法原理

![image.png](attachment:6da12c0e-c048-4bec-af5c-8c949e14967e.png)

![image.png](attachment:13b5f3e9-ab9d-44ce-9e3f-4cfa928a4d6c.png)

# 数据准备

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Lasso

特征 说明

CRIM 城镇人均犯罪率

ZN 大于25,000平方英尺的地块划分为住宅用地的比例

INDUS 每个城镇非零售业务的比例

CHAS 查尔斯河虚拟变量（如果 = 1则为河; =0则不为河）

NOX 一氧化氮浓度（每千万）

RM 每间住宅的平均房间数

AGE 自住房屋是在1940年之前建造的比例

DIS 到加州五个就业中心的加权距离

RAD 对径向高速公路的可达性指数

TAX 每10,000美元的全价物业税

PTRATIO 城镇的学生与教师比例

B 1000（Bk-0.63）^ 2其中Bk是城镇的黑人的比例

LSTAT 低社会阶层人口比例％

MEDV 以1000美元为单位的自住房屋的中位数价格

In [2]:
#读取数据
data = pd.read_csv("../data/boston_housing_data.csv")
print(data.head())

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900    1  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671    2  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671    2  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622    3  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622    3  222.0   

   PIRATIO       B  LSTAT  MEDV  
0     15.3  396.90   4.98  24.0  
1     17.8  396.90   9.14  21.6  
2     17.8  392.83   4.03  34.7  
3     18.7  394.63   2.94  33.4  
4     18.7  396.90   5.33  36.2  


In [3]:
#准备数据
data = data.dropna()
y = data['MEDV']
x = data.drop(['MEDV'],axis=1).astype('float64')

# 训练集、测试集划分
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state= 42)

# 模型训练

In [4]:
#回归模型、参数
model = Lasso()
param_grid = {'alpha': [0.1, 1.0, 10.0]}

#GridSearchCV优化参数、训练模型
gsearch = GridSearchCV(model, param_grid)
model = gsearch.fit(x_train, y_train)

#打印 coef
print('Lasso Regression coefficients:', model.best_estimator_.coef_)

Lasso Regression coefficients: [-1.53356187e-01  3.82857722e-02  1.12934036e-03  7.77688688e-01
 -0.00000000e+00  5.13562266e+00 -2.62448943e-02 -1.15333336e+00
  2.06282867e-01 -1.07686407e-02 -7.28833323e-01  1.51428621e-02
 -5.11630529e-01]


# 模型保存

In [5]:
# 法一
import joblib

# 保存模型
joblib.dump(model, '../outputs/best_models/lasso.pkl')

# 加载模型
model = joblib.load('../outputs/best_models/lasso.pkl')

In [6]:
# 法二
import pickle

with open('../outputs/best_models/lasso.pkl', 'wb') as f:
    pickle.dump(model, f)

#读取Model
with open('../outputs/best_models/lasso.pkl', 'rb') as f:
    model = pickle.load(f)

# 模型预测

In [7]:
prediction = model.predict(x_test)

In [8]:
# 计算R2，均方差
r2 = r2_score(y_test, prediction)
mse = np.sqrt(mean_squared_error(y_test, prediction))

In [9]:
print("R2为：", r2)
print("MSE为：", mse)

R2为： 0.6885107614920591
MSE为： 4.506544864853839
