# 波士顿房价预测案例——线性回归分析

比较对y和logy进行线性拟合的效果。

## 1、导入必要的工具包

In [2]:
import numpy as np  # 矩阵操作
import pandas as pd # SQL数据处理

from sklearn.metrics import r2_score  #评价回归预测模型的性能

## 2. 读取数据
已经是做完特征工程后的数据，请先运行2_FE_BostonHousePrice.ipynb，得到文件FE_boston_housing.csv

In [3]:
# path to where the data lies
#dpath = './data/'
df = pd.read_csv("FE_boston_housing.csv")

#通过观察前5行，了解数据每列（特征）的概况
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,TAX,PTRATIO,...,RAD_2,RAD_3,RAD_4,RAD_5,RAD_6,RAD_7,RAD_8,RAD_24,MEDV,log_MEDV
0,0.0,0.18,0.067815,0.0,0.314815,0.577505,0.641607,0.269203,0.208015,0.3,...,0,0,0,0,0,0,0,0,0.422222,0.666856
1,0.000236,0.0,0.242302,0.0,0.17284,0.547998,0.782698,0.348962,0.104962,0.5,...,1,0,0,0,0,0,0,0,0.368889,0.619696
2,0.000236,0.0,0.242302,0.0,0.17284,0.694386,0.599382,0.348962,0.104962,0.5,...,1,0,0,0,0,0,0,0,0.66,0.833335
3,0.000293,0.0,0.06305,0.0,0.150206,0.658555,0.441813,0.448545,0.066794,0.6,...,0,1,0,0,0,0,0,0,0.631111,0.816001
4,0.000705,0.0,0.06305,0.0,0.150206,0.687105,0.528321,0.448545,0.066794,0.6,...,0,1,0,0,0,0,0,0,0.693333,0.852567


###  数据基本信息
样本数目、特征维数
每个特征的类型、空值样本的数目、数据类型

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 23 columns):
CRIM        506 non-null float64
ZN          506 non-null float64
INDUS       506 non-null float64
CHAS        506 non-null float64
NOX         506 non-null float64
RM          506 non-null float64
AGE         506 non-null float64
DIS         506 non-null float64
TAX         506 non-null float64
PTRATIO     506 non-null float64
B           506 non-null float64
LSTAT       506 non-null float64
RAD_1       506 non-null int64
RAD_2       506 non-null int64
RAD_3       506 non-null int64
RAD_4       506 non-null int64
RAD_5       506 non-null int64
RAD_6       506 non-null int64
RAD_7       506 non-null int64
RAD_8       506 non-null int64
RAD_24      506 non-null int64
MEDV        506 non-null float64
log_MEDV    506 non-null float64
dtypes: float64(14), int64(9)
memory usage: 91.0 KB


### 数据准备

In [5]:
# 从原始数据中分离输入特征x和输出y
# 这里我们y有2个取值，原始的MEDV及其log1p之后的值
col_y = ["MEDV","log_MEDV"]
y = pd.DataFrame(df,columns = col_y)

X = df.drop(["MEDV", "log_MEDV"], axis = 1)

#特征名称，用于后续显示权重系数对应的特征
feat_names = X.columns

当数据量比较大时，可用train_test_split从训练集中分出一部分做校验集；
样本数目较少时，建议用交叉验证。
在线性回归中，留一交叉验证有简便计算方式。

下面将训练数据分割成训练集和测试集，只是让大家对模型的训练误差、校验集上的测试误差估计、和测试集上的测试误差做个比较。

In [6]:
#将数据分割训练数据与测试数据
from sklearn.model_selection import train_test_split

# 随机采样20%的数据构建测试样本，其余作为训练样本
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2)
X_train.shape

(404, 21)

## 3、确定模型类型

### 3.1 尝试缺省参数的线性回归

In [7]:
# 线性回归
#class sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)
from sklearn.linear_model import LinearRegression

# 1.使用默认配置初始化学习器实例
lr = LinearRegression()

# 2.用训练数据训练模型参数
lr.fit(X_train, y_train)

# 3. 用训练好的模型对测试集进行预测
y_test_pred_lr = lr.predict(X_test)
y_train_pred_lr = lr.predict(X_train)


# 看看各特征的权重系数，系数的绝对值大小可视为该特征的重要性
fs = pd.DataFrame({"columns":list(feat_names), "coef_org":list((lr.coef_[0,:].T)),"coef_log":list((lr.coef_[1,:].T))})
fs.sort_values(by=['coef_org'],ascending=False)

Unnamed: 0,coef_log,coef_org,columns
5,0.236617,0.452377,RM
1,0.073055,0.129239,ZN
20,0.094682,0.103221,RAD_24
10,0.078706,0.078899,B
3,0.050563,0.05966,CHAS
18,0.028402,0.037664,RAD_7
19,0.016906,0.030071,RAD_8
14,0.010503,0.027465,RAD_3
2,0.034671,0.013818,INDUS
6,0.002937,-0.001228,AGE


#### 3.1.1 模型评价

In [8]:
# 使用r2_score评价模型在测试集和训练集上的性能，并输出评估结果
#测试集
print 'The r2 score of LinearRegression on test with original MEDV is', r2_score(y_test.iloc[:,0], y_test_pred_lr[:,0])
#训练集
print 'The r2 score of LinearRegression on train with original MEDV is', r2_score(y_train.iloc[:,0], y_train_pred_lr[:,0])

# y取log
#测试集
print 'The r2 score of LinearRegression on test with log MEDV is', r2_score(y_test.iloc[:,1], y_test_pred_lr[:,1])
#训练集
print 'The r2 score of LinearRegression on train  with log MEDV is', r2_score(y_train.iloc[:,01], y_train_pred_lr[:,1])

The r2 score of LinearRegression on test with original MEDV is 0.6939789810511077
The r2 score of LinearRegression on train with original MEDV is 0.7549146436868254
The r2 score of LinearRegression on test with log MEDV is 0.7083054590270493
The r2 score of LinearRegression on train  with log MEDV is 0.811201041743473


对y（价格）取log后，r2 score略变好。