# 3.2 线性回归模型评估

# 3.2.1 模型评估的编程实现

3.1.3 代码汇总 ：不同行业工作年限与收入的线性回归模型

In [1]:
# 1.读取数据
import pandas
df = pandas.read_excel('IT行业收入表.xlsx')
X = df[['工龄']]
Y = df['薪水']

# 2.模型训练
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X,Y)

# 3.模型可视化
from matplotlib import pyplot as plt
plt.scatter(X,Y)
plt.plot(X, regr.predict(X), color='red')  # color='red'设置为红色
plt.xlabel('工龄')
plt.ylabel('薪水')
plt.show()

# 4.线性回归方程构造
print('系数a为:' + str(regr.coef_[0]))
print('截距b为:' + str(regr.intercept_))

<Figure size 640x480 with 1 Axes>

系数a为:2497.1513476046866
截距b为:10143.131966873787


In [2]:
import statsmodels.api as sm
X2 = sm.add_constant(X)
est = sm.OLS(Y, X2).fit()
est.summary()  # 在非Jupyter Notebook的编辑器中需要写成print(est.summary())

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,薪水,R-squared:,0.855
Model:,OLS,Adj. R-squared:,0.854
Method:,Least Squares,F-statistic:,578.5
Date:,"Wed, 01 Jan 2020",Prob (F-statistic):,6.69e-43
Time:,19:43:12,Log-Likelihood:,-930.83
No. Observations:,100,AIC:,1866.0
Df Residuals:,98,BIC:,1871.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.014e+04,507.633,19.981,0.000,9135.751,1.12e+04
工龄,2497.1513,103.823,24.052,0.000,2291.118,2703.185

0,1,2,3
Omnibus:,0.287,Durbin-Watson:,0.555
Prob(Omnibus):,0.867,Jarque-Bera (JB):,0.463
Skew:,0.007,Prob(JB):,0.793
Kurtosis:,2.667,Cond. No.,9.49


# 如果设置成一元二次方程，来看下模型评估效果

In [3]:
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=2)
X_ = poly_reg.fit_transform(X)

import statsmodels.api as sm
X2 = sm.add_constant(X_)  # 这里传入的是含有x^2的X_
est = sm.OLS(Y, X2).fit()
est.summary()  # 在非Jupyter Notebook的编辑器中需要写成print(est.summary())

0,1,2,3
Dep. Variable:,薪水,R-squared:,0.931
Model:,OLS,Adj. R-squared:,0.93
Method:,Least Squares,F-statistic:,654.8
Date:,"Wed, 01 Jan 2020",Prob (F-statistic):,4.7000000000000004e-57
Time:,19:43:12,Log-Likelihood:,-893.72
No. Observations:,100,AIC:,1793.0
Df Residuals:,97,BIC:,1801.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.399e+04,512.264,27.307,0.000,1.3e+04,1.5e+04
x1,-743.6808,321.809,-2.311,0.023,-1382.383,-104.979
x2,400.8040,38.790,10.333,0.000,323.816,477.792

0,1,2,3
Omnibus:,2.44,Durbin-Watson:,1.137
Prob(Omnibus):,0.295,Jarque-Bera (JB):,2.083
Skew:,-0.352,Prob(JB):,0.353
Kurtosis:,3.063,Cond. No.,102.0


可以看到模型效果的确有所提升

**补充知识点：另一种获取R-squared值的代码实现**

In [4]:
# 1.读取数据
import pandas
df = pandas.read_excel('IT行业收入表.xlsx')
X = df[['工龄']]
Y = df['薪水']

# 2.模型训练
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X,Y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [5]:
from sklearn.metrics import r2_score
r2 = r2_score(Y, regr.predict(X))
print(r2)

0.8551365584870814


可以看到和之前通过statsmodels库评估的结果是一致的。