# 4 sklearn监督学习
## 4.1 广义线性模型
### 4.1.1 普通最小二乘法
处理[回归问题](https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD)，对特征有**非奇异性**（满秩、特征比数多）与**中心化/标准化**要求。

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
msd = pd.read_table('data/YearPredictionMSD.txt',sep=',',header=None)
msd.columns = [['y'] + ['x' + str(i) for i in range(1,91)]]
train = msd.iloc[:463715,:]
test = msd.iloc[463715:,:]
x_train,y_train,x_test,y_test = train.iloc[:,1:],train[['y']],test.iloc[:,1:],test[['y']]
scaler = preprocessing.StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [2]:
from sklearn import linear_model
reg = linear_model.LinearRegression(fit_intercept=True)
reg.fit(x_train,y_train)

LinearRegression()

In [3]:
# print('参数:%s' % reg.get_params())
# print('秩:%s\n奇异矩阵:\n%s' % (reg.rank_,reg.singular_))
# print('系数:\n%s\n截距项:%s' % (reg.coef_,reg.intercept_))
print('训练集R-square:%s\n测试集R-square:%s' % (round(reg.score(x_train,y_train),2),round(reg.score(x_test,y_test),2)))


训练集R-square:0.24
测试集R-square:0.23


### 4.1.2 岭回归与分类
#### 4.1.2.1 岭回归
岭回归通过对系数的大小施加惩罚来解决普通最小二乘法对共线性敏感的问题，其最小化的是带L2正则化的残差平方和，L2正则系数 $\alpha$ 值越大，收缩量越大，对共线性的鲁棒性也越强。

In [4]:
from sklearn import linear_model
ridge = linear_model.Ridge(alpha=10000.0, # 数值越大，惩罚越大
                           tol=0.001, # 预测精度
                           solver='auto', # auto,svd,cholesky,lsqr,sparse_cg,sag,saga
                           max_iter=None, # 共轭函数求解迭代次数,sparse_cg and lsqr
                           random_state=None) # 控制sag、saga
ridge.fit(x_train,y_train)
# print('参数:%s' % ridge.get_params())
# print('系数:\n%s\n截距项:%s' % (ridge.coef_,ridge.intercept_))
print('训练集R-square:%s\n测试集R-square:%s' % (round(ridge.score(x_train,y_train),2),round(ridge.score(x_test,y_test),2)))

训练集R-square:0.24
测试集R-square:0.23


由于岭回归对共线性的改善，所以可以使用岭迹图来判断是否剔除某参数以避免共线性。

In [5]:
coef,alpha = [],[]
for i in np.arange(0,100,1):
    ridge = linear_model.Ridge(alpha=i)
    ridge.fit(x_train,y_train)
    coef.append(pd.DataFrame(ridge.coef_))
    alpha.append(i)
coef = pd.concat(coef,axis=0).reset_index(drop=True)
coef.columns = ['x' + str(i) for i in range(1,91)]
alpha = pd.DataFrame({'alpha':alpha})
alpha_coef = pd.concat([alpha,coef],axis=1)

ridge提供了便捷的交叉验证建模函数，进行快速定位：

In [12]:
ridgeCV = linear_model.RidgeCV(alphas=np.arange(0,10,0.5),
                               cv=5, # None：留一交叉验证，int：指定折叠数
                               scoring='r2') # 评估函数
ridgeCV.fit(x_train,y_train)
ridgeCV.alpha_ # ridgeCV.best_score_,ridgeCV.coef_,ridgeCV.intercept_

9.5

#### 4.1.2.2 岭分类
适用于[二分类问题](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)，模型将类别转换为 $-1,1$ 两种标签，然后使用回归的方式计算。                                              

In [76]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
from sklearn import preprocessing,linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
data = pd.read_csv('data/default of credit card clients.csv')
x_train,x_test,y_train,y_test = train_test_split(data.iloc[:,1:],data['default payment next month'],test_size=0.3,random_state=11)
scaler = preprocessing.StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
ridgeClassiflier = linear_model.RidgeClassifier(alpha=0.1,
                                                tol=0.01,
                                                # slover='auto',
                                                # max_iter=None,
                                                # random_state=None,
                                                class_weight='balanced') # dict,balanced
ridgeClassiflier.fit(x_train,y_train)
print('train accuracy:%s,test accuracy:%s.' % (round(ridgeClassiflier.score(x_train,y_train),2),round(ridgeClassiflier.score(x_test,y_test),2))) 
print('train auc:%s,test auc:%s.' % (round(roc_auc_score(y_train,ridgeClassiflier.predict(x_train)),2),round(roc_auc_score(y_test,ridgeClassiflier.predict(x_test)),2)))

train accuracy:0.69,test accuracy:0.68.
train auc:0.67,test auc:0.67.


也有与之相匹配的快速交叉验证方法：

In [63]:
ridgeClassiflierCV = linear_model.RidgeClassifierCV(alphas=range(670,680,1),
                                                    cv=3,
                                                    scoring='roc_auc',
                                                    class_weight='balanced')
ridgeClassiflierCV.fit(x_train,y_train)
print('best alpha:%s,best score:%s.' % (round(ridgeClassiflierCV.alpha_,2),round(ridgeClassiflierCV.best_score_,2)))

best alpha:678,best score:0.72.


### 4.1.3 Lasso
Lasso通过最小化L1正则化的残差平方和，快速提取出重要变量，简化模型（使用LASSO回归系数轨迹）。

In [19]:
import numpy as np
import pandas as pd
from sklearn import preprocessing,linear_model
msd = pd.read_table('data/YearPredictionMSD.txt',sep=',',header=None)
msd.columns = [['y'] + ['x' + str(i) for i in range(1,91)]]
train = msd.iloc[:463715,:]
test = msd.iloc[463715:,:]
x_train,y_train,x_test,y_test = train.iloc[:,1:],train[['y']],test.iloc[:,1:],test[['y']]
scaler = preprocessing.StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
lasso = linear_model.Lasso(alpha=0.01, # 数值越大，惩罚越大
                           tol=0.01,
                           precompute=True) # 启用预定义的格拉姆矩阵加速计算
lasso.fit(x_train,y_train)
print('train R-square:%s\ntest R-square:%s' % (round(lasso.score(x_train,y_train),2),round(lasso.score(x_test,y_test),2)))

train R-square:0.24
test R-square:0.23


In [43]:
lassoCV = linear_model.LassoCV(cv=3,
eps=0.01,
n_alphas=100, # alphas=[0.1,10]
tol=0.01)

lassoCV.fit(x_train,np.ravel(y_train))
print('alpha:%s' % lassoCV.alpha_)
print('train R-square:%s\ntest R-square:%s' % (round(lassoCV.score(x_train,y_train),2),round(lassoCV.score(x_test,y_test),2)))

alpha:0.024858764004282215
train R-square:0.24
test R-square:0.23


In [41]:
lassoLarsCV = linear_model.LassoLarsCV(cv=3,
                                       max_n_alphas=1000,
                                       eps=2.220446049250313e-16)
lassoLarsCV.fit(x_train,np.ravel(y_train))
print('alpha:%s' % lassoLarsCV.alpha_)
print('train R-square:%s\ntest R-square:%s' % (round(lassoLarsCV.score(x_train,y_train),2),round(lassoLarsCV.score(x_test,y_test),2)))

alpha:9.502802568068557e-06
train R-square:0.24
test R-square:0.23


In [44]:
lassoLarsIC = linear_model.LassoLarsIC(criterion='aic') # aic/bic
lassoLarsIC.fit(x_train,np.ravel(y_train))
print('alpha:%s' % lassoLarsIC.alpha_)
print('train R-square:%s\ntest R-square:%s' % (round(lassoLarsIC.score(x_train,y_train),2),round(lassoLarsIC.score(x_test,y_test),2)))

alpha:3.1343920992800895e-07
train R-square:0.24
test R-square:0.23


与岭回归一样，Lasso也可以进行多任务分类：MultiTaskLasso。
### 4.1.4 Logistic回归
用于处理二分类问题，lbfgs求解器鲁棒性占优；对于大型数据集，saga求解器通常更快。

In [3]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
from sklearn import preprocessing,linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
data = pd.read_csv('data/default of credit card clients.csv')
x_train,x_test,y_train,y_test = train_test_split(data.iloc[:,1:],data['default payment next month'],test_size=0.3,random_state=11)
scaler = preprocessing.StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

lr = linear_model.LogisticRegression(C=10000, # 正则化系数的倒数
                                     #class_weight='balanced',
                                     penalty='elasticnet', # l1,l2,elasticnet;l1-liblinear
                                     solver='saga', # liblinear(坐标轴下降法),lbfgs(loss二阶导),newton-cg(loss二阶导),sag(随机平均梯度下降),saga
                                     tol=0.01, # 迭代终止判据的误差范围
                                     max_iter=500,
                                     random_state=11, # sag,saga,liblinear时
                                     l1_ratio=0.9) # 仅在惩罚='elasticnet'时使用；值为0等同于使用惩罚='l2'，值为1等同于使用惩罚='l1'。当0 < l1_ratio <1时，罚点球是L1和L2的组合。

lr.fit(x_train,np.ravel(y_train))
print('train accuracy:%s,test accuracy:%s.' % (round(lr.score(x_train,y_train),2),round(lr.score(x_test,y_test),2))) 
print('train auc:%s,test auc:%s.' % (round(roc_auc_score(y_train,lr.predict_proba(x_train)[:,1]),2),round(roc_auc_score(y_test,lr.predict_proba(x_test)[:,1]),2)))

train accuracy:0.81,test accuracy:0.81.
train auc:0.72,test auc:0.73.


In [72]:
lrCV = linear_model.LogisticRegressionCV(Cs=[0.1,1,10,100,1000,10000],
                                         cv=3,
                                         #class_weight='balanced',
                                         penalty='elasticnet', # l1,l2,elasticnet
                                         scoring='roc_auc',
                                         solver='saga', # newton-cg,lbfgs,liblinear,sag,saga
                                         tol=0.01,
                                         max_iter=500,
                                         random_state=11,
                                         l1_ratios=[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])
lrCV.fit(x_train,np.ravel(y_train))
print('train auc:%s,test auc:%s.' % (round(roc_auc_score(y_train,lrCV.predict_proba(x_train)[:,1]),2),round(roc_auc_score(y_test,lrCV.predict_proba(x_test)[:,1]),2))) # predict,predict_proba,predict_log_proba
print('best C:%s,best l1 ratio:%s.' % (lrCV.C_,lrCV.l1_ratio_))

train auc:0.72,test auc:0.73.
best C:[10000.],best l1 ratio:[0.9].


一点感悟：balanced调参，非balanced训练最终模型，效果比较好。
### 4.1.6 随机梯度下降
适合大数据集，设定 loss="log" ，则 SGDClassifier 拟合一个逻辑回归模型，而 loss="hinge" 拟合线性支持向量机（SVM）。

In [153]:
import numpy as np
import pandas as pd
from pandas import DataFrame,Series
from sklearn import preprocessing,linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
dt1 = pd.read_csv('data/pp_gas_emission/gt_2011.csv')
dt2 = pd.read_csv('data/pp_gas_emission/gt_2012.csv')
dt3 = pd.read_csv('data/pp_gas_emission/gt_2013.csv')
dt4 = pd.read_csv('data/pp_gas_emission/gt_2014.csv')
data = pd.concat([dt1,dt2,dt3,dt4],axis=0).reset_index(drop=True)
verify = pd.read_csv('data/pp_gas_emission/gt_2015.csv')

x_train,x_test,y_train,y_test = train_test_split(data.iloc[:,:9],data.NOX)
y_verify = verify.NOX

scaler = preprocessing.StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
verify = scaler.transform(verify.iloc[:,:9])

sgd = linear_model.SGDRegressor(loss='epsilon_insensitive', # squared_loss,huber(对异常不敏感),epsilon_insensitive(超限后线性),squared_epsilon_insensitive
                                penalty='elasticnet', # l2,l1,elasticnet
                                alpha=0.01, # 值越大，正则化越强;optimal
                                l1_ratio=0.5,
                                max_iter=100,
                                tol=0.01,
                                epsilon=0.1, # huber,epsilon_insensitive,squared_epsilon_insensitive 敏感阈值界限
                                random_state=11,
                                learning_rate='optimal', # constant,optimal,invscaling,adaptive
                                eta0=0.1, # 'constant','invscaling','adaptive'的初始化学习率
                                # power_t=0.25, # invscaling 所需
                                early_stopping=True,
                                validation_fraction=0.1, # early_stopping预留验证集比例
                                n_iter_no_change=50)
sgd.fit(x_train,y_train)
print(sgd.n_iter_)
print(sgd.score(x_train,y_train)) # mean_squared_error(y_train,sgd.predict(x_train))
print(sgd.score(x_test,y_test))
print(sgd.score(verify,y_verify))

51
0.38784695658444734
0.3928081903351961
-0.06634117171935361


In [199]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
from sklearn import preprocessing,linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
data = pd.read_csv('data/default of credit card clients.csv')
x_train,x_test,y_train,y_test = train_test_split(data.iloc[:,1:],data['default payment next month'],test_size=0.3,random_state=11)
scaler = preprocessing.StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

sgd = linear_model.SGDClassifier(loss='hinge', # hinge,log,modified_huber,squared_hinge,perceptron;squared_loss,huber,epsilon_insensitive,squared_epsilon_insensitive.
                                 penalty='elasticnet',
                                 alpha=0.2,
                                 l1_ratio=0.2,
                                 random_state=11,
                                 learning_rate='invscaling',
                                 eta0=0.1,
                                 power_t=0.4,
                                 early_stopping=True,
                                 validation_fraction=0.1,
                                 n_iter_no_change=100,
                                 class_weight='balanced')
sgd.fit(x_train,y_train)
print(sgd.n_iter_)
print(sgd.score(x_train,y_train))
print(sgd.score(x_test,y_test))
print(round(roc_auc_score(y_train,sgd.predict(x_train)),2))
print(round(roc_auc_score(y_test,sgd.predict(x_test)),2))

151
0.8073809523809524
0.8023333333333333
0.68
0.69


### 4.1.7 多项式回归

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing,linear_model
from sklearn.metrics import roc_auc_score

data = pd.read_table('data/pp_gas_emission/gt_2011.csv',sep=',') # 2011~2015
x_train,x_test,y_train,y_test = train_test_split(data.iloc[:,:9],data.CO,random_state=11,test_size=0.3)

scaler = preprocessing.StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)


### 4.1.8 稳健回归

In [None]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
from sklearn import preprocessing,linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
data = pd.read_csv('data/default of credit card clients.csv')
x_train,x_test,y_train,y_test = train_test_split(data.iloc[:,1:],data['default payment next month'],test_size=0.3,random_state=11)
scaler = preprocessing.StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [None]:
from sklearn import kernel_ridge
kernel_ridge.KernelRidge(alpha=,
kernel='linear', # linear,rbf,sigmoid,poly/polynomial,laplacian,cosine,chi2,additive_chi2
gamma=, # rbf,laplacian,poly,chi2,sigmoid 核中的参数，使用其他核时无效
degree=, # poly核中的参数d，使用其他核时无效。
coef0=, # poly和sigmoid核中的0参数的替代值，使用其他核时无效。
kernel_params=None)

In [None]:
## 4.2 线性和二次判别分析
## 4.3 内核岭回归
## 4.4 支持向量机
## 4.5 随机梯度下降法
## 4.6 最近的邻居
## 4.7 高斯过程
## 4.8 横向分解
## 4.9 朴素贝叶斯
## 4.10 决策树
## 4.11 整体方法
## 4.12 多类和多输出算法
## 4.13 特征选择
## 4.14 Semi-supervised学习
## 4.15 等张回归
## 4.16 概率校准
## 4.17 神经网络模型(监督)

## 过拟合