# Kaggle | Give Me Some Credit - 信用违约概率预测

## 文章目录：
1. 前言  
1.1 字段名称描述  

2. 数据读取、初探  
2.1 数据读取及基本描述总览  
2.2 数据字段分布图、箱型图、关系热力图  

3. 数据预处理相关函数构建  
3.1 构建异常值及明显错误处理函数  
3.2 构建去共线性函数  
3.3 构建缺失值处理函数  
3.4 构建重采样函数  

4. 数据预处理&模型训练  
4.1 逻辑回归模型  
4.2 随机森林分类模型  
4.3 xgboost模型  

5. 预测并生成结果  

6. 补充  

## 1.前言
**本篇主要目的：进行数据建模，预测用户未来违约的概率，以此帮助借贷者做出最佳的财务决策。**
* 运用模型：逻辑回归模型、随机森林分类模型、xgboost模型
* 数据集下载链接 —— https://www.kaggle.com/c/GiveMeSomeCredit/data
***（各位有任何建议和疑问请务必在评论区留言 ~）***

### 1.1 字段名称描述
* **SeriousDlqin2yrs**	
未来两年可能违约 	Y/N

* **RevolvingUtilizationOfUnsecuredLines**	
信用卡和个人信用额度（不动产和汽车贷款等分期付款债务除外）的总余额除以信用额度之和 - 可用信贷额度比例  percentage

* **age**	 
借款人年龄（年） - 年龄	integer

* **NumberOfTime30-59DaysPastDueNotWorse**	
借款人逾期30-59天的次数，但在过去两年内没有更糟的情况。 - 逾期30-59天的笔数	integer

* **DebtRatio**	 
月债务、赡养费、生活费除以月总收入 - 负债率	percentage

* **MonthlyIncome**	 
月收入 - 月收入	real

* **NumberOfOpenCreditLinesAndLoans**	 
未偿贷款数量（分期付款，如汽车贷款或抵押贷款）和信贷额度（如信用卡） - 信贷数量	integer

* **NumberOfTimes90DaysLate**	 
借款人逾期90天或以上的次数。 - 逾期90天+的笔数	integer

* **NumberRealEstateLoansOrLines**	 
包括房屋净值信贷额度在内的抵押贷款和房地产贷款数量 - 固定资产贷款数	integer

* **NumberOfTime60-89DaysPastDueNotWorse**	 
借款人逾期60-89天的次数，但在过去两年内没有更糟的情况。 - 逾期60-89天的笔数	integer

* **NumberOfDependents** 
家庭中不包括自己的受抚养人数量（配偶、子女等） - 家属数量	integer

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import xgboost as xgb
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PowerTransformer
from sklearn.linear_model import LinearRegression,LassoCV,LogisticRegression
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
from sklearn.model_selection import KFold,train_test_split,StratifiedKFold,GridSearchCV,cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score,accuracy_score, \
                            precision_score,recall_score, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

  return f(*args, **kwds)
  return f(*args, **kwds)


## 2.数据读取、初探

### 2.1 数据读取及基本描述总览

In [2]:
df0 = pd.read_csv('/Users/ranmo/Desktop/机器学习项目集/信用违约预测模型/data/train.csv')
df0 = df0.drop('Unnamed: 0',axis=1)
# 为方便查看调整列名为中文
df0.rename(columns = {'SeriousDlqin2yrs':'未来两年可能违约', 'RevolvingUtilizationOfUnsecuredLines':'可用信贷额度比例', 'age':'年龄',
       'NumberOfTime30-59DaysPastDueNotWorse':'逾期30-59天的笔数', 'DebtRatio':'负债率', 'MonthlyIncome':'月收入',
       'NumberOfOpenCreditLinesAndLoans':'信贷数量', 'NumberOfTimes90DaysLate':'逾期90天+的笔数',
       'NumberRealEstateLoansOrLines':'固定资产贷款数', 'NumberOfTime60-89DaysPastDueNotWorse':'逾期60-89天的笔数',
       'NumberOfDependents':'家属数量'},inplace=True)
print(df0.info())
print(df0.head().T)
df0.describe().T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 11 columns):
未来两年可能违约       150000 non-null int64
可用信贷额度比例       150000 non-null float64
年龄             150000 non-null int64
逾期30-59天的笔数    150000 non-null int64
负债率            150000 non-null float64
月收入            120269 non-null float64
信贷数量           150000 non-null int64
逾期90天+的笔数      150000 non-null int64
固定资产贷款数        150000 non-null int64
逾期60-89天的笔数    150000 non-null int64
家属数量           146076 non-null float64
dtypes: float64(4), int64(7)
memory usage: 12.6 MB
None
                       0            1            2           3             4
未来两年可能违约        1.000000     0.000000     0.000000     0.00000      0.000000
可用信贷额度比例        0.766127     0.957151     0.658180     0.23381      0.907239
年龄             45.000000    40.000000    38.000000    30.00000     49.000000
逾期30-59天的笔数     2.000000     0.000000     1.000000     0.00000      1.000000
负债率             0.802982     0.12

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
未来两年可能违约,150000.0,0.06684,0.249746,0.0,0.0,0.0,0.0,1.0
可用信贷额度比例,150000.0,6.048438,249.755371,0.0,0.029867,0.154181,0.559046,50708.0
年龄,150000.0,52.295207,14.771866,0.0,41.0,52.0,63.0,109.0
逾期30-59天的笔数,150000.0,0.421033,4.192781,0.0,0.0,0.0,0.0,98.0
负债率,150000.0,353.005076,2037.818523,0.0,0.175074,0.366508,0.868254,329664.0
月收入,120269.0,6670.221237,14384.674215,0.0,3400.0,5400.0,8249.0,3008750.0
信贷数量,150000.0,8.45276,5.145951,0.0,5.0,8.0,11.0,58.0
逾期90天+的笔数,150000.0,0.265973,4.169304,0.0,0.0,0.0,0.0,98.0
固定资产贷款数,150000.0,1.01824,1.129771,0.0,0.0,1.0,2.0,54.0
逾期60-89天的笔数,150000.0,0.240387,4.155179,0.0,0.0,0.0,0.0,98.0


In [3]:
# 类别分布很不平衡，会影响建模效果
print(df0.未来两年可能违约.value_counts())

# 观察缺失值数,月收入 缺失29731，家属数量 缺失3924
df0.isnull().sum()

0    139974
1     10026
Name: 未来两年可能违约, dtype: int64


未来两年可能违约           0
可用信贷额度比例           0
年龄                 0
逾期30-59天的笔数        0
负债率                0
月收入            29731
信贷数量               0
逾期90天+的笔数          0
固定资产贷款数            0
逾期60-89天的笔数        0
家属数量            3924
dtype: int64

### 2.2 数据字段分布图、箱型图、关系热力图

In [4]:
# 输出各字段分布情况图
# 大多数字段明显偏态，后续建模需考虑纠偏处理
plt.figure(figsize=(20,20),dpi=300)
plt.subplots_adjust(wspace =0.3, hspace =0.3)
for n,i in enumerate(df0.columns):
    plt.subplot(4,3,n+1)
    plt.title(i,fontsize=15)
    plt.grid(linestyle='--')
    df0[i].hist(color='grey',alpha=0.5)

In [5]:
# 通过箱型图观察各字段异常情况
# 负债率异常值（错误）较多；可用信贷额度比例 异常值（错误）较多，理论应小于或等于1
#  '逾期30-59天的笔数', '负债率', '月收入','逾期90天+的笔数', '固定资产贷款数', '逾期60-89天的笔数'异常值非常多，难以观察数据分布。
# 年龄方面异常值有待观察
plt.figure(figsize=(20,20),dpi=300)
plt.subplots_adjust(wspace =0.3, hspace =0.3)
for n,i in enumerate(df0.columns):
    plt.subplot(4,3,n+1)
    plt.title(i,fontsize=15)
    plt.grid(linestyle='--')
    df0[[i]].boxplot(sym='.')

In [6]:
# 由图可知，逾期笔数这三个字段，共线性极高，可考虑去除共线性
plt.figure(figsize=(10,5),dpi=300)
sns.heatmap(df0.corr(),cmap='Reds',annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7fc880d3a320>

## 3.数据预处理相关函数构建

### 3.1 构建异常值及明显错误处理函数

In [8]:
def error_processing(df):
    '''
    异常值处理，可根据建模效果，反复调节处理方案，建议谨慎删除数据。
    df：数据源
    '''
    def show_error(df,col,whis=1.5,show=False):
        '''
        显示上下限异常值数量，可选显示示例异常数据
        df：数据源
        col：字段名
        whis：默认1.5，对应1.5倍iqr
        show：是否显示示例异常数据
        '''
        iqr = df[col].quantile(0.75) - df[col].quantile(0.25)
        upper_bound = df[col].quantile(0.75) + whis * iqr # 上界
        lower_bound = df[col].quantile(0.25) - whis * iqr # 下界
        # print(iqr,upper_bound,lower_bound)
        print('【',col,'】上界异常值总数：',df[col][df[col] > upper_bound].count())
        if show:
            print('异常值示例：\n',df[df[col] > upper_bound].head(5).T)
        print('【',col,'】下界异常值总数：',df[col][df[col] < lower_bound].count())
        if show:
            print('异常值示例：\n',df[df[col] < lower_bound].head(5).T)
        print('- - - - - - ')

    def drop_error(df,col):
        '''
        删除上下限异常值数量
        df：数据源
        col：字段名
        '''
        iqr = df[col].quantile(0.75) - df[col].quantile(0.25)
        upper_bound = df[col].quantile(0.75) + 1.5*iqr # 上界
        lower_bound = df[col].quantile(0.25) - 1.5*iqr # 下界
        data_del = df[col][(df[col] > upper_bound) | (df[col] < lower_bound)].count()
        data = df[(df[col] <= upper_bound) & (df[col] >= lower_bound)]
        # print('总剔除数据量：',data_del)
        return data
    
    # 计数器
    n = len(df)
    
    # 可用信贷额度
    # 从分布直方图可知，比例大于1的应该为错误值。
    # 错误值共3321，若剔除可能影响建模效果。剔除>=20000的数据
    show_error(df,'可用信贷额度比例')
    df = df[df.可用信贷额度比例<=20000]

    # 年龄
    # 异常值数量不多，剔除年龄大于100小于18的异常数据
    show_error(df,'年龄')
    df = df[(df['年龄']>18) & (df['年龄']<100)]

    # 逾期30-59天的笔数
    # 根据箱型图去除>80的异常数据
    show_error(df,'逾期30-59天的笔数')
    df = df[df['逾期30-59天的笔数']<80]

    # 逾期90天+的笔数
    # 根据箱型图去除>80的异常数据
    show_error(df,'逾期90天+的笔数')
    df = df[df['逾期90天+的笔数']<80]

    # 逾期60-89天的笔数
    # 根据箱型图去除>80的异常数据
    show_error(df,'逾期60-89天的笔数')
    df = df[df['逾期60-89天的笔数']<80]

    # 负债率
    # 根据箱型图去除>100000的异常数据
    show_error(df,'负债率')
    df = df[df['负债率']<100000]

    # 月收入
    # 根据箱型图去除>500000的异常数据
    show_error(df,'月收入')
    df = df[(df['月收入']<500000) | df.月收入.isna()]

    # 固定资产贷款数
    # 根据箱型图去除>20的异常数据
    show_error(df,'固定资产贷款数')
    df = df[df['固定资产贷款数']<20]

    # 家属数量
    # 根据箱型图去除>10的异常数据
    show_error(df,'家属数量')
    df = df[(df['家属数量']<12)|df.家属数量.isna()]

    # 信贷数量 - 保留异常值
    
    print('共删除数据 ',n - len(df),' 条。')

### 3.2 构建去共线性函数

In [9]:
# 3种违约情况，从上节的相关系数热力图中，可以看出有很高的共线性
# 可考虑保留'逾期90天+的笔数'，求出'逾期60-89天的笔数'/'逾期30-59天的笔数'的比值
def collineation_processing(df,col,col1,col2,name):
    '''
    去除共线性，保留一个字段，其他字段求比值
    df：数据源
    col：保留字段
    col1，col2：求比值字段
    name：新比值字段名称
    '''
    def trans2percent(row):
        if row[col2] == 0:
            return 0
        else:
            return row[col1] / row[col2]
    df[name] = df.apply(trans2percent,axis=1)
#     df[[name,col]].corr()
    
# collineation_processing(df,'逾期90天+的笔数'，'逾期60-89天的笔数'，'逾期30-59天的笔数'，'逾期60-89天/30-59天')

### 3.3 构建缺失值处理函数

In [10]:
def missing_values_processing(df,func1=1,func2=1):
    '''
    缺失值处理
    df：数据源
    func1：默认为1，众数填充家属；0，去除带空值数据行。
    func2：默认为1，众数填充月收入；0，平均数填充月收入。
    '''
    # 家属数量 - 剔除或众数填充
    if func1 == 1:
        df.loc[df.家属数量.isna(),'家属数量'] = df.家属数量.mode()[0]
    elif func1 == 0:
        df = df.dropna(subset=['家属数量'])
    else:
        print('parameter wrong!')
    
    # 月收入 - 剔除或均值填充
    if func1 == 1:
        df.loc[df.月收入.isna(),'月收入'] = df.月收入.mode()[0]
    elif func1 == 0:    
        df.loc[df.月收入.isna(),'月收入'] = df.月收入.mean()[0]
    else:
        print('parameter wrong!')

    # 可考虑建模填充 月收入，构建回归模型性能查看函数（最终测试结果很不理想）
#     def perfomance_reg(model,X,y,name=None):
#         y_predict = model.predict(X)
#         check = pd.DataFrame(y)
#         check['y_predict'] = y_predict
#         check['abs_err'] = abs(check['y_predict'] - check[y.name] )
#         check['ape'] = check['abs_err'] / check[y.name]
#         ape = check['ape'][check['ape']!=np.inf].mean()
#         if name:
#             print(name,':')
#         print(f'mean squared error is: {mean_squared_error(y,y_predict)}')
#         print(f'mean absolute error is: {mean_absolute_error(y,y_predict)}')
#         print(f'R Squared is: {r2_score(y,y_predict)}')
#         print(f'mean absolute percent error is: {ape}')
#         print('- - - - - - ')
        
    # 线性回归填充月收入,mae较大
#     train_x = df1[df1.月收入.notna()].drop(['逾期30-59天的笔数','逾期60-89天的笔数','月收入'],axis=1)
#     train_y = df1[df1.月收入.notna()].月收入
#     test_x = df1[df1.月收入.isna()].drop(['逾期30-59天的笔数','逾期60-89天的笔数','月收入'],axis=1)
#     pipe_lr = Pipeline([
#             ('sc',StandardScaler()),
#             ('pow_trans',PowerTransformer()),
#             ('rf',LinearRegression())
#             ])
#     pipe_lr.fit(train_x,train_y)
#     perfomance_reg(pipe_lr,train_x,train_y)
#     pipe_lr.predict(test_x)

    # 随机森林填充月收入，表现较线性回归略好一点,但也很差
#     dd = df1[df1.月收入.notna()].sample(n=5000)
#     train_x_sample = dd.drop(['逾期30-59天的笔数','逾期60-89天的笔数','月收入'],axis=1)
#     train_y_sample = dd.月收入
#     test_x_sample = df1[df1.月收入.isna()].drop(['逾期30-59天的笔数','逾期60-89天的笔数','月收入'],axis=1)
#     pipe_rf = Pipeline([
#             ('sc',StandardScaler()),
#             ('pow_trans',PowerTransformer()),
#             ('rf',RandomForestRegressor(criterion='mae',n_estimators=200,verbose=1,n_jobs=-1))
#             ])
#     pipe_rf.fit(train_x_sample,train_y_sample)
#     perfomance_reg(pipe_rf,train_x_sample,train_y_sample)
#     df1.loc[df1.月收入.isna(),'月收入'] = pipe_rf.predict(test_x_sample)

### 3.4 构建重采样函数

In [11]:
# 从数据初探可以发现，'未来两年可能违约'标签类别分布不均，需对样本进行重取样
def resample(df):
    '''
    使样本'未来两年可能违约'标签的0，1项可以各占一半，以提高预测效果。sample()可以考虑添加random_state以便生成相同样本集
    df：数据源
    '''
    num = df['未来两年可能违约'].value_counts()[1]
    df_t = df[df.未来两年可能违约==1]
    df_f = df[df.未来两年可能违约==0].sample(frac=1)[0:num]
    df_balanced = pd.concat([df_t,df_f]).sample(frac=1).reset_index(drop=True)
#     print(df_balanced.未来两年可能违约.value_counts())
    return df_balanced

## 4.数据预处理&模型训练
* 数据预处理 ==> 数据划分 ==> 模型训练及参数搜索

In [None]:
# 设 【 df1 】 为违约概率模型建模所用数据集
df1 = df0.copy()

# 异常处理
error_processing(df1)
# 去除共线性
collineation_processing(df1,'逾期90天+的笔数', '逾期60-89天的笔数', '逾期30-59天的笔数','逾期60-89天/30-59天')
# 缺失值处理
missing_values_processing(df1,func1=1,func2=1)
# 数据重采样
df_balanced = resample(df1)

# 最后将数据集划分成训练集和验证集，两者划分比例都为8：2
# 可考虑删去的列：'逾期30-59天的笔数','逾期60-89天的笔数','逾期90天+的笔数','逾期60-89天/30-59天','未来两年可能违约'
X = df_balanced.drop(['未来两年可能违约','逾期60-89天/30-59天'],axis=1) 
y = df_balanced['未来两年可能违约']
xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=0.2)    # random_state=42

# 分层k折交叉拆分器 - 用于网格搜索
cv = StratifiedKFold(n_splits=3,shuffle=True)

In [13]:
# 分类模型性能查看函数
def perfomance_clf(model,X,y,name=None):
    y_predict = model.predict(X)
    if name:
        print(name,':')
    print(f'accuracy score is: {accuracy_score(y,y_predict)}')
    print(f'precision score is: {precision_score(y,y_predict)}')
    print(f'recall score is: {recall_score(y,y_predict)}')
    print(f'auc: {roc_auc_score(y,y_predict)}')
    print('- - - - - - ')

### 4.1 逻辑回归模型

In [20]:
# 参数设定
log_params = {"penalty":['l1','l2'],
                 'C':[0.001*10**i for i in range(0,7)]}
# 参数搜索
log_gridsearch = GridSearchCV(LogisticRegression(solver='liblinear'),log_params,cv=cv,
                               n_jobs=-1,scoring='roc_auc',verbose=2,refit=True)
# 工作流管道
pipe_log = Pipeline([
        ('sc',StandardScaler()),    # 标准化Z-score
        ('pow_trans',PowerTransformer()),    # 纠偏
        ('log_grid',log_gridsearch)
        ])
# 搜索参数并训练模型
pipe_log.fit(xtrain,ytrain)
# 最佳参数组合
print(pipe_log.named_steps['log_grid'].best_params_)
# 训练集性能指标
perfomance_clf(pipe_log,xtrain,ytrain,name='train')
# 测试集性能指标
perfomance_clf(pipe_log,xtest,ytest,name='test')

Fitting 3 folds for each of 14 candidates, totalling 42 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


{'C': 100.0, 'penalty': 'l1'}
train :
accuracy score is: 0.7783180599713235
precision score is: 0.8100737442604703
recall score is: 0.7265693248471234
auc: 0.7782697147404526
- - - - - - 
test :
accuracy score is: 0.7738718524058839
precision score is: 0.8120767494356659
recall score is: 0.7148534525583706
auc: 0.7740933929458519
- - - - - - 


[Parallel(n_jobs=-1)]: Done  42 out of  42 | elapsed:    3.0s finished


In [22]:
# 交叉验证查看平均分数
cross_val_score(pipe_log,xtrain,ytrain,cv=3,scoring='roc_auc').mean()

Fitting 3 folds for each of 14 candidates, totalling 42 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 out of  42 | elapsed:    0.5s finished


Fitting 3 folds for each of 14 candidates, totalling 42 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 out of  42 | elapsed:    0.6s finished


Fitting 3 folds for each of 14 candidates, totalling 42 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 out of  42 | elapsed:    0.6s finished


0.8570960548600303

### 4.2 随机森林分类模型

In [23]:
# 随机森林分类模型
rf_clf = RandomForestClassifier(criterion='gini',
                               n_jobs=-1,
                               n_estimators=1000)    # random_state
# 参数设定
rf_grid_params = {'max_features':['auto'],    # ['auto',0.5,0.6,0.9] 未知最优参数时可以自己设定组合
                    'max_depth':[6,9]}    # [3,6,9]
# 参数搜索
rf_gridsearch = GridSearchCV(rf_clf,rf_grid_params,cv=cv,
                               n_jobs=-1,scoring='roc_auc',verbose=10,refit=True)
# 工作流管道
pipe_rf = Pipeline([
        ('sc',StandardScaler()),
        ('pow_trans',PowerTransformer()),
        ('rf_grid',rf_gridsearch)
        ])
# 搜索参数并训练模型
pipe_rf.fit(xtrain,ytrain)
# 最佳参数组合
print(pipe_rf.named_steps['rf_grid'].best_params_)
# 训练集性能指标
perfomance_clf(pipe_rf,xtrain,ytrain,name='train')
# 测试集性能指标
perfomance_clf(pipe_rf,xtest,ytest,name='test')

Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   18.3s
[Parallel(n_jobs=-1)]: Done   4 out of   6 | elapsed:   41.9s remaining:   20.9s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:  1.1min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:  1.1min finished


{'max_depth': 9, 'max_features': 'auto'}
train :
accuracy score is: 0.8172807181597157
precision score is: 0.8249360613810741
recall score is: 0.8050667665044303
auc: 0.8172693075172873
- - - - - - 
test :
accuracy score is: 0.781351283969085
precision score is: 0.7883248730964467
recall score is: 0.771485345255837
auc: 0.7813883182735641
- - - - - - 


In [24]:
# 注意！！！交叉验证查看平均分数（由于管道会反复搜索参数，会较耗时）
cross_val_score(pipe_rf,xtrain,ytrain,cv=3,scoring='roc_auc').mean()

Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   12.6s
[Parallel(n_jobs=-1)]: Done   4 out of   6 | elapsed:   28.6s remaining:   14.3s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:   43.6s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:   43.6s finished


Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   12.5s
[Parallel(n_jobs=-1)]: Done   4 out of   6 | elapsed:   28.6s remaining:   14.3s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:   43.6s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:   43.6s finished


Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   12.6s
[Parallel(n_jobs=-1)]: Done   4 out of   6 | elapsed:   28.5s remaining:   14.2s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:   43.5s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:   43.5s finished


0.8636721586763811

### 4.3 xgboost模型

In [25]:
# xgboost模型
xgb_clf = xgb.XGBClassifier(objective='binary:logistic',
                            n_job=-1,
                            booster='gbtree',
                            n_estimators=1000,
                            learning_rate=0.01)
# 参数设定
xgb_params = {'max_depth':[6,9],    # 注意参数设置，数量多了会更加耗时
             'subsample':[0.6,0.9],
             'colsample_bytree':[0.5,0.6],
             'reg_alpha':[0.05,0.1]}
# 参数搜索
xgb_gridsearch = GridSearchCV(xgb_clf,xgb_params,cv=cv,n_jobs=-1,
                                 scoring='roc_auc',verbose=10,refit=True)
# 工作流管道
pipe_xgb = Pipeline([
    ('sc',StandardScaler()),
    ('pow_trans',PowerTransformer()),
    ('xgb_grid',xgb_gridsearch)
])
# 搜索参数并训练模型
pipe_xgb.fit(xtrain,ytrain)
# 最佳参数组合
print(pipe_xgb.named_steps['xgb_grid'].best_params_)
# 训练集性能指标
perfomance_clf(pipe_xgb,xtrain,ytrain,name='train')
# 测试集性能指标
perfomance_clf(pipe_xgb,xtest,ytest,name='test')

Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   13.2s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   27.6s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  6.8min finished


{'colsample_bytree': 0.6, 'max_depth': 6, 'reg_alpha': 0.05, 'subsample': 0.6}
train :
accuracy score is: 0.8260083535939156
precision score is: 0.834314980793854
recall score is: 0.8131785847997005
auc: 0.8259963676365218
- - - - - - 
test :
accuracy score is: 0.7840937422089255
precision score is: 0.7939518195797027
recall score is: 0.76949826130154
auc: 0.7841485300501694
- - - - - - 


In [26]:
# 注意！！！交叉验证查看平均分数（由于管道会反复搜索参数，会很耗时）
cross_val_score(pipe_xgb,xtrain,ytrain,cv=3,scoring='roc_auc').mean()

Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    8.8s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   17.9s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   43.5s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  4.5min finished


Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    9.1s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   18.1s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   43.4s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  4.5min finished


Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    9.0s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   17.9s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   43.2s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  4.5min finished


0.8655361886709881

In [27]:
# 查看字段相对xgboost模型的重要程度
# 一般显示 ['可用信贷额度比例', '年龄', '负债率', '月收入', '信贷数量'] 这些字段比较重要
plt.figure(figsize=(10,5))
ax = plt.subplot(1,1,1)
xgb.plot_importance(pipe_xgb.named_steps['xgb_grid'].best_estimator_,
                       max_num_features=40,height=0.5,grid=False,ax=ax)
xtrain.columns

Index(['可用信贷额度比例', '年龄', '逾期30-59天的笔数', '负债率', '月收入', '信贷数量', '逾期90天+的笔数',
       '固定资产贷款数', '逾期60-89天的笔数', '家属数量'],
      dtype='object')

## 5.预测并生成结果
* 预测集读取与生成结果文件

In [28]:
# 预测集数据读取与处理
# dftest = pd.read_csv('/Users/ranmo/Desktop/机器学习项目集/信用违约预测模型/data/test.csv').drop('Unnamed: 0',axis=1)
# dftest.rename(columns = {'SeriousDlqin2yrs':'未来两年可能违约', 'RevolvingUtilizationOfUnsecuredLines':'可用信贷额度比例', 'age':'年龄',
#       'NumberOfTime30-59DaysPastDueNotWorse':'逾期30-59天的笔数', 'DebtRatio':'负债率', 'MonthlyIncome':'月收入',
#       'NumberOfOpenCreditLinesAndLoans':'信贷数量', 'NumberOfTimes90DaysLate':'逾期90天+的笔数',
#       'NumberRealEstateLoansOrLines':'固定资产贷款数', 'NumberOfTime60-89DaysPastDueNotWorse':'逾期60-89天的笔数',
#       'NumberOfDependents':'家属数量'},inplace=True)
# dftest.loc[dftest.家属数量.isna(),'家属数量'] = df1.家属数量.mode()[0]
# dftest.loc[dftest.月收入.isna(),'月收入'] = df1.月收入.mode()[0]

In [29]:
# 以xgboost模型预测，生成csv结果文件
# result = pipe_xgb.predict_proba(dftest.drop('未来两年可能违约',axis=1))
# result_ = [[n+1,i] for n,i in enumerate(result[:,1])]
# df_result = pd.DataFrame(result_,columns=['Id','Probability'])
# df_result.to_csv('sampleEntry.csv',index=False)

## 6.补充

Public Score - 0.85796

In [31]:
# 模型保存方法
# import pickle
# with open('pipe_log.pickle','wb') as f:
#     pickle.dump(pipe_log,f)
# with open('pipe_log.pickle','rb') as f:
#     clf = pickle.load(f)