# 常规赛：个贷违约预测 - 使用Paddle极简构造个贷违约预测器



## 1.赛题介绍
本赛题要求利用已有的与目标客群稍有差异的另一批信贷数据，辅助目标业务风控模型的创建，两者数据集之间存在大量相同的字段和极少的共同用户。此处希望大家可以利用迁移学习捕捉不同业务中用户基本信息与违约行为之间的关联，帮助实现对新业务的用户违约预测。

## 2.比赛背景
该比赛为长期比赛，原题目来自于2021年CCF大数据与计算智能大赛金融道题题目《个贷违约预测》

## 3.数据集介绍
训练数据：

  train_public.csv 个人贷款违约记录数据
  
  train_internet_public.csv 某网络信用贷产品违约记录数据
  
测试数据：

  test_public.csv 用于测试的数据，获取榜单排名

训练数据说明：

train_public.csv 个人贷款违约记录数据

<p>train_internet.csv 某网络信用贷产品违约记录数据</p>
<table>
<thead>
<tr>
<th>字段名</th>
<th>字段说明</th>
</tr>
</thead>
<tbody>
<tr>
<td>loan_id</td>
<td>网络贷款记录唯一标识</td>
</tr>
<tr>
<td>user_id</td>
<td>用户唯一标识</td>
</tr>
<tr>
<td>total_loan</td>
<td>网络贷款金额</td>
</tr>
<tr>
<td>year_of_loan</td>
<td>网络贷款期限（year）</td>
</tr>
<tr>
<td>interest</td>
<td>网络贷款利率</td>
</tr>
<tr>
<td>monthly_payment</td>
<td>分期付款金额</td>
</tr>
<tr>
<td>class</td>
<td>网络贷款等级</td>
</tr>
<tr>
<td>sub_class</td>
<td>网络贷款等级之子级</td>
</tr>
<tr>
<td>work_type</td>
<td>工作类型（公务员、企业白领、创业…）</td>
</tr>
<tr>
<td>employment_type</td>
<td>所在公司类型（世界五百强、国有企业、普通企业…）</td>
</tr>
<tr>
<td>industry</td>
<td>工作领域（传统工业、商业、互联网、金融…）</td>
</tr>
<tr>
<td>work_year</td>
<td>就业年限（年）</td>
</tr>
<tr>
<td>house_ownership</td>
<td>是否有房</td>
</tr>
<tr>
<td>house_loan_status</td>
<td>房屋贷款状况（无房贷、正在还房贷、已经还完房贷）</td>
</tr>
<tr>
<td>censor_status</td>
<td>验证状态</td>
</tr>
<tr>
<td>marriage</td>
<td>婚姻状态（未婚、已婚、离异、丧偶）</td>
</tr>
<tr>
<td>offsprings</td>
<td>子女状态(无子女、学前、小学、中学、大学、工作)</td>
</tr>
<tr>
<td>issue_date</td>
<td>网络贷款发放的月份</td>
</tr>
<tr>
<td>use</td>
<td>贷款用途</td>
</tr>
<tr>
<td>post_code</td>
<td>借款人邮政编码的前3位</td>
</tr>
<tr>
<td>region</td>
<td>地区编码</td>
</tr>
<tr>
<td>debt_loan_ratio</td>
<td>债务收入比</td>
</tr>
<tr>
<td>del_in_18month</td>
<td>借款人过去18个月信用档案中逾期60天内的违约事件数</td>
</tr>
<tr>
<td>scoring_low</td>
<td>借款人在信用评分系统所属的下限范围</td>
</tr>
<tr>
<td>scoring_high</td>
<td>借款人在信用评分系统所属的上限范围</td>
</tr>
<tr>
<td>pub_dero_bankrup</td>
<td>公开记录清除的数量</td>
</tr>
<tr>
<td>early_return</td>
<td>提前还款次数</td>
</tr>
<tr>
<td>early_return_amount</td>
<td>提前还款累积金额</td>
</tr>
<tr>
<td>early_return_amount_3mon</td>
<td>近3个月内提前还款金额</td>
</tr>
<tr>
<td>recircle_bal</td>
<td>信贷周转余额合计</td>
</tr>
<tr>
<td>recircle_util</td>
<td>循环额度利用率，或借款人使用的相对于所有可用循环信贷的信贷金额</td>
</tr>
<tr>
<td>initial_list_status</td>
<td>网络贷款的初始列表状态</td>
</tr>
<tr>
<td>earlies_credit_line</td>
<td>网络贷款信用额度开立的月份</td>
</tr>
<tr>
<td>title</td>
<td>借款人提供的网络贷款名称</td>
</tr>
<tr>
<td>policy_code</td>
<td>公开策略=1不公开策略=2</td>
</tr>
<tr>
<td>f系列匿名特征</td>
<td>匿名特征f0-f5，为一些网络贷款人行为计数特征的处理</td>
</tr>
</tbody>
</table>

## 4.案例分析与代码实现

### 4.1 查看数据集

In [3]:
# 查看样本信息
import pandas as pd
train_df=pd.read_csv('data/train_public.csv')
train_df

Unnamed: 0,loan_id,user_id,total_loan,year_of_loan,interest,monthly_payment,class,employer_type,industry,work_year,...,policy_code,f0,f1,f2,f3,f4,early_return,early_return_amount,early_return_amount_3mon,isDefault
0,1040418,240418,31818.18182,3,11.466,1174.91,C,政府机构,金融业,3 years,...,1,1.0,0.0,4.0,5.0,4.0,3,9927,0.0,0
1,1025197,225197,28000.00000,5,16.841,670.69,C,政府机构,金融业,10+ years,...,1,7.0,0.0,4.0,45.0,22.0,0,0,0.0,0
2,1009360,209360,17272.72727,3,8.900,603.32,A,政府机构,公共服务、社会组织,10+ years,...,1,6.0,0.0,6.0,28.0,19.0,0,0,0.0,0
3,1039708,239708,20000.00000,3,4.788,602.30,A,世界五百强,文化和体育业,6 years,...,1,5.0,0.0,10.0,15.0,9.0,0,0,0.0,0
4,1027483,227483,15272.72727,3,12.790,470.31,C,政府机构,信息传输、软件和信息技术服务业,< 1 year,...,1,10.0,0.0,6.0,15.0,4.0,0,0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,1028093,228093,17727.27273,3,15.037,510.27,B,普通企业,建筑业,7 years,...,1,4.0,0.0,4.0,11.0,7.0,2,5287,0.0,0
9996,1043911,243911,13636.36364,3,6.534,464.95,A,政府机构,农、林、牧、渔业,2 years,...,1,2.0,0.0,2.0,7.0,6.0,3,7182,0.0,0
9997,1023503,223503,24818.18182,3,14.421,708.69,B,普通企业,信息传输、软件和信息技术服务业,10+ years,...,1,6.0,0.0,5.0,15.0,11.0,1,8540,2562.0,0
9998,1024616,224616,20000.00000,3,18.450,727.58,D,政府机构,农、林、牧、渔业,10+ years,...,1,7.0,0.0,5.0,17.0,10.0,2,6161,616.1,0


In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 39 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   loan_id                   10000 non-null  int64  
 1   user_id                   10000 non-null  int64  
 2   total_loan                10000 non-null  float64
 3   year_of_loan              10000 non-null  int64  
 4   interest                  10000 non-null  float64
 5   monthly_payment           10000 non-null  float64
 6   class                     10000 non-null  object 
 7   employer_type             10000 non-null  object 
 8   industry                  10000 non-null  object 
 9   work_year                 9378 non-null   object 
 10  house_exist               10000 non-null  int64  
 11  censor_status             10000 non-null  int64  
 12  issue_date                10000 non-null  object 
 13  use                       10000 non-null  in

由上述信息可知，work_year，pub_dero_bankrup，f0,f1,f2,f3,f4等字段存在不同程度的数据缺失。

### 4.2 构造PaddlePaddle的DataSet

In [8]:
import paddle
import numpy as np
# import paddle.vision.transforms、 as T
import pandas as pd

class MyDateset(paddle.io.Dataset):
    # csv_dir对应要读取的数据地址，standard_csv_dir用于生成均值和方差信息对数据进行归一化的文件地址
    def __init__(self,input_dir,mode = 'train'):
        super(MyDateset, self).__init__()

        # 读取数据
        self.df = pd.read_csv(input_dir)
        
        # 构造各个变量的均值和方差
        st_df = pd.read_csv(input_dir)
        self.mean_df = st_df.mean()
        self.std_df = st_df.std()

        # 分别指定数值型变量/分类变量/不使用的变量
        self.num_item = ['total_loan', 'year_of_loan', 'interest','monthly_payment',
                         'debt_loan_ratio', 'del_in_18month', 'scoring_low','scoring_high', 
                         'known_outstanding_loan', 'known_dero','pub_dero_bankrup', 'recircle_b', 
                         'recircle_u', 'f0', 'f1','f2', 'f3', 'f4', 'early_return', 
                         'early_return_amount','early_return_amount_3mon']
        self.un_num_item = ['class','employer_type','industry','work_year','house_exist', 'censor_status',
                            'use','initial_list_status','app_type','policy_code']
        self.un_use_item = ['loan_id', 'user_id',
                            'issue_date', 
                            'post_code', 'region',
                            'earlies_credit_mon','title']

        # 构造一个映射表，将分类变量/分类字符串映射到对应数值上
        un_num_item_list = {}
        for item in self.un_num_item:
            un_num_item_list[item]=list(set(st_df[item].values))
        self.un_num_item_list = un_num_item_list

        self.mode = mode

    def __getitem__(self, index):
        data=[]

        # 进行归一化，如果这个数值缺省了直接设置为0
        for item in self.num_item:
            if np.isnan(self.df[item][index]):
                data.append((0-self.mean_df[item])/self.std_df[item])
            else:
                data.append((self.df[item][index]-self.mean_df[item])/self.std_df[item])
        
        emb_data = []

        # 将分类变量映射到对应数值上
        for item in self.un_num_item:
            try:
                if self.df[item][index] not in self.un_num_item_list[item]:
                    emb_data.append(-1)
                else:
                    emb_data.append(self.un_num_item_list[item].index(self.df[item][index]))
            except:
                emb_data.append(-1)

        data = paddle.to_tensor(data).astype('float32')
        emb_data = paddle.to_tensor(emb_data).astype('float32')

        # 如果当前模式不为train，则返回对应的loan_id，用于锁定样本条目
        if self.mode == 'train':
            label = self.df['isDefault'][index]
        else:
            label = self.df['loan_id'][index]

        label = np.array(label).astype('int64')
        return data,emb_data,label

    def __len__(self):
        return len(self.df)

测试MyDatese函数t函数

In [11]:
dataset=MyDateset('data/train_public.csv')
[data,emb_data,label] = dataset[0]
print("data:",data,"\nemb_data:",emb_data,"\nlabel:",label)

data: Tensor(shape=[21], dtype=float32, place=Place(cpu), stop_gradient=True,
       [ 1.94507027, -0.56161529, -0.36030969,  2.81924415, -1.06214857,
        -0.35715219, -1.39864016, -1.26400948, -1.57160521, -0.37241074,
        -0.36666167, -0.41815358,  1.46703112, -1.42198610, -0.03773430,
        -0.61069232, -1.16881323, -0.84118545,  1.17932808,  2.56085277,
        -0.52783436]) 
emb_data: Tensor(shape=[10], dtype=float32, place=Place(cpu), stop_gradient=True,
       [4., 5., 2., 5., 0., 1., 2., 0., 0., 0.]) 
label: 0


### 4.3 构造网络

网络结构为第一层使用21输入的全连接层，接着连接两层线性层，最后输出使用2输出的全连接层，表示二分类问题

In [13]:
class MyNet(paddle.nn.Layer):
    def __init__(self):
        super(MyNet,self).__init__()
        self.fc = paddle.nn.Linear(in_features=21, out_features=512)   #全连接层输入,线性变换层输入单元的数目为2，输出单元数为512
        self.emb1 = paddle.nn.Linear(in_features=10,out_features=2048) #embding1层，输入单元数为10，输出单元数为2048
        self.emb2 = paddle.nn.Linear(in_features=2048,out_features=512)#embding2层，输入单元数为2048，输出单元数为512
        self.out = paddle.nn.Linear(in_features=1024,out_features=2)   #输出层，输入单元数为1024，输出单元数为2，表示二分类问题

    def forward(self,data,emb_data):
        x = self.fc(data)

        emb = self.emb1(emb_data)
        emb = self.emb2(emb)

        x = paddle.concat([x,emb],axis=-1)

        x = self.out(x)
        
        x = paddle.nn.functional.sigmoid(x)
        return x

### 4.4 模型训练

In [21]:
# 构造读取器
train_dataset=MyDateset('data/train_public.csv')

train_dataloader = paddle.io.DataLoader(
    train_dataset,
    batch_size=10000,
    shuffle=True,
    drop_last=False)

model = MyNet()

# model_dict = paddle.load('model.pdparams')
# model.set_dict(model_dict)

model.train()

max_epoch=100
opt = paddle.optimizer.SGD(learning_rate=0.1, parameters=model.parameters())

# 训练
now_step=0
for epoch in range(max_epoch):
    for step, data in enumerate(train_dataloader):
        now_step+=1

        data,emb_data, label = data
        pre = model(data,emb_data)
        loss = paddle.nn.functional.cross_entropy(pre,label,weight=paddle.to_tensor([0.2,1.0]),reduction='mean')
        # loss = paddle.nn.functional.square_error_cost(pre,label.reshape([-1,1]).astype('float32'))
        # loss = paddle.mean(loss)
        loss.backward()
        opt.step()
        opt.clear_gradients()
        if now_step%1==0:
            print("epoch: {}, batch: {}, loss is: {}".format(epoch, step, loss.mean().numpy()))

        model.eval()
        accuracies = []
        losses = []
        for batch_id, data in enumerate(test_loader):
            #准备数据
            images, labels = data
            #前向计算的过程
            predicts = model(images)
            #计算损失
            loss = F.cross_entropy(predicts, labels)
            #计算准确率
            acc = paddle.metric.accuracy(predicts, labels)
            accuracies.append(acc.numpy())
            losses.append(loss.numpy())

    avg_acc, avg_loss = np.mean(accuracies), np.mean(losses)
    print("[validation]After epoch {}: accuracy/loss: {}/{}".format(epoch_id, avg_acc, avg_loss))


# 保存模型到model.pdparams
paddle.save(model.state_dict(), 'model.pdparams')

  .format(lhs_dtype, rhs_dtype, lhs_dtype))


epoch: 0, batch: 0, loss is: [0.7036219]
epoch: 1, batch: 0, loss is: [0.6963007]
epoch: 2, batch: 0, loss is: [0.690608]
epoch: 3, batch: 0, loss is: [0.6852205]
epoch: 4, batch: 0, loss is: [0.6801178]
epoch: 5, batch: 0, loss is: [0.6752735]
epoch: 6, batch: 0, loss is: [0.6706602]
epoch: 7, batch: 0, loss is: [0.66625434]
epoch: 8, batch: 0, loss is: [0.66203713]
epoch: 9, batch: 0, loss is: [0.65799594]
epoch: 10, batch: 0, loss is: [0.654122]
epoch: 11, batch: 0, loss is: [0.6504088]
epoch: 12, batch: 0, loss is: [0.646851]
epoch: 13, batch: 0, loss is: [0.64344275]
epoch: 14, batch: 0, loss is: [0.6401784]
epoch: 15, batch: 0, loss is: [0.6370512]
epoch: 16, batch: 0, loss is: [0.6340546]
epoch: 17, batch: 0, loss is: [0.63118213]
epoch: 18, batch: 0, loss is: [0.6284276]
epoch: 19, batch: 0, loss is: [0.62578464]
epoch: 20, batch: 0, loss is: [0.6232475]
epoch: 21, batch: 0, loss is: [0.62081075]
epoch: 22, batch: 0, loss is: [0.6184691]
epoch: 23, batch:

### 4.5 预测

这里直接读取保存好的得分为0.85984的模型，如需测试自己的模型请替换对应的模型读取路径

最后直接提交生成result.csv即可

In [24]:
# 读取模型和构造读取器
model = MyNet()

# 如果想要替换自己的训练结果请替换load的pdparams文件路径，如model.pdarams
model_dict = paddle.load('model.pdparams')
# model_dict = paddle.load('model.pdparams')

model.set_dict(model_dict)

model.eval()

test_dataset=MyDateset('data/test_public.csv',mode = 'test')

test_dataloader = paddle.io.DataLoader(
    test_dataset,
    batch_size=1,
    shuffle=False,
    drop_last=False)

# 将结果保存在result.csv中
result = []
for step, data in enumerate(test_dataloader):
    data ,emb_data, loan_id = data
    pre = model(data,emb_data)
    result.append([loan_id.numpy()[0], pre[:,1].numpy()[0]])
    # result.append([loan_id.numpy()[0], np.argmax(pre.numpy())])

pd.DataFrame(result,columns=['id','isDefault']).to_csv('result.csv',index=None)

本例参考了https://aistudio.baidu.com/aistudio/datasetdetail/130187/1项目方案