# Using LightGBM to predict and analyze personal risks on loans in finance industry

[*Datasets information:*](https://www.datafountain.cn/competitions/530/datasets)
1. train_public.csv: personal load violation records
2. train_internet_public.csv: a violation records of from a Fintech product/service
3. test_public.csv : for prediction testing

**Result:**

0.84

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn
import lightgbm

train_public.csv 个人贷款违约记录数据
train_internet_public.csv 某网络信用贷产品违约记录数据
test_public.csv 用于测试的数据，获取榜单排名

**Begin preprocessing**

In [2]:
##read the train and test datasets 
train_bank = pd.read_csv('./raw_data/train_public.csv')
train_internet = pd.read_csv('./raw_data/train_internet.csv')


In [47]:
test = pd.read_csv('./raw_data/test_public.csv')

In [3]:
## this will check the common column names between the train dataset and the train dataset
##      from the Fintech data
common_cols = []
for col in train_bank.columns:
    if col in train_internet.columns:
        common_cols.append(col)
    else:
        continue

print(common_cols)
len(common_cols)

['loan_id', 'user_id', 'total_loan', 'year_of_loan', 'interest', 'monthly_payment', 'class', 'employer_type', 'industry', 'work_year', 'house_exist', 'censor_status', 'issue_date', 'use', 'post_code', 'region', 'debt_loan_ratio', 'del_in_18month', 'scoring_low', 'scoring_high', 'pub_dero_bankrup', 'recircle_b', 'recircle_u', 'initial_list_status', 'earlies_credit_mon', 'title', 'policy_code', 'f0', 'f1', 'f2', 'f3', 'f4', 'early_return', 'early_return_amount', 'early_return_amount_3mon', 'is_default']


36

In [4]:
print(len(train_bank.columns)) ## 39 columns
print(len(train_internet.columns)) ## 42 columns

39
42


In [5]:
train_bank_left = list(set(list(train_bank.columns)) - set(common_cols))
train_internet_left = list(set(list(train_internet.columns)) - set(common_cols))
print('unique train_bank col:', train_bank_left,
        '\nunique train_internet col:', train_internet_left)

unique train_bank col: ['app_type', 'known_outstanding_loan', 'known_dero'] 
unique train_internet col: ['offsprings', 'marriage', 'house_loan_status', 'f5', 'sub_class', 'work_type']


In [6]:
## Extract of common columns
train1_data = train_bank[common_cols]
train2_data = train_internet[common_cols]


In [48]:
test_data = test[common_cols[:-1]] ##remove `is_default`

## Start using LightGBM

In [7]:
train1_data.info
train1_data.select_dtypes(include = 'O')

Unnamed: 0,class,employer_type,industry,work_year,issue_date,earlies_credit_mon
0,C,政府机构,金融业,3 years,2016/10/1,1-Dec
1,C,政府机构,金融业,10+ years,2013/6/1,Apr-90
2,A,政府机构,公共服务、社会组织,10+ years,2014/1/1,Oct-91
3,A,世界五百强,文化和体育业,6 years,2015/7/1,1-Jun
4,C,政府机构,信息传输、软件和信息技术服务业,< 1 year,2016/7/1,2-May
...,...,...,...,...,...,...
9995,B,普通企业,建筑业,7 years,2013/11/1,6-Feb
9996,A,政府机构,农、林、牧、渔业,2 years,2015/12/1,May-97
9997,B,普通企业,信息传输、软件和信息技术服务业,10+ years,2012/12/1,Feb-87
9998,D,政府机构,农、林、牧、渔业,10+ years,2018/3/1,Oct-92


I found string data types and they are unacceptable for LightGBM, so I'm going to use onehot label with char, and timestamp or timeDiff for string formatted time (will decide later based on the efficency).

Methods:
`issue_Date`, `EarliesCreditLine` will be converted to datatime format by using Pandas package;
`employer_type`, `industry` will be converted to label using OneHot label encoder

In [8]:
import datetime
# default='warn', ignore the operations on the chained assignment and do not want a copy 
##      of the dataset. Also used loc to perform selection on the processed datasets
pd.options.mode.chained_assignment = None  

train1_data.loc[:,'issue_date'] = pd.to_datetime(train1_data['issue_date'])
train1_data.loc[:, 'issue_date_y'] = train1_data['issue_date'].dt.year
train1_data.loc[:, 'issue_date_m'] = train1_data['issue_date'].dt.month

## train1_data['issue_date'].min() 
## returned: Timestamp('2007-10-01 00:00:00')
base_time = datetime.datetime.strptime('2007-10-01','%Y-%m-%d')
train1_data['issues_date_diff'] = train1_data.loc[:,'issue_date'].apply(lambda x: x - base_time).dt.days


In [9]:
## after calculation the year, month, and datediff, `issue_date` wont be necessary anymore 
train1_data.drop('issue_date', axis=1, inplace=True) 

In [10]:
## Applying to similar rules for train2 dataset
train2_data.loc[:,'issue_date'] = pd.to_datetime(train2_data['issue_date'])
train2_data.loc[:, 'issue_date_y'] = train2_data['issue_date'].dt.year
train2_data.loc[:, 'issue_date_m'] = train2_data['issue_date'].dt.month

## train1_data['issue_date'].min() 
## returned: Timestamp('2007-10-01 00:00:00')
base_time = datetime.datetime.strptime('2007-10-01','%Y-%m-%d')
train2_data['issues_date_diff'] = train2_data.loc[:,'issue_date'].apply(lambda x: x - base_time).dt.days


In [11]:
train2_data.drop('issue_date', axis=1, inplace=True) ##uncomment this row before re-run

In [12]:
employer_type = train1_data['employer_type'].value_counts().index
industry = train1_data['industry'].value_counts().index 

empt_employer_dict = dict(zip(employer_type,[i for i in range(5)]))
empt_industry_dict = dict(zip(industry,[i for i in range(14)]))

train1_data['employer_type'] = train1_data['employer_type'].map(empt_employer_dict)
train2_data['employer_type'] = train2_data['employer_type'].map(empt_employer_dict)

train1_data['industry'] = train1_data['industry'].map(empt_industry_dict)
train2_data['industry'] = train2_data['industry'].map(empt_industry_dict)

In [13]:
##train1_data['work_year'].isnull().sum() 
#   622 missing value, so I decide need to use fillna
# train1_data['work_year'].isnull().sum() 
#   43847 missing value, same method to fill the nulls

## arbitary decisions made here
train1_data['work_year'].fillna('10+ years', inplace=True)
train2_data['work_year'].fillna('10+ years', inplace=True)

## Didn't use label encoder here because of the corresponding relationships
##    between work year and the label 
work_year_list = list(train1_data['work_year'].value_counts().index)
work_year_coder = [10, 2, 3, 0, 1, 5, 4, 6, 8, 7, 9]
work_year_dict = dict(zip(work_year_list, work_year_coder))

train1_data['work_year'] = train1_data['work_year'].map(work_year_dict)
train2_data['work_year'] = train2_data['work_year'].map(work_year_dict)

In [14]:
train1_data['class'] = train1_data['class'].map({'A':0, 'B':1, 'C':2, 'D':3, 'E':4,'F':5, 'G':6})
train2_data['class'] = train2_data['class'].map({'A':0, 'B':1, 'C':2, 'D':3, 'E':4,'F':5, 'G':6})

Similarly, apply data cleaning and formatting rules for the test dataset

In [49]:
test_data.loc[:,'issue_date'] = pd.to_datetime(test_data['issue_date'])
test_data.loc[:, 'issue_date_y'] = test_data['issue_date'].dt.year
test_data.loc[:, 'issue_date_m'] = test_data['issue_date'].dt.month
test_data['issues_date_diff'] = test_data.loc[:,'issue_date'].apply(lambda x: x - base_time).dt.days

test_data['employer_type'] = test_data['employer_type'].map(empt_employer_dict)
test_data['industry'] = test_data['industry'].map(empt_industry_dict)
test_data['work_year'].fillna('10+ years', inplace=True)
test_data['work_year'] = test_data['work_year'].map(work_year_dict)
test_data['class'] = test_data['class'].map({'A':0, 'B':1, 'C':2, 'D':3, 'E':4,'F':5, 'G':6})

In [50]:
test_data.drop('issue_date', axis=1, inplace=True)


In [17]:
import lightgbm as lgb
from sklearn import metrics
import pickle 

In [23]:
X_train1 = train1_data.drop(['loan_id','user_id','is_default','earlies_credit_mon'], axis=1)
y_train1 = train1_data['is_default']

In [24]:
X_train2 = train2_data.drop(['loan_id','user_id','is_default','earlies_credit_mon'], axis=1)
y_train2 = train2_data['is_default']

In [25]:
X_train = pd.concat([X_train1,X_train2])
y_train = pd.concat([y_train1, y_train2])

In [57]:
X_test = test_data.drop(['loan_id','user_id','earlies_credit_mon'], axis=1)

In [58]:
clf_ex = lgb.LGBMClassifier(n_estimators=200)
clf_ex.fit(X=X_train, y = y_train)
with open ('./clf_ex.pkl','wb') as file:
    pickle.dump(clf_ex, file)
pred = clf_ex.predict(X_test)

In [71]:
pred_probability = clf_ex.predict_proba(X_test)[:,1]

In [72]:
result = pd.DataFrame({'id':test['loan_id'], 'isDefault':pred, 'probability_score':pred_probability})
result.to_csv('baseline_lgbCls.csv', index = None)

Unnamed: 0,id,isDefault,probability_score
0,1000575,0,0.007183
1,1028125,0,0.009214
2,1010694,0,0.003304
3,1026712,0,0.008577
4,1002895,0,0.004391
...,...,...,...
4995,1008856,1,0.503161
4996,1016651,0,0.063326
4997,1024140,0,0.003433
4998,1014316,0,0.002917
