## machine learning for credit scoring

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

| Variable Name                        | Description                                                 | Type       |
| ------------------------------------ | ----------------------------------------------------------- | ---------- |
| SeriousDlqin2yrs                     | Person experienced 90 days past due delinquency or worse    | Y/N        |
| RevolvingUtilizationOfUnsecuredLines | Total balance on credit divided by the sum of credit limits | percentage |
| age                                  | Age of borrower in years                                    | integer    |
| NumberOfTime30-59DaysPastDueNotWorse | Number of times borrower has been 30-59 days past due       | integer    |
| DebtRatio                            | Monthly debt payments                                       | percentage |
| MonthlyIncome                        | Monthly income                                              | real       |
| NumberOfOpenCreditLinesAndLoans      | Number of Open loans                                        | integer    |
| NumberOfTimes90DaysLate              | Number of times borrower has been 90 days or more past due. | integer    |
| NumberRealEstateLoansOrLines         | Number of mortgage and real estate loans                    | integer    |
| NumberOfTime60-89DaysPastDueNotWorse | Number of times borrower has been 60-89 days past due       | integer    |
| NumberOfDependents                   | Number of dependents in family                              | integer    |


Read the data into Pandas


In [2]:
import pandas as pd

pd.set_option("display.max_columns", 500)
import zipfile

with zipfile.ZipFile("KaggleCredit2.csv.zip", "r") as z:  ##读取zip里的文件
    f = z.open("KaggleCredit2.csv")
    data = pd.read_csv(f, index_col=0)
data.head()

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [3]:
data.shape

(112915, 11)

Drop na


In [4]:
data.isnull().sum(axis=0)

SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [5]:
data.dropna(inplace=True)  ##去掉为空的数据
data.shape

(108648, 11)

Create X and y


In [14]:
y = data["SeriousDlqin2yrs"]
X = data.drop("SeriousDlqin2yrs", axis=1)
print(X.shape)
print(y.shape)

(108648, 10)
(108648,)


In [15]:
y.mean()  ##求取均值

0.06742876076872101

# 练习 1

把数据切分成训练集和测试集


In [None]:
from sklearn import model_selection


# 具体划分: 训练集80%，测试集20%



x_tran, x_test, y_tran, y_test = model_selection.train_test_split(X, y, test_size=0.2)


print(x_test.shape)

(21730, 10)


# 练习 2

使用 logistic regression/决策树/SVM/KNN...等 sklearn 分类算法进行分类，尝试查 sklearn API 了解模型参数含义，调整不同的参数。


In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer

# 填充缺失值
imputer = SimpleImputer(strategy="mean")
x_filled = imputer.fit_transform(x_tran)

# 训练模型
lr = LogisticRegression(
    # multi_class="ovr", #老参数，移除
    multi_class="auto",
    solver="sag",
    class_weight="balanced",
    max_iter=100,  # 增加迭代次数确保收敛
)
lr.fit(x_filled, y_tran)  # fit-拟合，就是训练模型的过程
# lr.fit(x_tran, y_tran) # 用原始数据训练会有缺失值报错
print("权重系数:", lr.coef_)  # 特征重要性
print("偏置项:", lr.intercept_)
# 评估
score = lr.score(x_filled, y_tran)
print(f"模型分数: {score:.4f}")  # 最佳为1.0

# NOTE 还有个警告说明没有没有收敛，增大次数就好了



权重系数: [[-9.08207367e-07 -1.93375876e-06  1.80273542e-06  6.99577958e-08
  -1.29147206e-05  1.39616615e-08  1.62206294e-06 -1.28869365e-08
   1.37058561e-06  2.88187500e-07]]
偏置项: [7.77318852e-08]
模型分数: 0.9327




# 练习 3

在测试集上进行预测，计算准确度


In [10]:
from sklearn.metrics import accuracy_score

## https://blog.csdn.net/qq_16095417/article/details/79590455
train_score = accuracy_score(y_tran, lr.predict(x_filled))

# x_test也有缺失值，需要处理
imputer = SimpleImputer(strategy="mean")
x_test_filled = imputer.fit_transform(x_test)

test_score = lr.score(x_test_filled, y_test)
print("训练集准确率：", train_score)
print("测试集准确率：", test_score)

训练集准确率： 0.9326721737729814
测试集准确率： 0.9320754716981132


# 练习 4

查看 sklearn 的官方说明，了解分类问题的评估标准，并对此例进行评估。


In [11]:
##召回率
from sklearn.metrics import recall_score

train_recall = recall_score(y_tran, lr.predict(x_tran), average="macro")
test_recall = recall_score(y_test, lr.predict(x_test), average="macro")
print("训练集召回率：", train_recall)
print("测试集召回率：", test_recall)

训练集召回率： 0.4999876646765678
测试集召回率： 0.5




# 练习 5

银行通常会有更严格的要求，因为 fraud 带来的后果通常比较严重，一般我们会调整模型的标准。<br>
比如在 logistic regression 当中，一般我们的概率判定边界为 0.5，但是我们可以把阈值设定低一些，来提高模型的“敏感度”，试试看把阈值设定为 0.3，再看看这时的评估指标(主要是准确率和召回率)。

tips:sklearn 的很多分类模型，predict_prob 可以拿到预估的概率，可以根据它和设定的阈值大小去判断最终结果(分类类别)


In [12]:
import numpy as np

y_pro = lr.predict_proba(x_test)  ##获取预测概率值
y_prd2 = [
    list(p >= 0.3).index(1) for i, p in enumerate(y_pro)
]  ##设定0.3阈值，把大于0.3的看成1分类。
train_score = accuracy_score(y_test, y_prd2)
print(train_score)

0.9320754716981132


