## 评分卡

### 在银行借贷场景中，评分卡是一种以分数形式来衡量一个客户的信用风险大小的手段，它衡量向别人借钱的人（受信人，需要融资的公司）不能如期履行合同中的还本付息责任，并让借钱给别人的人（授信人，银行等金融机构）  造成经济损失的可能性。一般来说，评分卡打出的分数越高，客户的信用越好，风险越小。

### 这些”借钱的人“，可能是个人，有可能是有需求的公司和企业。对于企业来说，我们按照融资主体的融资用途，分别使用企业融资模型，现金流融资模型，项目融资模型等模型。而对于个人来说，我们有”四张卡“来评判个人的信用程度：A卡，B卡，C卡和F卡。而众人常说的“评分卡”其实是指A卡，又称为申请者评级模型，主要应用于相关融资类业务中新用户的主体评级，即判断金融机构是否应该借钱给一个新用户，如果这个人的风险太高，我们可以拒绝贷款。

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression as LR
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree
from sklearn.model_selection import train_test_split 
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

ModuleNotFoundError: No module named 'imblearn'

In [None]:
rc = pd.read_csv(r"C:\Users\Cracker Park\Desktop\培训材料\dataset\rankingcard.csv", index_col = 0)

pd.set_option('display.max_rows', 1000,'display.max_columns', 1000,"display.max_colwidth", None,'display.width',None)

In [None]:
rc.head()

In [None]:
rc.tail()

In [None]:
rc.shape

In [None]:
rc.info()

<img src="./借款人信息属性表.png" style="zoom:80%" />

In [None]:
rc.duplicated().sum()

In [None]:
rc.drop_duplicates(inplace = True)

In [None]:
describe = rc.describe().T
describe

In [None]:
rc.isna().sum()

In [None]:
describe['null'] = rc.isna().sum()
describe['% of Total Values'] = describe['null'] / len(rc) *100
describe

In [None]:
missvalue = describe[['null','% of Total Values','min','max']].sort_values(by = 'null', ascending = False)
missvalue

In [None]:
rc.MonthlyIncome.value_counts()

In [None]:
rc['MonthlyIncome'].fillna(value = rc['MonthlyIncome'].median(), inplace  = True) #以月收入中位数填充空值

In [None]:
rc.NumberOfDependents.value_counts()

In [None]:
sn.catplot('NumberOfDependents', data = rc, kind = 'count', aspect = 2)

In [None]:
rc['NumberOfDependents'].fillna(value = rc['NumberOfDependents'].mode()[0], inplace  = True)  #以家属数量的众数填充空值

In [None]:
rc.isna().sum()

### 异常值处理

In [None]:
describe

In [None]:
rc[rc.age < 18]

In [None]:
rc.drop(rc[rc.age < 18].index, inplace = True)

In [None]:
sn.catplot('NumberOfTime30-59DaysPastDueNotWorse', data = rc, kind = 'count', aspect  = 2)

In [None]:
rc[rc['NumberOfTime30-59DaysPastDueNotWorse'] > 90].count()

In [None]:
rc.drop(rc[rc['NumberOfTime30-59DaysPastDueNotWorse'] > 90].index, inplace = True)
rc.reset_index(drop = True)  #重置索引

### 样本分布情况

In [None]:
rc.SeriousDlqin2yrs.value_counts()

In [None]:
sn.catplot('SeriousDlqin2yrs', data = rc, kind = 'count')

In [None]:
x = rc.drop(columns = 'SeriousDlqin2yrs')
y = rc['SeriousDlqin2yrs']

Xtrain, Xtest, Ytrain, Ytest = train_test_split(x, y, test_size = 0.3, random_state = 1912) #划分训练集和测试集

print('训练样本数:',Xtrain.shape[0], '测试样本数:',Xtest.shape[0]) 

In [None]:
sm = SMOTE(random_state = 1912)
Xtrain,Ytrain = sm.fit_resample(Xtrain,Ytrain)

Ytrain.value_counts()

### 分箱

### 离散化连续变量必然伴随着信息的损失，并且箱子越少，   信息损失越大。为了衡量特征上的信息量以及特征对预测函数的贡献，银行业定义了概念Information value(IV)：
#### $$IV = \sum_{i=1}^N([good(percentage) - bad(percentage)]*WOE_i)$$
#### $$WOE_i = ln[\frac{good(percentage)}{bad(percentage)}]$$
### 这是我们在银行业中用来衡量违约概率的指标，中文叫做证据权重(Weight of Evidence)，本质其实就是优质客户比上坏客户的比例的对数。WOE是对一个箱子来说的，WOE越大，代表了这个箱子里的优质客户越多。而IV是对整个特征来说的，IV代表的意义是我们特征上的信息量以及这个特征对模型的贡献，由下表来控制：
#### IV<0.03 特征几乎不带有效信息，对模型没有贡献，可以忽略
#### 0.03<IV<0.09 有效信息很少，对模型贡献低
#### 0.1<IV<0.29 有效信息一般，对模型贡献中等
#### 0.3<IV<0.49 有效信息较多，对模型贡献较高
#### IV>=0.5 有效信息非常多，对模型贡献超高


In [None]:
def decisiontree_binning(x, y):
    boundary = [] #空列表以接收边界值
    
    x = x.values
    y = y.values
    
    clf = DecisionTreeClassifier(criterion = 'entropy'
                                ,max_leaf_nodes = 20
                                ,min_samples_leaf = 0.05
                                )    #实例化决策树
    clf.fit(x.reshape(-1,1),y)  #训练决策树
    
    n_nodes = clf.tree_.node_count
    children_left = clf.tree_.children_left
    children_right = clf.tree_.children_right
    threshold = clf.tree_.threshold
    
    for i in range(n_nodes):
        if children_left[i] != children_right[i]:
            boundary.append(threshold[i])
            
    boundary.sort()
    
    min_x = x.min()
    max_x = x.max() + 0.1
    boundary = [min_x] + boundary + [max_x]
    
    return boundary       

In [None]:
for ft in Xtrain.columns:
    result = decisiontree_binning(x = Xtrain[ft], y = Ytrain
    )
    
    print('特征:',ft,
          '\n''分箱个数:{}'.format(len(result)-1),
          '\n''分箱区间:{}'.format(result))

In [None]:
def feature_woe_iv(x, y):
    
    boundary = decisiontree_binning(x, y)
    df = pd.concat([x, y], axis = 1)
    df.columns = ['x', 'y']
    df['bins'] = pd.cut(x = x, bins = boundary, right = False)
    
    grouped = df.groupby('bins')['y']
    result_df = grouped.agg([('good', lambda y: (y == 0).sum()), 
                             ('bad', lambda y: (y == 1).sum()),
                             ('total', 'count')])
    
    result_df['% of the good'] = result_df['good'] / result_df['good'].sum()       # 好客户占比
    result_df['% of the bad'] = result_df['bad'] / result_df['bad'].sum()          # 坏客户占比
    result_df['% of the total'] = result_df['total'] / result_df['total'].sum()    # 总客户占比

    result_df['bad_rate'] = result_df['bad'] / result_df['total']             # 坏比率
    
    result_df['woe'] = np.log(result_df['% of the good'] / result_df['% of the bad'])              # WOE
    result_df['iv'] = (result_df['% of the good'] - result_df['% of the bad']) * result_df['woe']  # IV

    print('IV = {}'.format(result_df['iv'].sum()))
    return result_df

In [None]:
feature_woe_iv(Xtrain.age, Ytrain)

In [None]:
for ft in Xtrain.columns:
    print(ft)
    feature_woe_iv(x = Xtrain[ft], y = Ytrain
    )

In [None]:
feature_woe_iv(Xtrain.RevolvingUtilizationOfUnsecuredLines, Ytrain)

In [None]:
feature_woe_iv(Xtrain.age, Ytrain)

In [None]:
feature_woe_iv(Xtrain['NumberOfTime30-59DaysPastDueNotWorse'], Ytrain)

In [None]:
feature_woe_iv(Xtrain['DebtRatio'], Ytrain)

In [None]:
feature_woe_iv(Xtrain['NumberOfTimes90DaysLate'], Ytrain)

In [None]:
feature_woe_iv(Xtrain['NumberRealEstateLoansOrLines'], Ytrain)

In [None]:
feature_woe_iv(Xtrain['NumberOfTime60-89DaysPastDueNotWorse'], Ytrain)

In [None]:
ft_bins = {}

for ft in Xtrain.columns:
    ft_bins[ft] = decisiontree_binning(x = Xtrain[ft]
                                      ,y = Ytrain
    )

ft_bins['NumberOfDependents'] = [0,1,2,3]
ft_bins = {k:[-np.inf,*v[:-1],np.inf] for k,v in ft_bins.items()}

ft_bins

In [None]:
def get_woe(df,ft,y,bins):
    df = df[[ft,y]].copy()
    df["cut"] = pd.cut(df[ft],bins)
    bins_df = df.groupby("cut", observed = True)[y].value_counts().unstack()
    woe = bins_df["woe"] = np.log((bins_df[0]/bins_df[0].sum())/(bins_df[1]/bins_df[1].sum()))
    return woe

woeall = {}
for ft in ft_bins:
    woeall[ft] = get_woe(pd.concat([Xtrain,Ytrain], axis =1),ft,"SeriousDlqin2yrs",ft_bins[ft])
    
woeall

In [None]:
train_woe = pd.DataFrame(index = Xtrain.index)

for ft in ft_bins:
    train_woe[ft] = pd.cut(Xtrain[ft],ft_bins[ft]).map(woeall[ft])

train_woe['SeriousDlqin2yrs'] = Ytrain
train_woe

In [None]:
test_woe = pd.DataFrame(index = Xtest.index)

for ft in ft_bins:
    test_woe[ft] = pd.cut(Xtest[ft],ft_bins[ft]).map(woeall[ft])

test_woe['SeriousDlqin2yrs'] = Ytest
test_woe

### 建模

In [None]:
Xtrain = train_woe.iloc[:,:-1]
Ytrain = train_woe.iloc[:,-1]

Xtest = test_woe.iloc[:,:-1]
Ytest = test_woe.iloc[:,-1]

lr = LR()
lr = lr.fit(Xtrain,Ytrain)

score_train = lr.score(Xtrain,Ytrain) 
score_test = lr.score(Xtest,Ytest) 
print('训练集上准确度:',score_train,'测试集上准确度:',score_test)

In [None]:
from sklearn.metrics import recall_score, precision_score, f1_score

recall_score(Ytest, lr.predict(Xtest))


In [None]:
precision_score(Ytest, lr.predict(Xtest))

In [None]:
f1_score(Ytest, lr.predict(Xtest))

In [None]:
co = pd.concat([pd.DataFrame(Xtrain.columns),pd.DataFrame(lr.coef_).T]
               ,axis = 1
               ,ignore_index = True
              )
co.columns = ['feature','coefficient']
co.sort_values(by = 'coefficient', ascending = False)

## 评分卡

#### 评分卡的计算公式如下所示：
#### $$Score = A - B*\log(odds)$$
#### 式中$A$，$B$均为常数，$A$被称作“补偿”，$B$被称为“刻度”，$\log(odds)$表示某人违约的可能性。因为对逻辑回归取对数会得到$\boldsymbol\theta^T\cdot x$，即参数$*$特征矩阵，所以$\log(odds)$就是参数。$A$，$B$可由如下两个假设求解：
####     1.某特定违约概率下的预期分值
####     2.指定违约概率翻倍的分数（PDO)
#### eg.假定对数概率为$\frac{1}{60}$时指定的分数为600，PDO=20，那么对数概率为$\frac{1}{30}$时的分数为620。带入上式可以得到：
#### $$600 = A - B*\log(\frac{1}{60})$$
#### $$620 = A - B*\log(\frac{1}{30})$$

In [None]:
#求解A,B

B = 20/np.log(2)
A = 600 + B*np.log(1/60)

B,A

In [None]:
#基础分

base_score = A - B*lr.intercept_
base_score

In [None]:
score_age = woeall["age"] * (-B*lr.coef_[0][1])   #lr.coef_：逻辑回归里每一个特征对应的系数
score_age  #"age"特征中每个箱对应的分数

In [None]:
file = r"C:\Users\Cracker Park\Desktop\ScoreData.csv"

with open(file,"w") as fdata:
    fdata.write("base_score,{}\n".format(base_score))
for i,col in enumerate(Xtrain.columns):
    score = woeall[col] * (-B*lr.coef_[0][i])
    score.name = "Score"
    score.index.name = col
    score.to_csv(file,header=True,mode="a")