Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScoreCard的保存过程的优化建议 #110

Closed
FrankDataAnalystPython opened this issue Nov 30, 2022 · 2 comments
Closed

ScoreCard的保存过程的优化建议 #110

FrankDataAnalystPython opened this issue Nov 30, 2022 · 2 comments

Comments

@FrankDataAnalystPython
Copy link

FrankDataAnalystPython commented Nov 30, 2022

这个小bug可能很少有人碰到。在做ScoreCard的过程中,一般是直接使用默认的pdo=60, rate=2, base_odds=35, base_score=750。这些值一般都不会去做任何的设置。这些值在计算每个bins的分数其他的很大作用。

在ScoreCard.export的时候,发现bins的分数是保存下来了,但是完全没有保存下来pdo,rate,base_odds,base_score的值。而在ScoreCard().load()的时候,虽然加载进去了bins的分数,但是ScoreCard()中的self.factor和self.offset使用的却还是默认的值所计算出来的分数。从而前后的predict的结果不一致。

以下为问题的复现,其中用的代码是pipeline的分支

import os
import numpy as np
import pytest
import pandas as pd
from os.path import join
from toad.pipeline import Toad_Pipeline
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer

lb  = load_breast_cancer()
X = pd.DataFrame(lb['data'], columns=lb['feature_names'])
Y = lb['target']
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.3, random_state=42)

toad_pipe = Toad_Pipeline()
toad_pipe = toad_pipe.set_params(**params)
toad_pipe = toad_pipe.fit(Xtrain, Ytrain)
Xtrain_ = toad_pipe.transform(Xtrain)

from toad import ScoreCard

card1 = ScoreCard(
    combiner=toad_pipe.combiner,
    transer=toad_pipe.woe,
    base_score=1000,
    pdo=10,
    rate=5,
    base_odds=50
)
card1 = card1.fit(Xtrain_, Ytrain)

card1_result = card1.export()
print(card1_result['mean texture'])
print(card1.predict(Xtrain)[:5])
[910.61408963 891.17084246 960.33222975 900.2856913  932.18954537]
# 导入ScoreCard
card2 = ScoreCard().load(card1_result)
print(card2.predict(Xtrain_)[:5])
[910.63 891.17 960.33 900.3  932.21]
# 由于self.offset和self.factor有明显的不一样,在计算score_to_proba或者proba_to_score的时候有非常明显的差别
card1.score_to_proba(970), card2.score_to_proba(970)
(0.7142857142857154, 0.002244808514989028)

解决方案

  1. 在保存为json的时候保存为以下的格式
{
    'card_params' : {
        'pdo': 60, 
        'rate': 2, 
        'base_odds' : 35, 
        'base_score' : 750
    },
    'bins_scores' : {
        ...
    }
}
  1. load的时候,可以采用以下的方式
@classmethod
def load(
    cls,
    json_dict,
):
    params = json_dict['card_params']
    bins = json_dict['bins_scores']
    return cls(**params)._load(bins)
@Secbone
Copy link
Member

Secbone commented Dec 27, 2022

@FrankDataAnalystPython 好提议,其实之前就一直再考虑改这块儿的内容,因为考虑到导出的json格式变化后会有版本间的兼容性问题,所以一直没有动手。我会考虑设计一版新的json导出格式来进行升级,同时考虑一下兼容性的问题如果解决

@FrankDataAnalystPython
Copy link
Author

@FrankDataAnalystPython 好提议,其实之前就一直再考虑改这块儿的内容,因为考虑到导出的json格式变化后会有版本间的兼容性问题,所以一直没有动手。我会考虑设计一版新的json导出格式来进行升级,同时考虑一下兼容性的问题如果解决

赞同赞同

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants