<h2>定义部分：参考<a href="https://github.com/WangliLin/xunfei2021_car_loan_top1">https://github.com/WangliLin/xunfei2021_car_loan_top1</a></h2>

In [1]:
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.metrics import roc_auc_score, auc, roc_curve, accuracy_score, f1_score
from sklearn.model_selection import StratifiedKFold
from tqdm import tqdm
import pickle
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import os
from sklearn.preprocessing import StandardScaler, QuantileTransformer, KBinsDiscretizer, LabelEncoder, MinMaxScaler, PowerTransformer
from sklearn.preprocessing import OrdinalEncoder

def train_lgb_kfold(X_train, y_train, X_test, n_fold=5, cate_feats=None):
    '''train lightgbm with k-fold split'''
    gbms = []
    kfold = StratifiedKFold(n_splits=n_fold, random_state=1024, shuffle=True)
    oof_preds = np.zeros((X_train.shape[0],))
    test_preds = np.zeros((X_test.shape[0],))

    for fold, (train_index, val_index) in enumerate(kfold.split(X_train, y_train)):
        logging.info(f'############ fold {fold} ###########')
        X_tr, X_val, y_tr, y_val = X_train.iloc[train_index], X_train.iloc[val_index], y_train[train_index], y_train[val_index]
        dtrain = lgb.Dataset(X_tr, y_tr, categorical_feature=cate_feats)
        dvalid = lgb.Dataset(X_val, y_val, categorical_feature=cate_feats, reference=dtrain)

        params = {
            'objective': 'binary',
            'metric': 'auc',
            'num_leaves': 31,
            'learning_rate': 0.05,
            'min_data_in_leaf': 50,
            'feature_fraction': 0.8,
            'bagging_fraction': 0.8,
            'n_jobs': -1,
            'seed': 1024
        }

        gbm = lgb.train(params,
                        dtrain,
                        num_boost_round=300,
                        valid_sets=[dtrain, dvalid],
                        verbose_eval=50,
                        early_stopping_rounds=20)

        oof_preds[val_index] = gbm.predict(X_val, num_iteration=gbm.best_iteration)
        test_preds += gbm.predict(X_test, num_iteration=gbm.best_iteration) / kfold.n_splits
        gbms.append(gbm)

    return gbms, oof_preds, test_preds

<h2>读取数据部分</h2>

In [2]:
data = pd.read_csv('../data/账户交易信息.csv')
data_static = pd.read_csv('../data/账户静态信息.csv')
data_label = pd.read_csv('../data/训练集标签.csv')

<h2>特征工程部分</h2>

In [3]:
d = {}
users = data.zhdh.unique().tolist()
for user_id in tqdm(users):
    data_sample = data[data.zhdh == user_id].reset_index(drop=True)
    all_cnt = data_sample.shape[0] # 操作次数
    in_cnt = data_sample.jdbj.sum() # 转入次数、转出次数、转入次数占比
    out_cnt = all_cnt - in_cnt
    in_ratio = in_cnt / all_cnt
    in_amt = data_sample[data_sample.jdbj == 1].jyje.sum() # 转入金额、转出金额、转入金额占比
    out_amt = data_sample[data_sample.jdbj == 0].jyje.sum()
    in_amt_ratio = in_amt / (in_amt + out_amt)
    in_user_cnt = data_sample[data_sample.jdbj == 1].dfzh.nunique() # 转入人数、转出人数、转入人数占比
    out_user_cnt = data_sample[data_sample.jdbj == 0].dfzh.nunique()
    in_user_ratio = in_user_cnt / (in_user_cnt + out_user_cnt)
    date_cnt = data_sample.jyrq.nunique() # 转账日期数
    d[user_id] = [all_cnt, in_cnt, out_cnt, in_ratio, in_amt, out_amt, in_amt_ratio, in_user_cnt, out_user_cnt, in_user_ratio, date_cnt]

data_df = pd.DataFrame.from_dict(d).T.reset_index()
data_df.columns = ['zhdh', 'all_cnt', 'in_cnt', 'out_cnt', 'in_ratio', 'in_amt', 'out_amt', 'in_amt_ratio', 'in_user_cnt', 'out_user_cnt', 'in_user_ratio', 'date_cnt']

100%|████████████████████████████████████████████████████████████████████████████████| 6000/6000 [04:17<00:00, 23.27it/s]


<h2>合并标签部分</h2>

In [4]:
data_static.columns = ['zhdh', 'khrq', 'khjgdh', 'xb', 'age']
df_feats = pd.merge(left=data_df, right=data_static[['zhdh', 'xb', 'age']], on='zhdh', how='left')
df_final = pd.merge(left=df_feats, right=data_label, on='zhdh', how='left')

<h2>准备数据集</h2>

In [5]:
df_train = df_final[df_final.black_flag.notnull()].reset_index(drop=True)
df_test = df_final[df_final.black_flag.isnull()].reset_index(drop=True)

feats = df_train.columns[1:-1].tolist()
X_train = df_train[feats]
y_train = df_train['black_flag']
X_test = df_test[feats]

<h2>开始训练</h2>

In [6]:
gbms, oof_preds, test_preds = train_lgb_kfold(X_train, y_train, X_test, n_fold=5)

2023-02-12 12:32:02,238 : INFO : ############ fold 0 ###########
2023-02-12 12:32:02,374 : INFO : ############ fold 1 ###########


Training until validation scores don't improve for 20 rounds
[50]	training's auc: 0.982437	valid_1's auc: 0.972593
Early stopping, best iteration is:
[65]	training's auc: 0.986927	valid_1's auc: 0.976667
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[8]	training's auc: 0.962659	valid_1's auc: 0.952037


2023-02-12 12:32:02,440 : INFO : ############ fold 2 ###########
2023-02-12 12:32:02,582 : INFO : ############ fold 3 ###########
2023-02-12 12:32:02,636 : INFO : ############ fold 4 ###########


Training until validation scores don't improve for 20 rounds
[50]	training's auc: 0.984742	valid_1's auc: 0.930741
[100]	training's auc: 0.995917	valid_1's auc: 0.938889
Early stopping, best iteration is:
[81]	training's auc: 0.992625	valid_1's auc: 0.94037
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[7]	training's auc: 0.957394	valid_1's auc: 0.933333
Training until validation scores don't improve for 20 rounds

2023-02-12 12:32:02,728 : INFO : ############ fold 5 ###########



[50]	training's auc: 0.983489	valid_1's auc: 0.911111
Early stopping, best iteration is:
[36]	training's auc: 0.977867	valid_1's auc: 0.915926
Training until validation scores don't improve for 20 rounds
[50]	training's auc: 0.982899	valid_1's auc: 0.966296
Early stopping, best iteration is:
[59]	training's auc: 0.985693	valid_1's auc: 0.971481


2023-02-12 12:32:02,844 : INFO : ############ fold 6 ###########
2023-02-12 12:32:02,905 : INFO : ############ fold 7 ###########
2023-02-12 12:32:02,982 : INFO : ############ fold 8 ###########


Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[11]	training's auc: 0.962078	valid_1's auc: 0.926481
Training until validation scores don't improve for 20 rounds
[50]	training's auc: 0.984737	valid_1's auc: 0.948889
Early stopping, best iteration is:
[30]	training's auc: 0.975665	valid_1's auc: 0.94963
Training until validation scores don't improve for 20 rounds
[50]	training's auc: 0.984047	valid_1's auc: 0.921111

2023-02-12 12:32:03,130 : INFO : ############ fold 9 ###########



Early stopping, best iteration is:
[78]	training's auc: 0.991742	valid_1's auc: 0.926296
Training until validation scores don't improve for 20 rounds
[50]	training's auc: 0.982286	valid_1's auc: 0.977407
Early stopping, best iteration is:
[73]	training's auc: 0.989072	valid_1's auc: 0.984074


<h2>搜索最优阈值</h2>

In [7]:
def gen_thres_new(df_train, oof_preds):
    df_train['oof_preds'] = oof_preds
    quantile_point = df_train['black_flag'].mean()
    thres = df_train['oof_preds'].quantile(1 - quantile_point)

    _thresh = []
    for thres_item in np.arange(thres - 0.2, thres + 0.2, 0.01):
        _thresh.append(
            [thres_item, f1_score(df_train['black_flag'], np.where(oof_preds > thres_item, 1, 0), average='macro')])

    _thresh = np.array(_thresh)
    best_id = _thresh[:, 1].argmax()
    best_thresh = _thresh[best_id][0]

    print("阈值: {}\n训练集的f1: {}".format(best_thresh, _thresh[best_id][1]))
    return best_thresh

best_thresh = gen_thres_new(df_train, oof_preds)

阈值: 0.31801314355750926
训练集的f1: 0.8743046665782968


<h2>生成提交结果</h2>

In [8]:
df_test['black_flag'] = np.where(test_preds > best_thresh, 1, 0)
df_test[['zhdh', 'black_flag']].to_csv("../data/submit_%s.csv" % datetime.now().strftime("%Y-%m-%dT%H-%M"))