<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#工具函数" data-toc-modified-id="工具函数-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Utility Funcations(工具函数)</a></span></li><li><span><a href="#宏变量配置" data-toc-modified-id="宏变量配置-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Macro Variables(宏变量配置)</a></span><ul class="toc-item"><li><span><a href="#固定配置参数" data-toc-modified-id="固定配置参数-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Fixed Parameters(固定配置参数)</a></span></li><li><span><a href="#可调节超参数" data-toc-modified-id="可调节超参数-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Hyperparameters(可调节超参数)</a></span></li></ul></li><li><span><a href="#准备工作" data-toc-modified-id="准备工作-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Preparation Works(准备工作)</a></span><ul class="toc-item"><li><span><a href="#获取所有列名" data-toc-modified-id="获取所有列名-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Obtain Columns(获取所有列名)</a></span></li><li><span><a href="#数据集划分" data-toc-modified-id="数据集划分-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Data Set Split(数据集划分)</a></span></li></ul></li><li><span><a href="#特征筛选" data-toc-modified-id="特征筛选-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Feature Selection(特征筛选)</a></span></li><li><span><a href="#效果验证" data-toc-modified-id="效果验证-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Check Performance(效果验证)</a></span><ul class="toc-item"><li><span><a href="#训练模型" data-toc-modified-id="训练模型-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Model Training(训练模型)</a></span></li><li><span><a href="#TOPK效果" data-toc-modified-id="TOPK效果-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>TOPK Result(TOPK效果)</a></span></li><li><span><a href="#规则解释" data-toc-modified-id="规则解释-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Rule Explaination(规则解释)</a></span></li></ul></li></ul></div>

In [1]:
import gc
import os
import random
import warnings

import numpy as np
import pandas as pd
import lightgbm as lgb

from collections import defaultdict
from tqdm import tqdm, tqdm_notebook
from sklearn.model_selection import train_test_split,StratifiedKFold
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, confusion_matrix, precision_score, recall_score, f1_score


warnings.filterwarnings('ignore')

# Utility Funcations (工具函数)

In [2]:
# =============================【1. Utility Functions】=============================
def seed_everything(seed=2022):
    """_summary_ 
    set random seed, used for reproduction and comparison
    (设定随机种子,便于复现和进行对比实验)
    Args:
        seed (int, optional):random seed (随机种子). Defaults to 2022.
    """
    os.environ["PYTHONHASHSEED"] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    
    
def read_chunk_data(data_path, use_cols=None, chunksize=None, nrows=None, sep=','):
    """_summary_
    read csv file in chunks
    分列/分块读取csv文件
    Args:
        data_path (str): csv files(source files) path. (数据文件目录地址)
        use_cols (List, optional): columns to read, Defaults to None. (读取的列名,默认读取全部列). 
        chunksize (int, optional): rows to read in a chunk. (分块读取csv文件的行数,默认不分快，一次读取全部行)
        nrows (int, optional): Number of rows of file to read. (指定读取行数,默认读取不分块)
        sep (str, optional): csv seperator, Defaults to ','. (文件分隔符). 

    Returns:
        DataFrame: 读取后的dataframe文件
    """
    print("{0:=^70}".format(f'LOADING DATA FROM : {data_path}'))
    
    # when chunksize is not None, read csv in chunks
    if chunksize:
        data = []
        chunk = pd.read_csv(data_path, usecols=use_cols, nrows=nrows, sep=sep, chunksize=chunksize)
        
        # iterate through chunks
        for tmp_data in chunk:
            data.append(tmp_data)
        
        data = pd.concat(data, axis=0)  # merge csv
        del tmp_data; gc.collect()  # release memory
    
    # read the whole csv when chunksize is None
    else:
        data = pd.read_csv(data_path, usecols=use_cols, nrows=nrows, sep=sep, chunksize=chunksize)
    
    return data


# =============================【2. Feature Engineering Functions】=============================
def label_encode(df, cols, map_dict=None):
    """_summary_
    categorical encoding (类别特征编码)
    Args:
        df (dataframe):data files (数据文件)
        cols (list): column names for categorical features (类别特征名称)
        map_dict (dict, optional): a map for categorical features. None when training, 
        use the map from after training when doing the inference
        (类别特征编码字典,训练时为None,推理时应传入训练时返回的dict). Defaults to None.

    Returns:
        dataframe: encoding result for categorical features (类别特征编码完成后的数据)
        dict: a map for categorical features and its encoding (类别特征编码对应关系)
    """
    print("{0:=^70}".format(f'LABEL ENCODING FOR {len(cols)} FEARURES'))
    
    # CallBack when inference (在推理时map_dict不为空，因此只需要遍历ap_dict就行)
    if map_dict is not None:
        for col in tqdm_notebook(cols):
            df[col] = df[col].map(map_dict[col])
    
    # None when Training 训练
    else:
        map_dict = {}
        
        # iterate through categoricals (遍历所有类别特征)
        for col in tqdm_notebook(cols):
            # 1. df.sample(frac=1.0): shuffle, in case of Data leakage (对数据进行打乱，防止出现标签泄露)
            # 2. df.unique(): unique categorical values (类别特征取值)
            # 3. zip(): mapping a categorical to an int (将类别特征取值与一个int对应)
            map_dict[col] = dict(zip(df[col].sample(frac=1.0).unique(), range(len(df[col].sample(frac=1.0).unique()))))
            # 编码
            df[col] = df[col].map(map_dict[col])
            
    return df, map_dict


def filter_cols(df, use_cols, cols, ratio):
    """_summary_
    filter abnormal features
    (过滤异常特征)
    Args:
        df (dataframe): data files (数据文件)
        use_cols (list): all columns in df (传入文件的所有列)
        cols (list): columns need to filter (需要进行过滤的文件(一般来说与use_cols相等))
        ratio (float): missing ratio to filter (缺失值过滤比例(缺失占比大于一定程度就过滤该特征))

    Returns:
        dataframe: df adter removing the abnormal columns 除去异常列的数据
        list: column names after filtering 过滤完成的列名
    """
    print("{0:=^70}".format(f'FILTER {len(cols)} FEARURES'))
    
    useless_cols = []  # abnormal columns 异常列名
    
    # iterate over columns 遍历待过滤的所有列
    for col in tqdm_notebook(cols):
        # firter the column when missing ratio over the set ratio
        #(缺失值占比大于设定值则去除该列)
        if df[col].isnull().sum() / df.shape[0] > ratio:
            useless_cols.append(col)
        
        # firter the column when there is only one value(该列取值唯一时去除该列)
        elif df[col].nunique() == 1:
            useless_cols.append(col)
            
    print(f'Drop {len(useless_cols)} features')
    use_cols = [i for i in use_cols if i not in useless_cols]  # remained columns(过滤后的列名)
    df.drop(useless_cols, axis=1, inplace=True)  # drop abnormal columns(删除所有异常列)
    
    return df, use_cols

def filter_data(data, date_col=None, window=None, filter_=None):
    """_summary_
    filter accounts that are low risks
    (前置过滤 提前过滤低风险账户+天)
    Args:
        data (dataframe): data files (数据文件)
        date_col (str, optional): datetime columns, Defaults to None. (日期列列名, 使用时间窗口时需要指定). 
        window (int, optional): datetime window size, Defaults to None. (时间窗口大小). 
        filter_ (str, optional): filter conditions, Defaults to None. (前置过滤条件). 

    Returns:
        dataframe: filtered files (过滤完成的数据文件)
    """
    print("{0:=^70}".format('FILTER date'))
    print(f'start shape {data.shape}')
    
    # when window is not None, reserve trade data of window size days prior to the last trade
    # (当window不为空时，保留每个用户最后一次交易的近window天数据)
    if window is not None:
        assert date_col is not None, 'When the window is not empty, the date_col must not be empty.'
        
        data[date_col] = pd.to_datetime(data[date_col], format='%Y/%m/%d')  # str2time
        # order by trade date in descending order (每个用户按日期倒叙排列的序值)
        data['rnk'] = data.groupby(UID)[date_col].rank(ascending = False)
        # reserve window size days of trade data (保留每个用户最后一次交易的近window天数据)
        data = data.query(f'rnk <= {window}')
        del data['rnk']; gc.collect()
    
    # some other specific filter conditions (如过传入了别的前置过滤条件则执行dataframe.query)
    if filter_ is not None:
        data = data.query(filter_)
        
    print(f'end shape {data.shape}')
    return data


# =============================【3. Model Training Functions】=============================
def train_single_lgb(x_train, y_train, x_valid, y_valid, fea_cols, params):
    """_summary_
    Standard LGB model Traininig
    (LGB模型训练)
    Args:
        x_train (datafram): training data features (训练集特征)
        y_train (datafram): training ata labels (训练集标签)
        x_valid (datafram): testing data features (验证集特征)
        y_valid (datafram): testing data labels (验证集标签)
        fea_cols (list): used columns (使用的特征)
        params (dict): parameters of lgb (LGB模型参数)

    Returns:
        model: trained model (训练完成的LBG模型)
        dataframe: feature importance (特征重要性df)
    """
    print("{0:=^70}".format(f'TRAIN LGB MODEL WITH {len(fea_cols)} FEARURES'))
    lgb_train = lgb.Dataset(x_train[fea_cols], label = y_train)
    lgb_valid = lgb.Dataset(x_valid[fea_cols], label = y_valid)
    
    lgb_clf = lgb.train(
                params,
                lgb_train,
                num_boost_round=10000,  # maximum iteration times (最大迭代次数)
                valid_sets=lgb_valid,  # validation set (验证集)
                early_stopping_rounds=50,  # early stopping (如果多少次验证集metric不增加则停止训练，并返回最高分数的模型)
                verbose_eval = 100)

    # feature importance (特征重要性)
    imp_df = pd.DataFrame(zip(lgb_clf.feature_name(), lgb_clf.feature_importance()))
    
    return lgb_clf, imp_df


# =============================【4 Model Evaluation Functions】=============================
def model_result(clf, df_valid,top_list=[100,200,500,1000,1500,2000]):
    """_summary_
    self_defined topk evaluation function.
    (自定义TOPK效果评价函数,_d 结果按账户+天计算得到，_u 结果按账户计算得到)
    Args:
        clf (model): trained model (训练得到的模型)
        df_valid (dataframe): data set to validate, test data in general (需验证效果的数据集)
        top_list (list, optional): topk threshold, number of top k accounts with highest predict values. (TOPK阈值设定). 
                                   Defaults to [100,200,500,1000,1500,2000].

    Returns:
        dataframe: model performance wrt different thresholds (包含top_list中不同阈值下模型效果)
    """
    y_proba = clf.predict(df_valid[USE_COLS])  # predict values (预测值)
    y_proba_sort = sorted(y_proba, reverse=True)  # sorted predict values (预测值排序)
    thresholds = [y_proba_sort[i] for i in top_list]  # predict values for different topks(各TOPK下的概率阈值)
    
    # return values (返回值)
    re_dict = defaultdict(list)
    re_dict['topk'] = top_list
    re_dict['thresholds'] = thresholds
    
    # calculate evaluation scores wrt topk values (遍历阈值)
    for thre in thresholds:
        y_predict = [1 if i > thre else 0 for i in y_proba]  # mapping for labels under certian threshold (根据阈值对标签进行映射)
        re_dict['precision_d'].append(precision_score(df_valid[TARGET],y_predict))  # precision (精确率)
        re_dict['recall_d'].append(recall_score(df_valid[TARGET],y_predict))  # recall (召回率)
        re_dict['f_score_d'].append(f1_score(df_valid[TARGET],y_predict))  # F1-score (F1值)
        
        ## evaluations for accounts (聚集到账户级别)
        df_valid['predict'] = y_predict
        df_valid_dup = df_valid.groupby(by=[UID, TARGET]).predict.max().reset_index()
        df_valid_dup[TARGET] = df_valid_dup[TARGET].astype(int)
        
        re_dict['precision_u'].append(precision_score(df_valid_dup[TARGET], df_valid_dup['predict']))  # 精确率 
        re_dict['recall_u'].append(recall_score(df_valid_dup[TARGET], df_valid_dup['predict']))  # 召回率
        re_dict['f_score_u'].append(f1_score(df_valid_dup[TARGET], df_valid_dup['predict']))  # F1值
        
    re_df = pd.DataFrame(re_dict)
    return re_df

# Macro Variables (宏变量配置)
## Fixed Parameters (固定配置参数)

In [3]:
DATA_PATH = '../data/all_features.csv'  # feature file path (数据文件地址)
UID = 'acct_no'  # user_id column (用户ID)
DATE = 'oper_date'  # datetime column (日期列)
TARGET = 'black_flag'  # label column (标签列)
RULE_NUM = 3  # number of rules to filter (规则个数)

MUST_COLS = [UID, DATE, TARGET]  # necessary columns (必要的列 一般是UID+DATE)
# categorical features, need encoding (类别特征 一般为需要编码的特征)
CAT_COLS = ['open_inst_no', 'id_card_number', 'if_sfz', 'gender', 'acct_opene_at']
# columns to be romoved (需要删除的特征)
DROP_COLS = MUST_COLS + ['id_card_number']
# kept columns after filtering (筛选后的特征)
USE_COLS = []

## Hyperparameter (可调节超参数)

In [4]:
# Modifiable hyperparameters (可调节超参数)
RANDOM_SEED = 2022  # random seeds (随机种子)

# choose number of features during training, as many as possible if mem ory allowed
# (特征选择时每次选取多少个特征进行训练，内存不报错的情况下可以尽可能大一点)
ONE_STEP_COLS = 500  

NUM_FEATURE = 200  # number of features chosen for the final model (最终选取多少个特征)
NULL_RATIO = 0.999  # missing value ratio threshold (缺失值大于多少时去掉该特征)

# pre-filtering conditions
# 前置过滤条件 (可以设置为None，则不进行前置过滤)
FILTER = 'amt_sum_day > 100 and amt_cnt_day > 2'  

# LGB model parameters
params = {
    'boosting_type': 'gbdt',  # type of gradient boost tree (提升树类型)
    'objective': 'binary',  # objective function (二分类目标函数)
    'metric': 'auc',  # evaluation martix (评估指标)
    'num_leaves': 256,  # the maximum number of leaves per tree, up to (0, 2^max_depth - 1] (每棵树的叶子节点数目)
    'max_depth': -1,  # max depth of the tree (每棵树的最大深度(-1表示无限制))
    'min_data_in_leaf': 50,  # the minimum number of data/sample/count per leaf (每个叶子节点所需的最小样本数)
    'learning_rate': 0.05,  # learning_rate, deciding the update speed of weights (学习率(每次提升迭代的步长))
    'feature_fraction': 0.8,  # feature sampling, helps avoid overfitting and speed up (每次迭代中使用的特征比例)
    'bagging_fraction': 0.8,  # subsample, (每棵树使用的数据比例(无放回采样))
    'bagging_freq': 5,  # bagging frequency, set with feature_fraction (采样频率(每隔多少次迭代进行一次采样))
    'lambda_l1': 1,  # L1 regularization (正则化项)
    'lambda_l2': 0.001,  # L2 regularization (正则化项(值越小表示正则化程度越高))
    'min_gain_to_split': 0.2,  # minimun gain to split, helps to avoid overfitting (分裂节点所需的最小增益)
    #'device': 'gpu',  # GPU speed up, undo the annotation when used (加速(如果使用 GPU，请取消注释))
    'verbose': -1,  # detailed level (详细程度模式(-1 表示训练过程中不输出信息))
    'is_unbalance': True,  # 处理类别不平衡问题，调整权重
    'seed': RANDOM_SEED  # random seed (随机种子，用于可重复性)
}

# Preparation Works (准备工作)

In [5]:
# Fixed random seed (固定随机种子)
seed_everything(RANDOM_SEED)

## Obtain Columns (获取所有列名)
- remove pre-set columns（去除预先设定需去除的列）
- split and save cloumns that will be put into the model (分裂并保存规则列名)

In [6]:
# obtain columns (获取所有列)
all_features = read_chunk_data(data_path=DATA_PATH, nrows=1, sep='\t').columns
# filter columns (过滤掉不需要的列)
all_features = [fea for fea in all_features if fea not in DROP_COLS]
print(len(all_features))

3641


In [7]:
# filter columns with set RULE_NUM (过滤掉规则匹配列)
rule_cols = all_features[-RULE_NUM:]
all_features = all_features[:-RULE_NUM]
random.shuffle(all_features)

## Data Set Split (数据集划分)
Preprocessing data in case memory not big enough, which includes:
- train, test split
- pre-filtering
- train, validation split

由于数据不能一次性读入内存，因此我们需要预先对数据进行处理主要包括
- 训练集、测试机划分
- 前置过滤
- 训练集、验证集划分

In [8]:
# read data and filtered columns (读取全部数据的UID、DATE, 前置过滤需要用到的列)
data = read_chunk_data(data_path=DATA_PATH, use_cols=[UID, DATE, 'amt_sum_day', 'amt_cnt_day'], chunksize=50000, sep='\t')



In [9]:
data[DATE] = pd.to_datetime(data[DATE])
data = data[data[DATE] < pd.to_datetime('2020-08-01')]  # training data, often divided by datetime (训练集)

In [10]:
# pre-filtering, remove accounts of low risks (前置过滤 训练集中过滤掉低风险的账户+天)
data = filter_data(data, filter_ = FILTER)

# window size can be added to acquire current trade samples 
# (可以加入时间窗口获取用户最近一段时间的样本)
# data = filter_data(data, date_col=DATE, window=30, filter_ = FILTER)

start shape (136447, 4)
end shape (49217, 4)


In [11]:
# split train, validation sets (划分训练集、验证集)
u_train, u_valid = train_test_split(data[[UID]].drop_duplicates(), test_size=0.2, random_state=RANDOM_SEED)

# extract filtered accounts wrt datetime (提取过滤后的账户+天)
u_train = data[data[UID].isin(u_train[UID])][[UID, DATE]]
u_valid = data[data[UID].isin(u_valid[UID])][[UID, DATE]]

# Feature Selection (特征筛选)

In [12]:
best_auc = 0  # store the best auc score (最佳AUC指标初始值)

# in case of memory shortage, train the data using fixed number of features
# (按照固定步长循环遍历特征列表)
for i in tqdm_notebook(range(len(all_features) // ONE_STEP_COLS + 1)):
    tmp_cols = all_features[i*ONE_STEP_COLS: min(len(all_features) - 1, (i + 1) * ONE_STEP_COLS)]
    USE_COLS += tmp_cols  # Add feature columns to USE_COLS list (将当前步长的特征添加到使用的特征列表中)
    
    # Read data chunks, preprocessing first(fill in missing values)
    # (读取分块数据，并进行数据预处理)
    data = read_chunk_data(data_path=DATA_PATH, use_cols=USE_COLS+MUST_COLS, chunksize=50000, sep='\t')
    data[TARGET] = data[TARGET].fillna(0).astype(int)
    data[DATE] = pd.to_datetime(data[DATE])
    data = data[data[DATE] < pd.to_datetime('2020-08-01')]  # training data (过滤出训练集数据)
    
    # filter feature columns wrt filtering conditions (根据条件过滤特征列)
    data, USE_COLS = filter_cols(data, USE_COLS, tmp_cols, NULL_RATIO)
    
    # encoding categorical features (对分类特征进行标签编码)
    data, map_dict = label_encode(data, [i for i in USE_COLS if i in CAT_COLS])
    
    # concat train set (拼接训练集)
    x_train = u_train.merge(data, how='left', on=[UID, DATE])
    y_train = x_train[TARGET].fillna(0).astype('int32')

    #concat val set (拼接验证集)
    x_valid = u_valid.merge(data, how='left', on=[UID, DATE])
    y_valid = x_valid[TARGET].fillna(0).astype('int32')
    
    # Using Lightgbm to train + predict (使用LightGBM进行训练和预测)
    lgb_clf, imp = train_single_lgb(x_train, y_train, x_valid, y_valid, USE_COLS, params)
    
    cur_auc = lgb_clf.best_score['valid_0']['auc']  # AUC for current iteration (当前轮次的AUC指标)
    if cur_auc > best_auc:  # choose the best auc (如果当前AUC指标优于最佳AUC指标)
        # choose the most important features (选择重要性排名靠前的特征作为最终使用的特征)
        USE_COLS = imp[imp[1] > 10].sort_values(by=1, ascending=False).head(NUM_FEATURE)[0].to_list()
        best_auc = cur_auc  # update better auc score (更新最佳AUC指标)
    else:
        # if auc no improvement, remove features of current steps
        # (如果当前AUC指标没有提升，则移除本轮步长添加的特征)
        USE_COLS = [col for col in USE_COLS if col not in tmp_cols]
        
    del data, x_train, x_valid  # memory clean up (删除临时变量，释放内存)
    gc.collect()  # garbage colelction (执行垃圾回收)


  0%|          | 0/8 [00:00<?, ?it/s]



  0%|          | 0/500 [00:00<?, ?it/s]

Drop 301 features


0it [00:00, ?it/s]

Training until validation scores don't improve for 50 rounds
[100]	valid_0's auc: 0.955866
[200]	valid_0's auc: 0.958753
[300]	valid_0's auc: 0.960284
Early stopping, best iteration is:
[318]	valid_0's auc: 0.960573


  0%|          | 0/500 [00:00<?, ?it/s]

Drop 324 features


0it [00:00, ?it/s]

Training until validation scores don't improve for 50 rounds
[100]	valid_0's auc: 0.957392
[200]	valid_0's auc: 0.961125
[300]	valid_0's auc: 0.962134
Early stopping, best iteration is:
[321]	valid_0's auc: 0.962345


  0%|          | 0/500 [00:00<?, ?it/s]

Drop 310 features


  0%|          | 0/1 [00:00<?, ?it/s]

Training until validation scores don't improve for 50 rounds
[100]	valid_0's auc: 0.963146
[200]	valid_0's auc: 0.968773
[300]	valid_0's auc: 0.96997
Early stopping, best iteration is:
[260]	valid_0's auc: 0.970007


  0%|          | 0/500 [00:00<?, ?it/s]

Drop 321 features


  0%|          | 0/3 [00:00<?, ?it/s]

Training until validation scores don't improve for 50 rounds
[100]	valid_0's auc: 0.961366
[200]	valid_0's auc: 0.969159
[300]	valid_0's auc: 0.970459
Early stopping, best iteration is:
[279]	valid_0's auc: 0.970493


  0%|          | 0/500 [00:00<?, ?it/s]

Drop 314 features


  0%|          | 0/3 [00:00<?, ?it/s]

Training until validation scores don't improve for 50 rounds
[100]	valid_0's auc: 0.9614
[200]	valid_0's auc: 0.968093
[300]	valid_0's auc: 0.969376
Early stopping, best iteration is:
[285]	valid_0's auc: 0.969432


  0%|          | 0/500 [00:00<?, ?it/s]

Drop 309 features


  0%|          | 0/3 [00:00<?, ?it/s]

Training until validation scores don't improve for 50 rounds
[100]	valid_0's auc: 0.96186
[200]	valid_0's auc: 0.967956
[300]	valid_0's auc: 0.969345
Early stopping, best iteration is:
[310]	valid_0's auc: 0.969434


  0%|          | 0/500 [00:00<?, ?it/s]

Drop 318 features


  0%|          | 0/3 [00:00<?, ?it/s]

Training until validation scores don't improve for 50 rounds
[100]	valid_0's auc: 0.963526
[200]	valid_0's auc: 0.969178
[300]	valid_0's auc: 0.970771
Early stopping, best iteration is:
[283]	valid_0's auc: 0.97084


  0%|          | 0/137 [00:00<?, ?it/s]

Drop 90 features


  0%|          | 0/3 [00:00<?, ?it/s]

Training until validation scores don't improve for 50 rounds
[100]	valid_0's auc: 0.965282
[200]	valid_0's auc: 0.97053
[300]	valid_0's auc: 0.971624
Early stopping, best iteration is:
[281]	valid_0's auc: 0.971684


# Check Performance (效果验证)
## Model Training (训练模型)

In [13]:
# checking performance (测试效果)
data = read_chunk_data(data_path=DATA_PATH, use_cols=all_features+rule_cols+MUST_COLS, chunksize=50000, sep='\t')
data[TARGET] = data[TARGET].fillna(0).astype(int)

data[DATE] = pd.to_datetime(data[DATE])
# Training data and test data are usually divided according to time
# Historical transactions for training, current transactions(e.g latest several months) are
# used for test.
test = data[data[DATE] > pd.to_datetime('2020-07-31')]  # test data set (测试集)
data = data[data[DATE] < pd.to_datetime('2020-08-01')]  # training data set (训练集)

# filter features same as previous operations (根据条件过滤特征列)
data = filter_data(data, filter_ = FILTER)

# exclude black samples (labeled) 训练集的黑样本不能出现在测试集中
print('过滤前测试集：', test.shape)
test = test[~test[UID].isin(data[data[TARGET] == 1][UID].unique())]
print('过滤后测试集：', test.shape)

start shape (136447, 3644)
end shape (49217, 3644)
过滤前测试集： (59879, 3644)
过滤后测试集： (58138, 3644)


In [14]:
# encoding categorical features (类别编码)
data, map_dict = label_encode(data, [i for i in USE_COLS if i in CAT_COLS])
test, _ = label_encode(test, [i for i in USE_COLS if i in CAT_COLS], map_dict=map_dict)



  0%|          | 0/3 [00:00<?, ?it/s]



  0%|          | 0/3 [00:00<?, ?it/s]

In [15]:
# concat train and val set respectively(拼接训练集和验证集)
x_train = u_train.merge(data, how='left', on=[UID, DATE])
y_train = x_train[TARGET].fillna(0).astype('int32')

x_valid = u_valid.merge(data, how='left', on=[UID, DATE])
y_valid = x_valid[TARGET].fillna(0).astype('int32')

In [16]:
lgb_clf, imp = train_single_lgb(x_train, y_train, x_valid, y_valid, USE_COLS, params)

Training until validation scores don't improve for 50 rounds
[100]	valid_0's auc: 0.962729
[200]	valid_0's auc: 0.969476
[300]	valid_0's auc: 0.970919
Early stopping, best iteration is:
[311]	valid_0's auc: 0.971062


In [17]:
imp.sort_values(by=1, ascending=False).head(50)

Unnamed: 0,0,1
0,n_oppo_acct_bank_id_in_7d,610
1,work_time_trans_sum_ratio_1m,555
2,close_100int_sum_ratio_15d,495
3,khsc_year,450
4,end_balance_3m,327
6,work_time_trans_cnt_ratio_15d,321
8,close_int_sum_ratio_15d,303
7,n_oppo_acct_bank_id_7d,297
9,amt_less10_cnt_ratio_3m,296
5,out_amt_less500_sum_3m,290


## TOPK Result (TOPK效果)

In [18]:
model_result(lgb_clf,test,top_list=[100,200,500,1000,1500,2000])

Unnamed: 0,topk,thresholds,precision_d,recall_d,f_score_d,precision_u,recall_u,f_score_u
0,100,0.999625,1.0,0.265252,0.419287,1.0,0.41129,0.582857
1,200,0.998649,1.0,0.530504,0.693241,1.0,0.701613,0.824645
2,500,0.862581,0.702,0.931034,0.800456,0.851064,0.967742,0.90566
3,1000,0.361564,0.363,0.962865,0.527233,0.614213,0.975806,0.753894
4,1500,0.10676,0.244163,0.970822,0.390192,0.463878,0.983871,0.630491
5,2000,0.049765,0.183,0.970822,0.307951,0.347578,0.983871,0.513684


In [19]:
preds = lgb_clf.predict(test[USE_COLS])
test['preds'] = preds
pre_max = test.groupby(UID)['preds'].idxmax()
preds_df = test.loc[pre_max, [UID, 'preds', TARGET, DATE]]

roc_auc_score(preds_df[TARGET], preds_df['preds']) 

0.9959541347654124

## Rule Explaination (规则解释)

In [20]:
preds_df = preds_df.merge(test[[UID, DATE] + rule_cols], how='left', on=[UID, DATE])
preds_df.sort_values(by='preds', ascending=False).head(200)

Unnamed: 0,acct_no,preds,black_flag,oper_date,yjpf,fszc,jzzc
2806,9EEF218E52B7B373,0.999949,1,2020-08-21,,,0
1249,477C783DB38D79FC,0.999926,1,2020-08-06,,,0
1313,4B23C476C5003ECF,0.999916,1,2020-08-21,,,0
1763,63ECD0B0202D9F7F,0.999913,1,2020-08-18,,,0
3729,D21C2B6F3BDD4392,0.999908,1,2020-08-10,,,0
...,...,...,...,...,...,...,...
3462,C1FDF04B9D6AF4DA,0.368743,0,2020-08-23,,集中转入，分散转出,0
4107,E68032544AFDCFC6,0.366435,0,2020-08-22,,,0
2513,8DDB0B1013EE7C5F,0.358697,0,2020-08-10,,,0
1229,45F7995529FE0A3F,0.339936,0,2020-08-16,,,0
