## 信用卡反欺诈检测之基于imbalanced-learn,XGBoost和LightGBM的有监督学习实现  
>1. 数据及项目来源：[Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud)  
>2. 问题类别：**有监督学习的二分类问题**或者是**无监督学习的异常检测问题**    
>3. 有监督学习方案：使用imbalanced-learn中的BalancedRandomForestClassifier,RUSBoostClassifier以及XGBoost和LightGBM四种模型对数据进行分类  
>4. 无监督学习方案：使用Isolation Forest（孤立森林）对数据进行异常检测  

>5. 思路：单一变量原则，逐渐叠加影响因子  
>>1. 首先对未进行特征缩放的数据进行训练和测试，查看结果，文件：without_feature_scaling_without_feature_selection.ipynb  
>>2. 然后对经过特征缩放但未经过特征选择的数据进行训练和测试，查看结果  文件：with_feature_scaling_without_feature_selection.ipynb  
>>3. 最后对经过特征缩放和特征选择的数据进行训练和测试，查看结果  文件：with_feature_scaling_with_feature_selection.ipynb



### I. 加载数据并对其进行初步的探索

In [1]:
# 加载数据前处理的通用库numpy和pandas
import numpy as np
import pandas as pd

In [2]:
# 读取数据为Pandas dataframe格式
data_original = pd.read_csv('creditcard.csv')

In [3]:
#数据为Pandas dataframe格式
type(data_original)

pandas.core.frame.DataFrame

In [4]:
#概览数据，显示前10行
data_original.head(10)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0
5,2.0,-0.425966,0.960523,1.141109,-0.168252,0.420987,-0.029728,0.476201,0.260314,-0.568671,...,-0.208254,-0.559825,-0.026398,-0.371427,-0.232794,0.105915,0.253844,0.08108,3.67,0
6,4.0,1.229658,0.141004,0.045371,1.202613,0.191881,0.272708,-0.005159,0.081213,0.46496,...,-0.167716,-0.27071,-0.154104,-0.780055,0.750137,-0.257237,0.034507,0.005168,4.99,0
7,7.0,-0.644269,1.417964,1.07438,-0.492199,0.948934,0.428118,1.120631,-3.807864,0.615375,...,1.943465,-1.015455,0.057504,-0.649709,-0.415267,-0.051634,-1.206921,-1.085339,40.8,0
8,7.0,-0.894286,0.286157,-0.113192,-0.271526,2.669599,3.721818,0.370145,0.851084,-0.392048,...,-0.073425,-0.268092,-0.204233,1.011592,0.373205,-0.384157,0.011747,0.142404,93.2,0
9,9.0,-0.338262,1.119593,1.044367,-0.222187,0.499361,-0.246761,0.651583,0.069539,-0.736727,...,-0.246914,-0.633753,-0.120794,-0.38505,-0.069733,0.094199,0.246219,0.083076,3.68,0


In [5]:
#显示数据规模，各个特征的数据类型；查看各个特征下是否存在缺失值Null
#根据Kaggle上该数据集的描述，以及本条代码的查看结果，该数据的所有特征均为数值类型，并且没有缺失值，因此，不需要进行one-hot编码，也不需要进行缺失值处理
data_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26  

In [6]:
#统计各个特征下Null的数量，经过查看，我们发现该数据的确不存在缺失值，与info()的结果一致
data_original.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [7]:
#显示整个数据集中缺失值Null的总数
data_original.isnull().sum().sum()

0

In [8]:
#显示各个特征包含的唯一值的数量
data_original.nunique()

Time      124592
V1        275663
V2        275663
V3        275663
V4        275663
V5        275663
V6        275663
V7        275663
V8        275663
V9        275663
V10       275663
V11       275663
V12       275663
V13       275663
V14       275663
V15       275663
V16       275663
V17       275663
V18       275663
V19       275663
V20       275663
V21       275663
V22       275663
V23       275663
V24       275663
V25       275663
V26       275663
V27       275663
V28       275663
Amount     32767
Class          2
dtype: int64

In [9]:
#显示数据集中各个类别的数量（标签的数量），我们发现两个类别的分布极不平衡
data_original['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

In [10]:
#显示数据集中各个类别的百分比（标签的百分比），再次验证这是一个类别分布极不平衡的数据集
data_original['Class'].value_counts(normalize = True)*100

0    99.827251
1     0.172749
Name: Class, dtype: float64

In [11]:
#显示各个特征的一些统计学特征
data_original.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,3.91956e-15,5.688174e-16,-8.769071e-15,2.782312e-15,-1.552563e-15,2.010663e-15,-1.694249e-15,-1.927028e-16,-3.137024e-15,...,1.537294e-16,7.959909e-16,5.36759e-16,4.458112e-15,1.453003e-15,1.699104e-15,-3.660161e-16,-1.206049e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


### 数据探索总结:  
1. 数据中一共包含284807个样本  
2. 数据一共包含30个特征列以及一个标签列  
3. 数据的特征列和标签列下均没有缺失值，不需要进行缺失值处理  
4. 数据的30个特征均为连续的数值特征(continuous numerical features)，没有类别特征(categorical features),不需要进行one-hot编码  
5. 数据中两个类别（正常及欺诈）分布极不平衡，正常数据（非欺诈数据）所占比例为99.83%，欺诈数据所占比例为0.17%,因此这是一个非均衡数据集(imbalanced data)的分类问题

### II. 数据集特征列和标签列的分离，训练集和测试集的分割  
>为了尽可能避免数据信息泄露的问题，在对数据进行任何前处理之前，一定要先对数据进行训练集和测试集的分割  
>[参考1：Normalize data before or after split of training and testing data?](https://stackoverflow.com/questions/49444262/normalize-data-before-or-after-split-of-training-and-testing-data)  
>[参考2：Onehotencoding before or after split of training and testing data?](https://stackoverflow.com/questions/55525195/do-i-have-to-do-one-hot-encoding-separately-for-train-and-test-dataset)  
>[参考3：Imputation before or after train test spliting](https://stats.stackexchange.com/questions/95083/imputation-before-or-after-splitting-into-train-and-test)

In [12]:
#首先对数据集进行特征，标签的分离
X = data_original.iloc[:,0:-1]
y = data_original['Class']

In [13]:
X.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99


In [14]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: Class, dtype: int64

In [15]:
#然后进行训练集和测试集的分割,75%训练集，25%测试集
#注意，这里我们要采用分层抽样(stratified sampling)的方法，以保证训练集和测试集中类别的比例和总体数据类别的比例基本一致
#此外，如果我们提前获知数据中的某一个特征是关键特征，那么在进行分层抽样的时候，也可以该特征的比例作为参考，进行抽样
#参考链接：https://medium.com/@411.codebrain/train-test-split-vs-stratifiedshufflesplit-374c3dbdcc36
#参考链接：https://zhuanlan.zhihu.com/p/49991313
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42) 

In [43]:
# 显示训练集中各个类别的数量，用于计算scale_pos_weight
y_train.value_counts()

0    213236
1       369
Name: Class, dtype: int64

In [44]:
# scale_pos_weight = sum(negative instances) / sum(positive instances)
# 这个参数是XGBoost和LightGBM两个模型在对非均衡数据进行分类时用于控制类别平衡的最关键的参数
# 实际应用时，也可以考虑使用按上述公式计算得到的值的平方根
# 参考1：https://xgboost.readthedocs.io/en/latest/parameter.html
# 参考2：https://stats.stackexchange.com/questions/243207/what-is-the-proper-usage-of-scale-pos-weight-in-xgboost-for-imbalanced-datasets
scale_pos_weight_1 = 213236 / 369
scale_pos_weight_2 = np.sqrt(scale_pos_weight_1)

In [45]:
scale_pos_weight_1

577.8753387533875

In [46]:
scale_pos_weight_2

24.039037808393818

### III. 对于非均衡数据集（imbalanced data）的处理  
>对于非均衡数据的处理，有多种思路:  
>* **重采样**，包括上采样（Oversampling,也叫过采样）和下采样（Undersampling,也叫欠采样），其基本思路就是将数据中两类的数量调整均衡一些，  
让少的变多,让多的变少,从而使非均衡数据变得均衡.在重采样以后，再利用各种机器学习分类模型对数据进行分类。  
>>- 两种重采样各自的实现方式均有很多种，实践中，我们利用[imbalanced-learn](https://imbalanced-learn.readthedocs.io/en/stable/api.html#module-imblearn.under_sampling)库来完成各种重采样的实现  
>* **利用imbalaced-learn中的分类器**, 这些分类器具有处理非均衡数据的内在机制,比如[BalancedRandomForestClassifier](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.ensemble.BalancedRandomForestClassifier.html#)和[RUSBoostClassifier](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.ensemble.RUSBoostClassifier.html)  
>* **利用XGBoost和LightGBM**,这两种基于GBDT的强分类器，均可以设置参数'scale_pos_weight'来处理这种非均衡数据  
>>- scale_pos_weight = number of negative samples / number of positive samples  
>>- 对于二分类问题，正例(positive)用1表示，反例(negative)用0表示  
>* 模型评估准则(Metrics)：对于非均衡数据，不能再使用accuracy作为评估准则，可以考虑使用f1_score或者专门针对非均衡问题的评估准则，比如  
[geometric_mean_score](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.metrics.geometric_mean_score.html#imblearn.metrics.geometric_mean_score)  

>* [参考1:Dealing With Class Imbalanced Datasets For Classification](https://towardsdatascience.com/dealing-with-class-imbalanced-datasets-for-classification-2cc6fad99fd9)    
>* [参考2:机器学习之类别不平衡问题 (3) —— 采样方法](https://www.cnblogs.com/massquantity/p/9382710.html)  
>* [参考3:机器学习中的非均衡问题(imbalanced data)和应对方法](https://zhuanlan.zhihu.com/p/38687978)  
>* [参考4:机器学习：如何解决机器学习中数据不平衡问题](https://www.jianshu.com/p/be343414dd24)
    

### IV. 训练模型并使用[Hyperopt](http://hyperopt.github.io/hyperopt/)进行超参数的调优  
>* hyperopt是一种通过**贝叶斯优化(Bayesian Optimization)**来调整参数的工具  
>* 三种调参方法GridSearch,RandomSearch以及Bayesian Search的对比可参见：  
>>* [Intuitive Hyperparameter Optimization : Grid Search, Random Search and Bayesian Search](https://towardsdatascience.com/intuitive-hyperparameter-optimization-grid-search-random-search-and-bayesian-search-2102dbfaf5b)  
>>* [贝叶斯优化: 一种更好的超参数调优方式](https://zhuanlan.zhihu.com/p/29779000)  
>* 本项目采用十折交叉验证法进行参数调优,模型评估准则采用专门针对非均衡数据的准则geometric_mean_score  
>* 思路：  
>>1. 首先对未进行特征缩放的数据进行训练和测试，查看结果  
>>2. 然后对经过特征缩放但未经过特征选择的数据进行训练和测试，查看结果  
>>3. 最后对经过特征缩放和特征选择的数据进行训练和测试，查看结果  
>* [RandomForest调参参考](https://www.cnblogs.com/pinard/p/6160412.html)  


In [34]:
import json
import time
from sklearn.metrics import make_scorer
from imblearn.metrics import geometric_mean_score
from sklearn.utils import class_weight
from imblearn.ensemble import BalancedRandomForestClassifier
from imblearn.ensemble import RUSBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score 
from hyperopt import fmin, tpe, atpe, hp, STATUS_OK, Trials, space_eval # (atpe) adaptive TPE 算法是hyperopt最新版本加入的新算法
from sklearn.model_selection import StratifiedKFold

In [19]:
#对于非均衡数据的class_weight的计算方法
#参考1：https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/53696
#参考2：https://stackoverflow.com/questions/44716150/how-can-i-assign-a-class-weight-in-keras-in-a-simple-way
class_weights = class_weight.compute_class_weight(class_weight='balanced',classes = np.unique(y_train),y = y_train)
class_weight_dict = dict(enumerate(class_weights))

In [20]:
type(class_weights)

numpy.ndarray

In [21]:
class_weights

array([  0.50086524, 289.43766938])

In [22]:
type(class_weight_dict)

dict

In [23]:
class_weight_dict

{0: 0.5008652385150725, 1: 289.43766937669375}

In [24]:
#BalancedRandomForest分类器的训练
start = time.time()
def brdf(params):
    # ip = params["imputer"]
    # del params["imputer"]
    # sc = params["scaler"]
    # del params["scaler"]
    brdf_clf = BalancedRandomForestClassifier(**params)
    str_kfold = StratifiedKFold(
        n_splits=10, shuffle=True, random_state=42
    )  # 注意随机数random_state保持一致，以便复现结果
    # 参考链接：https://stackoverflow.com/questions/39782243/how-to-use-cross-val-score-with-random-state
    gms = make_scorer(geometric_mean_score)
    metric = cross_val_score(
        brdf_clf,
        # Data_to_opt(sc)[0],
        # Data_to_opt(sc)[1],
        X_train,
        y_train,
        cv=str_kfold,
        scoring = gms, 
        n_jobs=-1,  
    ).mean()  
    return {"loss": -metric, "status": STATUS_OK}

space4brdf = {
    "n_estimators": hp.choice("n_estimators", range(100, 320, 20)),
    "max_depth": hp.choice("max_depth", range(1, 70)),  #! max_depth 影响模型的复杂程度
    # "max_features": 1,
    "max_features": hp.choice("max_features", range(1, 30)),
    "class_weight": class_weight_dict,
    "warm_start": True,
    "n_jobs": -1,
    "random_state": 42,  # 注意保持随机状态的一致性，以便复现结果
    #"imputation": hp.choice("imputation", ["dropna", "SI", "MI"]),
    #"scaling_method": hp.choice("scaling_method", ["min_max", "std"]),
}

rstate = np.random.RandomState(42)
trials = Trials()
best = fmin(
    brdf, space4brdf, algo=tpe.suggest, max_evals=30, trials=trials, rstate=rstate
)  #! fmin返回的是这些最佳参数在其列表中的索引，而不是直接返回最佳参数本身
# print(best)
print(space_eval(space4brdf, best))  #! space_eval()输出最佳参数本身而不是索引
# print(lgt(best))
print(trials.best_trial["result"]["loss"])
# print(trials.best_trial["result"])

# 把最终搜索到的最有超参数写入到一个json文件
# 参考链接: https://stackabuse.com/scikit-learn-save-and-restore-models/
with open("brdf.json", "w") as f:
    f.write(json.dumps({"f1": trials.best_trial["result"]["loss"], "Best params": space_eval(space4brdf, best)}))
hyperparams_brdf = space_eval(space4brdf, best)
stop = time.time()
print(f"Training time: {stop - start:.3f}s")

100%|██████████| 30/30 [36:47<00:00, 73.58s/it, best loss: -0.9374411453583523]
{'class_weight': {0: 0.5008652385150725, 1: 289.43766937669375}, 'max_depth': 65, 'max_features': 3, 'n_estimators': 300, 'n_jobs': -1, 'random_state': 42, 'warm_start': True}
-0.9374411453583523
Training time: 2207.403s


In [25]:
#RUSBoostClassifier的训练及参数调优
start = time.time()
def rusb(params):
    # ip = params["imputer"]
    # del params["imputer"]
    # sc = params["scaler"]
    # del params["scaler"]
    rusb_clf = RUSBoostClassifier(**params)
    str_kfold = StratifiedKFold(
        n_splits=10, shuffle=True, random_state=42
    )  # 注意随机数random_state保持一致，以便复现结果
    # 参考链接：https://stackoverflow.com/questions/39782243/how-to-use-cross-val-score-with-random-state
    gms = make_scorer(geometric_mean_score)
    metric = cross_val_score(
        rusb_clf,
        # Data_to_opt(sc)[0],
        # Data_to_opt(sc)[1],
        X_train,
        y_train,
        cv=str_kfold,
        scoring = gms, 
        n_jobs=-1,  
    ).mean()  
    return {"loss": -metric, "status": STATUS_OK}

space4rusb = {
    "n_estimators": hp.choice("n_estimators", range(50, 320, 20)),
    # "max_depth": hp.choice("max_depth", range(1, 70)),  
    "learning_rate": hp.uniform("learning_rate", 0, 1),
    # "max_features": 1,
    # "max_features": hp.choice("max_features", range(1, 30)),
    # "class_weight": class_weight_dict,
    # "warm_start": True,
    # "n_jobs": -1,
    "random_state": 42,  
    #"imputation": hp.choice("imputation", ["dropna", "SI", "MI"]),
    #"scaling_method": hp.choice("scaling_method", ["min_max", "std"]),
}

rstate = np.random.RandomState(42)
trials = Trials()
best = fmin(
    rusb, space4rusb, algo=tpe.suggest, max_evals=30, trials=trials, rstate=rstate
)  #! fmin返回的是这些最佳参数在其列表中的索引，而不是直接返回最佳参数本身
# print(best)
print(space_eval(space4rusb, best))  #! space_eval()输出最佳参数本身而不是索引
# print(lgt(best))
print(trials.best_trial["result"]["loss"])
# print(trials.best_trial["result"])

with open("rusb.json", "w") as f:
    f.write(json.dumps({"f1": trials.best_trial["result"]["loss"], "Best params": space_eval(space4rusb, best)}))
hyperparams_rusb = space_eval(space4rusb, best)
stop = time.time()
print(f"Training time: {stop - start:.3f}s")

100%|██████████| 30/30 [1:12:24<00:00, 144.81s/it, best loss: -0.9396325585633679]
{'learning_rate': 0.12470065927231533, 'n_estimators': 270, 'random_state': 42}
-0.9396325585633679
Training time: 4344.322s


In [38]:
#XGBoostClassifier的训练和参数调优
start = time.time()
def xgb(params):
    # ip = params["imputer"]
    # del params["imputer"]
    # sc = params["scaler"]
    # del params["scaler"]
    xgb_clf = XGBClassifier(**params)
    str_kfold = StratifiedKFold(
        n_splits=10, shuffle=True, random_state=42
    )  #!Here the random state should be the same as that in the model
    # ?https://stackoverflow.com/questions/39782243/how-to-use-cross-val-score-with-random-state
    gms = make_scorer(geometric_mean_score)
    metric = cross_val_score(
        xgb_clf,
        # Data_to_opt(sc)[0],
        # Data_to_opt(sc)[1],
        X_train,
        y_train,
        cv=str_kfold,
        scoring = gms, 
        n_jobs=-1,  
    ).mean()  
    return {"loss": -metric, "status": STATUS_OK}

space4xgb = {
    "max_depth": hp.choice("max_depth", range(3, 20)),  
    "learning_rate": hp.uniform("learning_rate", 0, 1),
    "n_estimators": hp.choice("n_estimators", [50, 100, 150, 200, 250, 300]),
    # "objective": "multi:softmax",
    "objective": "binary:logistic",
    "scale_pos_weight": hp.choice("scale_pos_weight", [scale_pos_weight_1,scale_pos_weight_2]),
    "n_jobs": -1,
    "gamma": hp.randint("gamma", 10),
    "min_child_weight": hp.choice("min_child_weight", range(1, 10)),
    "subsample": hp.uniform("subsample", 0.1, 1.0),
    "colsample_bytree": hp.uniform("colsample_bytree", 0.1, 1.0),
    "random_state": 42,  
    "tree_method": "hist",
    # "imputation": hp.choice("imputation", ["dropna", "SI", "MI"]),
    # "scaling_method": hp.choice("scaling_method", ["min_max", "std"]),
}

rstate = np.random.RandomState(42)
trials = Trials()
best = fmin(
    xgb, space4xgb, algo=tpe.suggest, max_evals=30, trials=trials, rstate=rstate
)  #! fmin返回的是这些最佳参数在其列表中的索引，而不是直接返回最佳参数本身
# print(best)
print(space_eval(space4xgb, best))  #! space_eval()输出最佳参数本身而不是索引
# print(lgt(best))
print(trials.best_trial["result"]["loss"])
# print(trials.best_trial["result"])


with open("xgb.json", "w") as f:
    f.write(json.dumps({"f1": trials.best_trial["result"]["loss"], "Best params": space_eval(space4xgb, best)}))
hyperparams_xgb =  space_eval(space4xgb, best)
stop = time.time()
print(f"Training time: {stop - start:.3f}s")

100%|██████████| 30/30 [20:17<00:00, 40.57s/it, best loss: -0.9307152801313322]
{'colsample_bytree': 0.39779512373111203, 'gamma': 5, 'learning_rate': 0.326749198175752, 'max_depth': 3, 'min_child_weight': 2, 'n_estimators': 50, 'n_jobs': -1, 'objective': 'binary:logistic', 'random_state': 42, 'scale_pos_weight': 577.8760162601626, 'subsample': 0.625032357077646, 'tree_method': 'hist'}
-0.9307152801313322
Training time: 1217.138s


In [49]:
#LightGBMClassifier的训练和恶参数调优
start = time.time()
def lgbm(params):
    # ip = params["imputer"]
    # del params["imputer"]
    # sc = params["scaler"]
    # del params["scaler"]
    lgbm_clf = LGBMClassifier(**params)
    str_kfold = StratifiedKFold(
        n_splits=10, shuffle=True, random_state=42
    )  
    gms = make_scorer(geometric_mean_score)
    metric = cross_val_score(
        lgbm_clf,
        # Data_to_opt(sc)[0],
        # Data_to_opt(sc)[1],
        X_train,
        y_train,
        cv=str_kfold,
        scoring = gms, 
        n_jobs=-1,  
    ).mean()  
    return {"loss": -metric, "status": STATUS_OK}

space4lgbm = {
    # 参考链接： https://lightgbm.readthedocs.io/en/latest/Parameters.html#max_bin
    # max_bin: int, default =255, >1,
    # smaller max_bin, faster speed, maybe underfitting; larger max_bin, slower speed, maybe overfitting
    "max_bin": 63,
    "num_leaves": hp.choice("num_leaves", range(100, 500)),  # * the larger this value, the more complex the model is
    "max_depth": hp.choice("max_depth", range(3, 32)),
    "learning_rate": hp.uniform("learning_rate", 0.01, 0.2),
    # "n_estimators": hp.choice("n_estimators", [50, 100, 150, 200, 250, 300]), #! n_estimators has bug here
    "num_boost_round": hp.choice("num_boost_round", range(50, 500)),  #! this is an alias of n_estimators but no bug
    # "objective": "multiclass",
    "objective": "binary",
    "n_jobs": -1,
    # "class_weight": "balanced",  #! This set should be done when the classes are imbalanced
    "scale_pos_weight": hp.choice("scale_pos_weight", [scale_pos_weight_1,scale_pos_weight_2]), #控制非均衡是数据中各个类别的平衡
    "min_split_gain": hp.uniform("gamma", 0, 50),  #! this is the 'gamma' in xgboost but its type is float now
    "min_child_weight": hp.uniform("min_child_weight", 0, 10),
    "min_child_samples": hp.randint("min_child_samples", 20),  #! too large may cause underfitting
    "subsample": hp.uniform("subsample", 0.1, 1.0),
    "subsample_freq": hp.choice("subsample_freq", range(1, 30)),  #! k means perform bagging at every k iteration
    "colsample_bytree": hp.uniform("colsample_bytree", 0.1, 1.0),
    "random_state": 42,  #! Here the random state should be the same as that in the stratify Kfold setttings
    "gpu_use_dp": False,  #! for result's reproducibility
    "device": "gpu",
    "gpu_platform_id": 0,  # *OpenCL platform ID   小规模的CPU会快，大规模的GPU会快
    "gpu_device_id": 0,  # *OpenCL device ID
    # "imputation": hp.choice("imputation", ["dropna", "SI", "MI"]),
    # "scaling_method": hp.choice("scaling_method", ["min_max", "std"]),
}#!fmin needs this random state for reproducibility, and all the random seed should be the same as above.
rstate = np.random.RandomState(42)
trials = Trials()
best = fmin(
    lgbm, space4lgbm, algo=tpe.suggest, max_evals=30, trials=trials, rstate=rstate
)  #! fmin返回的是这些最佳参数在其列表中的索引，而不是直接返回最佳参数本身
# print(best)
print(space_eval(space4lgbm, best))  #! space_eval()输出最佳参数本身而不是索引
# print(lgt(best))
print(trials.best_trial["result"]["loss"])
# print(trials.best_trial["result"])


with open("lgbm.json", "w") as f:
    f.write(json.dumps({"f1": trials.best_trial["result"]["loss"], "Best params": space_eval(space4lgbm, best)}))
hyperparams_lgbm = space_eval(space4lgbm, best)
stop = time.time()
print(f"Training time: {stop - start:.3f}s")

100%|██████████| 30/30 [14:27<00:00, 28.90s/it, best loss: -0.920832927590423]
{'colsample_bytree': 0.6688200948328725, 'device': 'gpu', 'gpu_device_id': 0, 'gpu_platform_id': 0, 'gpu_use_dp': False, 'learning_rate': 0.01378363697590293, 'max_bin': 63, 'max_depth': 15, 'min_child_samples': 17, 'min_child_weight': 9.842579154419157, 'min_split_gain': 49.99706400365879, 'n_jobs': -1, 'num_boost_round': 499, 'num_leaves': 266, 'objective': 'binary', 'random_state': 42, 'scale_pos_weight': 577.8753387533875, 'subsample': 0.9905036472418532, 'subsample_freq': 18}
-0.920832927590423
Training time: 867.386s


### VII.测试，检验模型的泛化能力

#### Model Refit

In [114]:
# BalancedRandomForestClassifier Refit
brdf_refit = BalancedRandomForestClassifier(**hyperparams_brdf)
brdf_refit.fit(X_train, y_train)

BalancedRandomForestClassifier(bootstrap=True,
                               class_weight={0: 0.5008652385150725,
                                             1: 289.43766937669375},
                               criterion='gini', max_depth=65, max_features=3,
                               max_leaf_nodes=None, min_impurity_decrease=0.0,
                               min_samples_leaf=2, min_samples_split=2,
                               min_weight_fraction_leaf=0.0, n_estimators=300,
                               n_jobs=-1, oob_score=False, random_state=42,
                               replacement=False, sampling_strategy='auto',
                               verbose=0, warm_start=True)

In [26]:
# RUSBoostClassifier Refit
rusb_refit = RUSBoostClassifier(**hyperparams_rusb)
rusb_refit.fit(X_train, y_train)

RUSBoostClassifier(algorithm='SAMME.R', base_estimator=None,
                   learning_rate=0.12470065927231533, n_estimators=270,
                   random_state=42, replacement=False,
                   sampling_strategy='auto')

In [47]:
# XGBoostClassifier Refit
xgb_refit = XGBClassifier(**hyperparams_xgb)
xgb_refit.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.39779512373111203, gamma=5,
              learning_rate=0.326749198175752, max_delta_step=0, max_depth=3,
              min_child_weight=2, missing=None, n_estimators=50, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=577.8760162601626,
              seed=None, silent=None, subsample=0.625032357077646,
              tree_method='hist', verbosity=1)

In [50]:
# LGBMClassifier Refit
lgbm_refit = LGBMClassifier(**hyperparams_lgbm)
lgbm_refit.fit(X_train, y_train)

LGBMClassifier(boosting_type='gbdt', class_weight=None,
               colsample_bytree=0.6688200948328725, device='gpu',
               gpu_device_id=0, gpu_platform_id=0, gpu_use_dp=False,
               importance_type='split', learning_rate=0.01378363697590293,
               max_bin=63, max_depth=15, min_child_samples=17,
               min_child_weight=9.842579154419157,
               min_split_gain=49.99706400365879, n_estimators=100, n_jobs=-1,
               num_boost_round=499, num_leaves=266, objective='binary',
               random_state=42, reg_alpha=0.0, reg_lambda=0.0,
               scale_pos_weight=577.8753387533875, silent=True,
               subsample=0.9905036472418532, subsample_for_bin=200000,
               subsample_freq=18)

#### Model Test

In [115]:
# BalancedRandomForestClassifier Test
y_test_pred = brdf_refit.predict(X_test)
gms = geometric_mean_score(y_test, y_test_pred, average="binary")  
print("Final geometric_mean_score:", gms)

Final geometric_mean_score: 0.9281944766749752


In [27]:
# RUSBoostClassifier Test
y_test_pred = rusb_refit.predict(X_test)
gms = geometric_mean_score(y_test, y_test_pred, average="binary")  
print("Final geometric_mean_score:", gms)

Final geometric_mean_score: 0.9306201600729969


In [48]:
# XGBoostClassifier Test
y_test_pred = xgb_refit.predict(X_test)
gms = geometric_mean_score(y_test, y_test_pred, average="binary")  
print("Final geometric_mean_score:", gms)

Final geometric_mean_score: 0.9307896965954383


In [51]:
# LGBMClassifier Test
y_test_pred = lgbm_refit.predict(X_test)
gms = geometric_mean_score(y_test, y_test_pred, average="binary")  
print("Final geometric_mean_score:", gms)

Final geometric_mean_score: 0.931538876124443
