<a href="https://colab.research.google.com/github/dk-wei/recommendation-algo/blob/main/GBDT_LR_(CTR%E9%A2%84%E4%BC%B0_%2B_Churn_rate%E9%A2%84%E6%B5%8B).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://pic4.zhimg.com/80/v2-7fe5861dd26ef82c3e79f25f46c4ce83_1440w.jpg)

图中Tree1、Tree2为通过GBDT模型学出来的两颗树，x为一条输入样本，遍历两棵树后，x样本分别落到两颗树的叶子节点上，每个叶子节点对应LR一维特征，那么通过遍历树，就得到了该样本对应的所有LR特征。由于树的每条路径，是通过最小化均方差等方法最终分割出来的有区分性路径，根据该路径得到的特征、特征组合都相对有区分性，效果理论上不会亚于人工经验的处理方式。

GBDT模型的特点，非常适合用来挖掘有效的特征、特征组合。业界不仅GBDT+LR融合有实践，GBDT+FM也有实践，2014 Kaggle CTR竞赛冠军就是使用GBDT+FM，可见，使用GBDT融合其它模型是非常值得尝试的思路。

调研了Facebook、Kaggle竞赛关于GBDT建树的细节，发现两个关键点：采用ensemble决策树而非单颗树；建树采用GBDT而非RF（Random Forests）。解读如下：

1） 为什么建树采用ensemble决策树？

  一棵树的表达能力很弱，不足以表达多个有区分性的特征组合，多棵树的表达能力更强一些。GBDT每棵树都在
  学习前面棵树尚存的不足，迭代多少次就会生成多少颗树。按paper以及Kaggle竞赛中的GBDT+LR融合方式，
  多棵树正好满足LR每条训练样本可以通过GBDT映射成多个特征的需求。

  2） 为什么建树采用GBDT而非RF？

  RF也是多棵树，但从效果上有实践证明不如GBDT。且GBDT前面的树，特征分裂主要体现对多数样本有区分度的特征；后面的树，
  主要体现的是经过前N颗树，残差仍然较大的少数样本。优先选用在整体上有区分度的特征，再选用针对少数样本有区分度的特征，
  思路更加合理，这应该也是用GBDT的原因。

  **GBDT + LR不只是可以用来做CTR预估(根据LR的概率排序)，也可以应用到寻常的Classification模型中，下文我们给出了两例**：

# Case 1: Porto Seguro’s Safe Driver Prediction

In [1]:
import lightgbm as lgb

import pandas as pd
import numpy as np

from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression

In [2]:
from google.colab import drive

drive.mount('/content/gdrive')

TRAINING_PATH = '/content/gdrive/MyDrive/扬FAANG起航/单项准备/(GBDT+LR) Porto Seguro’s Safe Driver Prediction/train.csv'
TESTING_PATH  = '/content/gdrive/MyDrive/扬FAANG起航/单项准备/(GBDT+LR) Porto Seguro’s Safe Driver Prediction/test.csv'

Mounted at /content/gdrive


In [3]:
pd.read_csv(TRAINING_PATH).shape

(595212, 59)

In [4]:
print('Load data...')
df_train = pd.read_csv(TRAINING_PATH).iloc[:10000]
df_test  = pd.read_csv(TRAINING_PATH).iloc[10000:12000]

NUMERIC_COLS = [
    "ps_reg_01", "ps_reg_02", "ps_reg_03",
    "ps_car_12", "ps_car_13", "ps_car_14", "ps_car_15",
]


print(df_train.head(3))
print(df_test.head(3))

Load data...
   id  target  ps_ind_01  ...  ps_calc_18_bin  ps_calc_19_bin  ps_calc_20_bin
0   7       0          2  ...               0               0               1
1   9       0          1  ...               0               1               0
2  13       0          5  ...               0               1               0

[3 rows x 59 columns]
          id  target  ps_ind_01  ...  ps_calc_18_bin  ps_calc_19_bin  ps_calc_20_bin
10000  25242       0          0  ...               1               0               0
10001  25246       0          1  ...               1               1               0
10002  25247       0          0  ...               1               0               0

[3 rows x 59 columns]


In [5]:
y_train = df_train['target']  # training label
y_test = df_test['target']  # testing label
X_train = df_train[NUMERIC_COLS]  # training dataset
X_test = df_test[NUMERIC_COLS]  # testing dataset

训练GBDT模型

本文使用lightgbm包来训练我们的GBDT模型，每个case训练100 trees，每棵树有64个叶子结点 (leaves)。

In [6]:
# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': {'binary_logloss'},
    'num_trees': 100,  # 每个case总共100棵树
    'num_leaves': 64,   # 每棵树有64个leaves
    'learning_rate': 0.01,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

# number of leaves,will be used in feature transformation
num_leaf = 64

print('Start training...')
# train
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=100,
                valid_sets=lgb_train,
                verbose_eval = False
                )

print('Save model...')
# save model to file
gbm.save_model('model.txt')

Start training...




Save model...


<lightgbm.basic.Booster at 0x7efdc6c95cd0>

特征转换

在训练得到100棵树之后，我们需要得到的不是GBDT的预测结果，而是每一条训练数据落在了每棵树的哪个叶子结点上，因此需要使用下面的语句：

In [7]:
print('Start predicting...')
# predict and get data on leaves, training data
y_pred = gbm.predict(X_train, pred_leaf=True)

print(np.array(y_pred).shape)
print(y_pred[0])
print(y_pred[0].shape)

Start predicting...
(10000, 100)
[28 59 17 50 31 22 18 40 22 40 61  7  7  7 53  3 47 22 23 35 41 34 33 43
 50 46 19 42 44 48 42 30 10  9  9 63 21 61 46 58 23 37 50  6 56 35 34 38
 36 40 34 40 30 30 36 52 29 54 35 62  0 48 52 37 55  1 57 61 45 57 36 53
 36 36 14 54 56 40 38 38  1  1 30  1 38 62 25 31 29 23 29 40 18 41 40  1
 42 52 48 45]
(100,)


打印上面结果的输出，可以看到shape是[10000,100]，即`[data size * #trees per case]`

然后我们需要将每棵树的特征进行one-hot处理，如前面所说，总共100棵树，假设第一棵树落在28号leaf node上，那我们需要建立一个64维的向量(因为有64棵树)，除43维之外全部都是0。因此用于LR训练的特征维数共`[num_trees * num_leaves]`，也就是100个64维的one-hot vector。

这样就为每个data point创建了`[100*64]`的新特征向量。

In [8]:
print('Writing transformed training data')
transformed_training_matrix = np.zeros([len(y_pred), len(y_pred[0]) * num_leaf],
                                       dtype=np.int64)  # N * num_tress * num_leafs

print(transformed_training_matrix[0].shape)   

for i in range(0, len(y_pred)):
    temp = np.arange(len(y_pred[0])) * num_leaf + np.array(y_pred[i])
    transformed_training_matrix[i][temp] += 1


y_pred = gbm.predict(X_test, pred_leaf=True)

Writing transformed training data
(6400,)


每个data point transform为为`[100*64]`的新特征向量。当然，对于测试集也要进行同样的处理.


In [9]:
print('Writing transformed testing data')
transformed_testing_matrix = np.zeros([len(y_pred), len(y_pred[0]) * num_leaf], dtype=np.int64)
for i in range(0, len(y_pred)):
    temp = np.arange(len(y_pred[0])) * num_leaf + np.array(y_pred[i])
    transformed_testing_matrix[i][temp] += 1

Writing transformed testing data


LR训练

然后我们可以用转换后的训练集特征和label训练我们的LR模型，并对测试集进行测试：

In [10]:
lm = LogisticRegression(penalty='l2',C=0.05) # logestic model construction
lm.fit(transformed_training_matrix,y_train)  # fitting the data
y_pred_test = lm.predict_proba(transformed_testing_matrix)   # Give the probabilty on each label

print(y_pred_test)

[[0.97742378 0.02257622]
 [0.98815137 0.01184863]
 [0.98046795 0.01953205]
 ...
 [0.98790653 0.01209347]
 [0.99404478 0.00595522]
 [0.97347545 0.02652455]]


我们这里得到的不是简单的类别，而是每个类别的概率, 我们需要对这样的类别概率进行排序，得到我们的TopN。

效果评价
在Facebook的paper中，模型使用NE(Normalized Cross-Entropy)，进行评价，计算公式如下:

![](https://pic2.zhimg.com/80/v2-b2a88a7874e01b316d10a31aa3863171_1440w.jpg)

In [11]:
NE = (-1) / len(y_pred_test) * sum(((1+y_test)/2 * np.log(y_pred_test[:,1]) +  (1-y_test)/2 * np.log(1 - y_pred_test[:,1])))
print("Normalized Cross Entropy " + str(NE))

Normalized Cross Entropy 2.1481980203420465


# Case 2: 基于LR+XGBoost预估电信客户流失


In [13]:
# -*-coding:utf-8-*-
"""
    Author: Alan
    Desc:
        GBDT+LR模型 电信客户流失预测
"""
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

class ChurnPredWithGBDTAndLR:
    def __init__(self):
        self.file = "/content/gdrive/MyDrive/扬FAANG起航/单项准备/(GBDT+LR) Porto Seguro’s Safe Driver Prediction/new_churn.csv"
        self.data = self.load_data()
        self.train, self.test = self.split()

    # 加载数据
    def load_data(self):
        return pd.read_csv(self.file)

    # 拆分数据集
    def split(self):
        train, test = train_test_split(self.data, test_size=0.1, random_state=40)
        return train, test

    # 模型训练
    def train_model(self):
        lable = "Churn"
        ID = "customerID"
        x_columns = [x for x in self.train.columns if x not in [lable, ID]]
        x_train = self.train[x_columns]
        y_train = self.train[lable]

        # 创建gbdt模型 并训练
        gbdt = GradientBoostingClassifier()
        gbdt.fit(x_train, y_train)

        # 模型融合
        gbdt_lr = LogisticRegression()
        enc = OneHotEncoder()
        print(gbdt.apply(x_train).shape)
        print(gbdt.apply(x_train).reshape(-1,100).shape)

        # 100为n_estimators，迭代次数
        enc.fit(gbdt.apply(x_train).reshape(-1,100))
        gbdt_lr.fit(enc.transform(gbdt.apply(x_train).reshape(-1,100)),y_train)

        return enc, gbdt, gbdt_lr

    # 效果评估
    def evaluate(self,enc,gbdt,gbdt_lr):
        lable = "Churn"
        ID = "customerID"
        x_columns = [x for x in self.test.columns if x not in [lable, ID]]
        x_test = self.test[x_columns]
        y_test = self.test[lable]

        # gbdt 模型效果评估
        gbdt_y_pred = gbdt.predict_proba(x_test)
        new_gbdt_y_pred = list()
        for y in gbdt_y_pred:
            # y[0] 表示样本label=0的概率 y[1]表示样本label=1的概率
            new_gbdt_y_pred.append(1 if y[1] > 0.5 else 0)
        print("GBDT-MSE: %.4f" % mean_squared_error(y_test, new_gbdt_y_pred))
        print("GBDT-Accuracy : %.4g" % metrics.accuracy_score(y_test.values, new_gbdt_y_pred))
        print("GBDT-AUC Score : %.4g" % metrics.roc_auc_score(y_test.values, new_gbdt_y_pred))

        gbdt_lr_y_pred = gbdt_lr.predict_proba(enc.transform(gbdt.apply(x_test).reshape(-1,100)))
        new_gbdt_lr_y_pred = list()
        for y in gbdt_lr_y_pred:
            # y[0] 表示样本label=0的概率 y[1]表示样本label=1的概率
            new_gbdt_lr_y_pred.append(1 if y[1] > 0.5 else 0)
        print("GBDT_LR-MSE: %.4f" % mean_squared_error(y_test, new_gbdt_lr_y_pred))
        print("GBDT_LR-Accuracy : %.4g" % metrics.accuracy_score(y_test.values, new_gbdt_lr_y_pred))
        print("GBDT_LR-AUC Score : %.4g" % metrics.roc_auc_score(y_test.values, new_gbdt_lr_y_pred))

if __name__ == "__main__":
    pred = ChurnPredWithGBDTAndLR()
    enc, gbdt, gbdt_lr = pred.train_model()
    pred.evaluate(enc, gbdt,gbdt_lr)


(6338, 100, 1)
(6338, 100)
GBDT-MSE: 0.2199
GBDT-Accuracy : 0.7801
GBDT-AUC Score : 0.7058
GBDT_LR-MSE: 0.2638
GBDT_LR-Accuracy : 0.7362
GBDT_LR-AUC Score : 0.6647


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
