<a href="https://colab.research.google.com/github/dk-wei/dl-recomm-algo-implementation/blob/main/GBDT_LR_(CTR%E9%A2%84%E4%BC%B0_%2B_Churn_rate%E9%A2%84%E6%B5%8B).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://pic4.zhimg.com/80/v2-7fe5861dd26ef82c3e79f25f46c4ce83_1440w.jpg)

图中Tree1、Tree2为通过GBDT模型学出来的两颗树，x为一条输入样本，遍历两棵树后，x样本分别落到两颗树的叶子节点上，每个叶子节点对应LR一维特征，那么通过遍历树，就得到了该样本对应的所有LR特征。由于树的每条路径，是通过最小化均方差等方法最终分割出来的有区分性路径，根据该路径得到的特征、特征组合都相对有区分性，效果理论上不会亚于人工经验的处理方式。

GBDT模型的特点，非常适合用来挖掘有效的特征、特征组合。业界不仅GBDT+LR融合有实践，GBDT+FM也有实践，2014 Kaggle CTR竞赛冠军就是使用GBDT+FM，可见，使用GBDT融合其它模型是非常值得尝试的思路。

调研了Facebook、Kaggle竞赛关于GBDT建树的细节，发现两个关键点：采用ensemble决策树而非单颗树；建树采用GBDT而非RF（Random Forests）。解读如下：

1） 为什么建树采用ensemble决策树？

  一棵树的表达能力很弱，不足以表达多个有区分性的特征组合，多棵树的表达能力更强一些。GBDT每棵树都在
  学习前面棵树尚存的不足，迭代多少次就会生成多少颗树。按paper以及Kaggle竞赛中的GBDT+LR融合方式，
  多棵树正好满足LR每条训练样本可以通过GBDT映射成多个特征的需求。

  2） 为什么建树采用GBDT而非RF？

  RF也是多棵树，但从效果上有实践证明不如GBDT。且GBDT前面的树，特征分裂主要体现对多数样本有区分度的特征；后面的树，
  主要体现的是经过前N颗树，残差仍然较大的少数样本。优先选用在整体上有区分度的特征，再选用针对少数样本有区分度的特征，
  思路更加合理，这应该也是用GBDT的原因。

  **GBDT + LR不只是可以用来做CTR预估(根据LR的概率排序)，也可以应用到寻常的Classification模型中，下文我们给出了两例**：

# Case 1: Porto Seguro’s Safe Driver Prediction

In [None]:
import lightgbm as lgb

import pandas as pd
import numpy as np

from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression

In [None]:
from google.colab import drive

drive.mount('/content/gdrive')

TRAINING_PATH = '/content/gdrive/MyDrive/扬FAANG起航/单项准备/(GBDT+LR) Porto Seguro’s Safe Driver Prediction/train.csv'
TESTING_PATH  = '/content/gdrive/MyDrive/扬FAANG起航/单项准备/(GBDT+LR) Porto Seguro’s Safe Driver Prediction/test.csv'

Mounted at /content/gdrive


In [None]:
pd.read_csv(TRAINING_PATH).shape

(595212, 59)

In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., `ind`, `reg`, `car`, `calc`). In addition, feature names include the postfix `bin` to indicate binary features and `cat` to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.

In [None]:
print('Load data...')
df_train = pd.read_csv(TRAINING_PATH).iloc[:10000]
df_test  = pd.read_csv(TRAINING_PATH).iloc[10000:12000]

NUMERIC_COLS = [
    "ps_reg_01", "ps_reg_02", "ps_reg_03",
    "ps_car_12", "ps_car_13", "ps_car_14", "ps_car_15",
]


print(df_train.head(3))
print(df_test.head(3))

Load data...
   id  target  ps_ind_01  ...  ps_calc_18_bin  ps_calc_19_bin  ps_calc_20_bin
0   7       0          2  ...               0               0               1
1   9       0          1  ...               0               1               0
2  13       0          5  ...               0               1               0

[3 rows x 59 columns]
          id  target  ps_ind_01  ...  ps_calc_18_bin  ps_calc_19_bin  ps_calc_20_bin
10000  25242       0          0  ...               1               0               0
10001  25246       0          1  ...               1               1               0
10002  25247       0          0  ...               1               0               0

[3 rows x 59 columns]


In [None]:
y_train = df_train['target']  # training label
y_test = df_test['target']  # testing label
X_train = df_train[NUMERIC_COLS]  # training dataset
X_test = df_test[NUMERIC_COLS]  # testing dataset

In [None]:
df_train.head()

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,ps_ind_10_bin,ps_ind_11_bin,ps_ind_12_bin,ps_ind_13_bin,ps_ind_14,ps_ind_15,ps_ind_16_bin,ps_ind_17_bin,ps_ind_18_bin,ps_reg_01,ps_reg_02,ps_reg_03,ps_car_01_cat,ps_car_02_cat,ps_car_03_cat,ps_car_04_cat,ps_car_05_cat,ps_car_06_cat,ps_car_07_cat,ps_car_08_cat,ps_car_09_cat,ps_car_10_cat,ps_car_11_cat,ps_car_11,ps_car_12,ps_car_13,ps_car_14,ps_car_15,ps_calc_01,ps_calc_02,ps_calc_03,ps_calc_04,ps_calc_05,ps_calc_06,ps_calc_07,ps_calc_08,ps_calc_09,ps_calc_10,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,0,0,0,0,0,0,11,0,1,0,0.7,0.2,0.71807,10,1,-1,0,1,4,1,0,0,1,12,2,0.4,0.883679,0.37081,3.605551,0.6,0.5,0.2,3,1,10,1,10,1,5,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,0,0,0,0,0,0,3,0,0,1,0.8,0.4,0.766078,11,1,-1,0,-1,11,1,1,2,1,19,3,0.316228,0.618817,0.388716,2.44949,0.3,0.1,0.3,2,1,9,5,8,1,7,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,0,0,0,0,0,0,12,1,0,0,0.0,0.0,-1.0,7,1,-1,0,-1,14,1,1,2,1,60,1,0.316228,0.641586,0.347275,3.316625,0.5,0.7,0.1,2,2,9,1,8,2,7,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,0,0,0,0,0,0,8,1,0,0,0.9,0.2,0.580948,7,1,0,0,1,11,1,1,3,1,104,1,0.374166,0.542949,0.294958,2.0,0.6,0.9,0.1,2,4,7,1,8,4,2,2,2,4,9,0,0,0,0,0,0
4,17,0,0,2,0,1,0,1,0,0,0,0,0,0,0,0,9,1,0,0,0.7,0.6,0.840759,11,1,-1,0,-1,14,1,1,2,1,82,3,0.31607,0.565832,0.365103,2.0,0.4,0.6,0.0,2,2,6,3,10,2,12,3,1,1,3,0,0,0,1,1,0


训练GBDT模型

本文使用lightgbm包来训练我们的GBDT模型，每个case训练100 trees，每棵树有64个叶子结点 (leaves)。

In [None]:
# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': {'binary_logloss'},
    'num_trees': 100,  # 每个case总共100棵树
    'num_leaves': 64,   # 每棵树有64个leaves
    'learning_rate': 0.01,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

# number of leaves,will be used in feature transformation
num_leaf = 64

print('Start training...')
# train
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=100,
                valid_sets=lgb_train,
                verbose_eval = False
                )

print('Save model...')
# save model to file
gbm.save_model('model.txt')

Start training...




Save model...


<lightgbm.basic.Booster at 0x7f2bf0a699d0>

特征转换

在训练得到100棵树之后，我们需要得到的不是GBDT的预测结果，而是每一条训练数据落在了每棵树的哪个叶子结点上，因此需要使用下面的语句：

In [None]:
print('Start predicting...')
# predict and get data on leaves, training data
y_pred = gbm.predict(X_train, pred_leaf=True)

print(np.array(y_pred).shape)
print(y_pred[0])
print(y_pred[0].shape)

Start predicting...
(10000, 100)
[28 59 17 50 31 22 18 40 22 40 61  7  7  7 53  3 47 22 23 35 41 34 33 43
 50 46 19 42 44 48 42 30 10  9  9 63 21 61 46 58 23 37 50  6 56 35 34 38
 36 40 34 40 30 30 36 52 29 54 35 62  0 48 52 37 55  1 57 61 45 57 36 53
 36 36 14 54 56 40 38 38  1  1 30  1 38 62 25 31 29 23 29 40 18 41 40  1
 42 52 48 45]
(100,)


打印上面结果的输出，可以看到shape是[10000,100]，即`[data size * #trees per case]`

然后我们需要将每棵树的特征进行one-hot处理，如前面所说，总共100棵树，假设第一棵树落在28号leaf node上，那我们需要建立一个64维的向量(因为有64棵树)，除43维之外全部都是0。因此用于LR训练的特征维数共`[num_trees * num_leaves]`，也就是100个64维的one-hot vector。

这样就为每个data point创建了`[100*64]`的新特征向量。

In [None]:
print('Writing transformed training data')
transformed_training_matrix = np.zeros([len(y_pred), len(y_pred[0]) * num_leaf],
                                       dtype=np.int64)  # N * num_tress * num_leafs

print(transformed_training_matrix[0].shape)   

for i in range(0, len(y_pred)):
    temp = np.arange(len(y_pred[0])) * num_leaf + np.array(y_pred[i])
    transformed_training_matrix[i][temp] += 1


y_pred = gbm.predict(X_test, pred_leaf=True)

Writing transformed training data
(6400,)


每个data point transform为为`[100*64]`的新特征向量。当然，对于测试集也要进行同样的处理.


In [None]:
print('Writing transformed testing data')
transformed_testing_matrix = np.zeros([len(y_pred), len(y_pred[0]) * num_leaf], dtype=np.int64)
for i in range(0, len(y_pred)):
    temp = np.arange(len(y_pred[0])) * num_leaf + np.array(y_pred[i])
    transformed_testing_matrix[i][temp] += 1

Writing transformed testing data


LR训练

然后我们可以用转换后的训练集特征和label训练我们的LR模型，并对测试集进行测试：

In [None]:
lm = LogisticRegression(penalty='l2',C=0.05) # logestic model construction
lm.fit(transformed_training_matrix,y_train)  # fitting the data
y_pred_test = lm.predict_proba(transformed_testing_matrix)   # Give the probabilty on each label

print(y_pred_test)

[[0.97742378 0.02257622]
 [0.98815137 0.01184863]
 [0.98046795 0.01953205]
 ...
 [0.98790653 0.01209347]
 [0.99404478 0.00595522]
 [0.97347545 0.02652455]]


我们这里得到的不是简单的类别，而是每个类别的概率, 我们需要对这样的类别概率进行排序，得到我们的TopN。

效果评价
在Facebook的paper中，模型使用NE(Normalized Cross-Entropy)，进行评价，计算公式如下:

![](https://pic2.zhimg.com/80/v2-b2a88a7874e01b316d10a31aa3863171_1440w.jpg)

In [None]:
NE = (-1) / len(y_pred_test) * sum(((1+y_test)/2 * np.log(y_pred_test[:,1]) +  (1-y_test)/2 * np.log(1 - y_pred_test[:,1])))
print("Normalized Cross Entropy " + str(NE))

Normalized Cross Entropy 2.1481980203420465


# Case 2: 基于LR+XGBoost预估电信客户流失


In [None]:
# -*-coding:utf-8-*-
"""
    Author: Alan
    Desc:
        GBDT+LR模型 电信客户流失预测
"""
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

class ChurnPredWithGBDTAndLR:
    def __init__(self):
        self.file = "/content/gdrive/MyDrive/扬FAANG起航/单项准备/(GBDT+LR) Porto Seguro’s Safe Driver Prediction/new_churn.csv"
        self.data = self.load_data()
        self.train, self.test = self.split()

    # 加载数据
    def load_data(self):
        return pd.read_csv(self.file)

    # 拆分数据集
    def split(self):
        train, test = train_test_split(self.data, test_size=0.1, random_state=40)
        return train, test

    # 模型训练
    def train_model(self):
        lable = "Churn"
        ID = "customerID"
        x_columns = [x for x in self.train.columns if x not in [lable, ID]]
        x_train = self.train[x_columns]
        y_train = self.train[lable]

        # 创建gbdt模型 并训练
        gbdt = GradientBoostingClassifier()
        gbdt.fit(x_train, y_train)

        # 模型融合
        gbdt_lr = LogisticRegression()
        enc = OneHotEncoder()
        print(gbdt.apply(x_train).shape)
        print(gbdt.apply(x_train).reshape(-1,100).shape)

        # 100为n_estimators，迭代次数
        enc.fit(gbdt.apply(x_train).reshape(-1,100))
        gbdt_lr.fit(enc.transform(gbdt.apply(x_train).reshape(-1,100)),y_train)

        return enc, gbdt, gbdt_lr

    # 效果评估
    def evaluate(self,enc,gbdt,gbdt_lr):
        lable = "Churn"
        ID = "customerID"
        x_columns = [x for x in self.test.columns if x not in [lable, ID]]
        x_test = self.test[x_columns]
        y_test = self.test[lable]

        # gbdt 模型效果评估
        gbdt_y_pred = gbdt.predict_proba(x_test)
        new_gbdt_y_pred = list()
        for y in gbdt_y_pred:
            # y[0] 表示样本label=0的概率 y[1]表示样本label=1的概率
            new_gbdt_y_pred.append(1 if y[1] > 0.5 else 0)
        print("GBDT-MSE: %.4f" % mean_squared_error(y_test, new_gbdt_y_pred))
        print("GBDT-Accuracy : %.4g" % metrics.accuracy_score(y_test.values, new_gbdt_y_pred))
        print("GBDT-AUC Score : %.4g" % metrics.roc_auc_score(y_test.values, new_gbdt_y_pred))

        gbdt_lr_y_pred = gbdt_lr.predict_proba(enc.transform(gbdt.apply(x_test).reshape(-1,100)))
        new_gbdt_lr_y_pred = list()
        for y in gbdt_lr_y_pred:
            # y[0] 表示样本label=0的概率 y[1]表示样本label=1的概率
            new_gbdt_lr_y_pred.append(1 if y[1] > 0.5 else 0)
        print("GBDT_LR-MSE: %.4f" % mean_squared_error(y_test, new_gbdt_lr_y_pred))
        print("GBDT_LR-Accuracy : %.4g" % metrics.accuracy_score(y_test.values, new_gbdt_lr_y_pred))
        print("GBDT_LR-AUC Score : %.4g" % metrics.roc_auc_score(y_test.values, new_gbdt_lr_y_pred))

if __name__ == "__main__":
    pred = ChurnPredWithGBDTAndLR()
    enc, gbdt, gbdt_lr = pred.train_model()
    pred.evaluate(enc, gbdt,gbdt_lr)


(6338, 100, 1)
(6338, 100)
GBDT-MSE: 0.2199
GBDT-Accuracy : 0.7801
GBDT-AUC Score : 0.7058
GBDT_LR-MSE: 0.2638
GBDT_LR-Accuracy : 0.7362
GBDT_LR-AUC Score : 0.6647


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


# Case 3: Criteo 项目 (失败)

In [35]:
!git clone https://github.com/chengstone/kaggle_criteo_ctr_challenge-.git

Cloning into 'kaggle_criteo_ctr_challenge-'...
remote: Enumerating objects: 156, done.[K
remote: Total 156 (delta 0), reused 0 (delta 0), pack-reused 156[K
Receiving objects: 100% (156/156), 2.10 MiB | 23.42 MiB/s, done.
Resolving deltas: 100% (49/49), done.


In [1]:
import os
import sys
import click
import random
import collections

import numpy as np
import lightgbm as lgb
import json
import pandas as pd
from sklearn.metrics import mean_squared_error

In [2]:
!wget --no-check-certificate https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/10082655/dac.tar.gz

--2021-07-09 23:49:08--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/10082655/dac.tar.gz
Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 52.218.112.35
Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|52.218.112.35|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4576820670 (4.3G) [binary/octet-stream]
Saving to: ‘dac.tar.gz’


2021-07-09 23:51:21 (32.8 MB/s) - ‘dac.tar.gz’ saved [4576820670/4576820670]



In [3]:
!tar zxf dac.tar.gz

!rm -f dac.tar.gz

tar: Ignoring unknown extended header keyword 'SCHILY.dev'
tar: Ignoring unknown extended header keyword 'SCHILY.ino'
tar: Ignoring unknown extended header keyword 'SCHILY.nlink'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.creationtime'
tar: Ignoring unknown extended header keyword 'SCHILY.dev'
tar: Ignoring unknown extended header keyword 'SCHILY.ino'
tar: Ignoring unknown extended header keyword 'SCHILY.nlink'
tar: Ignoring unknown extended header keyword 'SCHILY.dev'
tar: Ignoring unknown extended header keyword 'SCHILY.ino'
tar: Ignoring unknown extended header keyword 'SCHILY.nlink'


In [4]:
!mkdir raw

!mv ./*.txt raw/

In [19]:
!mkdir raw2

#!mv ./*.txt raw2/

In [25]:
# 选取前1000000样本

# !head -n 1000 Criteo/test.txt > test_sub100w.txt

# !head -n 1000 Criteo/train.txt > train_sub100w.txt

!head -n 100000 raw/test.txt > raw2/test.txt

!head -n 100000 raw/train.txt > raw2/train.txt

In [26]:
import pickle

def save_params(params):
    """
    Save parameters to file
    """
    pickle.dump(params, open('params.p', 'wb'))


def load_params():
    """
    Load parameters from file
    """
    return pickle.load(open('params.p', mode='rb'))


def save_params_with_name(params, name):
    """
    Save parameters to file
    """
    pickle.dump(params, open('{}.p'.format(name), 'wb'))


def load_params_with_name(name):
    """
    Load parameters from file
    """
    return pickle.load(open('{}.p'.format(name), mode='rb'))

In [27]:
# There are 13 integer features and 26 categorical features
continous_features = range(1, 14)
categorial_features = range(14, 40)

# Clip integer features. The clip point for each integer feature
# is derived from the 95% quantile of the total values in each feature
continous_clip = [20, 600, 100, 50, 64000, 500, 100, 50, 500, 10, 10, 10, 50]

class ContinuousFeatureGenerator:
    """
    Normalize the integer features to [0, 1] by min-max normalization
    """

    def __init__(self, num_feature):
        self.num_feature = num_feature
        self.min = [sys.maxsize] * num_feature
        self.max = [-sys.maxsize] * num_feature

    def build(self, datafile, continous_features):
        with open(datafile, 'r') as f:
            for line in f:
                features = line.rstrip('\n').split('\t')
                for i in range(0, self.num_feature):
                    val = features[continous_features[i]]
                    if val != '':
                        val = int(val)
                        if val > continous_clip[i]:
                            val = continous_clip[i]
                        self.min[i] = min(self.min[i], val)
                        self.max[i] = max(self.max[i], val)

    def gen(self, idx, val):
        if val == '':
            return 0.0
        val = float(val)
        return (val - self.min[idx]) / (self.max[idx] - self.min[idx])

class CategoryDictGenerator:
    """
    Generate dictionary for each of the categorical features
    """

    def __init__(self, num_feature):
        self.dicts = []
        self.num_feature = num_feature
        for i in range(0, num_feature):
            self.dicts.append(collections.defaultdict(int))

    def build(self, datafile, categorial_features, cutoff=0):
        with open(datafile, 'r') as f:
            for line in f:
                features = line.rstrip('\n').split('\t')
                for i in range(0, self.num_feature):
                    if features[categorial_features[i]] != '':
                        self.dicts[i][features[categorial_features[i]]] += 1
        for i in range(0, self.num_feature):
            self.dicts[i] = filter(lambda x: x[1] >= cutoff,
                                   self.dicts[i].items())

            self.dicts[i] = sorted(self.dicts[i], key=lambda x: (-x[1], x[0]))
            vocabs, _ = list(zip(*self.dicts[i]))
            self.dicts[i] = dict(zip(vocabs, range(1, len(vocabs) + 1)))
            self.dicts[i]['<unk>'] = 0

    def gen(self, idx, key):
        if key not in self.dicts[idx]:
            res = self.dicts[idx]['<unk>']
        else:
            res = self.dicts[idx][key]
        return res

    def dicts_sizes(self):
        return list(map(len, self.dicts))

In [28]:
def preprocess(datadir, outdir):
    """
    All the 13 integer features are normalzied to continous values and these
    continous features are combined into one vecotr with dimension 13.

    Each of the 26 categorical features are one-hot encoded and all the one-hot
    vectors are combined into one sparse binary vector.
    """
    dists = ContinuousFeatureGenerator(len(continous_features))
    dists.build(os.path.join(datadir, 'train.txt'), continous_features)

    dicts = CategoryDictGenerator(len(categorial_features))
    dicts.build(
        os.path.join(datadir, 'train.txt'), categorial_features, cutoff=200)#200 50

    dict_sizes = dicts.dicts_sizes()
    categorial_feature_offset = [0]
    for i in range(1, len(categorial_features)):
        offset = categorial_feature_offset[i - 1] + dict_sizes[i - 1]
        categorial_feature_offset.append(offset)

    random.seed(0)

    print('training set process started')

    # 90% of the data are used for training, and 10% of the data are used
    # for validation.
    train_ffm = open(os.path.join(outdir, 'train_ffm.txt'), 'w')
    valid_ffm = open(os.path.join(outdir, 'valid_ffm.txt'), 'w')

    train_lgb = open(os.path.join(outdir, 'train_lgb.txt'), 'w')
    valid_lgb = open(os.path.join(outdir, 'valid_lgb.txt'), 'w')

    with open(os.path.join(outdir, 'train.txt'), 'w') as out_train:
        with open(os.path.join(outdir, 'valid.txt'), 'w') as out_valid:
            with open(os.path.join(datadir, 'train.txt'), 'r') as f:
                for line in f:
                    features = line.rstrip('\n').split('\t')
                    continous_feats = []
                    continous_vals = []
                    for i in range(0, len(continous_features)):

                        val = dists.gen(i, features[continous_features[i]])
                        continous_vals.append(
                            "{0:.6f}".format(val).rstrip('0').rstrip('.'))
                        continous_feats.append(
                            "{0:.6f}".format(val).rstrip('0').rstrip('.'))#('{0}'.format(val))

                    categorial_vals = []
                    categorial_lgb_vals = []
                    for i in range(0, len(categorial_features)):
                        val = dicts.gen(i, features[categorial_features[i]]) + categorial_feature_offset[i]
                        categorial_vals.append(str(val))
                        val_lgb = dicts.gen(i, features[categorial_features[i]])
                        categorial_lgb_vals.append(str(val_lgb))

                    continous_vals = ','.join(continous_vals)
                    categorial_vals = ','.join(categorial_vals)
                    label = features[0]
                    if random.randint(0, 9999) % 10 != 0:
                        out_train.write(','.join(
                            [continous_vals, categorial_vals, label]) + '\n')
                        train_ffm.write('\t'.join(label) + '\t')
                        train_ffm.write('\t'.join(
                            ['{}:{}:{}'.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(','))]) + '\t')
                        train_ffm.write('\t'.join(
                            ['{}:{}:1'.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(','))]) + '\n')
                        
                        train_lgb.write('\t'.join(label) + '\t')
                        train_lgb.write('\t'.join(continous_feats) + '\t')
                        train_lgb.write('\t'.join(categorial_lgb_vals) + '\n')

                    else:
                        out_valid.write(','.join(
                            [continous_vals, categorial_vals, label]) + '\n')
                        valid_ffm.write('\t'.join(label) + '\t')
                        valid_ffm.write('\t'.join(
                            ['{}:{}:{}'.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(','))]) + '\t')
                        valid_ffm.write('\t'.join(
                            ['{}:{}:1'.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(','))]) + '\n')
                                                
                        valid_lgb.write('\t'.join(label) + '\t')
                        valid_lgb.write('\t'.join(continous_feats) + '\t')
                        valid_lgb.write('\t'.join(categorial_lgb_vals) + '\n')
                        
    print('training set process finished')

    train_ffm.close()
    valid_ffm.close()

    train_lgb.close()
    valid_lgb.close()

    print('testing set process started')

    test_ffm = open(os.path.join(outdir, 'test_ffm.txt'), 'w')
    test_lgb = open(os.path.join(outdir, 'test_lgb.txt'), 'w')

    with open(os.path.join(outdir, 'test.txt'), 'w') as out:
        with open(os.path.join(datadir, 'test.txt'), 'r') as f:
            for line in f:
                features = line.rstrip('\n').split('\t')

                continous_feats = []
                continous_vals = []
                for i in range(0, len(continous_features)):
                    val = dists.gen(i, features[continous_features[i] - 1])
                    continous_vals.append(
                        "{0:.6f}".format(val).rstrip('0').rstrip('.'))
                    continous_feats.append(
                            "{0:.6f}".format(val).rstrip('0').rstrip('.'))#('{0}'.format(val))

                categorial_vals = []
                categorial_lgb_vals = []
                for i in range(0, len(categorial_features)):
                    val = dicts.gen(i,
                                    features[categorial_features[i] -
                                             1]) + categorial_feature_offset[i]
                    categorial_vals.append(str(val))

                    val_lgb = dicts.gen(i, features[categorial_features[i] - 1])
                    categorial_lgb_vals.append(str(val_lgb))

                continous_vals = ','.join(continous_vals)
                categorial_vals = ','.join(categorial_vals)

                out.write(','.join([continous_vals, categorial_vals]) + '\n')
                
                test_ffm.write('\t'.join(['{}:{}:{}'.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(','))]) + '\t')
                test_ffm.write('\t'.join(
                    ['{}:{}:1'.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(','))]) + '\n')
                                                                
                test_lgb.write('\t'.join(continous_feats) + '\t')
                test_lgb.write('\t'.join(categorial_lgb_vals) + '\n')

    test_ffm.close()
    test_lgb.close()

    print('testing set process finished')

    return dict_sizes

In [42]:
!mkdir data

dict_sizes = preprocess('./raw2','./data')

training set process started
training set process finished
testing set process started
testing set process finished


In [43]:
save_params_with_name((dict_sizes), 'dict_sizes') #pickle.dump((dict_sizes), open('dict_sizes.p', 'wb'))

In [46]:
dict_sizes = load_params_with_name('dict_sizes') #pickle.load(open('dict_sizes.p', mode='rb'))

In [47]:
sum(dict_sizes)

909

## FFM training

FFM input data is prepared, lets get started training FFM model.

 - learning ratee = 0.1
 - epoch = 32
 - model saved as `model_ffm`

In [93]:
import subprocess, sys, os, time

NR_THREAD = 1

In [95]:
cmd = '/content/kaggle_criteo_ctr_challenge-/libffm/libffm/ffm-train --auto-stop -r 0.1 -t 32 -s {nr_thread} -p ./data/valid_ffm.txt ./data/train_ffm.txt model_ffm'.format(nr_thread=NR_THREAD) 
os.popen(cmd).readlines()

[]

In [96]:
cmd = '/content/kaggle_criteo_ctr_challenge-/libffm/libffm/ffm-predict ./data/train_ffm.txt model_ffm tr_ffm.out'.format(nr_thread=NR_THREAD) 
os.popen(cmd).readlines()

[]

In [97]:
cmd = '/content/kaggle_criteo_ctr_challenge-/libffm/libffm/ffm-predict ./data/valid_ffm.txt model_ffm va_ffm.out'.format(nr_thread=NR_THREAD) 
os.popen(cmd).readlines()

[]

In [63]:
cmd = '/content/kaggle_criteo_ctr_challenge-/libffm/libffm/ffm-predict ./data/test_ffm.txt model_ffm te_ffm.out true'.format(nr_thread=NR_THREAD) 
os.popen(cmd).readlines()

[]

In [70]:
def lgb_pred(tr_path, va_path, _sep = '\t', iter_num = 32):
    # load or create your dataset
    print('Load data...')
    df_train = pd.read_csv(tr_path, header=None, sep=_sep)
    df_test = pd.read_csv(va_path, header=None, sep=_sep)
    
    y_train = df_train[0].values
    y_test = df_test[0].values
    X_train = df_train.drop(0, axis=1).values
    X_test = df_test.drop(0, axis=1).values
    
    # create dataset for lightgbm
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
    
    # specify your configurations as a dict
    params = {
        'task': 'train',
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric': {'l2', 'auc', 'logloss'},
        'num_leaves': 30,
#         'max_depth': 7,
        'num_trees': 32,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': 0
    }
    
    print('Start training...')
    # train
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=iter_num,
                    valid_sets=lgb_eval,
                    feature_name=["I1","I2","I3","I4","I5","I6","I7","I8","I9","I10","I11","I12","I13","C1","C2","C3","C4","C5","C6","C7","C8","C9","C10","C11","C12","C13","C14","C15","C16","C17","C18","C19","C20","C21","C22","C23","C24","C25","C26"],
                    categorical_feature=["C1","C2","C3","C4","C5","C6","C7","C8","C9","C10","C11","C12","C13","C14","C15","C16","C17","C18","C19","C20","C21","C22","C23","C24","C25","C26"],
                    early_stopping_rounds=5)
    
    print('Save model...')
    # save model to file
    gbm.save_model('lgb_model.txt')
    
    print('Start predicting...')
    # predict
    y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
    # eval
    print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)

    return gbm,y_pred,X_train,y_train

In [71]:
gbm,y_pred,X_train ,y_train = lgb_pred('./data/train_lgb.txt', './data/valid_lgb.txt', '\t', 256)

Load data...
Start training...


New categorical_feature is ['C1', 'C10', 'C11', 'C12', 'C13', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C2', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9']
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))


[1]	valid_0's auc: 0.708795	valid_0's l2: 0.172283
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's auc: 0.719777	valid_0's l2: 0.170605
[3]	valid_0's auc: 0.72299	valid_0's l2: 0.168996
[4]	valid_0's auc: 0.725087	valid_0's l2: 0.167526
[5]	valid_0's auc: 0.724959	valid_0's l2: 0.166189
[6]	valid_0's auc: 0.725621	valid_0's l2: 0.164986
[7]	valid_0's auc: 0.727142	valid_0's l2: 0.163885
[8]	valid_0's auc: 0.728577	valid_0's l2: 0.162806
[9]	valid_0's auc: 0.730393	valid_0's l2: 0.161819
[10]	valid_0's auc: 0.731448	valid_0's l2: 0.160906
[11]	valid_0's auc: 0.732238	valid_0's l2: 0.160099
[12]	valid_0's auc: 0.734043	valid_0's l2: 0.159331
[13]	valid_0's auc: 0.734657	valid_0's l2: 0.158612
[14]	valid_0's auc: 0.735504	valid_0's l2: 0.157925
[15]	valid_0's auc: 0.736296	valid_0's l2: 0.15727
[16]	valid_0's auc: 0.737115	valid_0's l2: 0.156654
[17]	valid_0's auc: 0.737804	valid_0's l2: 0.156098
[18]	valid_0's auc: 0.73855	valid_0's l2: 0.155556
[19]	valid_0's 

In [72]:
gbm.feature_importance()

array([ 10,   2,  36,  20,  36,  66,  21,  21,  21,   1,  62,   0,  51,
         0,  98,   2,  14,   0,   9,   1,   0,   2,   2,  25,   2, 121,
        41,  83,  17,  20,  96,   1,   1,   0,   0,  31,   6,   1,   8])

In [73]:
gbm.feature_importance("gain")

array([ 4959.7851181 ,    70.05719948,  3060.81740379,  1290.30080032,
        3068.40098953, 25257.18478775, 18178.44504356,  1466.6282177 ,
        1088.65239143,    27.51740074, 12205.77954483,     0.        ,
        5568.64053917,     0.        ,  7636.05537224,    66.50469971,
         641.64440918,     0.        ,   368.72410583,    54.0868988 ,
           0.        ,    75.97299957,   107.39220047,  1072.73800278,
         178.64580154,  6770.76226997,  2078.64819145,  4580.94059944,
        2108.68519592,  1755.55281258,  9561.59857368,    29.9470005 ,
          26.52330017,     0.        ,     0.        ,  1936.1554985 ,
         249.021101  ,    32.85879898,   365.58109283])

In [74]:
def ret_feat_impt(gbm):
    gain = gbm.feature_importance("gain").reshape(-1, 1) / sum(gbm.feature_importance("gain"))
    col = np.array(gbm.feature_name()).reshape(-1, 1)
    return sorted(np.column_stack((col, gain)),key=lambda x: x[1],reverse=True)

In [75]:
ret_feat_impt(gbm)

[array(['I6', '0.21784656445602604'], dtype='<U32'),
 array(['I7', '0.15679149648593121'], dtype='<U32'),
 array(['I11', '0.10527646539762159'], dtype='<U32'),
 array(['C18', '0.08247005426333529'], dtype='<U32'),
 array(['C2', '0.0658619890861652'], dtype='<U32'),
 array(['C13', '0.058398721459156325'], dtype='<U32'),
 array(['I13', '0.048030262293402785'], dtype='<U32'),
 array(['I1', '0.042778803635427326'], dtype='<U32'),
 array(['C15', '0.039511219478802255'], dtype='<U32'),
 array(['I5', '0.026465364986651786'], dtype='<U32'),
 array(['I3', '0.026399955555101454'], dtype='<U32'),
 array(['C16', '0.018187689139205654'], dtype='<U32'),
 array(['C14', '0.017928615996830218'], dtype='<U32'),
 array(['C23', '0.016699597645041024'], dtype='<U32'),
 array(['C17', '0.015141875555598093'], dtype='<U32'),
 array(['I8', '0.01264986265272704'], dtype='<U32'),
 array(['I4', '0.011129015320886438'], dtype='<U32'),
 array(['I9', '0.009389771083105599'], dtype='<U32'),
 array(['C11', '0.00925250

In [76]:
dump = gbm.dump_model()

In [77]:
save_params_with_name((gbm, dump), 'gbm_dump')

In [78]:
gbm, dump = load_params_with_name('gbm_dump')

In [79]:
def generat_lgb2fm_data(outdir, gbm, dump, tr_path, va_path, te_path, _sep = '\t'):
    with open(os.path.join(outdir, 'train_lgb2fm.txt'), 'w') as out_train:
        with open(os.path.join(outdir, 'valid_lgb2fm.txt'), 'w') as out_valid:
            with open(os.path.join(outdir, 'test_lgb2fm.txt'), 'w') as out_test:
                df_train_ = pd.read_csv(tr_path, header=None, sep=_sep)
                df_valid_ = pd.read_csv(va_path, header=None, sep=_sep)
                df_test_= pd.read_csv(te_path, header=None, sep=_sep)

                y_train_ = df_train_[0].values
                y_valid_ = df_valid_[0].values                

                X_train_ = df_train_.drop(0, axis=1).values
                X_valid_ = df_valid_.drop(0, axis=1).values
                X_test_= df_test_.values
   
                train_leaves= gbm.predict(X_train_, num_iteration=gbm.best_iteration, pred_leaf=True)
                valid_leaves= gbm.predict(X_valid_, num_iteration=gbm.best_iteration, pred_leaf=True)
                test_leaves= gbm.predict(X_test_, num_iteration=gbm.best_iteration, pred_leaf=True)

                tree_info = dump['tree_info']
                tree_counts = len(tree_info)
                for i in range(tree_counts):
                    train_leaves[:, i] = train_leaves[:, i] + tree_info[i]['num_leaves'] * i + 1
                    valid_leaves[:, i] = valid_leaves[:, i] + tree_info[i]['num_leaves'] * i + 1
                    test_leaves[:, i] = test_leaves[:, i] + tree_info[i]['num_leaves'] * i + 1
#                     print(train_leaves[:, i])
#                     print(tree_info[i]['num_leaves'])

                for idx in range(len(y_train_)):            
                    out_train.write((str(y_train_[idx]) + '\t'))
                    out_train.write('\t'.join(
                        ['{}:{}'.format(ii, val) for ii,val in enumerate(train_leaves[idx]) if float(val) != 0 ]) + '\n')
                    
                for idx in range(len(y_valid_)):                   
                    out_valid.write((str(y_valid_[idx]) + '\t'))
                    out_valid.write('\t'.join(
                        ['{}:{}'.format(ii, val) for ii,val in enumerate(valid_leaves[idx]) if float(val) != 0 ]) + '\n')
                    
                for idx in range(len(X_test_)):                   
                    out_test.write('\t'.join(
                        ['{}:{}'.format(ii, val) for ii,val in enumerate(test_leaves[idx]) if float(val) != 0 ]) + '\n')

In [80]:
generat_lgb2fm_data('./data', gbm, dump, './data/train_lgb.txt', './data/valid_lgb.txt', './data/test_lgb.txt', '\t')

In [82]:
cmd = 'kaggle_criteo_ctr_challenge-/libfm/libfm/bin/libFM -task c -train ./data/train_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 64 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -save_model fm_model'
os.popen(cmd).readlines()

[]

In [83]:
cmd = 'kaggle_criteo_ctr_challenge-/libfm/libfm/bin/libFM -task c -train ./data/train_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 32 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -load_model fm_model -train_off true -prefix tr'
os.popen(cmd).readlines()

[]

In [84]:
cmd = 'kaggle_criteo_ctr_challenge-/libfm/libfm/bin/libFM -task c -train ./data/valid_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 32 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -load_model fm_model -train_off true -prefix va'
os.popen(cmd).readlines()


[]

In [85]:
cmd = 'kaggle_criteo_ctr_challenge-/libfm/libfm/bin/libFM -task c -train ./data/test_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 32 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -load_model fm_model -train_off true -prefix te -test2predict true'
os.popen(cmd).readlines()

[]

In [86]:
embed_dim = 32
sparse_max = 30000 # sparse_feature_dim = 117568
sparse_dim = 26
dense_dim = 13
out_dim = 400

In [87]:
def get_batches(Xs, ys, batch_size):
    for start in range(0, len(Xs), batch_size):
        end = min(start + batch_size, len(Xs))
        yield Xs[start:end], ys[start:end]

In [88]:
def get_batches_downsample(Xs, ys, batch_size):
    ind_0 = ys==0
    ind_1 = ys==1
    Xs_0 = Xs[ind_0]
    ys_0 = ys[ind_0]
    Xs_1 = Xs[ind_1]
    ys_1 = ys[ind_1]
    sampling_ind = np.random.permutation(Xs_0.shape[0])[:Xs_1.shape[0]]
    Xs_0_sampling = Xs_0[sampling_ind]
    ys_0_sampling = ys_0[sampling_ind]
    Xs_downsampled = np.concatenate((Xs_0_sampling, Xs_1))
    ys_downsampled = np.concatenate((ys_0_sampling, ys_1))
    downsampled_ind = np.random.permutation(Xs_downsampled.shape[0])
    Xs_downsampled = Xs_downsampled[downsampled_ind]
    ys_downsampled = ys_downsampled[downsampled_ind]
    for start in range(0, len(Xs_downsampled), batch_size):
        end = min(start + batch_size, len(Xs_downsampled))
        yield Xs_downsampled[start:end], ys_downsampled[start:end]

In [89]:
import tensorflow as tf
import datetime
from tensorflow import keras
from tensorflow.python.ops import summary_ops_v2
import time
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import time
import datetime
from sklearn.metrics import log_loss
# from sklearn.learning_curve import learning_curve
from sklearn.model_selection import learning_curve
from sklearn import metrics as sk_metrics

MODEL_DIR = "./models"


class ctr_network(object):
    def __init__(self, batch_size=32):
        self.batch_size = batch_size
        self.best_loss = 9999

        self.losses = {'train': [], 'test': []}
        self.pred_lst = []
        self.test_y_lst = []
        
        # 定义输入
        dense_input = tf.keras.layers.Input(shape=(dense_dim,), name='dense_input')
        sparse_input = tf.keras.layers.Input(shape=(sparse_dim,), name='sparse_input')
        FFM_input = tf.keras.layers.Input(shape=(1,), name='FFM_input')
        FM_input = tf.keras.layers.Input(shape=(1,), name='FM_input')

        # 输入类别特征，从嵌入层获得嵌入向量
        sparse_embed_layer = tf.keras.layers.Embedding(sparse_max, embed_dim, input_length=sparse_dim)(sparse_input)
        sparse_embed_layer = tf.keras.layers.Reshape([sparse_dim * embed_dim])(sparse_embed_layer)

        # 输入数值特征，和嵌入向量链接在一起经过三层全连接层
        input_combine_layer = tf.keras.layers.concatenate([dense_input, sparse_embed_layer])  # (?, 845 = 832 + 13)
        fc1_layer = tf.keras.layers.Dense(out_dim, name="fc1_layer", activation='relu')(input_combine_layer)
        fc2_layer = tf.keras.layers.Dense(out_dim, name="fc2_layer", activation='relu')(fc1_layer)
        fc3_layer = tf.keras.layers.Dense(out_dim, name="fc3_layer", activation='relu')(fc2_layer)

        ffm_fc_layer = tf.keras.layers.Dense(1, name="ffm_fc_layer")(FFM_input)
        fm_fc_layer = tf.keras.layers.Dense(1, name="fm_fc_layer")(FM_input)
        feature_combine_layer = tf.keras.layers.concatenate([ffm_fc_layer, fm_fc_layer, fc3_layer], 1)  # (?, 402)

        logits_output = tf.keras.layers.Dense(1, name="logits_layer", activation='sigmoid')(feature_combine_layer)

        self.model = tf.keras.Model(inputs=[dense_input, sparse_input, FFM_input, FM_input], outputs=[logits_output])
        self.model.summary()

        self.optimizer = tf.compat.v1.train.FtrlOptimizer(0.01)  # tf.keras.optimizers.Adam(0.01)
        self.ComputeLoss = tf.keras.losses.LogLoss()

        if tf.io.gfile.exists(MODEL_DIR):
            #             print('Removing existing model dir: {}'.format(MODEL_DIR))
            #             tf.io.gfile.rmtree(MODEL_DIR)
            pass
        else:
            tf.io.gfile.makedirs(MODEL_DIR)

        train_dir = os.path.join(MODEL_DIR, 'summaries', 'train')
        test_dir = os.path.join(MODEL_DIR, 'summaries', 'eval')

#         self.train_summary_writer = summary_ops_v2.create_file_writer(train_dir, flush_millis=10000)
#         self.test_summary_writer = summary_ops_v2.create_file_writer(test_dir, flush_millis=10000, name='test')

        checkpoint_dir = os.path.join(MODEL_DIR, 'checkpoints')
        self.checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt')
        self.checkpoint = tf.train.Checkpoint(model=self.model, optimizer=self.optimizer)

        # Restore variables on creation if a checkpoint exists.
        self.checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

    def compute_metrics(self, labels, pred):
        correct_prediction = tf.equal(tf.keras.backend.cast(pred > 0.5, 'float32'), labels)
        accuracy = tf.reduce_mean(tf.keras.backend.cast(correct_prediction, 'float32'), name="accuracy")
        return accuracy  

    @tf.function
    def train_step(self, x, y):
        # Record the operations used to compute the loss, so that the gradient
        # of the loss with respect to the variables can be computed.
        metrics = 0
        with tf.GradientTape() as tape:
            pred = self.model([x[0],
                               x[1],
                               x[2],
                               x[3]], training=True)
            loss = self.ComputeLoss(y, pred)
            metrics = self.compute_metrics(y, pred)
        grads = tape.gradient(loss, self.model.trainable_variables)
        self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
        return loss, metrics, pred

    def training(self, train_dataset, test_dataset, downsample_flg=True, epochs=1, log_freq=50):

        train_X, train_y = train_dataset

        for epoch_i in range(epochs):
            if downsample_flg:
                train_batches = get_batches_downsample(train_X, train_y, self.batch_size)
                batch_num = (len(train_y[train_y==1])*2 // self.batch_size)
            else:
                train_batches = get_batches(train_X, train_y, self.batch_size)
                batch_num = len(train_X) // self.batch_size

            train_start = time.time()
#             with self.train_summary_writer.as_default():
            if True:
                start = time.time()
                # Metrics are stateful. They accumulate values and return a cumulative
                # result when you call .result(). Clear accumulated values with .reset_states()
                avg_loss = tf.keras.metrics.Mean('loss', dtype=tf.float32)
                avg_acc = tf.keras.metrics.Mean('acc', dtype=tf.float32)
                avg_auc = tf.keras.metrics.Mean('auc', dtype=tf.float32)

                # Datasets can be iterated over like any other Python iterable.
                for batch_i in range(batch_num):
                    x, y = next(train_batches)
                    if len(x) < self.batch_size:
                        break
                    
                    loss, metrics, pred = self.train_step([x.take([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 1),
                               x.take(
                                   [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
                                    36, 37, 38, 39, 40], 1),
                               np.reshape(x.take(0, 1), [self.batch_size, 1]),
                               np.reshape(x.take(1, 1), [self.batch_size, 1])], np.expand_dims(y, 1))
                    avg_loss(loss)
                    avg_acc(metrics)

                    prediction = tf.reshape(pred, y.shape)
                    self.losses['train'].append(loss)

                    if (np.mean(y) != 0):
                        auc = sk_metrics.roc_auc_score(y, prediction)
                    else:
                        auc = -1

                    avg_auc(auc)
                    if tf.equal((epoch_i * (batch_num) + batch_i) % log_freq, 0):
#                         summary_ops_v2.scalar('loss', avg_loss.result(), step=self.optimizer.iterations)
                        #                         summary_ops_v2.scalar('mae', self.ComputeMetrics.result(), step=self.optimizer.iterations)
#                         summary_ops_v2.scalar('acc', avg_acc.result(), step=self.optimizer.iterations)

                        rate = log_freq / (time.time() - start)
                        
                        print('Epoch {:>3} Batch {:>4}/{} Loss: {:0.6f} acc: {:0.6f} auc = {} ({} steps/sec)'.format(
                            epoch_i, batch_i, batch_num, avg_loss.result(), (avg_acc.result()), avg_auc.result(), rate))

                        avg_auc.reset_states()
                        avg_loss.reset_states()
                        
                        avg_acc.reset_states()
                        start = time.time()

            train_end = time.time()
            print('\nTrain time for epoch #{} : {}'.format(epoch_i + 1, train_end - train_start))
#             with self.test_summary_writer.as_default():
            self.testing(test_dataset)
            # self.checkpoint.save(self.checkpoint_prefix)
        self.export_path = os.path.join(MODEL_DIR, 'export')
        tf.saved_model.save(self.model, self.export_path)

    def testing(self, test_dataset):
        test_X, test_y = test_dataset
        test_batches = get_batches(test_X, test_y, self.batch_size)

        """Perform an evaluation of `model` on the examples from `dataset`."""
        avg_loss = tf.keras.metrics.Mean('loss', dtype=tf.float32)
        avg_acc = tf.keras.metrics.Mean('acc', dtype=tf.float32)
        avg_auc = tf.keras.metrics.Mean('auc', dtype=tf.float32)
        avg_prediction = tf.keras.metrics.Mean('prediction', dtype=tf.float32)

        self.pred_lst=[]
        self.test_y_lst=[]
        
        batch_num = (len(test_X) // self.batch_size)
        for batch_i in range(batch_num):
            x, y = next(test_batches)
            if len(x) < self.batch_size:
                break
            
            pred = self.model([x.take([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 1),
                               x.take(
                                   [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
                                    36, 37, 38, 39, 40], 1),
                               np.reshape(x.take(0, 1), [self.batch_size, 1]),
                               np.reshape(x.take(1, 1), [self.batch_size, 1])], training=False)
            test_loss = self.ComputeLoss(np.expand_dims(y, 1), pred)
            avg_loss(test_loss)
            acc = self.compute_metrics(np.expand_dims(y, 1), pred)
            avg_acc(acc)

            # 保存测试损失和准确率
            prediction = tf.reshape(pred, y.shape)
            avg_prediction(prediction)
            self.losses['test'].append(test_loss)

            self.pred_lst.append(prediction)
            self.test_y_lst.append(y)

            if (np.mean(y) != 0):
                auc = sk_metrics.roc_auc_score(y, prediction)
            else:
                auc = -1
            avg_auc(auc)

        self.pred_lst = np.concatenate([val for val in self.pred_lst])
        self.test_y_lst = np.concatenate([val for val in self.test_y_lst])
        print('Model test set loss: {:0.6f}  acc: {:0.6f}  auc = {} prediction = {}'.format(
            avg_loss.result(), avg_acc.result(), avg_auc.result(), avg_prediction.result()))
        print(sk_metrics.classification_report(self.test_y_lst, tf.keras.backend.cast((self.pred_lst) > 0.5, 'float32')))
#         summary_ops_v2.scalar('loss', avg_loss.result(), step=step_num)
        #         summary_ops_v2.scalar('mae', self.ComputeMetrics.result(), step=step_num)
#         summary_ops_v2.scalar('acc', avg_acc.result(), step=step_num)

        if avg_loss.result() < self.best_loss:
            self.best_loss = avg_loss.result()
            print("best loss = {}".format(self.best_loss))
            self.checkpoint.save(self.checkpoint_prefix)

    def predict_click(self, x, axis = 0):
        clicked = self.model([np.reshape(x.take([2,3,4,5,6,7,8,9,10,11,12,13,14],axis), [1 if axis == 0 else len(x.take([2,3,4,5,6,7,8,9,10,11,12,13,14],axis)), 13]),
                               np.reshape(x.take([15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40],axis), [1 if axis == 0 else len(x.take([15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40],axis)), 26]),
                               np.reshape(x.take(0,axis), [1 if axis == 0 else len(x.take(0,axis)), 1]),
                               np.reshape(x.take(1,axis), [1 if axis == 0 else len(x.take(0,axis)), 1])])

        return (np.int32(np.array(clicked) > 0.5))

In [90]:
# Number of Epochs
num_epochs = 1
# Batch Size
batch_size = 32

# Learning Rate
learning_rate = 0.01
# Show stats for every n number of batches
show_every_n_batches = 25

save_dir = './save'

ffm_tr_out_path = './tr_ffm.out.logit'
ffm_va_out_path = './va_ffm.out.logit'
fm_tr_out_path = './tr.fm.logits'
fm_va_out_path = './va.fm.logits'
train_path = './data/train.txt'
valid_path = './data/valid.txt'

In [91]:
ffm_train = pd.read_csv(ffm_tr_out_path, header=None)    
ffm_train = ffm_train[0].values

ffm_valid = pd.read_csv(ffm_va_out_path, header=None)    
ffm_valid = ffm_valid[0].values

FileNotFoundError: ignored