<a href="https://colab.research.google.com/github/dk-wei/recommendation-algo-implementation/blob/main/DeepCTR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DeepCTR

DeepCTR is a **Easy-to-use,Modular** and **Extendible** package of deep-learning based CTR models along with lots of core components layers which can be used to easily build custom models.You can use any complex model with `model.fit()`，and `model.predict()`.

- Provide tf.keras.Model like interface for **quick experiment**. example
- Provide tensorflow estimator interface for **large scale data** and **distributed training**. example
- It is compatible with both tf 1.x and tf 2.x.

In [1]:
!git clone https://github.com/shenweichen/DeepCTR.git

Cloning into 'DeepCTR'...
remote: Enumerating objects: 2862, done.[K
remote: Counting objects: 100% (523/523), done.[K
remote: Compressing objects: 100% (198/198), done.[K
remote: Total 2862 (delta 365), reused 423 (delta 322), pack-reused 2339[K
Receiving objects: 100% (2862/2862), 6.77 MiB | 18.83 MiB/s, done.
Resolving deltas: 100% (2108/2108), done.


In [2]:
!pip install deepctr[gpu]

Collecting deepctr[gpu]
[?25l  Downloading https://files.pythonhosted.org/packages/50/2c/dd9dd105e366d80328fbbf1312d2623d41c8bfbd56ed14390b6c59d6719b/deepctr-0.8.6-py3-none-any.whl (128kB)
[K     |████████████████████████████████| 133kB 4.4MB/s 
Collecting h5py==2.10.0
[?25l  Downloading https://files.pythonhosted.org/packages/3f/c0/abde58b837e066bca19a3f7332d9d0493521d7dd6b48248451a9e3fe2214/h5py-2.10.0-cp37-cp37m-manylinux1_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 29.8MB/s 
[?25hCollecting tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"
[?25l  Downloading https://files.pythonhosted.org/packages/1d/a2/5ccf0a418eb22e0a2ae9edc1e7f5456d0a4b8b49524572897564b4030a9b/tensorflow_gpu-2.5.0-cp37-cp37m-manylinux2010_x86_64.whl (454.3MB)
[K     |████████████████████████████████| 454.3MB 40kB/s 
[31mERROR: tensorflow 2.5.0 has requirement h5py~=3.1.0, but you'll have h5py 2.10.0 which is incompatible.[0m
[31mERROR: tensorflow-gpu 2.5.0 has requirement h5p

![](https://pic3.zhimg.com/80/v2-dd98a58d2676f20ded7d7b0c61e88fa2_1440w.jpg)

DeepCTR的设计主要是面向那些对深度学习以及CTR预测算法感兴趣的同学，使他们可以利用这个包：

1. 从一个统一视角来看待各个模型
2. 快速地进行简单的对比实验
3. 利用已有的组件快速构建新的模型

# 统一视角

DeepCTR通过对现有的基于深度学习的点击率预测模型的结构进行抽象总结，在设计过程中采用模块化的思路，各个模块自身具有高复用性，各个模块之间互相独立。 基于深度学习的点击率预测模型按模型内部组件的功能可以划分成以下4个模块：输入模块，嵌入模块，特征提取模块，预测输出模块。

![](https://pic1.zhimg.com/80/v2-392784585d6238db7f20744e8e98c5f4_1440w.jpg)

- Input & Embedding    
The data in CTR estimation task usually includes **high sparse,high cardinality categorical features** and some dense numerical features.

  Since DNN are good at handling **dense numerical features**,we usually map the **sparse categorical features** to dense numerical through embedding technique.

  For **numerical features**,we usually apply **discretization** or **normalization** on them.

- Feature Extractor    
Low-order Extractor learns feature interaction through product between vectors. **Factorization-Machine (FM)** and it’s variants FFM are widely used to learn the low-order feature interaction.

  High-order Extractor learns feature combination through complex neural network functions like MLP,Cross Net,etc.#

## DeepFM

![](http://fancyerii.github.io/img/ctr/deepfm-1.png)

# Classification: Criteo

In [3]:
import pandas as pd
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

from deepctr.models import DeepFM
from deepctr.feature_column import SparseFeat, DenseFeat, get_feature_names

In [4]:
avazu_data = pd.read_csv('https://raw.githubusercontent.com/shenweichen/DeepCTR/master/examples/avazu_sample.txt')

avazu_data.head()

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1000009418151094273,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,ddd2926e,44956a24,1,2,15706,320,50,1722,0,35,-1,79
1,10000169349117863715,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,96809ac8,711ee120,1,0,15704,320,50,1722,0,35,100084,79
2,10000371904215119486,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,b3cf8def,8a4875bd,1,0,15704,320,50,1722,0,35,100084,79
3,10000640724480838376,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,e8275b8f,6332421a,1,0,15706,320,50,1722,0,35,100084,79
4,10000679056417042096,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,07d7df22,a99f214a,9644d0bf,779d90c2,1,0,18993,320,50,2161,0,35,-1,157


Criteo Data fields
- Label - Target variable that indicates if an ad was clicked (1) or not (0).
- I1-I13 - A total of 13 columns of dense numerical (integer) features (mostly *count features*).
- C1-C26 - A total of 26 columns of sparse categorical features. The values of 
these features have been hashed onto 32 bits for anonymization purposes. 
The semantic of the features is undisclosed.

When a value is missing, the field is empty.

In [5]:
data = pd.read_csv('https://raw.githubusercontent.com/shenweichen/DeepCTR/master/examples/criteo_sample.txt')
sparse_features = ['C' + str(i) for i in range(1, 27)]    # sparse feature一般指的就是categorical feature，特别是high cardinality
dense_features = ['I' + str(i) for i in range(1, 14)]     # dense feature一般指的就是numerical feature，还有count feature
 
data[sparse_features] = data[sparse_features].fillna('-1', )
data[dense_features] = data[dense_features].fillna(0, )
target = ['label']

In [6]:
data.head()

Unnamed: 0,label,I1,I2,I3,I4,I5,I6,I7,I8,I9,I10,I11,I12,I13,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26
0,0,0.0,3,260.0,0.0,17668.0,0.0,0.0,33.0,0.0,0.0,0.0,0.0,0.0,05db9164,08d6d899,9143c832,f56b7dd5,25c83c98,7e0ccccf,df5c2d18,0b153874,a73ee510,8f48ce11,a7b606c4,ae1bb660,eae197fd,b28479f6,bfef54b3,bad5ee18,e5ba7672,87c6f83c,-1,-1,0429f84b,-1,3a171ecb,c0d61a5c,-1,-1
1,0,0.0,-1,19.0,35.0,30251.0,247.0,1.0,35.0,160.0,0.0,1.0,0.0,35.0,68fd1e64,04e09220,95e13fd4,a1e6a194,25c83c98,fe6b92e5,f819e175,062b5529,a73ee510,ab9456b4,6153cf57,8882c6cd,769a1844,b28479f6,69f825dd,23056e4f,d4bb7bd8,6fc84bfb,-1,-1,5155d8a3,-1,be7c41b4,ded4aac9,-1,-1
2,0,0.0,0,2.0,12.0,2013.0,164.0,6.0,35.0,523.0,0.0,3.0,0.0,18.0,05db9164,38a947a1,3f55fb72,5de245c7,30903e74,7e0ccccf,b72ec13d,1f89b562,a73ee510,acce978c,3547565f,a5b0521a,12880350,b28479f6,c12fc269,95a8919c,e5ba7672,675c9258,-1,-1,2e01979f,-1,bcdee96c,6d5d1302,-1,-1
3,0,0.0,13,1.0,4.0,16836.0,200.0,5.0,4.0,29.0,0.0,2.0,0.0,4.0,05db9164,8084ee93,02cf9876,c18be181,25c83c98,-1,e14874c9,0b153874,7cc72ec2,2462946f,636405ac,8fe001f4,31b42deb,07d13a8f,422c8577,36103458,e5ba7672,52e44668,-1,-1,e587c466,-1,32c7478e,3b183c5c,-1,-1
4,0,0.0,0,104.0,27.0,1990.0,142.0,4.0,32.0,37.0,0.0,1.0,0.0,27.0,05db9164,207b2d81,5d076085,862b5ba0,25c83c98,fbad5c96,17c22666,0b153874,a73ee510,534fc986,feb49a68,f24b551c,8978af5c,64c94865,32ec6582,b6d021e8,e5ba7672,25c88e42,21ddcdc9,b1252a9d,0e8585d2,-1,32c7478e,0d4a6d1a,001f3601,92c878de


In [7]:
# 1.Label Encoding for sparse features,and do simple Transformation for dense features
# label eencoding + minmax

# Label Encoding: map the features to integer value from 0 ~ len(#unique) - 1
for feat in sparse_features:
    lbe = LabelEncoder()
    data[feat] = lbe.fit_transform(data[feat])
mms = MinMaxScaler(feature_range=(0, 1))
data[dense_features] = mms.fit_transform(data[dense_features])

data.head()

Unnamed: 0,label,I1,I2,I3,I4,I5,I6,I7,I8,I9,I10,I11,I12,I13,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26
0,0,0.0,0.001332,0.092362,0.0,0.034825,0.0,0.0,0.673469,0.0,0.0,0.0,0.0,0.0,0,4,96,146,1,4,163,1,1,72,117,127,157,7,127,126,8,66,0,0,3,0,1,96,0,0
1,0,0.0,0.0,0.00675,0.402299,0.059628,0.117284,0.003322,0.714286,0.154739,0.0,0.03125,0.0,0.343137,11,1,98,98,1,6,179,0,1,89,58,97,79,7,72,26,7,52,0,0,47,0,7,112,0,0
2,0,0.0,0.000333,0.00071,0.137931,0.003968,0.077873,0.019934,0.714286,0.505803,0.0,0.09375,0.0,0.176471,0,18,39,52,3,4,140,2,1,93,31,122,16,7,129,97,8,49,0,0,25,0,6,53,0,0
3,0,0.0,0.004664,0.000355,0.045977,0.033185,0.094967,0.016611,0.081633,0.028046,0.0,0.0625,0.0,0.039216,0,45,7,117,1,0,164,1,0,20,61,104,36,1,43,43,8,37,0,0,156,0,0,32,0,0
4,0,0.0,0.000333,0.036945,0.310345,0.003922,0.067426,0.013289,0.653061,0.035783,0.0,0.03125,0.0,0.264706,0,11,59,77,1,5,18,1,1,45,171,162,96,4,36,121,8,14,5,3,9,0,0,5,1,47


In [8]:
# 2.count #unique features for each sparse field,and record dense feature field name
# for varlen(multi-valued) sparse features,you can use VarlenSparseFeat. Visit examples of using VarlenSparseFeat

fixlen_feature_columns = [SparseFeat(feat, vocabulary_size=data[feat].nunique(),embedding_dim=4 )
                        for i,feat in enumerate(sparse_features)] + [DenseFeat(feat, 1,)
                      for feat in dense_features]

dnn_feature_columns = fixlen_feature_columns
linear_feature_columns = fixlen_feature_columns

feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)

In [9]:
#fixlen_feature_columns

In [10]:
# 3.generate input data for model

train, test = train_test_split(data, test_size=0.2, random_state=2018)
train_model_input = {name:train[name] for name in feature_names}
test_model_input = {name:test[name] for name in feature_names}

train_model_input['C1']

155    16
113     0
36     11
51     14
81     16
       ..
148    19
156     1
149    11
9       0
102    10
Name: C1, Length: 160, dtype: int64

In [11]:
#train_model_input

In [12]:
# 4.Define Model,train,predict and evaluate
model = DeepFM(linear_feature_columns, dnn_feature_columns, task='binary')
model.compile("adam", "binary_crossentropy",
              metrics=['binary_crossentropy'], )

In [13]:
history = model.fit(train_model_input, train[target].values,
                    batch_size=256, epochs=10, verbose=2, validation_split=0.2, )

pred_ans = model.predict(test_model_input, batch_size=256)

print("test LogLoss", round(log_loss(test[target].values, pred_ans), 4))
print("test AUC", round(roc_auc_score(test[target].values, pred_ans), 4))

Epoch 1/10
1/1 - 7s - loss: 0.7907 - binary_crossentropy: 0.7907 - val_loss: 0.7620 - val_binary_crossentropy: 0.7620
Epoch 2/10
1/1 - 0s - loss: 0.7648 - binary_crossentropy: 0.7648 - val_loss: 0.7469 - val_binary_crossentropy: 0.7469
Epoch 3/10
1/1 - 0s - loss: 0.7398 - binary_crossentropy: 0.7398 - val_loss: 0.7323 - val_binary_crossentropy: 0.7323
Epoch 4/10
1/1 - 0s - loss: 0.7158 - binary_crossentropy: 0.7158 - val_loss: 0.7182 - val_binary_crossentropy: 0.7182
Epoch 5/10
1/1 - 0s - loss: 0.6925 - binary_crossentropy: 0.6925 - val_loss: 0.7044 - val_binary_crossentropy: 0.7044
Epoch 6/10
1/1 - 0s - loss: 0.6700 - binary_crossentropy: 0.6700 - val_loss: 0.6909 - val_binary_crossentropy: 0.6909
Epoch 7/10
1/1 - 0s - loss: 0.6479 - binary_crossentropy: 0.6479 - val_loss: 0.6775 - val_binary_crossentropy: 0.6775
Epoch 8/10
1/1 - 0s - loss: 0.6262 - binary_crossentropy: 0.6261 - val_loss: 0.6644 - val_binary_crossentropy: 0.6644
Epoch 9/10
1/1 - 0s - loss: 0.6046 - binary_crossentropy

# Classification: Criteo with feature hashing on the fly

In [14]:
import pandas as pd
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

from deepctr.models import DeepFM
from deepctr.feature_column import SparseFeat, DenseFeat,get_feature_names

if __name__ == "__main__":
    #data = pd.read_csv('./criteo_sample.txt')
    data = pd.read_csv('https://raw.githubusercontent.com/shenweichen/DeepCTR/master/examples/criteo_sample.txt')
    
    sparse_features = ['C' + str(i) for i in range(1, 27)]
    dense_features = ['I' + str(i) for i in range(1, 14)]

    data[sparse_features] = data[sparse_features].fillna('-1', )
    data[dense_features] = data[dense_features].fillna(0, )
    target = ['label']

    # 1.do simple Transformation for dense features
    mms = MinMaxScaler(feature_range=(0, 1))
    data[dense_features] = mms.fit_transform(data[dense_features])

    # 2.set hashing space for each sparse field,and record dense feature field name

    fixlen_feature_columns = [SparseFeat(feat, vocabulary_size=1000,embedding_dim=4, use_hash=True, dtype='string')  # since the input is string
                              for feat in sparse_features] + [DenseFeat(feat, 1, )
                          for feat in dense_features]

    linear_feature_columns = fixlen_feature_columns
    dnn_feature_columns = fixlen_feature_columns
    feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns, )

    # 3.generate input data for model

    train, test = train_test_split(data, test_size=0.2, random_state=2020)

    train_model_input = {name:train[name] for name in feature_names}
    test_model_input = {name:test[name] for name in feature_names}


    # 4.Define Model,train,predict and evaluate
    model = DeepFM(linear_feature_columns,dnn_feature_columns, task='binary')
    model.compile("adam", "binary_crossentropy",
                  metrics=['binary_crossentropy'], )

    history = model.fit(train_model_input, train[target].values,
                        batch_size=256, epochs=10, verbose=2, validation_split=0.2, )
    pred_ans = model.predict(test_model_input, batch_size=256)
    print("test LogLoss", round(log_loss(test[target].values, pred_ans), 4))
    print("test AUC", round(roc_auc_score(test[target].values, pred_ans), 4))

Epoch 1/10
1/1 - 7s - loss: 0.7852 - binary_crossentropy: 0.7852 - val_loss: 0.7727 - val_binary_crossentropy: 0.7727
Epoch 2/10
1/1 - 0s - loss: 0.7621 - binary_crossentropy: 0.7621 - val_loss: 0.7580 - val_binary_crossentropy: 0.7579
Epoch 3/10
1/1 - 0s - loss: 0.7404 - binary_crossentropy: 0.7403 - val_loss: 0.7441 - val_binary_crossentropy: 0.7440
Epoch 4/10
1/1 - 0s - loss: 0.7197 - binary_crossentropy: 0.7197 - val_loss: 0.7309 - val_binary_crossentropy: 0.7309
Epoch 5/10
1/1 - 0s - loss: 0.6999 - binary_crossentropy: 0.6999 - val_loss: 0.7184 - val_binary_crossentropy: 0.7184
Epoch 6/10
1/1 - 0s - loss: 0.6807 - binary_crossentropy: 0.6807 - val_loss: 0.7064 - val_binary_crossentropy: 0.7064
Epoch 7/10
1/1 - 0s - loss: 0.6621 - binary_crossentropy: 0.6621 - val_loss: 0.6947 - val_binary_crossentropy: 0.6947
Epoch 8/10
1/1 - 0s - loss: 0.6438 - binary_crossentropy: 0.6438 - val_loss: 0.6834 - val_binary_crossentropy: 0.6834
Epoch 9/10
1/1 - 0s - loss: 0.6258 - binary_crossentropy

# Regression: Movielens

In [15]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from deepctr.models import DeepFM
from deepctr.feature_column import SparseFeat,get_feature_names

if __name__ == "__main__":

    data = pd.read_csv("DeepCTR/examples/movielens_sample.txt")
    sparse_features = ["movie_id", "user_id",
                       "gender", "age", "occupation", "zip"]
    target = ['rating']

    # 1.Label Encoding for sparse features,and do simple Transformation for dense features
    for feat in sparse_features:
        lbe = LabelEncoder()
        data[feat] = lbe.fit_transform(data[feat])
    # 2.count #unique features for each sparse field
    fixlen_feature_columns = [SparseFeat(feat, data[feat].max() + 1,embedding_dim=4)
                              for feat in sparse_features]
    linear_feature_columns = fixlen_feature_columns
    dnn_feature_columns = fixlen_feature_columns
    feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)

    # 3.generate input data for model
    train, test = train_test_split(data, test_size=0.2, random_state=2020)
    train_model_input = {name:train[name].values for name in feature_names}
    test_model_input = {name:test[name].values for name in feature_names}

    # 4.Define Model,train,predict and evaluate
    model = DeepFM(linear_feature_columns, dnn_feature_columns, task='regression')
    model.compile("adam", "mse", metrics=['mse'], )

    history = model.fit(train_model_input, train[target].values,
                        batch_size=256, epochs=10, verbose=2, validation_split=0.2, )
    pred_ans = model.predict(test_model_input, batch_size=256)
    print("test MSE", round(mean_squared_error(
        test[target].values, pred_ans), 4))

Epoch 1/10
1/1 - 2s - loss: 13.4138 - mse: 13.4138 - val_loss: 15.6879 - val_mse: 15.6879
Epoch 2/10
1/1 - 0s - loss: 13.2782 - mse: 13.2782 - val_loss: 15.5574 - val_mse: 15.5574
Epoch 3/10
1/1 - 0s - loss: 13.1353 - mse: 13.1353 - val_loss: 15.4163 - val_mse: 15.4163
Epoch 4/10
1/1 - 0s - loss: 12.9816 - mse: 12.9816 - val_loss: 15.2659 - val_mse: 15.2659
Epoch 5/10
1/1 - 0s - loss: 12.8179 - mse: 12.8179 - val_loss: 15.1055 - val_mse: 15.1055
Epoch 6/10
1/1 - 0s - loss: 12.6436 - mse: 12.6436 - val_loss: 14.9349 - val_mse: 14.9349
Epoch 7/10
1/1 - 0s - loss: 12.4583 - mse: 12.4583 - val_loss: 14.7533 - val_mse: 14.7533
Epoch 8/10
1/1 - 0s - loss: 12.2610 - mse: 12.2610 - val_loss: 14.5601 - val_mse: 14.5601
Epoch 9/10
1/1 - 0s - loss: 12.0510 - mse: 12.0510 - val_loss: 14.3545 - val_mse: 14.3545
Epoch 10/10
1/1 - 0s - loss: 11.8273 - mse: 11.8273 - val_loss: 14.1355 - val_mse: 14.1355
test MSE 13.4891


# Multi-value Input : Movielens 

也就是如果categorical feature有多个values

In [16]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

from deepctr.models import DeepFM
from deepctr.feature_column import SparseFeat, VarLenSparseFeat,get_feature_names


def split(x):
    key_ans = x.split('|')
    for key in key_ans:
        if key not in key2index:
            # Notice : input value 0 is a special "padding",so we do not use 0 to encode valid feature for sequence input
            key2index[key] = len(key2index) + 1
    return list(map(lambda x: key2index[x], key_ans))

if __name__ == "__main__":
    data = pd.read_csv("DeepCTR/examples/movielens_sample.txt")
    sparse_features = ["movie_id", "user_id",
                       "gender", "age", "occupation", "zip", ]
    target = ['rating']

    # 1.Label Encoding for sparse features,and process sequence features
    for feat in sparse_features:
        lbe = LabelEncoder()
        data[feat] = lbe.fit_transform(data[feat])
    # preprocess the sequence feature

    key2index = {}
    genres_list = list(map(split, data['genres'].values))
    genres_length = np.array(list(map(len, genres_list)))
    max_len = max(genres_length)
    # Notice : padding=`post`
    genres_list = pad_sequences(genres_list, maxlen=max_len, padding='post', )

    # 2.count #unique features for each sparse field and generate feature config for sequence feature

    fixlen_feature_columns = [SparseFeat(feat, data[feat].max() + 1,embedding_dim=4)
                        for feat in sparse_features]

    use_weighted_sequence = False
    if use_weighted_sequence:
        varlen_feature_columns = [VarLenSparseFeat(SparseFeat('genres',vocabulary_size=len(
            key2index) + 1,embedding_dim=4), maxlen= max_len, combiner='mean',weight_name='genres_weight')]  # Notice : value 0 is for padding for sequence input feature
    else:
        varlen_feature_columns = [VarLenSparseFeat(SparseFeat('genres',vocabulary_size= len(
            key2index) + 1,embedding_dim=4), maxlen=max_len, combiner='mean',weight_name=None)]  # Notice : value 0 is for padding for sequence input feature

    linear_feature_columns = fixlen_feature_columns + varlen_feature_columns
    dnn_feature_columns = fixlen_feature_columns + varlen_feature_columns

    feature_names = get_feature_names(linear_feature_columns+dnn_feature_columns)


    # 3.generate input data for model
    model_input = {name:data[name] for name in feature_names}#
    model_input["genres"] = genres_list
    model_input["genres_weight"] =  np.random.randn(data.shape[0],max_len,1)


    # 4.Define Model,compile and train
    model = DeepFM(linear_feature_columns,dnn_feature_columns,task='regression')

    model.compile("adam", "mse", metrics=['mse'], )
    history = model.fit(model_input, data[target].values,
                        batch_size=256, epochs=10, verbose=2, validation_split=0.2, )

Epoch 1/10


  [n for n in tensors.keys() if n not in ref_input_names])


1/1 - 3s - loss: 14.2996 - mse: 14.2996 - val_loss: 13.3741 - val_mse: 13.3741
Epoch 2/10
1/1 - 0s - loss: 14.1497 - mse: 14.1497 - val_loss: 13.2381 - val_mse: 13.2381
Epoch 3/10
1/1 - 0s - loss: 13.9874 - mse: 13.9874 - val_loss: 13.0924 - val_mse: 13.0924
Epoch 4/10
1/1 - 0s - loss: 13.8131 - mse: 13.8131 - val_loss: 12.9355 - val_mse: 12.9355
Epoch 5/10
1/1 - 0s - loss: 13.6255 - mse: 13.6255 - val_loss: 12.7666 - val_mse: 12.7666
Epoch 6/10
1/1 - 0s - loss: 13.4240 - mse: 13.4240 - val_loss: 12.5852 - val_mse: 12.5852
Epoch 7/10
1/1 - 0s - loss: 13.2078 - mse: 13.2078 - val_loss: 12.3907 - val_mse: 12.3907
Epoch 8/10
1/1 - 0s - loss: 12.9755 - mse: 12.9755 - val_loss: 12.1822 - val_mse: 12.1822
Epoch 9/10
1/1 - 0s - loss: 12.7261 - mse: 12.7261 - val_loss: 11.9591 - val_mse: 11.9591
Epoch 10/10
1/1 - 0s - loss: 12.4582 - mse: 12.4582 - val_loss: 11.7204 - val_mse: 11.7204


# Multi-value Input : Movielens with feature hashing on the fly

In [17]:
import numpy as np
import pandas as pd
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

from deepctr.feature_column import SparseFeat, VarLenSparseFeat,get_feature_names
from deepctr.models import DeepFM

if __name__ == "__main__":
    data = pd.read_csv("DeepCTR/examples/movielens_sample.txt")
    sparse_features = ["movie_id", "user_id",
                       "gender", "age", "occupation", "zip", ]

    data[sparse_features] = data[sparse_features].astype(str)
    target = ['rating']

    # 1.Use hashing encoding on the fly for sparse features,and process sequence features

    genres_list = list(map(lambda x: x.split('|'), data['genres'].values))
    genres_length = np.array(list(map(len, genres_list)))
    max_len = max(genres_length)

    # Notice : padding=`post`
    genres_list = pad_sequences(genres_list, maxlen=max_len, padding='post', dtype=str, value=0)

    # 2.set hashing space for each sparse field and generate feature config for sequence feature

    fixlen_feature_columns = [SparseFeat(feat, data[feat].nunique() * 5, embedding_dim=4, use_hash=True, dtype='string')
                              for feat in sparse_features]
    varlen_feature_columns = [
        VarLenSparseFeat(SparseFeat('genres', vocabulary_size=100, embedding_dim=4, use_hash=True, dtype="string"),
                         maxlen=max_len, combiner='mean',
                         )]  # Notice : value 0 is for padding for sequence input feature
    linear_feature_columns = fixlen_feature_columns + varlen_feature_columns
    dnn_feature_columns = fixlen_feature_columns + varlen_feature_columns
    feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)

    # 3.generate input data for model
    model_input = {name: data[name] for name in feature_names}
    model_input['genres'] = genres_list

    # 4.Define Model,compile and train
    model = DeepFM(linear_feature_columns, dnn_feature_columns, task='regression')

    model.compile("adam", "mse", metrics=['mse'], )
    history = model.fit(model_input, data[target].values,
                        batch_size=256, epochs=10, verbose=2, validation_split=0.2, )

Epoch 1/10
1/1 - 3s - loss: 14.2997 - mse: 14.2997 - val_loss: 13.3717 - val_mse: 13.3717
Epoch 2/10
1/1 - 0s - loss: 14.1511 - mse: 14.1511 - val_loss: 13.2344 - val_mse: 13.2344
Epoch 3/10
1/1 - 0s - loss: 13.9906 - mse: 13.9906 - val_loss: 13.0876 - val_mse: 13.0876
Epoch 4/10
1/1 - 0s - loss: 13.8194 - mse: 13.8194 - val_loss: 12.9316 - val_mse: 12.9316
Epoch 5/10
1/1 - 0s - loss: 13.6376 - mse: 13.6376 - val_loss: 12.7655 - val_mse: 12.7655
Epoch 6/10
1/1 - 0s - loss: 13.4446 - mse: 13.4446 - val_loss: 12.5887 - val_mse: 12.5887
Epoch 7/10
1/1 - 0s - loss: 13.2395 - mse: 13.2395 - val_loss: 12.4003 - val_mse: 12.4003
Epoch 8/10
1/1 - 0s - loss: 13.0211 - mse: 13.0211 - val_loss: 12.1996 - val_mse: 12.1996
Epoch 9/10
1/1 - 0s - loss: 12.7886 - mse: 12.7886 - val_loss: 11.9859 - val_mse: 11.9859
Epoch 10/10
1/1 - 0s - loss: 12.5411 - mse: 12.5411 - val_loss: 11.7586 - val_mse: 11.7586


# Estimator with `TFRecord`: Classification Criteo

In [18]:
import tensorflow as tf

from tensorflow.python.ops.parsing_ops import  FixedLenFeature
from deepctr.estimator import DeepFMEstimator
from deepctr.estimator.inputs import input_fn_tfrecord


if __name__ == "__main__":

    # 1.generate feature_column for linear part and dnn part

    sparse_features = ['C' + str(i) for i in range(1, 27)]
    dense_features = ['I' + str(i) for i in range(1, 14)]

    dnn_feature_columns = []
    linear_feature_columns = []

    for i, feat in enumerate(sparse_features):
        dnn_feature_columns.append(tf.feature_column.embedding_column(
            tf.feature_column.categorical_column_with_identity(feat, 1000), 4))
        linear_feature_columns.append(tf.feature_column.categorical_column_with_identity(feat, 1000))
    for feat in dense_features:
        dnn_feature_columns.append(tf.feature_column.numeric_column(feat))
        linear_feature_columns.append(tf.feature_column.numeric_column(feat))

    # 2.generate input data for model

    feature_description = {k: FixedLenFeature(dtype=tf.int64, shape=1) for k in sparse_features}
    feature_description.update(
        {k: FixedLenFeature(dtype=tf.float32, shape=1) for k in dense_features})
    feature_description['label'] = FixedLenFeature(dtype=tf.float32, shape=1)

    train_model_input = input_fn_tfrecord('DeepCTR/examples/criteo_sample.tr.tfrecords', feature_description, 'label', batch_size=256,
                                          num_epochs=1, shuffle_factor=10)
    test_model_input = input_fn_tfrecord('DeepCTR/examples/criteo_sample.te.tfrecords', feature_description, 'label',
                                         batch_size=2 ** 14, num_epochs=1, shuffle_factor=0)

    # 3.Define Model,train,predict and evaluate
    model = DeepFMEstimator(linear_feature_columns, dnn_feature_columns, task='binary', 
                            config=tf.estimator.RunConfig(tf_random_seed=2021))

    model.train(train_model_input)
    eval_result = model.evaluate(test_model_input)

    print(eval_result)

INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmp5rfji1ro', '_tf_random_seed': 2021, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and gr



Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
Use `tf.cast` instead.
Instructions for updating:
The value of AUC returned by this may race with the update so this is deprecated. Please use tf.keras.metrics.AUC instead.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create 

# Estimator with Pandas DataFrame: Classification Criteo

In [19]:
import pandas as pd
import tensorflow as tf
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

from deepctr.estimator import DeepFMEstimator
from deepctr.estimator.inputs import input_fn_pandas

if __name__ == "__main__":
    data = pd.read_csv('DeepCTR/examples/criteo_sample.txt')

    sparse_features = ['C' + str(i) for i in range(1, 27)]
    dense_features = ['I' + str(i) for i in range(1, 14)]

    data[sparse_features] = data[sparse_features].fillna('-1', )
    data[dense_features] = data[dense_features].fillna(0, )
    target = ['label']

    # 1.Label Encoding for sparse features,and do simple Transformation for dense features
    for feat in sparse_features:
        lbe = LabelEncoder()
        data[feat] = lbe.fit_transform(data[feat])
    mms = MinMaxScaler(feature_range=(0, 1))
    data[dense_features] = mms.fit_transform(data[dense_features])

    # 2.count #unique features for each sparse field,and record dense feature field name

    dnn_feature_columns = []
    linear_feature_columns = []

    for i, feat in enumerate(sparse_features):
        dnn_feature_columns.append(tf.feature_column.embedding_column(
            tf.feature_column.categorical_column_with_identity(feat, data[feat].max() + 1), 4))
        linear_feature_columns.append(tf.feature_column.categorical_column_with_identity(feat, data[feat].max() + 1))
    for feat in dense_features:
        dnn_feature_columns.append(tf.feature_column.numeric_column(feat))
        linear_feature_columns.append(tf.feature_column.numeric_column(feat))

    # 3.generate input data for model

    train, test = train_test_split(data, test_size=0.2, random_state=2021)

    # Not setting default value for continuous feature. filled with mean.

    train_model_input = input_fn_pandas(train, sparse_features + dense_features, 'label', shuffle=True)
    test_model_input = input_fn_pandas(test, sparse_features + dense_features, None, shuffle=False)

    # 4.Define Model,train,predict and evaluate
    model = DeepFMEstimator(linear_feature_columns, dnn_feature_columns, task='binary', 
                            config=tf.estimator.RunConfig(tf_random_seed=2021))

    model.train(train_model_input)
    pred_ans_iter = model.predict(test_model_input)
    pred_ans = list(map(lambda x: x['pred'], pred_ans_iter))
    #
    print("test LogLoss", round(log_loss(test[target].values, pred_ans), 4))
    print("test AUC", round(roc_auc_score(test[target].values, pred_ans), 4))



INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmp769zctih', '_tf_random_seed': 2021, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To cons



INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp769zctih/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 572.73376, step = 0
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 1...
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmp769zctih/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 1...
INFO:tensorflow:Loss for final step: 572.73376.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp769zctih/model.

  loss = -(transformed_labels * np.log(y_pred)).sum(axis=1)
