# Training a model with all features on the Adkdd challenge dataset

As explained in the research article, training directly on the aggregated data released in the challenge does not produce good results.
The main reason is as follow:
- the dataset contains several features with a large number (>50k) of modalities. 
- those features (in particular features 14 and 17) are both strongly predictive (removing one has a significant impact on the skyline) and very strongly correlated.
- However, it is not reasonable to have one parameter for each pair of modality of those features (the model would just be too large to fit in memory.) I tried hashing the crossfeature, it does not work well because this important information about the correlation of those features is lost by hashing, and the resulting model significantly underperforms.
- The best solution so far is to compute some target encodings of these features, to reduce their cardinality to a reasonable number ( < 1000 ) before aggregating.

In this notebook:
- we precompute some target encodings of features with many modalities on a set of held out granular data.
- we read the full (granular) dataset released with  https://arxiv.org/pdf/2201.13123.pdf  (see also https://github.com/criteo-research/ad_click_prediction_from_aggregated_data ),  preprocess it to replace each feature by these target encodings, and aggregate the data
- finally we train the RMF model on the resulting aggregated data

Note the training is done with *fairly large pyspark session*. It was written and tested on Criteo infrastructure, *making it work from outside may require a few changes* to install pyspark and create a spark session.

In [None]:
%load_ext autoreload
%autoreload 2
import sys
from aggregated_models.myimports  import *
import aggregated_models.myJupyterUtils as myJupyterUtils ## Remove stacktraces on Keyboardinterupt
plt.style.use('ggplot')
from aggregated_models.aggdataset import * 
import gzip
from itertools import islice
from aggregated_models.RawFeatureMapping import *


## Downloading challenge datasets

In [6]:
datapath = "../data/challenge/"
if not os.path.exists(datapath):
    print(f"creating {datapath}")
    os.mkdir(datapath)
    import urllib.request
    # full granular train set (2.4G)
    urllib.request.urlretrieve("http://go.criteo.net/criteo-ppml-challenge-adkdd21-dataset-raw-granular-data.csv.gz",
                               datapath + "large_train.csv.gz")
    # challenge files
    urllib.request.urlretrieve("http://go.criteo.net/criteo-ppml-challenge-adkdd21-dataset.zip", 
                               datapath + "challenge.zip")
    import zipfile
    with zipfile.ZipFile(datapath + "challenge.zip", "r") as zip_ref:
        zip_ref.extractall(datapath)
    # additional lines
    urllib.request.urlretrieve("http://go.criteo.net/criteo-ppml-challenge-adkdd21-dataset-additional-test-data.csv.gz",
                               datapath + "large_test.csv.gz")    

In [7]:
!ls -lah ../data/challenge

total 3.1G
drwxr-xr-x 2 a.gilotte Domain Users  290 Jul  5 16:39 .
drwxr-xr-x 5 a.gilotte Domain Users   99 Jul  5 15:59 ..
-rw-r--r-- 1 a.gilotte Domain Users  26M Jul  5 16:13 X_test.csv.gz
-rw-r--r-- 1 a.gilotte Domain Users 2.9M Jul  5 16:13 X_train.csv.gz
-rw-r--r-- 1 a.gilotte Domain Users 241M Jul  5 16:13 aggregated_noisy_data_pairs.csv.gz
-rw-r--r-- 1 a.gilotte Domain Users  15M Jul  5 16:13 aggregated_noisy_data_singles.csv.gz
-rw-r--r-- 1 a.gilotte Domain Users 285M Jul  5 16:13 challenge.zip
-rw-r--r-- 1 a.gilotte Domain Users 103M Jul  5 16:39 large_test.csv.gz
-rw-r--r-- 1 a.gilotte Domain Users 2.5G Jul  5 16:09 large_train.csv.gz
-rw-r--r-- 1 a.gilotte Domain Users 141K Jul  5 16:13 y_test.csv.gz
-rw-r--r-- 1 a.gilotte Domain Users  16K Jul  5 16:13 y_train.csv.gz


In [8]:
datapath = "../data/challenge/"
filename =  datapath + "large_train.csv.gz"
filename_largetest =  datapath + "large_test.csv.gz"
filename_smalltrain = datapath + "small_train.csv.gz"
filename_smalltest =  datapath + "data/test.csv.gz"

labels = ["click" , "sale"]
allfeatures = ['hash_'+str(i) for i in range(0,19)]

##  Using  "large_test" to compute target encodings
## Aggregating large_train, training on aggregated data
## Test metrics computed on test and/or small_train  ("small_train" is actually not used at all in training)

## Preparing encodings

In [10]:
df = pd.read_csv(filename_largetest , dtype=np.int32 ,  nrows = 4_000_000) 

In [24]:
logbase=2
nbStd = 0.3
gaussianStd=1
mappings = {}
sigma = 0

for f in allfeatures:
    mappings[f] = RawFeatureMapping.FromDF( f, df  )
    size = mappings[f].Size
    if size > 100:
        df["d"]=1
        df_f = df[[f, "click", "d"]].groupby(f).sum().reset_index()
        df_f["click"] += np.random.normal( 0,sigma , len( df_f ))
        df_f["d"] += np.random.normal( 0,sigma , len( df_f ))  
        df_f.loc[df_f['d'] <1, 'd'] = 1
        df_f.loc[df_f['click'] <0, 'click'] = 0
        mappings[f] = RawFeatureMapping.BuildCtrBucketsFromAggDf(f, df_f, logbase=logbase, nbStd=nbStd, gaussianStd=gaussianStd)
        print(f, size, '->', mappings[f].Size ) 


hash_0 9542 -> 666
hash_3 762 -> 150
hash_10 6417 -> 714
hash_12 4312 -> 270
hash_13 104 -> 82
hash_14 441976 -> 542
hash_16 1042 -> 258
hash_17 94264 -> 665


## Aggregating the large train

In [25]:
rawFeaturesSet = RawFeaturesSet(allfeatures, mappings )
maxNbModalities= {f : 998 for f in allfeatures}
maxNbModalities["default"] = 1_000_000 
cfset = CrossFeaturesSet(rawFeaturesSet , "*&*",maxNbModalities=maxNbModalities  )

In [26]:
    aggdata = AggDataset( cfset , [ "display" , "click" ])
    df0 = pd.read_csv(  filename ,  nrows=1)
    names = df0.columns
    batch = 100_000
    with  gzip. open(filename, "rb") as file:
        header = file.readline()
        i = 0
        while True:
            df = pd.read_csv(  file ,  nrows=batch , header=0 , names =names )
            if len(df) < 1 :
                break
            i +=1
            print( f"processing batch {i}        ", end = '\r' )
            aggdata.aggregate(df)

processing batch 884        

In [27]:
with open('data/aggdata_kdd', "wb") as handle:
    aggdata.dump( handle )

## reloading the data

In [11]:
with open('data/aggdata_kdd', "rb") as handle:
    aggdata = AggDataset.load( handle )

In [12]:
aggdata = aggdata.MakeDiffPrivate( 5 ,1e-5 , True )

GaussianMechanism epsilon:5 delta:1e-05 sigma:19.494904835891614


## Learning the model

#### Note
This part of the notebook was written to run on Criteo infrastructure.
- *it requires access to fairly large a spark session.*
- and it used Criteo internal library (thx) to create and configure this session.
It should be possible to make it work with minor changes on another infrastructure with spark (replacing thx calls with your own calls to create the session).

In [13]:
# pip install thx

In [20]:
%load_ext autoreload
%autoreload 2
import sys
from aggregated_models.myimports  import *
import aggregated_models.myJupyterUtils as myJupyterUtils ## Remove stacktraces on Keyboardinterupt
plt.style.use('ggplot')

from aggregated_models.agg_mrf_model import *
from aggregated_models.validation import * 
from aggregated_models.aggLogistic import AggLogistic
from aggregated_models.aggdataset import *
from aggregated_models.experiment import *
from aggregated_models.mrf_helpers  import *
from thx.hadoop.spark_config_builder import create_remote_spark_session, SparkSession
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [21]:
    def LLH(prediction, y):
        llh = np.log(prediction) * y + np.log(1 - prediction) * (1 - y)
        return sum(llh) / len(y)
    def Entropy(y):
        py = sum(y > 0) / len(y)
        return Entropy_(py)
    def Entropy_(py):
        return py * np.log(py) + (1 - py) * np.log(1 - py)
    def NLLH(prediction, y):
        if any(prediction <= 0) or any(prediction >= 1):
            return np.nan
        h = Entropy(y)
        llh = LLH(prediction, y)
        return (h - llh) / h
    allfeatures = ['hash_'+str(i) for i in range(0,19)]

In [22]:
filename_smalltest =  "../../challenge-release/data/test.csv.gz"
test = pd.read_csv(filename_smalltest, sep=',')
test["clicks"] =test["click"]      
len(test)

931843

In [23]:
from thx.hadoop.spark_config_builder import create_remote_spark_session, SparkSession


In [27]:
memory = '6g' # memory = '8g'
ss = create_remote_spark_session('LearningFromAggData', 250, 8, memory=memory,
                                 memoryOverhead='8g', driver_memory='32g',
                                         properties=
                                            [
                                                ('spark.speculation', 'true'),
                                                ('spark.speculation.interval', '4s'),
                                                    ('spark.speculation.multiplier', '3'),
                                                ('spark.speculation.quantile', '0.9'),                                                
                                            ],
                                 hadoop_file_systems=['viewfs://root', 'viewfs://prod-am6'])
ss.sparkContext.setCheckpointDir("viewfs://prod-am6/tmp/a.gilotte/load/")

2022-07-07 14:49:58,587 - cluster_pack.filesystem - INFO - Resolved base filesystem: <class 'pyarrow.hdfs.HadoopFileSystem'>


  fs = EnhancedFileSystem(pyarrow.hdfs.connect(host=host, port=port))


2022-07-07 14:49:59,539 - cluster_pack.uploader - INFO - viewfs://root/user/a.gilotte/envs/.agg_model_venv.pex already exists
2022-07-07 14:50:30,625 - cluster_pack.packaging - INFO - found editable requirements {'aggregated_models': '/mnt/nfs/home/a.gilotte/aggdata_public/aggregated_models'}
2022-07-07 14:50:30,651 - thx.hadoop.spark_config_builder - INFO - applicationId: application_1657197191666_39117
2022-07-07 14:50:30,653 - thx.hadoop.spark_config_builder - INFO - spark UI: http://10.188.159.37:31047


In [28]:
ss

In [30]:
xp0 = Experiment( "aggdata_kdd_mu5" , ss )
xp0.defineModel(
        'data/aggdata_kdd',
    f'''
config_params = AggMRFModelParams( {allfeatures} ,  clicksCfs="*&*", 
            nbSamples=1_000_000,
            regulL2= 50,     
            regulL2Click=1000,
            sampleFromPY0 = True,  
            maxNbRowsPerSlice = 250,
            muStepSizeMultiplier = 5.0
             )
model = AggMRFModel ( aggdata,
            config_params,
            sparkSession= ss   )     
''',
        stepsize =0.01 )
xp0.run( test , logevery = 10, nbiters = 1000)

creating model
starting to train for 1000 iters. logging every 10
 auc=0.848,  nllh=0.249,  llh=-0.2434,  nbiters=10
 auc=0.856,  nllh=0.263,  llh=-0.2389,  nbiters=20
 auc=0.859,  nllh=0.272,  llh=-0.2360,  nbiters=30
 auc=0.861,  nllh=0.277,  llh=-0.2344,  nbiters=40
 auc=0.863,  nllh=0.280,  llh=-0.2336,  nbiters=50
 auc=0.863,  nllh=0.281,  llh=-0.2331,  nbiters=60
 auc=0.864,  nllh=0.282,  llh=-0.2327,  nbiters=70
 auc=0.864,  nllh=0.283,  llh=-0.2325,  nbiters=80
 auc=0.864,  nllh=0.284,  llh=-0.2323,  nbiters=90
 auc=0.865,  nllh=0.284,  llh=-0.2322,  nbiters=100
 auc=0.865,  nllh=0.284,  llh=-0.2321,  nbiters=110
 auc=0.865,  nllh=0.284,  llh=-0.2321,  nbiters=120
simpleGradientStep iter=4     

KeyboardInterrupt


In [31]:
ss.stop()