# Running on the criteo-adkkd competition

In the criteo-adkdd competition (https://www.adkdd.org/2021-privacy-ml-competition) the goal was to learn a click model from aggregated data.
However, some small granular datasets were also available. This notebook train a model using only the aggregated data released during the competition.

- it downloads the datasets
- it reads the noisy aggregated files and build an "aggdata" structure adapted to the training of RMF. In this structure, all features and crossfearures are "hashed" (actually just a modulo) to 100K modalities.
- it trains a RMF model, using only a subset of 11 features (out of 18).

The main reason why we use a subset of the features is that the method does not work well with all features. (Scalability issues are another non trivial reason, but can be solved by using the pyspark training). 
The main problem is that some of the features are very strongly correlated, and the information on those correlations is lost when hashing. A workaround would be to first project these features (eg with target encodings) two a smaller number a modalities (say 1000), so that the full crossfeatures, without hashing can be modelled. However this does not work well with the preaggregated data available during the competition, for several reasons:
- the data on unfrequent modalities pair was filtered out, making it impossible to reconsruct fully the crossfeatures (on a few pairs of features, it is a significative part of the samples which are misisng)
- the noise becomes no longer negligible when re-aggregating (we sum together several instances of the noise, leading to a higher noise variance than if we directly aggregate on the target encoded data.
- finaly, the rpe-aggregated data do not allow to build some good target encodings (or we would learn the model on the same set where those target encodings are trained; this typically leads to strong ovefitting)

A model with all features is trained in the other notebook, which directlty aggregates on target encoded features.

In [1]:
%load_ext autoreload
%autoreload 2
import sys
from aggregated_models.myimports  import *
import aggregated_models.myJupyterUtils as myJupyterUtils ## Remove stacktraces on Keyboardinterupt
plt.style.use('ggplot')
from aggregated_models.aggdataset import * 
import gzip
from itertools import islice
from aggregated_models.RawFeatureMapping import *

2022-07-08 13:26:24,364 - matplotlib.font_manager - INFO - Generating new fontManager, this may take some time...
failed to load pyspark
failed to load pyspark


In [2]:
datapath = "../data/challenge/"
# filename =  datapath + "large_train.csv.gz"
# filename_largetest =  datapath + "large_test.csv.gz"

singleAggFile = datapath + "aggregated_noisy_data_singles.csv.gz"
pairsAggFile  = datapath + "aggregated_noisy_data_pairs.csv.gz"

labels = ["click" , "sale"]
allfeatures = ['hash_'+str(i) for i in range(0,19)]

## Downloading challenge datasets

In [3]:
datapath = "../data/challenge/"
if not os.path.exists(datapath):
    print(f"creating {datapath}")
    os.mkdir(datapath)
    import urllib.request
    # full granular train set (2.4G)
    urllib.request.urlretrieve("http://go.criteo.net/criteo-ppml-challenge-adkdd21-dataset-raw-granular-data.csv.gz",
                               datapath + "large_train.csv.gz")
    # challenge files
    urllib.request.urlretrieve("http://go.criteo.net/criteo-ppml-challenge-adkdd21-dataset.zip", 
                               datapath + "challenge.zip")
    import zipfile
    with zipfile.ZipFile(datapath + "challenge.zip", "r") as zip_ref:
        zip_ref.extractall(datapath)
    # additional lines
    urllib.request.urlretrieve("http://go.criteo.net/criteo-ppml-challenge-adkdd21-dataset-additional-test-data.csv.gz",
                               datapath + "large_test.csv.gz")    

In [4]:
!ls -lah ../data/challenge

total 3.1G
drwxr-xr-x 2 a.gilotte Domain Users  290 Jul  5 16:39 .
drwxr-xr-x 5 a.gilotte Domain Users  150 Jul  7 16:46 ..
-rw-r--r-- 1 a.gilotte Domain Users  26M Jul  5 16:13 X_test.csv.gz
-rw-r--r-- 1 a.gilotte Domain Users 2.9M Jul  5 16:13 X_train.csv.gz
-rw-r--r-- 1 a.gilotte Domain Users 241M Jul  5 16:13 aggregated_noisy_data_pairs.csv.gz
-rw-r--r-- 1 a.gilotte Domain Users  15M Jul  5 16:13 aggregated_noisy_data_singles.csv.gz
-rw-r--r-- 1 a.gilotte Domain Users 285M Jul  5 16:13 challenge.zip
-rw-r--r-- 1 a.gilotte Domain Users 103M Jul  5 16:39 large_test.csv.gz
-rw-r--r-- 1 a.gilotte Domain Users 2.5G Jul  5 16:09 large_train.csv.gz
-rw-r--r-- 1 a.gilotte Domain Users 141K Jul  5 16:13 y_test.csv.gz
-rw-r--r-- 1 a.gilotte Domain Users  16K Jul  5 16:13 y_train.csv.gz


# Preparing Aggregated data

In [5]:
df = pd.read_csv(singleAggFile  ) 
df["d"] = df["count"] * (df["count"]>0) +1
df["click"] = df["nb_clicks"] * (df["nb_clicks"] > 0)
df["sale"] = df["nb_sales"] * (df["nb_sales"] > 0)

In [6]:
mappings = {}
for i in sorted(set( df.feature_1_id.values  )):
    f = f"hash_{i}"
    df_f = df[ df.feature_1_id == i 
             ].rename({ "feature_1_value":f }, axis=1)
    size = len(df_f)
    mappings[f] = RawFeatureMapping.FromDF(f, df_f)

In [7]:
## Hashing all features and all crossfeatures to 100K modalities
rawFeaturesSet = RawFeaturesSet(allfeatures, mappings )
maxNbModalities= {f : 100_000 for f in allfeatures}
maxNbModalities["default"] = 1_000_000 
cfset = CrossFeaturesSet(rawFeaturesSet , "*&*",maxNbModalities=maxNbModalities )

In [8]:
aggdata = AggDataset( cfset , [ "display" , "click", "sale", "variance" ])
aggdata = AggDataset( cfset , [ "display" , "click" ])
df = pd.read_csv(singleAggFile ) 
for f in allfeatures:
    i = int( f.split('_')[1] )
    df_f = df[ df.feature_1_id == i 
             ].rename({ "feature_1_value":f }, axis=1)
    df_f = mappings[f].Map(  df_f )
    df_f["variance"] = 17*17
    
    df_f =df_f.groupby(f).sum().reset_index()
    d = cfset.encodings[f].ProjectPandasDF(df_f , "count") 
    c = cfset.encodings[f].ProjectPandasDF(df_f , "nb_clicks")  
    s = cfset.encodings[f].ProjectPandasDF(df_f , "nb_sales")  
    variance = cfset.encodings[f].ProjectPandasDF(df_f , "variance")  
    aggdata.aggDisplays[ f ].Data += d
    aggdata.aggClicks[ f ].Data += c
    # aggdata.aggregations["sale"][f].Data += s
    # aggdata.aggregations["variance"][f].Data += variance    

for k in aggdata.aggregations:    
    aggdata.AggregationSums[k] = np.median( [ aggdata.aggregations[k][ f ].Data.sum()  for f in allfeatures]  )    

In [9]:
df = pd.read_csv(pairsAggFile ) 

In [10]:
for cf in aggdata.aggDisplays:
    if cf in allfeatures:
        continue
    f = cf.split("&")[0]
    f2 = cf.split("&")[1]    
    i = int(f.split("_")[1])       
    i2 = int(f2.split("_")[1])       
    
    df_f = pd.concat(  [ df[ (df.feature_1_id == i) &(df.feature_2_id == i2) 
             ].rename({ "feature_1_value":f, "feature_2_value":f2 }, axis=1),
                        df[ (df.feature_1_id == i2) &(df.feature_2_id == i) 
             ].rename({ "feature_1_value":f2, "feature_2_value":f }, axis=1),
                       ])          
    df_f = mappings[f].Map(  df_f )
    df_f = mappings[f2].Map(  df_f )
    df_f["variance"] = 17*17
    
    df_f =df_f.groupby([f,f2]).sum().reset_index()
    
    d = cfset.encodings[cf].ProjectPandasDF(df_f , "count") 
    c = cfset.encodings[cf].ProjectPandasDF(df_f , "nb_clicks")  
    s = cfset.encodings[cf].ProjectPandasDF(df_f , "nb_sales")  
    variance = cfset.encodings[cf].ProjectPandasDF(df_f , "variance")      
    
    aggdata.aggDisplays[ cf ].Data += d
    aggdata.aggClicks[ cf ].Data += c    
#    aggdata.aggregations["sale"][cf].Data += s    
#    aggdata.aggregations["variance"][cf].Data += variance        
    

In [11]:
sum( [ len( aggdata.aggregations["click"][cf].Data ) for cf in aggdata.aggregations["click"] ] )

50992603

In [12]:
with open('data/aggdata_officialcompetition_hash100k', "wb") as handle:
    aggdata.dump( handle )

## Reloading aggdata

In [13]:
with open('data/aggdata_officialcompetition_hash100k', "rb") as handle:
    aggdata = AggDataset.load( handle )

# Learning models

In [14]:
from aggregated_models.agg_mrf_model import *
from aggregated_models.validation import * 
%matplotlib inline



In [15]:
Validation = MetricsComputer("click")

In [16]:
smalldf = pd.read_csv( datapath + "X_train.csv.gz" , sep=',')
smalldf["clicks"]  = pd.read_csv( datapath + "y_train.csv.gz" , sep=',')["click"]
smalldf["click"] = smalldf["clicks"]

In [17]:
features11 = [allfeatures[i] for i in[0, 1, 2, 3, 4, 6, 8, 10, 13, 15, 16]]

In [18]:
config_params = AggMRFModelParams( features11 ,  clicksCfs="*&*", 
            nbSamples=100_000,
            regulL2= 50,     
            regulL2Click=1000,
            sampleFromPY0 = True,  
            maxNbRowsPerSlice = 250,
            muStepSizeMultiplier = 5.0
             )
model = AggMRFModel ( aggdata, config_params)

In [None]:
for i in range(0,20):
    model.fit(20)
    print( Validation.run(model,smalldf))


NLLH=0.2304, NMSE=0.1949       
NLLH=0.2539, NMSE=0.2174       
simpleGradientStep iter=11     

In [None]:
test = pd.read_csv( datapath + "X_test.csv.gz" , sep=',')
test["clicks"]  = pd.read_csv( datapath + "y_test.csv.gz" , sep=',')["click"]
test["click"] = test["clicks"]

print( Validation.run(model,test))