# CRITEO SPONSORED SEARCH CONVERSION LOG DATASET

## WHAT IS THIS NOTEBOOK ABOUT

This notebook presents EDA (Exploratory Data Analysis), elements of feature engineering and correlation test of the "CRITEO SPONSORED SEARCH CONVERSION LOG DATASET"

## CONTENTS

1. INTRODUCTION
2. EXPLORATORY DATA ANALYSIS
3. TESTING HYPOTHESES
4. FEATURE ENGINEERING
5. STATISTICAL TESTING OF DESCRIBING FEATURES
6. COMPETITION METRIC
7. SUMMARY
8. LITERATURE


# 1. INTRODUCTION

## DESCRIPTION OF THE DATASET

Criteo Sponsored Search Conversion Log Dataset contains logs obtained from Criteo Predictive Search. \
Each row in the dataset represents an action performed by the user on a product related advertisement. 

### Data description

- Sale : Indicates 1 if conversion occurred and 0 if not.
- SalesAmountInEuro : Indicates the revenue obtained when a conversion took place. This might be different from product-price, due to attribution issues. It is -1, when no conversion took place.
- Time_delay_for_conversion : This indicates the time between click and conversion. It is -1, when no conversion took place.

- click_timestamp: Timestamp of the click. The dataset is sorted according to timestamp.
- nb_clicks_1week: Number of clicks the product related advertisement has received in the last 1 week.
- product_price: Price of the product shown in the advertisement.
- product_age_group: The intended user age group of the user, the product is made for.
- device_type: This indicates whether it is a returning user or a new user on mobile, tablet or desktop. 
- audience_id:  We do not disclose the meaning of this feature.
- product_gender: The intended gender of the user, the product is made for.
- product_brand: Categorical feature about the brand of the product.
- product_category(1-7): Categorical features associated to the product. We do not disclose the meaning of these features.
- product_country: Country in which the product is sold.
- product_id: Unique identifier associated with every product.
- product_title: Hashed title of the product.
- partner_id: Unique identifier associated with the seller of the product.
- user_id: Unique identifier associated with every user.

**All categorical features have been hashed**, **-1 is the missing value indicator**

For more information about the dataset head over to https://ailab.criteo.com/criteo-sponsored-search-conversion-log-dataset/

## IMPORTS FOR THE NOTEBOOK

In [1]:
import os
import numpy as np
import pandas as pd

from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder

from matplotlib import pyplot as plt
import seaborn as sns


%matplotlib inline

## CONSTANTS
Constants that will be used later in the notebook

In [2]:
PARTNER_ID = 'BD01BAFAE73CF38C403978BBB458300C'

ALL_COLUMN_NAMES = ['Sale', 'SalesAmountInEuro', 'time_delay_for_conversion', 'click_timestamp',
                    'nb_clicks_1week', 'product_price', 'product_age_group', 'device_type',
                    'product_gender', 'product_brand','product_category(1)', 'product_category(2)',
                    'product_category(3)', 'product_category(4)','product_category(5)',
                    'product_category(6)', 'product_category(7)', 'product_country', 'product_id',
                    'product_title', 'partner_id', 'user_id']

OBJECT_TYPE_COLUMN_NAMES = ['product_age_group', 'device_type','audience_id', 'product_gender', 'product_brand',
                       'product_category(1)', 'product_category(2)', 'product_category(3)', 'product_category(4)',
                       'product_category(5)', 'product_category(6)', 'product_category(7)',
                       'product_country', 'product_id', 'product_title', 'user_id']

## CREATE CSV - CHOOSE THE PARTNER_ID

The analysis will be performed only for one of many *partner_id* from the dataset. Because the dataset can be considered a big data source, we have created a csv file containing only the information related to chosen ` PARTNER_ID `.

## READ CSV

In [3]:
filepath = f"CriteoSearchData_{PARTNER_ID}.csv"

if os.path.isfile(filepath):
    df_raw = pd.read_csv(f"CriteoSearchData_{PARTNER_ID}.csv", low_memory=False, usecols=ALL_COLUMN_NAMES)
else:
    print('Sorry. You have to create an appropriate csv file first.')

# drop partner_id column -> same for every row
df_raw.drop(labels='partner_id', inplace=True, axis=1)

Before performing EDA, let's preprocess the data - handle NaNs and hashed values.

First of all let's change all the `-1` values in the dataset to `np.NaN` to a indicate missing value.

## PREPROCESSING

In [6]:
# handle NANs
df_nans = df_raw.replace('-1', np.NaN)
df_nans.replace(-1, np.NaN, inplace=True)

# Filter Sales
df_nans_sale_1 = df_nans.query("Sale == 1")

Now let's handle the hashed values. First of all let's check datatypes of the columns of `df`

In [7]:
print(df_nans_sale_1.dtypes)
print('-'*30)
print(df_nans_sale_1.dtypes.value_counts())

Sale                           int64
SalesAmountInEuro            float64
time_delay_for_conversion    float64
click_timestamp                int64
nb_clicks_1week              float64
product_price                float64
product_age_group             object
device_type                   object
product_gender                object
product_brand                 object
product_category(1)           object
product_category(2)           object
product_category(3)           object
product_category(4)           object
product_category(5)           object
product_category(6)           object
product_category(7)          float64
product_country               object
product_id                    object
product_title                 object
user_id                       object
dtype: object
------------------------------
object     14
float64     5
int64       2
dtype: int64


Thanks to this summary we know now that 16 features have `object` type which means they have been hashed. We can simplify every column with `object` type. As long as they are hashed, we do not lose any valuable information (however we will store the original data in a dictionary `encoders`) and what is more - it will simplify the EDA of the dataset later. \
For example we will transform the `product_gender` column by assigning one number for each category. \
To achieve this we will use `LabelEncoder` from `sklearn.preprocessing` package.

In [8]:
df_nans_encoded = pd.DataFrame()
encoders = {}

for col in df_nans.columns:
    if col in OBJECT_TYPE_COLUMN_NAMES:
        encoder = LabelEncoder()
        # filter not null values from the column
        series_not_null = df_nans_sale_1[col][df_nans_sale_1[col].notnull()]
        # transform the values using LabelEncoder
        df_nans_encoded[col] = pd.Series(encoder.fit_transform(series_not_null), index=series_not_null.index)
        # save the encoder
        encoders[col] = encoder
    else:
        df_nans_encoded[col] = df_nans_sale_1[col]

df_nans_encoded.head(10)

Unnamed: 0,Sale,SalesAmountInEuro,time_delay_for_conversion,click_timestamp,nb_clicks_1week,product_price,product_age_group,device_type,product_gender,product_brand,...,product_category(2),product_category(3),product_category(4),product_category(5),product_category(6),product_category(7),product_country,product_id,product_title,user_id
7,1,119.0,457035.0,1598898651,19.0,119.0,0.0,1,0.0,852.0,...,3.0,15.0,,,,,0,1354,1412.0,2347
12,1,53.0,457.0,1598919368,3.0,53.0,2.0,0,0.0,659.0,...,13.0,,,,,,0,2179,804.0,1520
35,1,178.0,101671.0,1598903859,0.0,89.0,0.0,1,2.0,396.0,...,6.0,27.0,,,,,0,317,1010.0,970
41,1,103.0,986.0,1598890694,1.0,103.0,0.0,1,0.0,10.0,...,6.0,36.0,1.0,,,,0,1995,10.0,1416
52,1,173.0,585181.0,1598829047,305.0,173.0,1.0,1,2.0,855.0,...,13.0,,,,,,0,2193,1796.0,2021
55,1,107.0,776964.0,1598820645,0.0,107.0,0.0,1,0.0,778.0,...,6.0,42.0,,,,,0,627,754.0,1352
67,1,680.0,1281755.0,1598851564,0.0,157.0,0.0,1,0.0,375.0,...,13.0,,,,,,0,258,1261.0,1489
76,1,70.0,85063.0,1598823977,0.0,21.0,0.0,1,0.0,876.0,...,13.0,,,,,,0,1321,1572.0,1283
86,1,94.0,712911.0,1598816362,0.0,44.0,0.0,1,2.0,372.0,...,13.0,,,,,,0,254,363.0,157
106,1,34.0,3221.0,1597441520,2.0,34.0,0.0,0,0.0,218.0,...,13.0,,,,,,0,396,436.0,1831


## CREATE FINAL DATASET

In [9]:
df_nans_encoded.click_timestamp = pd.to_datetime(df_nans_encoded.click_timestamp, unit='s', origin='unix')
df_nans_encoded['day'] = df_nans_encoded.click_timestamp.dt.date
df = df_nans_encoded.groupby(['day', 'product_id']).agg({'Sale' : 'size',
                                                         'SalesAmountInEuro': 'sum',
                                                         'time_delay_for_conversion': ['mean', 'median'],
                                                         'nb_clicks_1week': 'mean',
                                                         'product_price': ['sum', 'mean'],
                                                         'product_age_group': 'mean', 
                                                         'device_type': 'mean', 
                                                         'product_gender': 'mean',
                                                         'product_brand': 'mean', 
                                                         'product_category(1)': 'mean', 
                                                         'product_category(2)': 'mean', 
                                                         'product_category(3)': 'mean', 
                                                         'product_category(4)': 'mean', 
                                                         'product_category(5)': 'mean', 
                                                         'product_category(6)': 'mean', 
                                                         'product_category(7)': 'mean', 
                                                         'product_country': 'mean', 
                                                         'product_title': 'mean'})

df.reset_index(inplace=True)
df.columns = ['_'.join(temp).strip('_') for temp in df.columns.to_flat_index() ]
df.head(10)

Unnamed: 0,day,product_id,Sale_size,SalesAmountInEuro_sum,time_delay_for_conversion_mean,time_delay_for_conversion_median,nb_clicks_1week_mean,product_price_sum,product_price_mean,product_age_group_mean,...,product_brand_mean,product_category(1)_mean,product_category(2)_mean,product_category(3)_mean,product_category(4)_mean,product_category(5)_mean,product_category(6)_mean,product_category(7)_mean,product_country_mean,product_title_mean
0,2020-08-04,1068,2,580.0,57369.0,57369.0,0.0,290.0,145.0,0.0,...,343.0,0.0,13.0,,,,,,0,1895.0
1,2020-08-04,2326,2,228.0,986.0,986.0,0.0,228.0,114.0,0.0,...,379.0,0.0,6.0,42.0,,,,,0,583.0
2,2020-08-05,19,2,254.0,105006.0,105006.0,3.0,254.0,127.0,1.0,...,343.0,0.0,13.0,,,,,,0,260.0
3,2020-08-05,32,2,178.0,479.0,479.0,5.0,178.0,89.0,0.0,...,343.0,0.0,13.0,,,,,,0,1038.0
4,2020-08-05,132,2,194.0,479925.0,479925.0,1.0,194.0,97.0,0.0,...,105.0,0.0,13.0,,,,,,0,211.0
5,2020-08-05,580,2,380.0,1643.0,1643.0,26.0,380.0,190.0,0.0,...,343.0,0.0,13.0,,,,,,0,357.0
6,2020-08-05,682,2,250.0,424.0,424.0,0.0,190.0,95.0,0.0,...,343.0,0.0,13.0,,,,,,0,1304.0
7,2020-08-05,734,2,300.0,85069.0,85069.0,0.0,152.0,76.0,0.0,...,707.0,0.0,3.0,15.0,,,,,0,1328.0
8,2020-08-05,785,2,256.0,666.0,666.0,0.0,256.0,128.0,0.0,...,343.0,0.0,13.0,,,,,,0,1377.0
9,2020-08-05,1074,2,222.0,969.0,969.0,0.0,222.0,111.0,1.0,...,791.0,0.0,13.0,,,,,,0,192.0


# 2. EXPLORATORY DATA ANALYSIS

This part of the notebook presents EDA performed on the FINAL DATASET created earlier. \
First of all we will calculate statistics for columns:
- `Sale_size`
- `SalesAmountInEuro_sum` 
- `time_delay_for_conversion_mean`

Statistics:
- Mean
- Median
- Variance
- Standard deviation
- Skewness
- Kurtosis
- 0.25, 0.5, 0.75 Quantiles
- IQR

In [12]:
df_subset = df[['Sale_size', 'SalesAmountInEuro_sum', 'time_delay_for_conversion_mean']]

df[df_subset.time_delay_for_conversion_mean.isna()]

Unnamed: 0,day,product_id,Sale_size,SalesAmountInEuro_sum,time_delay_for_conversion_mean,time_delay_for_conversion_median,nb_clicks_1week_mean,product_price_sum,product_price_mean,product_age_group_mean,...,product_brand_mean,product_category(1)_mean,product_category(2)_mean,product_category(3)_mean,product_category(4)_mean,product_category(5)_mean,product_category(6)_mean,product_category(7)_mean,product_country_mean,product_title_mean
364,2020-08-17,464,2,196.0,,,5.0,196.0,98.0,0.0,...,343.0,0.0,13.0,,,,,,0,1386.0
387,2020-08-18,418,2,78.0,,,0.0,78.0,39.0,0.0,...,74.0,0.0,13.0,,,,,,0,99.0
623,2020-08-26,1501,2,258.0,,,0.0,258.0,129.0,0.0,...,146.0,3.0,7.0,,,,,,0,210.0
949,2020-09-04,1527,2,538.0,,,0.0,538.0,269.0,0.0,...,102.0,0.0,6.0,36.0,1.0,,,,0,177.0
1300,2020-09-14,1130,2,280.0,,,0.0,280.0,140.0,0.0,...,343.0,0.0,13.0,,,,,,0,1386.0
1329,2020-09-14,2291,2,166.0,,,0.0,166.0,83.0,0.0,...,211.0,0.0,6.0,7.0,,,,,0,431.0
1386,2020-09-15,2229,2,378.0,,,2.0,378.0,189.0,1.0,...,855.0,0.0,13.0,,,,,,0,1804.0
2199,2020-10-15,1704,2,132.0,,,0.0,132.0,66.0,0.0,...,798.0,0.0,13.0,,,,,,0,1705.0
2224,2020-10-16,1802,2,418.0,,,0.0,418.0,209.0,0.0,...,61.0,0.0,6.0,36.0,1.0,,,,0,1296.0
2457,2020-10-26,664,2,196.0,,,0.0,196.0,98.0,0.0,...,343.0,0.0,13.0,,,,,,0,342.0


In [13]:
df_nans_encoded[df_nans_encoded['product_id'] == 664]

Unnamed: 0,Sale,SalesAmountInEuro,time_delay_for_conversion,click_timestamp,nb_clicks_1week,product_price,product_age_group,device_type,product_gender,product_brand,...,product_category(3),product_category(4),product_category(5),product_category(6),product_category(7),product_country,product_id,product_title,user_id,day
23229,1,98.0,,2020-10-26 07:08:39,0.0,98.0,0.0,1,2.0,343.0,...,,,,,,0,664,342.0,832,2020-10-26
56566,1,98.0,,2020-10-26 07:08:39,0.0,98.0,0.0,1,2.0,343.0,...,,,,,,0,664,342.0,832,2020-10-26


# 3. TESTING HYPOTHESES

# 4. FEATURE ENGINEERING

# 5. STATISTICAL TESTING OF DESCRIBING FEATURES

# 6. COMPETITION METRIC

# 7. SUMMARY