# Criteo Sponsored Search Conversion Log Dataset Analysis

## Resources
Dataset: [Criteo Sponsored Search Conversion Log Dataset](https://ailab.criteo.com/criteo-sponsored-search-conversion-log-dataset/)

Paper: [Reacting to Variations in Product Demand: An Application for
Conversion Rate (CR) Prediction in Sponsored Search
](https://arxiv.org/pdf/1806.08211.pdf)

## Shortform Overview of the data
- Entire dataset spans 90 days of logs
- Attribution Window used is 30 days
- Each row in the dataset represents an action (i.e. click) performed by the user on a product related advertisement.
- Each row may or may not have a conversion with it. If Sale = 1, then there was a conversion.
- The data has been sub-sampled, and it's not clear how the sampling was done. So, it could be that we have impartial user stories. For instance, multiple impressions might have _actually_ led to a conversion, but that conversion might be missing from the dataset. Similarly, multiple impressions might have _actually_ been recorded, but some impressions might be missing from the dataset. This has implications for accurately recording privacy budget deductions.

## Overview of the data
(Copied from the data source website. Emphasis is mine)

This dataset contains logs obtained from Criteo Predictive Search (CPS). CPS, offers an automated end-to-end solution using sophisticated machine learning techniques to improve Google Shopping experience using robust, predictive optimization across every aspect of the advertiser’s campaign. CPS in general has two main aims : (1) Retarget high-value users via behavioral targeting such that the bids are based on each user’s likelihood to make a purchase. (2) Increase ROI using a bidding strategy which incorporates the effects of product characteristics, user intent, device and user behavior.

**Each row in the dataset represents an action (i.e. click) performed by the user on a product related advertisement**. The product advertisement was shown to the user, post the user expressing an intent via an online search engine.  Each row in the dataset, contains information about the product characteristics (age, brand, gender, price), time of the click ( subject to uniform shift), user characteristics and device information. The **logs also contain information on whether the clicks eventually led to a conversion (product was bought) within a 30 day window and the time between click and the conversion**.

**This dataset represents a sample of 90 days of Criteo live traffic data**. Each line corresponds to one click (product related advertisement) that was displayed to a user. For each advertisement, we have detailed information about the product. Further, we also provide information on whether the click led to a conversion, amount of conversion and the time between the click and the conversion. **Data has been sub-sampled** and anonymized so as not to disclose proprietary elements.

In [None]:
# import pandas as pd
import modin.pandas as pd
import numpy as np
from datetime import datetime
import os
import plotly.express as px

os.environ["MODIN_ENGINE"] = "ray"

### Dataset Retrieval
This notebook assumes you already have the dataset downloaded and in the current directory. If you do not, uncomment and run the following cell. Note that the uncompressed data is ~6 GB.

In [None]:
# !wget http://go.criteo.net/criteo-research-search-conversion.tar.gz
# !tar -xzf criteo-research-search-conversion.tar.gz

In [None]:
DATA_FILE = 'Criteo_Conversion_Search/CriteoSearchData'
dtype={
    "Sale": np.int32,
    "SalesAmountInEuro": np.float64,
    "Time_delay_for_conversion": np.int32,
    "click_timestamp": np.int32,
    "nb_clicks_1week": pd.Int64Dtype(),
    "product_price": np.float64,
    "product_age_group": str,
    "device_type": str,
    "audience_id": str,
    "product_gender": str,
    "product_brand": str,
    "product_category1": str,
    "product_category2": str,
    "product_category3": str,
    "product_category4": str,
    "product_category5": str,
    "product_category6": str,
    "product_category7": str,
    "product_country": str,
    "product_id": str,
    "product_title": str,
    "partner_id": str,
    "user_id": str,
}
na_values={
    "click_timestamp": "0",
    "nb_clicks_1week": "-1",
    "product_price": "-1",
    "product_age_group": "-1",
    "device_type": "-1",
    "audience_id": "-1",
    "product_gender": "-1",
    "product_brand": "-1",
    "product_category1": "-1",
    "product_category2": "-1",
    "product_category3": "-1",
    "product_category4": "-1",
    "product_category5": "-1",
    "product_category6": "-1",
    "product_category7": "-1",
    "product_country": "-1",
    "product_id": "-1",
    "product_title": "-1",
    "partner_id": "-1",
    "user_id": "-1",
}
columns_to_drop = [
    'product_category1', 'product_category2', 'product_category3', 'product_category4',
    'product_category5', 'product_category6', 'product_category7', 'nb_clicks_1week', 'device_type',
    'product_title', 'product_brand', 'product_gender', 'audience_id', 'product_age_group', 'product_country'
]

In [None]:
df = pd.read_csv(DATA_FILE, names=dtype.keys(), dtype=dtype, na_values=na_values, header=None, sep="\t")
df = df.drop(columns=columns_to_drop)
df = df.dropna(subset=['product_id', 'partner_id', 'user_id'])
df.head()

In [None]:
df["click_datetime"] = df["click_timestamp"].apply(lambda x: datetime.fromtimestamp(x))
df["click_day"] = df["click_datetime"].apply(
    lambda x: (7 * (x.isocalendar().week - 1)) + x.isocalendar().weekday
)
min_click_day = df["click_day"].min()
df["click_day"] -= min_click_day

df["conversion_timestamp"] = df["Time_delay_for_conversion"] + df["click_timestamp"]
df["conversion_datetime"] = df["conversion_timestamp"].apply(
    lambda x: datetime.fromtimestamp(x)
)
df["conversion_day"] = df["conversion_datetime"].apply(
    lambda x: (7 * (x.isocalendar().week - 1)) + x.isocalendar().weekday
)
df["conversion_day"] -= min_click_day

impressions = df[["click_timestamp", "click_day", "user_id", "partner_id"]]
conversions = pd.DataFrame(df.loc[df.Sale == 1])[
    [
        "conversion_timestamp",
        "conversion_day",
        "user_id",
        "partner_id",
        "SalesAmountInEuro",
    ]
]

In [None]:
impressions

In [None]:
print(conversions["user_id"].nunique())


In [None]:
total_impressions = impressions.shape[0]
total_conversions = conversions.shape[0]
unique_user_count = len(df.groupby(['user_id']).count())
unique_partner_count = len(df.groupby(['partner_id']).count())
unique_product_count = len(df.groupby(['product_id']).count())
unique_partner_product_count = len(df.groupby(['partner_id', 'product_id']).size())

print("total impressions:", total_impressions, "total conversions:", total_conversions)
print("conversion rate:", total_conversions/total_impressions*100, "%")
print("unique users:", unique_user_count)
print("unique partners:", unique_partner_count)
print("unique products:", unique_product_count)
print("unique per partner products :", unique_partner_product_count)

In [None]:
iuser_counts = impressions.groupby(['user_id']).size().reset_index(name="count")
iuser_counts.describe()

In [None]:
cuser_counts = conversions.groupby(['user_id']).size().reset_index(name="count")
cuser_counts.describe()

In [None]:
iday_counts = impressions.groupby(['click_day']).size().reset_index(name="count")
iday_counts.describe()

In [None]:
iuser_day_counts = impressions.groupby(['user_id', 'click_day']).size().reset_index(name="count")
iuser_day_counts.describe()

In [None]:
cday_counts = conversions.groupby(['conversion_day']).size().reset_index(name="count")
cday_counts.describe()

In [None]:
cday_counts = conversions.groupby(['conversion_day']).size().reset_index(name="count")
cday_counts.describe()

In [None]:
cuser_day_counts = conversions.groupby(['user_id', 'conversion_day']).size().reset_index(name="count")
cuser_day_counts.describe()

In [None]:
# print(user_day_counts.loc[user_day_counts['count'] == 1].shape)
# print(user_day_counts.loc[user_day_counts['count'] > 1].shape)
# print(user_day_counts.loc[user_day_counts['count'] > 2].shape)
# print(user_day_counts.loc[user_day_counts['count'] > 3].shape)
# print(user_day_counts.loc[user_day_counts['count'] > 4].shape)
# print(user_day_counts.loc[user_day_counts['count'] > 5].shape)


In [None]:
# user_counts = user_counts.sort_values(["count"], ascending=False,)
# user_counts

In [None]:
# iuser = impressions.query("user_id == 'C8C869CD45415BA13541D602D8EA277E'")
# cuser = conversions.query("user_id == 'C8C869CD45415BA13541D602D8EA277E'")
# iuser
# cuser

In [None]:
conversion_user_day_counts = conversions.groupby(['user_id', 'conversion_day']).size().reset_index(name="count")
conversion_user_day_counts.describe()

In [None]:
# fig = px.ecdf(user_day_counts, x="count")
# fig.show()

In [None]:
# df.query("partner_id=='319A2412BDB0EF669733053640B80112' and product_id=='C9A3F830655829E5E924423E7417AAB4'")