# CTR Prediction Solution

* [Avazu CTR Prediction](https://www.kaggle.com/c/avazu-ctr-prediction) dataset from the 2014 Kaggle competition
* 11 days of anonymized bid requests, 24 columns - timestamp, click, slot properties, device properties, C?? attributes
* we are going to use a subset of 500.000 requests (pre-cached for our exact queries - otherwise needs credentials)
* this is only a couple of hours of data - not even a full day (the model won't be able to properly capture temporal features)
* not doing datascience
* using [Openschema catalog](https://openschema.readthedocs.io/en/latest/_auto/_auto/openschema.kaggle.Avazu.html)


## Project Setup

Let's start by setting up the project skeleton and the main components.

### Starting New Project

We are going to initialize new ForML project with the following parameters:

* the project name will be `forml-solution-avazuctr`
* but we want the Python package to be called just `avazuctr`
* setting the initial project version to `0.1`
* anticipated dependency requirements are `openschema` and `pandas`

In [1]:
! forml project --path .. init "forml-solution-avazuctr" \
    --version "0.1" \
    --package "avazuctr" \
    --requirements="openschema==0.7,pandas==2.0.1"

In [None]:
! tree ../forml-solution-avazuctr

In [None]:
%cd ../forml-solution-avazuctr

Going to keep the project under version control from the beginning:

In [None]:
! git init .
! git add .

### Defining Project Source

We use the [Openschema catalog](https://openschema.readthedocs.io/en/latest/_auto/_auto/openschema.kaggle.Avazu.html) to specify the data requirements.

The schema contains the following set of fields (see the [schema page](https://openschema.readthedocs.io/en/latest/_auto/_auto/openschema.kaggle.Avazu.html) for their descriptions):

In [None]:
from openschema import kaggle

print([f.name for f in kaggle.Avazu.schema])

Let's define the [avazuctr/source.py](../forml-solution-avazuctr/avazuctr/source.py) by using this schema:

1. Open the [avazuctr/source.py](../forml-solution-avazuctr/avazuctr/source.py) component.
2. Select the required `FEATURES` (excluding `id`, `device_ip`, and `device_id`):
```python
from openschema import kaggle as schema
from forml import project
from forml.pipeline import payload

# Listing the feature columns
FEATURES = (
    schema.Avazu.hour,
    schema.Avazu.C1,
    schema.Avazu.banner_pos,
    schema.Avazu.site_id,
    schema.Avazu.site_domain,
    schema.Avazu.site_category,
    schema.Avazu.app_id,
    schema.Avazu.app_domain,
    schema.Avazu.app_category,
    schema.Avazu.device_model,
    schema.Avazu.device_type,
    schema.Avazu.device_conn_type,
    schema.Avazu.C14,
    schema.Avazu.C15,
    schema.Avazu.C16,
    schema.Avazu.C17,
    schema.Avazu.C18,
    schema.Avazu.C19,
    schema.Avazu.C20,
    schema.Avazu.C21,
)
```

3. Point `OUTCOMES` to `schema.Avazu.click`.
4. For continuous data (timeseries) we also need to point ForML to the time dimension to allow for incremental processing - here the `schema.Avazu.hour`.
5. Compose the source query with the familiar `payload.ToPandas`:
```python
OUTCOMES = schema.Avazu.click
ORDINAL = schema.Avazu.hour

STATEMENT = ... # Write a statement to select just the listed FEATURES, order it by schema.Avazu.hour and limit it to just 500.000 rows

# Setting up the source descriptor:
SOURCE = (
    project.Source.query(STATEMENT, OUTCOMES, ordinal=ORDINAL)
    >> payload.ToPandas()
)

# Registering the descriptor
project.setup(SOURCE)
```
6. **SAVE THE [avazuctr/source.py](../forml-solution-avazuctr/avazuctr/source.py) FILE!**

In [6]:
! git add avazuctr/source.py

### Defining Evaluation Metric

The generated [avazuctr/evaluation.py](../forml-solution-avazuctr/avazuctr/evaluation.py) contains some default evaluation logic (calculating accuracy using 20% holdout). Let's modify the file changing the metric to `logloss`:

1. Open the [avazuctr/evaluation.py](../forml-solution-avazuctr/avazuctr/evaluation.py) component.
2. Update it with the code below specifying the `logloss` metric:
```python
from forml import evaluation, project
from sklearn import metrics

# Using LogLoss on a 20% holdout dataset:
EVALUATION = project.Evaluation(
    evaluation.Function(metrics.log_loss),
    evaluation.HoldOut(test_size=0.2, stratify=True, random_state=42),
)

# Registering the descriptor
project.setup(EVALUATION)
```
3. **SAVE THE [avazuctr/evaluation.py](../forml-solution-avazuctr/avazuctr/evaluation.py) FILE!**

In [7]:
! git add avazuctr/evaluation.py

## Exploration

We can now interactively use our project skeleton to peek into the data:

In [None]:
from forml import project
PROJECT = project.open(path='.', package='avazuctr')
PROJECT.launcher.apply()

Launching it in the _train mode_ allows us to explore the trainset:

In [None]:
trainset = PROJECT.launcher.train()
trainset.features.isnull().sum()

In [None]:
trainset.features.describe()

In [None]:
trainset.labels.value_counts()

## Informal Base Pipeline

Let's now put together some minimal feature engineering to fit our base model.

### Extracting Time Features

We implement a simple stateless for extracting temporal features from the `hour` timestamp:

In [12]:
import pandas
from forml.pipeline import wrap

@wrap.Operator.mapper
@wrap.Actor.apply
def TimeExtractor(features: pandas.DataFrame) -> pandas.DataFrame:
    """Transformer extracting temporal features from the original ``hour`` column."""
    assert 'hour' in features.columns, 'Missing column: hour'
    time = features['hour']
    ...  # add to features a column `dayofweek` (hint: time.dt.dayofweek)
    ...  # add to features a column `day` with the day (of month) number
    ...  # replace the column `hour` with the hour (of day) number
    ...  # add to features a column `month` with the month number
    return features

In [13]:
SOURCE = PROJECT.components.source 
...  # Bind the TimeExtractor() as our pipeline to SOURCE and launch it using Dask in apply mode

Unnamed: 0,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_model,...,C15,C16,C17,C18,C19,C20,C21,dayofweek,day,month
0,0,1005,0,235ba823,f6ebf28e,f028772b,ecad2386,7801e8d9,07d7df22,0eb711ec,...,320,50,761,3,175,100075,23,4,31,10
1,0,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,ecb851b2,...,320,50,2616,0,35,100083,51,4,31,10
2,0,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,1f0bc64f,...,320,50,2616,0,35,100083,51,4,31,10
3,0,1005,0,85f751fd,c4e18dd6,50e219e0,51cedd4e,aefc06bd,0f2161f8,542422a7,...,320,50,1092,3,809,100156,61,4,31,10
4,0,1005,0,85f751fd,c4e18dd6,50e219e0,9c13b419,2347f47a,f95efa07,1f0bc64f,...,320,50,2667,0,47,-1,221,4,31,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,4,1005,1,b7e9786d,b12b9f85,f028772b,ecad2386,7801e8d9,07d7df22,0eb711ec,...,320,50,2528,0,167,-1,221,4,31,10
499996,4,1005,0,85f751fd,c4e18dd6,50e219e0,9c13b419,2347f47a,f95efa07,7abbbd5c,...,320,50,2717,2,47,100233,23,4,31,10
499997,4,1005,0,5b08c53b,7687a86e,3e814130,ecad2386,7801e8d9,07d7df22,8a4875bd,...,300,250,2295,2,35,100075,23,4,31,10
499998,4,1002,0,887a4754,e3d9ca35,50e219e0,ecad2386,7801e8d9,07d7df22,fc10a0d3,...,320,50,2624,0,35,-1,221,4,31,10


### Encoding Categorical Columns

Let's apply the [Target encoding](https://towardsdatascience.com/dealing-with-categorical-variables-by-using-target-encoder-a0f1733a4c69) technique to all the categorical columns. We can use the [TargetEncoder](https://contrib.scikit-learn.org/category_encoders/targetencoder.html) implementation from the [Category-encoders](https://contrib.scikit-learn.org/category_encoders/) package.
As a new dependency, we add it to the [pyproject.toml](../forml-solution-avazuctr/pyproject.toml) together with [Scikit-learn](https://scikit-learn.org/stable/index.html) which we are going to need in the next step:

1. Open the [pyproject.toml](../forml-solution-avazuctr/pyproject.toml).
2. Update it with the config below adding the new dependency of `category-encoders==2.6.0` and `scikit-learn==1.2.2`:
```toml
[project]
name = "forml-solution-avazuctr"
version = "0.1"
dependencies = [
    "category-encoders==2.6.0",
    "forml==0.93",
    "openschema==0.7",
    "pandas==2.0.1",
    "scikit-learn==1.2.2"
]

[tool.forml]
package = "avazuctr"
```
3. **SAVE THE [pyproject.toml](../forml-solution-avazuctr/pyproject.toml) FILE!**

In [14]:
! git add pyproject.toml

Now we can add the encoder into our pipeline:

In [15]:
...  # import the TargetEncoder from category_encoders under the wrap.importer() context

CATEGORICAL_COLUMNS = [
    "C1", "banner_pos", "site_id", "site_domain",
    "site_category", "app_id", "app_domain", "app_category",
    "device_model", "device_type", "device_conn_type",
    "C14", "C15", "C16", "C17", "C18", "C19", "C20", "C21"
]

SOURCE.bind(
    TargetEncoder(cols=CATEGORICAL_COLUMNS)
).launcher.train().features

Unnamed: 0,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,2014-10-21 00:00:00,0.164949,0.155877,0.211945,0.211945,0.208603,0.196316,0.190123,0.196119,0.225080,0.164528,0.124586,0.167300,0.153985,0.154188,0.208514,0.166794,0.166436,0.171947,0.208514
1,2014-10-21 00:00:00,0.164949,0.155877,0.211945,0.211945,0.208603,0.196316,0.190123,0.196119,0.236538,0.164528,0.169448,0.216279,0.153985,0.154188,0.208514,0.166794,0.166436,0.253048,0.208514
2,2014-10-21 00:00:00,0.164949,0.155877,0.211945,0.211945,0.208603,0.196316,0.190123,0.196119,0.131059,0.164528,0.169448,0.216279,0.153985,0.154188,0.208514,0.166794,0.166436,0.253048,0.208514
3,2014-10-21 00:00:00,0.164949,0.155877,0.211945,0.211945,0.208603,0.196316,0.190123,0.196119,0.293158,0.164528,0.169448,0.167300,0.153985,0.154188,0.208514,0.166794,0.166436,0.253048,0.208514
4,2014-10-21 00:00:00,0.164949,0.195663,0.036642,0.036642,0.036364,0.196316,0.190123,0.196119,0.227891,0.164528,0.169448,0.080279,0.153985,0.154188,0.074237,0.166794,0.166436,0.171947,0.086167
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,2014-10-21 03:00:00,0.164949,0.155877,0.012195,0.012195,0.030204,0.196316,0.190123,0.196119,0.187754,0.164528,0.169448,0.074220,0.153985,0.154188,0.074220,0.109077,0.166436,0.150633,0.073100
499996,2014-10-21 03:00:00,0.164949,0.195663,0.442296,0.431840,0.195131,0.196316,0.190123,0.196119,0.127723,0.164528,0.169448,0.278302,0.153985,0.154188,0.298913,0.109077,0.230515,0.114082,0.217605
499997,2014-10-21 03:00:00,0.164949,0.195663,0.091357,0.092482,0.195131,0.196316,0.190123,0.196119,0.224570,0.164528,0.169448,0.123115,0.153985,0.154188,0.125073,0.109077,0.141789,0.177072,0.217605
499998,2014-10-21 03:00:00,0.164949,0.155877,0.211945,0.211945,0.208603,0.196316,0.190123,0.196119,0.200614,0.164528,0.169448,0.218173,0.153985,0.154188,0.208514,0.166794,0.166436,0.171947,0.208514


### Base Model Pipeline on the Fly
Let's just append the [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) to the pipeline and the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) classifier:

In [None]:
with wrap.importer():
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import MinMaxScaler

PIPELINE = (
    TimeExtractor()
    >> TargetEncoder(cols=CATEGORICAL_COLUMNS)
    >> MinMaxScaler()
    >> LogisticRegression(max_iter=200, random_state=42)
)
SOURCE.bind(PIPELINE).launcher(runner="graphviz").train()

### Evaluating the Pipeline

Using our evaluation definition from [avazuctr/evaluation.py](../forml-solution-avazuctr/avazuctr/evaluation.py), we get the `logloss` of this our base model:

In [None]:
SOURCE.bind(PIPELINE, evaluation=PROJECT.components.evaluation).launcher.eval()