Solution: Formal Base Model
===========================

In [1]:
%cd forml-solution-avazuctr

/opt/forml/workspace/3-solution/forml-solution-avazuctr


Updating the Project Code Base
------------------------------
Let's now add the pipeline code produced on-the-fly during our [exploration](1-setup-and-exploration.ipynb) as the formal project component.

### Adding TimeExtractor to Source.py

Since the `TimeExtractor` is a *stateless* operator working on *per-row* basis, it is possible to move it already to [avazuctr/source.py](forml-solution-avazuctr/avazuctr/source.py) where it gets applied on the dataset before any splitting:

1. Open the [avazuctr/source.py](forml-solution-avazuctr/avazuctr/source.py) component.
2. Update it with the code below engaging the `TimeExtractor` operator.
3. Save the file!

```python
import pandas

from forml import project
from forml.pipeline import payload, wrap
from openschema import kaggle as schema

# Using the ForML DSL to specify the data source:
FEATURES = (
    schema.Avazu.select(
        schema.Avazu.hour,
        schema.Avazu.C1,
        schema.Avazu.banner_pos,
        schema.Avazu.site_id,
        schema.Avazu.site_domain,
        schema.Avazu.site_category,
        schema.Avazu.app_id,
        schema.Avazu.app_domain,
        schema.Avazu.app_category,
        schema.Avazu.device_id,
        schema.Avazu.device_ip,
        schema.Avazu.device_model,
        schema.Avazu.device_type,
        schema.Avazu.device_conn_type,
        schema.Avazu.C14,
        schema.Avazu.C15,
        schema.Avazu.C16,
        schema.Avazu.C17,
        schema.Avazu.C18,
        schema.Avazu.C19,
        schema.Avazu.C20,
        schema.Avazu.C21,
    )
    .orderby(schema.Avazu.hour)
    .limit(500000)
)
OUTCOMES = schema.Avazu.click


@wrap.Operator.mapper
@wrap.Actor.apply
def TimeExtractor(features: pandas.DataFrame) -> pandas.DataFrame:
    """Transformer extracting temporal features from the original ``hour`` column."""
    assert "hour" in features.columns, "Missing column: hour"
    time = features["hour"]
    features["dayofweek"] = time.dt.dayofweek
    features["day"] = time.dt.day
    features["hour"] = time.dt.hour  # replacing the original column
    features["month"] = time.dt.month
    return features


# Setting up the source descriptor:
SOURCE = (
    project.Source.query(FEATURES, OUTCOMES, ordinal=schema.Avazu.hour)
    >> payload.ToPandas()
    >> TimeExtractor()
)

# Registering the descriptor
project.setup(SOURCE)
```

**SAVE THE [avazuctr/source.py](forml-solution-avazuctr/avazuctr/source.py) FILE!**

In [5]:
! git add avazuctr/source.py

### Adding the Base Model to Pipeline.py

Add the base model pipeline code to the [avazuctr/pipeline.py](forml-solution-avazuctr/avazuctr/pipeline.py):

1. Open the [avazuctr/pipeline.py](forml-solution-avazuctr/avazuctr/pipeline.py) component.
2. Update it with the code below specifying the base model pipeline.
3. Save the file!

```python
from forml import project
from forml.pipeline import wrap

with wrap.importer():
    from category_encoders import TargetEncoder
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import StandardScaler


CATEGORICAL_COLUMNS = [
    "C1",
    "banner_pos",
    "site_id",
    "site_domain",
    "site_category",
    "app_id",
    "app_domain",
    "app_category",
    "device_id",
    "device_ip",
    "device_model",
    "device_type",
    "device_conn_type",
    "C14",
    "C15",
    "C16",
    "C17",
    "C18",
    "C19",
    "C20",
    "C21",
]

PIPELINE = (
    TargetEncoder(cols=CATEGORICAL_COLUMNS)
    >> StandardScaler()
    >> LogisticRegression(warm_start=True, random_state=42)
)

# Registering the pipeline
project.setup(PIPELINE)
```

**SAVE THE [avazuctr/pipeline.py](forml-solution-avazuctr/avazuctr/pipeline.py) FILE!**

In [6]:
! git add avazuctr/pipeline.py

In [7]:
! forml project eval

running eval
0.45063007802414795


Adding Unit Test for TimeExtractor
----------------------------------

In [8]:
! touch tests/test_source.py

Edit the created [test_source.py](forml-solution-avazuctr/tests/test_source.py) and implement the unit test:

1. Open the [test_source.py](forml-solution-avazuctr/tests/test_source.py).
2. Update it with the code below providing the `TestTimeExtractor` unit test implementation.
3. Save the file!

```python
import pandas
from forml import testing

from avazuctr import source


class TestTimeExtractor(testing.operator(source.TimeExtractor)):
    """Unit testing the stateless TimeExtractor transformer."""

    # Dataset fixtures
    EMPTY = pandas.DataFrame()
    INPUT = pandas.DataFrame(
        {
            'hour': [
                pandas.Timestamp('2023-02-01 14:12:10'),
                pandas.Timestamp('2023-03-04 06:13:27'),
                pandas.Timestamp('2023-04-10 12:00:00'),
            ]
        }
    )
    EXPECTED = pandas.DataFrame(
        {
            'hour': [14, 6, 12],
            'dayofweek': [2, 5, 0],
            'day': [1, 4, 10],
            'month': [2, 3, 4],
        }
    ).astype('int32')

    # Test scenarios
    missing_column = (
        testing.Case().apply(EMPTY).raises(AssertionError, 'Missing column: hour')
    )
    valid_extraction = (
        testing.Case().apply(INPUT).returns(EXPECTED, testing.pandas_equals)
    )
```

**SAVE THE [test_source.py](forml-solution-avazuctr/tests/test_source.py) FILE!**

In [9]:
! git add tests/test_source.py

In [10]:
! forml project test

running test
running egg_info
creating forml_solution_avazuctr.egg-info
writing forml_solution_avazuctr.egg-info/PKG-INFO
writing dependency_links to forml_solution_avazuctr.egg-info/dependency_links.txt
writing requirements to forml_solution_avazuctr.egg-info/requires.txt
writing top-level names to forml_solution_avazuctr.egg-info/top_level.txt
writing manifest file 'forml_solution_avazuctr.egg-info/SOURCES.txt'
reading manifest file 'forml_solution_avazuctr.egg-info/SOURCES.txt'
writing manifest file 'forml_solution_avazuctr.egg-info/SOURCES.txt'
running build_ext
test_missing_column (tests.test_source.TestTimeExtractor)
Test of Missing Column ... ERROR: 2023-05-11 15:29:38,249: __init__: Instruction TimeExtractor.apply failed when processing arguments: Empty DataFrame
Columns: []
Index: []
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/forml/flow/_code/target/__init__.py", line 56, in __call__
    result = self.execute(*args)
  File "/usr/local/li