# Description

When developing a Machine Learning model, arguably the most important component is the data. As the old tenet says "_Garbage In, Garbage Out_". The data we use to feed our model must be carefully crafter and checked against the most stringent quality sources.

Databricks has a Feature Store that aims to:
- **Aid with Feature discovery and democratization**, reducing double efforts that recreate already existing features.
- **Have a Data Governance first take** where features are governed alongside models and functions in the catalog ensuring the right role has the right permissions.
- **Track Lineage**, increasing the Data Scientist confidence that their models are built with the right data and logic and aiding ML and Data Engineers track impact of changes downstream.
Cross-workspace access
- **Eliminate skews in Model Training & Scoring**, ensuring the model produces consitent results by applying the same featurization logic in both stages.
- **Eliminate skew in Offline / Online serving**, similar to the point above, no need to have a complex system that replicates logic across systems.

# Boilerplate

## Dependencies

## Parameters

In [0]:
dbutils.widgets.text("catalog_name", "", "00 - Catalog Name")
dbutils.widgets.text("schema_name", "", "01 - Schema Name")

In [0]:
params = dbutils.widgets.getAll()

for key, value in params.items():
  assert value != "", f"Parameter {key} is empty"

locals().update(params)

# Main

## Demographic Features

In [0]:
%sql
CREATE OR REPLACE FUNCTION IDENTIFIER(:catalog_name || '.' || :schema_name || '.' || 'day_difference')(from_date DATE, to_date DATE)
RETURNS INT
LANGUAGE PYTHON
COMMENT 'Computes the difference between two dates.'
AS $$
from datetime import datetime
def day_difference(from_date: datetime, to_date: datetime) -> int:
  return (to_date - from_date).days

return day_difference(from_date, to_date)
$$

## Weekly Features

In [0]:
weekly_features = spark.sql(f"""
WITH weekly_events AS (
  SELECT
    c.customer_id
    , CAST(DATE_TRUNC('WEEK', event_ts) AS DATE) event_week
    , SUM(CASE WHEN event_type = 'sms' THEN 1 ELSE 0 END) AS w_n_messages
    , SUM(CASE WHEN event_type = 'local call' THEN 1 ELSE 0 END) AS w_n_local_calls
    , SUM(CASE WHEN event_type = 'local call' THEN minutes ELSE 0 END) AS w_local_calls_minutes
    , AVG(CASE WHEN event_type = 'local call' THEN minutes END) AS w_local_calls_avg_minutes
    , SUM(CASE WHEN event_type = 'ld call' THEN 1 ELSE 0 END) AS w_n_ld_calls
    , SUM(CASE WHEN event_type = 'ld call' THEN minutes ELSE 0 END) AS w_ld_calls_minutes
    , AVG(CASE WHEN event_type = 'ld call' THEN minutes END) AS w_ld_calls_avg_minutes
    , SUM(CASE WHEN event_type = 'intl call' THEN 1 ELSE 0 END) AS w_n_intl_calls
    , SUM(CASE WHEN event_type = 'intl call' THEN minutes ELSE 0 END) AS w_intl_calls_minutes
    , AVG(CASE WHEN event_type = 'intl call' THEN minutes END) AS w_intl_calls_avg_minutes
    FROM {catalog_name}.{schema_name}.events
    INNER JOIN {catalog_name}.{schema_name}.customers c
      USING (device_id)
    GROUP BY ALL
)

SELECT
  *
  , COALESCE(
      LAG(w_n_messages) OVER (PARTITION BY customer_id ORDER BY event_week),
      0) lw_n_messages
  , COALESCE(
      LAG(w_n_local_calls) OVER (PARTITION BY customer_id ORDER BY event_week),
      0) lw_n_local_calls
  , COALESCE(
        LAG(w_local_calls_minutes) OVER (PARTITION BY customer_id ORDER BY event_week),
        0) lw_local_calls_minutes
  , COALESCE(
        LAG(w_local_calls_avg_minutes) OVER (PARTITION BY customer_id ORDER BY event_week),
        0) lw_local_calls_avg_minutes
  , COALESCE(
        LAG(w_n_ld_calls) OVER (PARTITION BY customer_id ORDER BY event_week),
        0) lw_n_ld_calls
  , COALESCE(
        LAG(w_ld_calls_minutes) OVER (PARTITION BY customer_id ORDER BY event_week),
        0) lw_ld_calls_minutes
  , COALESCE(
        LAG(w_ld_calls_avg_minutes) OVER (PARTITION BY customer_id ORDER BY event_week),
        0) lw_ld_calls_avg_minutes
  , COALESCE(
        LAG(w_n_intl_calls) OVER (PARTITION BY customer_id ORDER BY event_week),
        0) lw_n_intl_calls
  , COALESCE(
        LAG(w_intl_calls_minutes) OVER (PARTITION BY customer_id ORDER BY event_week),
        0) lw_intl_calls_minutes
  , COALESCE(
        LAG(w_intl_calls_avg_minutes) OVER (PARTITION BY customer_id ORDER BY event_week),
        0) lw_intl_calls_avg_minutes
FROM weekly_events
""")

(
    weekly_features
    .write.mode("overwrite")
    .saveAsTable(f"{catalog_name}.{schema_name}.customer_weekly_features")
)

In [0]:
%sql
SELECT
  *
FROM IDENTIFIER(:catalog_name || '.' || :schema_name || '.' || 'customer_weekly_features')
LIMIT 10

In [0]:
%sql
ALTER TABLE IDENTIFIER(:catalog_name || '.' || :schema_name || '.' || 'customer_weekly_features')
  ALTER COLUMN customer_id SET NOT NULL;

ALTER TABLE IDENTIFIER(:catalog_name || '.' || :schema_name || '.' || 'customer_weekly_features')
  ALTER COLUMN event_week SET NOT NULL;

ALTER TABLE IDENTIFIER(:catalog_name || '.' || :schema_name || '.' || 'customer_weekly_features')
  ADD CONSTRAINT customer_weekly_features_pk PRIMARY KEY (customer_id, event_week TIMESERIES);

In [0]:
spark.sql(f"""
ALTER TABLE {catalog_name}.{schema_name}.customer_weekly_features
  ADD CONSTRAINT customer_weekly_features_customers_fk
    FOREIGN KEY(customer_id) REFERENCES {catalog_name}.{schema_name}.customers;""")