# Data Transformation

In [1]:
# | echo: false
%load_ext lab_black

Import Python modules

In [2]:
import os
from datetime import datetime
from functools import reduce
from glob import glob
from typing import Dict, List, Union

import numpy as np
import pandas as pd
import pytz
from feature_engine.encoding import RareLabelEncoder
from google.oauth2 import service_account
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline

## About

### Objective
This step of the analysis prepares data for ML model training. Training, validation and test data splits are created in order to support training a ML model to predict the propensity of new visitors to the Google Merchandise store on the Google Marketplace to make a purchase on a future visit.

### Overview of Data Transformation
The raw data in the dataset is provided per visit. The ML model needs to be trained to make predictions at the visit level for each visitor who made a purchase on a return visit to the store. However, useful features might exist based on

1. actions performed by a visitor within each visit
   - number of items that were added to a shopping cart during the first visit
   - whether a purchase was completed on the first visit
   - etc.
2. device used by the visitor during the visit

All actions performed per visit are found in a nested column `hits` for each visit so the raw data for this column must be exploded from one row per visit to one row per action in order to extract these action-based features. Exploding a nested row is similar to the `.explode()` `DataFrame` in `pandas` ([link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html#pandas.DataFrame.explode)). Then, the data is aggregated again by visit.

Exploding the `hits` column results in multiple rows where the sequence of actions is important. Features can be created based on this sequence. As an example, if a visit ended in an item(s) being added to the shopping cart, then the last action performed in the visit is a add-to-cart (encoded as an integer `3`). So, a feature will be created to indicate that last action of each visit and this requires exploding the `hits` column, which is a nested column. Similarly, to get the browser used during a visit, a nested value `device.browser` needs to be extracted from the `device` column. Exploding the `device` column is not necessary for this purpose, since the there is only one browser used per visit (per the definition of a visit) so, when exploded, all rows for the visit contain the same browser. `hits` is a nested column in the data since it contains a list of dictionaries, while `device` is a dictionary. We can just directly access `device.browser` from the nested `device` column, which is [supported by `BigQuery`](https://cloud.google.com/bigquery/docs/nested-repeated#define_nested_and_repeated_columns). For this reason, only the `hits` column in the raw data needs to be exploded.

Nested columns are exploded are

1. `hits`

Nested columns that are used without exploding are

1. `totals`
2. `trafficSource`
3. `device`

### Feature Selection
Features are selected based on

1. EDA performed in previous step
2. intuition about features that might be predictive of a new visitor making a purchase on a future visit to the Merchandise store

The following are the types of features that are selected
1. User-Facing Features
   - name of browser
   - type of operating system (one of Windows, Mac, Linux)
   - [referring channel](https://www.jellyfish.com/en-us/training/blog/google-analytics-channels-explained)
   - type of device
2. General Features
   - hits
   - bounces
   - page views
   - time spent on the Marketplace website

As a reminder, in order to get these columns, we have developed the following approach in previous steps and the same will be used here

1. get visitors who made a purchase on their return visit (this comes from the query immediately above)
2. Get all features for the first visit (`totals.newVisits = 1`) for all visitors
3. `INNER JOIN` features with the visitors who made a purchase on their return visit
   - this provides the features for only the visitors who made a purchase on their return visit, since we are not interested in using data for visitors who did not make a purchase on their return visit
4. `GROUP BY` visit and get the last action that was performed in that visit
   - ML modeling will be performed against data at the visit level since we need to predict propensity to make a purchase during a future visit
   - the `UNNEST` function has exploded nested actions per visit on separate rows, so the exploded data in the `first_visit_attributes` CTE is per action
   - since the ML model will be trained on visits, a `GROUP BY` is required to convert actions into visits
   - since we are only retrieving the first visit for reach visitor (from 2. above), getting the last action that was performed in that visit indicates how far the visitor advanced in the purchase process during their first visit to the marketplace
     - intuitively, this seems like it could be an indicator of whether the visitor will make a purchase on a return visit to the Google Merchandise store
   - a unique visit is defined by the combination of the following columns
     - `fullvisitorid`
     - `visitId`
     - `visitNumber`
     - `visitStartTime`

     and so a `GROUP BY` must be performed over these columns. The following columns are included as part of the definition of a visit

     - columns of aggregated stats per visit
       - `hits`
       - `bounces`
       - `pageviews`
       - `time_on_site`
     - columns showing features of the traffic source that initiated a visitor's visit
       - `source`
       - `medium`
       - `channelGrouping`
     - columns with features of the visitor's device used during a visit
       - `browser`
       - `os` (operating system)
       - `deviceCategory` (Windows, Mac or Linux)
     - (as mentioned above) the label column is reported per visit
       - `made_purchase_on_future_visit`

     These are columns that only change at the visit level, so we have grouped over these columns. See the two previous steps (data preparation and EDA) for more details about the structure of this SQL query to retrieve these features from the raw data.

### Feature Processing
Without any reduction in cardinality, the `source` and `os` (operating system) columns present the biggest problem due to their large number of unique sub-categories. If one-hot encoding is used to transform the categorical features in the training data, then tree-based models will perform poorly since the one-hot encoded data will be sparse. Bucketing to reduce the cardinality will help the performance of tree-based models. `XGBoost` has specific guidance to handle categorical data ([1](https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html#categorical-data), [2](https://machinelearningmastery.com/data-preparation-gradient-boosting-xgboost-python/)).

In order to avoid categorical features with a high cardinality, we have only kept the most common categories for each categorical feature. A generic category of `Other` was assigned to all other (less commonly occurring) categories. A bucketing strategy, where all categories accounting for less than 5% or 10% of all observations per categorical have been grouped into the same bucket strongly reduced the cardinality in the training data and will again be used in this step. In the data preparation step, this approach was called frequency encoding and its main disadvantage is that the predictive power of such a feature might be reduced by such a bucketing approach.

### Data
The raw data from the `BigQuery` table, that was used in the preceding (EDA) step, is again used here.

### Timeframe for Study
For the entire project, we have assumed that the current date is March 1, 2017. The data splits to be created are

1. training
   - September 1, 2016 to December 31, 2016
2. validation
   - January 1, 2017 to January 31, 2017
3. test
   - February 1, 2017 to February 28, 2017

As we mentioned in the data preparation step, during ML model development, we are only choosing first-time visitors who made a purchase on a return visit to the store between September 1, 2016 and February 31, 2017. This means each selected visitor initially visited the store starting on September 1, 2016. During this first visit, they may or may not have made a purchase. The same visitor also made a return visit to the store during which they made a purchase. This return visit is allowed to occur no later than February 28, 2017.

With this in mind, in the

1. training data
   - each visitor made their first visit as early as September 1, 2016
     - this first visit gives us the ML features (`X`)
   - each such visitor could have made a return visit, in which they made a purchase, as late as February 28, 2017
     - the outcome of this future visit (purchase or no purchase) gives us the ML label (`y`)
     - since today is March 1, 2017, we know the label (`y`) for all these visitors
       - for this reason, when we prepare our data for ML development, our data preparation does not suffer from data leakage/lookahead bias (this was also discussed in the preceding two steps - data preparation and EDA)
     - note that each such visitor only needs to make a purchase on one or more future visits before February 28, 2017
2. validation data
   - each visitor made their first visit as early as January 1, 2017
   - each such visitor could have made a return visit, in which they made a purchase, as late as February 28, 2017
3. test data
   - each visitor made their first visit as early as February 1, 2017
   - each such visitor could have made a return visit, in which they made a purchase, as late as February 28, 2017

### Assumptions
None.

### Output
A file containing the features and labels will be exported for the

1. training
2. validation
3. test

data splits.

## User Inputs

Get relative path to project root directory

In [3]:
# | code-fold: false
PROJ_ROOT_DIR = os.path.join(os.pardir)

Define the following

1. train data start date
2. train data end date
3. validation data start date
4. validation data end date
5. test data start date
6. test data end date
7. list of categorical features to be used
8. dictionary of columns to be frequency-encoded and minimum frequency thresholds to be used
   - thresholds are 5% or 10%, based on our findings from the data preparation step

In [4]:
# | code-fold: false
# start and end dates
train_start_date = "20160901"
train_end_date = "20161231"
val_start_date = "20170101"
val_end_date = "20170131"
test_start_date = "20170201"
test_end_date = "20170228"

# categorical column names
categorical_columns = [
    "bounces",
    "last_action",
    "source",
    "medium",
    "channelGrouping",
    "browser",
    "os",
    "deviceCategory",
]

# categorical columns to be grouped
cols_to_group = {
    "5_pct": ["source", "browser"],
    "10_pct": ["os", "channelGrouping", "medium"],
}

Get path to `data/processed` in which the transformed data splits (training, validation and test) produced by this step will be exported

In [5]:
# | code-fold: false
processed_data_dir = os.path.join(PROJ_ROOT_DIR, "data", "processed")

Retrieve credentials for `bigquery` client

In [6]:
# | code-fold: false
# Google Cloud PROJECT ID
gcp_project_id = os.environ["GCP_PROJECT_ID"]

Get filepath to Google Cloud Service Account JSON key

In [7]:
# | code-fold: false
raw_data_dir = os.path.join(PROJ_ROOT_DIR, "data", "raw")
gcp_creds_fpath = glob(os.path.join(raw_data_dir, "*.json"))[0]

Authenticate `bigquery` client and get dictionary with credentials

In [8]:
# | code-fold: false
gcp_credentials = service_account.Credentials.from_service_account_file(gcp_creds_fpath)
gcp_auth_dict = dict(gcp_project_id=gcp_project_id, gcp_creds=gcp_credentials)

Create a mapping between action type integer and label, in order to get meaningful names from the `action_type` column

In [9]:
# | code-fold: false
mapper = {
    1: "Click through of product lists",
    2: "Product detail views",
    3: "Add product(s) to cart",
    4: "Remove product(s) from cart",
    5: "Check out",
    6: "Completed purchase",
    7: "Refund of purchase",
    8: "Checkout options",
    0: "Unknown",
}

Define a dictionary to change datatypes of prepared data (this was originally developed in the data preparation step)

In [10]:
# | code-fold: false
dtypes_dict = {
    "fullvisitorid": pd.StringDtype(),
    "visitId": pd.StringDtype(),
    "visitNumber": pd.Int8Dtype(),
    "country": pd.StringDtype(),
    "quarter": pd.Int8Dtype(),
    "month": pd.Int8Dtype(),
    "day_of_month": pd.Int8Dtype(),
    "day_of_week": pd.Int8Dtype(),
    "hour": pd.Int8Dtype(),
    "minute": pd.Int8Dtype(),
    "second": pd.Int8Dtype(),
    "source": pd.CategoricalDtype(),  #
    "medium": pd.CategoricalDtype(),  #
    "channelGrouping": pd.CategoricalDtype(),  #
    "hits": pd.Int16Dtype(),
    "bounces": pd.CategoricalDtype(),  #
    "last_action": pd.CategoricalDtype(),  #
    "promos_displayed": pd.Int16Dtype(),
    "promos_clicked": pd.Int16Dtype(),
    "product_views": pd.Int16Dtype(),
    "product_clicks": pd.Int16Dtype(),
    "pageviews": pd.Int16Dtype(),
    "time_on_site": pd.Int16Dtype(),
    "browser": pd.CategoricalDtype(),  #
    "os": pd.CategoricalDtype(),  #
    "added_to_cart": pd.Int16Dtype(),
    "deviceCategory": pd.CategoricalDtype(),  #
    "made_purchase_on_future_visit": pd.BooleanDtype(),
}

Define a Python helper function to perform the following

1. execute a SQL query using Google BigQuery
2. set column datatypes of a `pandas.DataFrame`
3. drop duplicates based on a list of columns in a `pandas.DataFrame`

In [11]:
# | code-fold: false
def run_sql_query(
    query: str,
    gcp_project_id: str,
    gcp_creds: os.PathLike,
    show_dtypes: bool = False,
    show_info: bool = False,
    show_df: bool = False,
) -> pd.DataFrame:
    """Run query on BigQuery and return results as pandas.DataFrame."""
    start_time = datetime.now(pytz.timezone("US/Eastern"))
    start_time_str = start_time.strftime("%Y-%m-%d %H:%M:%S.%f")
    print(f"Query execution start time = {start_time_str[:-3]}...", end="")
    df = pd.read_gbq(
        query,
        project_id=gcp_project_id,
        credentials=gcp_creds,
        dialect="standard",
        # configuration is optional, since default for query caching is True
        configuration={"query": {"useQueryCache": True}},
        # use_bqstorage_api=True,
    )
    end_time = datetime.now(pytz.timezone("US/Eastern"))
    end_time_str = end_time.strftime("%Y-%m-%d %H:%M:%S.%f")
    duration = end_time - start_time
    duration = duration.seconds + (duration.microseconds / 1_000_000)
    print(f"done at {end_time_str[:-3]} ({duration:.3f} seconds).")
    print(f"Query returned {len(df):,} rows")
    if show_df:
        with pd.option_context("display.max_columns", None):
            display(df)
    if show_dtypes:
        display(df.dtypes.rename("dtype").to_frame().transpose())
    if show_info:
        df.info()
    return df


def set_datatypes(df: pd.DataFrame, dtypes: Dict) -> pd.DataFrame:
    """Set DataFrame datatypes using dictionary."""
    df = df.astype(dtypes)
    return df


def drop_duplicates(df: pd.DataFrame, subset: List[str]) -> pd.DataFrame:
    """Drop duplicates."""
    df = df.drop_duplicates(subset=subset, keep="first")
    return df

## Get Data

Define the same custom data transformation pipeline to handle the categorical columns that was developed in the data preparation step and used in the EDA step

In [12]:
# | code-fold: false
encoder_05 = RareLabelEncoder(
    tol=0.05,
    n_categories=2,
    variables=[v for k, v in cols_to_group.items() if "5" in k][0],
    replace_with="other",
)
encoder_10 = RareLabelEncoder(
    tol=0.10,
    n_categories=2,
    variables=[v for k, v in cols_to_group.items() if "10" in k][0],
    replace_with="other",
)
categorical_transformer = Pipeline(
    steps=[("enc_05", encoder_05), ("enc_10", encoder_10)]
)
preprocessor = ColumnTransformer(
    transformers=[("cat", categorical_transformer, categorical_columns)],
    remainder="passthrough",
)
pipe_trans = Pipeline(steps=[("preprocessor", preprocessor)])
pipe_trans

A helper function will be created to programmatically load data from BigQuery based on the desired start and end dates. The function accepts the following

1. start and end dates for which data is to be retrieved
   - these dates will be different for the training, validation and test splits
2. start date for the training data and end date for the test data
   - these two dates define the period over which ML model development will occur
   - these are used to retrieve visitors who made a purchase on a return (future) visit to the store during this period

The function is defined below

In [13]:
# | code-fold: false
def get_sql_query(
    split_start_date: str,
    split_end_date: str,
    train_split_start_date: str,
    test_split_end_date: str,
) -> str:
    """Assemble query to retrieve attributes about first visits."""
    query_str = f"""
            WITH
            -- Step 1. get visitors with a purchase on a future visit
            next_visit_purchasers AS (
                 SELECT fullvisitorid,
                        IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, True, False) AS made_purchase_on_future_visit
                 FROM `data-to-insights.ecommerce.web_analytics`
                 WHERE date BETWEEN '{train_split_start_date}' AND '{test_split_end_date}'
                 AND geoNetwork.country = 'United States'
                 GROUP BY fullvisitorid
            ),
            -- Steps 2. and 3. get attributes of the first visit
            first_visit_attributes AS (
                SELECT -- =========== GEOSPATIAL AND TEMPORAL ATTRIBUTES OF VISIT ===========
                       geoNetwork.country,
                       EXTRACT(QUARTER FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Pacific')) AS quarter,
                       EXTRACT(MONTH FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Pacific')) AS month,
                       EXTRACT(DAY FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Pacific')) AS day_of_month,
                       EXTRACT(DAYOFWEEK FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Pacific')) AS day_of_week,
                       EXTRACT(HOUR FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Pacific')) AS hour,
                       EXTRACT(MINUTE FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Pacific')) AS minute,
                       EXTRACT(SECOND FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Pacific')) AS second,
                       -- =========== VISIT AND VISITOR METADATA ===========
                       fullvisitorid,
                       visitId,
                       visitNumber,
                       DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Pacific') AS visitStartTime,
                       -- =========== SOURCE OF SITE TRAFFIC ===========
                       -- source of the traffic from which the visit was initiated
                       trafficSource.source,
                       -- medium of the traffic from which the visit was initiated
                       trafficSource.medium,
                       -- referring channel connected to visit
                       channelGrouping,
                       -- =========== VISITOR ACTIVITY ===========
                       -- total number of hits
                       (CASE WHEN totals.hits > 0 THEN totals.hits ELSE 0 END) AS hits,
                       -- number of bounces
                       (CASE WHEN totals.bounces > 0 THEN totals.bounces ELSE 0 END) AS bounces,
                       -- action performed during first visit
                       CAST(h.eCommerceAction.action_type AS INT64) AS action_type,
                       -- page views
                       IFNULL(totals.pageviews, 0) AS pageviews,
                       -- time on the website
                       IFNULL(totals.timeOnSite, 0) AS time_on_site,
                       -- whether add-to-cart was performed during visit
                       (CASE WHEN CAST(h.eCommerceAction.action_type AS INT64) = 3 THEN 1 ELSE 0 END) AS added_to_cart,
                       -- =========== VISITOR DEVICES ===========
                       -- user's browser
                       device.browser,
                       -- user's operating system
                       device.operatingSystem AS os,
                       -- user's type of device
                       device.deviceCategory,
                       -- =========== PROMOTION ===========
                       h.promotion,
                       h.promotionActionInfo AS pa_info,
                       -- =========== PRODUCT ===========
                       h.product,
                       -- =========== ML LABEL (DEPENDENT VARIABLE) ===========
                       made_purchase_on_future_visit
                FROM `data-to-insights.ecommerce.web_analytics`,
                UNNEST(hits) AS h
                INNER JOIN next_visit_purchasers USING (fullvisitorid)
                WHERE date BETWEEN '{split_start_date}' AND '{split_end_date}'
                AND geoNetwork.country = 'United States'
                AND totals.newVisits = 1
            ),
            -- Step 4. get aggregated features (attributes) per visit
            visit_attributes AS (
                SELECT fullvisitorid,
                       visitId,
                       visitNumber,
                       visitStartTime,
                       country,
                       quarter,
                       month,
                       day_of_month,
                       day_of_week,
                       hour,
                       minute,
                       second,
                       source,
                       medium,
                       channelGrouping,
                       hits,
                       bounces,
                       -- get the last action performed during the first visit
                       -- (this indicates where the visitor left off at the end of their visit)
                       MAX(action_type) AS last_action,
                       -- get number of promotions displayed and clicked during the first visit
                       COUNT(CASE WHEN pa_info IS NOT NULL THEN pa_info.promoIsView ELSE NULL END) AS promos_displayed,
                       COUNT(CASE WHEN pa_info IS NOT NULL THEN pa_info.promoIsClick ELSE NULL END) AS promos_clicked,
                       -- get number of products displayed and clicked during the first visit
                       COUNT(CASE WHEN pu.isImpression IS NULL THEN NULL ELSE 1 END) AS product_views,
                       COUNT(CASE WHEN pu.isClick IS NULL THEN NULL ELSE 1 END) AS product_clicks,
                       pageviews,
                       time_on_site,
                       browser,
                       os,
                       deviceCategory,
                       SUM(added_to_cart) AS added_to_cart,
                       made_purchase_on_future_visit,
                FROM first_visit_attributes
                LEFT JOIN UNNEST(promotion) as p
                LEFT JOIN UNNEST(product) as pu
                GROUP BY fullvisitorid,
                         visitId,
                         visitNumber,
                         visitStartTime,
                         country,
                         quarter,
                         month,
                         day_of_month,
                         day_of_week,
                         hour,
                         minute,
                         second,
                         source,
                         medium,
                         channelGrouping,
                         hits,
                         bounces,
                         pageviews,
                         time_on_site,
                         browser,
                         os,
                         deviceCategory,
                         made_purchase_on_future_visit
            )
            SELECT *
            FROM visit_attributes
            """
    return query_str

### Create Training Data

The training data will be prepared following the identical approach used in the EDA step, namely

1. load data from Google BigQuery dataset using the Python `bigquery` client
2. set datatypes to support frequency encoding of categorical features
   - all categorical features must have the datatype `pd.CategoricalDtype()`
3. drop duplicates by `fullvisitorid`
2. perform frequency-encoding on categorical features to reduce cardinality

In [14]:
# | code-fold: false
# load data from BigQuery dataset, set datatypes and drop duplicates
query = get_sql_query(train_start_date, train_end_date, train_start_date, test_end_date)
df_train = (
    run_sql_query(query, **gcp_auth_dict, show_df=False)
    .pipe(set_datatypes, dtypes=dtypes_dict)
    .pipe(drop_duplicates, subset=["fullvisitorid"])
)
print(
    f"Got {len(df_train):,} rows and {df_train.shape[1]:,} columns "
    "after dropping duplicates"
)

# perform frequency-encoding on categorical features
# # Get a list of the non-categorical columns
non_categorical_columns = [c for c in list(df_train) if c not in categorical_columns]
# # Apply the custom data transformation pipeline to prepare the training data split
_ = pipe_trans.fit(df_train)
df_train = pd.DataFrame(
    pipe_trans.transform(df_train),
    columns=categorical_columns + non_categorical_columns,
)[non_categorical_columns + categorical_columns].pipe(set_datatypes, dtypes=dtypes_dict)
print(
    f"Got {len(df_train):,} rows and {df_train.shape[1]:,} columns after "
    "frequency-encoding categorical features"
)

with pd.option_context("display.max_columns", None):
    display(df_train.head())
    display(df_train.tail())

Query execution start time = 2023-04-13 17:38:11.057...done at 2023-04-13 17:38:31.166 (20.109 seconds).
Query returned 92,859 rows
Got 92,551 rows and 29 columns after dropping duplicates
Got 92,551 rows and 29 columns after frequency-encoding categorical features


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,minute,second,hits,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,added_to_cart,made_purchase_on_future_visit,bounces,last_action,source,medium,channelGrouping,browser,os,deviceCategory
0,4180680121446408775,1476943855,1,2016-10-19 23:10:55,United States,4,10,19,4,23,10,55,5,0,0,0,0,5,292,0,False,0,0,google,organic,Organic Search,Chrome,Android,mobile
1,3072592563711482446,1476880065,1,2016-10-19 05:27:45,United States,4,10,19,4,5,27,45,18,27,1,36,0,9,317,0,False,0,0,google,organic,Organic Search,Chrome,Android,mobile
2,1687301606877489412,1477794145,1,2016-10-29 19:22:25,United States,4,10,29,7,19,22,25,5,9,0,0,0,5,54,0,False,0,0,youtube.com,referral,other,other,Windows,desktop
3,796191439564725883,1473279331,1,2016-09-07 13:15:31,United States,3,9,7,4,13,15,31,7,9,0,0,0,7,2494,0,False,0,0,google,organic,Organic Search,Chrome,Windows,desktop
4,9194147359170837949,1478035636,1,2016-11-01 14:27:16,United States,4,11,1,3,14,27,16,7,9,0,0,0,4,174,0,False,0,0,youtube.com,referral,other,Chrome,Android,mobile


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,minute,second,hits,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,added_to_cart,made_purchase_on_future_visit,bounces,last_action,source,medium,channelGrouping,browser,os,deviceCategory
92546,6912704476581951344,1481395682,1,2016-12-10 10:48:02,United States,4,12,10,7,10,48,2,3,0,0,30,0,3,70,0,False,0,0,google,organic,Organic Search,Chrome,Windows,desktop
92547,253693082293291936,1475785004,1,2016-10-06 13:16:44,United States,4,10,6,5,13,16,44,3,0,0,27,0,3,30,0,False,0,0,google,organic,Organic Search,Safari,iOS,mobile
92548,745341923593722891,1481152121,1,2016-12-07 15:08:41,United States,4,12,7,4,15,8,41,3,0,0,36,0,3,182,0,False,0,0,google,organic,Organic Search,Chrome,Macintosh,desktop
92549,3099760389535632682,1475112188,1,2016-09-28 18:23:08,United States,3,9,28,4,18,23,8,3,0,0,9,0,3,40,0,False,0,0,other,referral,other,other,iOS,mobile
92550,7973052081416467006,1482541637,1,2016-12-23 17:07:17,United States,4,12,23,6,17,7,17,3,0,0,33,0,3,64,0,False,0,0,youtube.com,referral,other,Chrome,Macintosh,desktop


::: {.callout-note title="Notes"}

1. Google Analytics uses visits and sessions interchangeably ([1](https://www.blyp.ai/a/question-hub/google-analytics/sessions-vs-visits-are-they-the-same-in-google-analytics), [2](https://databox.com/sessions-users-pageviews-in-google-analytics#head2))
   - a visit identifies a user reaching the marketplace website
     - a visitor can have multiple visits since they can visit the marketplace website multiple times
   - a session captures a visitor's interactions on the site during a visit
     - a session begins at the start of the visit and ends after 30 minutes of inactivity by the visitor
2. Each row here corresponds to a single action performed by a single visitor during a visit.
3. The `made_purchase_on_future_visit` is the label for ML training. However, this column is currently shown at the user action level (since the visits were exploded using the `UNNEST` function on the `hits` column). The label value only changes at the visit level since we only know if a visitor will make a purchase on their return (or future) visit after that visit has ended and that applies to the entire visit. So, we can aggregate over this column (include this column in the `GROUP BY`) in order to get it at the visit level
   - this column indicates if a visitor makes a purchase during their *next* visit
   - a ML model will be trained to predict this probability (propensity) of making a purchase during the return visit to the Merchandise store
   - the ML model will be trained on features of the same visitor's *first* visit
   - this is a [forward-looking](https://docs.aws.amazon.com/whitepapers/latest/time-series-forecasting-principles-with-amazon-forecast/step-2-prepare-data.html#concepts-of-featurization-and-related-time-series) label (`y`)
4. We had to select `totals.newVisits = 1` since we only wanted ML features from the first visit. We can't use features from the return visit since we want to predict the outcome of the return visit *before of that visit has occurred*. Earlier, we selected visitors who made a purchase on a future visit. So, for these visitors, the ML features (`X`) will be extracted from these visitors' first visit only.**
:::

The start and end `visitStartTime`s match those specified by the required start (`train_start_date`) and end (`train_end_date`) dates of the training data split

In [15]:
assert df_train["visitStartTime"].min().month == int(train_start_date[4:6])
assert df_train["visitStartTime"].max().month == int(train_end_date[4:6])
print(
    f"Visit Start Times: {df_train['visitStartTime'].min().strftime('%Y-%m-%d %H:%M:%S')} - "
    f"{df_train['visitStartTime'].max().strftime('%Y-%m-%d %H:%M:%S')}"
)

Visit Start Times: 2016-09-01 00:07:35 - 2016-12-31 23:56:56


Below is the class imbalance in the ML labels (`y_train`) from the training data and the number of unique values in all categorical columns

In [16]:
display(
    (
        100
        * df_train["made_purchase_on_future_visit"]
        .value_counts(normalize=True)
        .rename("fraction")
        .to_frame()
    )
    .merge(
        df_train["made_purchase_on_future_visit"]
        .value_counts()
        .rename("number")
        .to_frame(),
        how="left",
        left_index=True,
        right_index=True,
    )
    .reset_index()
)
display(
    pd.DataFrame.from_records(
        [
            {
                "column": c,
                "num_unique_values": df_train[c].nunique(),
            }
            for c in categorical_columns
        ]
    )
)

Unnamed: 0,made_purchase_on_future_visit,fraction,number
0,False,95.407937,88301
1,True,4.592063,4250


Unnamed: 0,column,num_unique_values
0,bounces,2
1,last_action,7
2,source,5
3,medium,4
4,channelGrouping,4
5,browser,3
6,os,5
7,deviceCategory,3


::: {.callout-tip title="Observations"}

1. Class Imbalance
   - we will want to consider random [undersampling, or downsampling](https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/) to improve the imbalance ratio from approximately 20:1 in favor of the majority class to a ratio such as 10:1 or 5:1. This will help the ML model in two ways
     - reducing the degree of imbalance will learn from a relatively larger number of positive examples (visitor did make a purchase on a future visit - we are interested in these visitors) than in the raw data which will have a larger number negative examples (visitor did not make a purchase on a future visit - we are not interested in these visitors for the current business use-case)
     - keeping the true class imbalance is also inefficient in terms of training time since the model spends most of its time learning from uninteresting examples
       - reducing the class imbalance results in shorter model training times

     Other approaches to handle the class imbalance are
     - do nothing
       - train the model using the true distribution of the classes
       - if a model trained on the true imbalanced distribution can generalize to to unseen data then no undersampling is required
       - the disadvantage of this approach is that longer training time will be required
     - use a data-augmentation technique such as [SMOTE](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-106)
:::

Show a count of missing values in all the columns

In [17]:
df_train.isna().sum().rename("missing").reset_index().rename(
    columns={"index": "column"}
)

Unnamed: 0,column,missing
0,fullvisitorid,0
1,visitId,0
2,visitNumber,0
3,visitStartTime,0
4,country,0
5,quarter,0
6,month,0
7,day_of_month,0
8,day_of_week,0
9,hour,0


::: {.callout-note title="Notes"}

1. Since the Google Analytics data is reported by visit, missing values in the General features columns
   - `hits`
   - `bounces`
   - `pageviews`
   - `time_on_site`

   would indicate zeros. Zeros were already present in the raw data so there were no missing values in these columns that had to be filled as part of the data preparation SQL query used here.
2. The manually chosen set of User-Facing features (all of which are categoricals) don't have missing values.
3. For both types of features, other choices than the ones made here could have missing values. In that case, feature imputation would be necessary without data leakage/lookahead bias. For the choices in this step, imputing missing values is not necessary in the features aggregated at the visit level.
:::

We'll now assessing zeros in the General Features Columns

1. `hits`
   - `hits` are users' interactions on the merchandise store's website that sends data to the Google Analytics server ([1](https://www.digishuffle.com/blogs/google-analytics-hits/#hitdefinition), [2](https://whatagraph.com/blog/articles/which-kinds-of-hits-does-google-analytics-track/#toc_0))
   - a visit is a group of hits ([1](https://www.optimizesmart.com/why-google-analytics-show-zero-sessions/))
   - [examples include](https://whatagraph.com/blog/articles/which-kinds-of-hits-does-google-analytics-track/)
     - viewing a page
     - social media interactions, like sharing or liking content using *share on social media* buttons
     - e-commerce interactions (add to cart, remove from cart, make purchase, etc.)
     - user timings (loading a page, loading an image, clicking a button)
     - etc.
   - there is no occurrence of zero hits in the Merchandise Store's Google Analytics dataset, so the minimum number of hits is 1, likely since viewing a page (which always occurs) is considered a hit
2. `time_on_site` (total duration of a single visit)
   - reaching the site triggers the start of a visit
   - time on site [is calculated as](https://roirevolution.com/blog/time-on-page-and-time-on-site-how-confident-are-you/) the difference between the timestamp of the last and first pages of a visit
   - a zero indicates the user did not navigate to further pages or trigger events on the site after reaching the site ([1](https://support.google.com/google-ads/thread/1455669?hl=en&msgid=1455678))
3. `bounces`
   - a bounce is when a single request is submitted from scripts embedded in the merchandise store's website to the [Google Analytics server](https://www.analyticsmania.com/post/introduction-to-google-tag-manager-server-side-tagging/) ([1](https://support.google.com/analytics/answer/1009409?hl=en))
     - if a bounce occurs, then time spent on the site is zero and a single page has most likely been viewed (`pageviews`)
     - a zero indicates the absence of a bounced visit, while `1` indicates a bounce was present

If a bounce occurs then, by definition, the following are true

1. zero time on site
2. a single hit is registered
3. (predominantly) a single page gets viewed

In [18]:
# | code-fold: false
# time on site versus bounces
display(
    (
        (
            100
            * df_train.query("bounces == 1")["time_on_site"].value_counts(
                normalize=True
            )
        )
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "time_on_site"})
    )
    .assign(split="train")
    .assign(time_on_site=0)
    .iloc[[0]]
)

# number of pages viewed versus bounces
display(
    (
        (100 * df_train.query("bounces == 1")["pageviews"].value_counts(normalize=True))
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "pageviews"})
    )
    .assign(split="train")
    .assign(time_on_site=0)
    .iloc[[0]]
)

# hits versus bounces
display(
    (
        (100 * df_train.query("bounces == 1")["hits"].value_counts(normalize=True))
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "hits"})
    )
    .assign(split="train")
    .assign(time_on_site=0)
    .iloc[[0]]
)

Unnamed: 0,time_on_site,frequency,split
0,0,100.0,train


Unnamed: 0,pageviews,frequency,split,time_on_site
0,1,100.0,train,0


Unnamed: 0,hits,frequency,split,time_on_site
0,1,99.167869,train,0


Similarly, if zero time is spent on the site then this is almost always associated with

- a bounce occurs
- single page view
- single hit

In [19]:
# | code-fold: false
# bounce
display(
    (
        (
            100
            * df_train.query("time_on_site == 0")["bounces"].value_counts(
                normalize=True
            )
        )
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "bounces"})
    )
    .assign(split="train")
    .assign(time_on_site=0)
    .iloc[[0]]
)
# single page view
display(
    (
        (
            (
                100
                * df_train.query("time_on_site == 0")["pageviews"].value_counts(
                    normalize=True
                )
            )
            .rename("frequency")
            .to_frame()
            .reset_index()
            .rename(columns={"index": "pageviews"})
        )
    )
    .assign(split="train")
    .assign(time_on_site=0)
    .iloc[[0]]
)
# hits
display(
    (
        (100 * df_train.query("time_on_site == 0")["hits"].value_counts(normalize=True))
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "hits"})
    )
    .assign(split="train")
    .assign(time_on_site=0)
    .iloc[[0]]
)

Unnamed: 0,bounces,frequency,split,time_on_site
0,1,99.830164,train,0


Unnamed: 0,pageviews,frequency,split,time_on_site
0,1,99.863393,train,0


Unnamed: 0,hits,frequency,split,time_on_site
0,1,99.010522,train,0


Similarly, if a single hit occurs on the e-commerce site then this is almost always associated with

- a bounce occurs
- single page view
- zero time on site

In [20]:
# | code-fold: false
# hits versus bounces
display(
    (
        (100 * df_train.query("hits == 1")["bounces"].value_counts(normalize=True))
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "bounces"})
    )
    .assign(split="train")
    .assign(hits=1)
    .iloc[[0]]
)

# hits versus number of pages viewed
display(
    (
        (100 * df_train.query("hits == 1")["pageviews"].value_counts(normalize=True))
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "pageviews"})
    )
    .assign(split="train")
    .assign(hits=1)
    .iloc[[0]]
)

# hits versus time on site
display(
    (
        (100 * df_train.query("hits == 1")["time_on_site"].value_counts(normalize=True))
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "time_on_site"})
    )
    .assign(split="train")
    .assign(hits=1)
    .iloc[[0]]
)

Unnamed: 0,bounces,frequency,split,hits
0,1,99.988813,train,1


Unnamed: 0,pageviews,frequency,split,hits
0,1,99.988813,train,1


Unnamed: 0,time_on_site,frequency,split,hits
0,0,100.0,train,1


::: {.callout-tip title="Observations"}

1. Based on the above, it might be worth training a ML model with only one of these General Features (`hits`, `time_on_site`, `pageviews`).
2. `bounces` is a binary column and should be treated as a categorical ML feature and not as a numerical feature.
:::

### Create Validation Data

Using the same approach, the validation data split is now created by only changing

1. `train_start_date` to `val_start_date`
2. `train_end_date` to `val_end_date`

in the `first_visit_attributes` CTE in order to capture the validation data and categorical features are encoded using the custom pipeline that was trained using the training data

In [21]:
# | code-fold: false
# load data from BigQuery dataset, set datatypes and drop duplicates
query = get_sql_query(val_start_date, val_end_date, train_start_date, test_end_date)
df_val = (
    run_sql_query(query, **gcp_auth_dict, show_df=False)
    .pipe(set_datatypes, dtypes=dtypes_dict)
    .pipe(drop_duplicates, subset=["fullvisitorid"])
)
print(
    f"Got {len(df_val):,} rows  and {df_train.shape[1]:,} columns"
    "after dropping duplicates"
)

# perform frequency-encoding on categorical features
# # Apply the custom data transformation pipeline to prepare the training data split
df_val = pd.DataFrame(
    pipe_trans.transform(df_val), columns=categorical_columns + non_categorical_columns
)[non_categorical_columns + categorical_columns].pipe(set_datatypes, dtypes=dtypes_dict)
print(
    f"Got {len(df_val):,} rows and {df_val.shape[1]:,} columns after "
    "frequency-encoding categorical features"
)

with pd.option_context("display.max_columns", None):
    display(df_val.head())
    display(df_val.tail())

Query execution start time = 2023-04-13 17:38:37.871...done at 2023-04-13 17:38:43.515 (5.644 seconds).
Query returned 21,208 rows
Got 21,177 rows  and 29 columnsafter dropping duplicates
Got 21,177 rows and 29 columns after frequency-encoding categorical features


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,minute,second,hits,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,added_to_cart,made_purchase_on_future_visit,bounces,last_action,source,medium,channelGrouping,browser,os,deviceCategory
0,7443659111332807488,1484063672,1,2017-01-10 07:54:32,United States,1,1,10,3,7,54,32,11,9,0,66,0,11,151,0,False,0,0,google,organic,Organic Search,Chrome,other,desktop
1,5597527133395896902,1485228391,1,2017-01-23 19:26:31,United States,1,1,23,2,19,26,31,4,9,0,12,0,4,68,0,False,0,0,google,organic,Organic Search,Safari,iOS,tablet
2,9166144922111078017,1483730457,1,2017-01-06 11:20:57,United States,1,1,6,6,11,20,57,6,9,0,0,0,3,27,0,False,0,0,google,organic,Organic Search,Chrome,Macintosh,desktop
3,4913593613905335447,1485720519,1,2017-01-29 12:08:39,United States,1,1,29,1,12,8,39,5,9,0,0,0,5,122,0,False,0,0,google,organic,Organic Search,Chrome,Windows,desktop
4,3123113931923419625,1485363986,1,2017-01-25 09:06:26,United States,1,1,25,4,9,6,26,20,54,0,39,0,20,494,0,False,0,5,google,organic,Organic Search,Chrome,Windows,desktop


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,minute,second,hits,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,added_to_cart,made_purchase_on_future_visit,bounces,last_action,source,medium,channelGrouping,browser,os,deviceCategory
21172,7261133747877726873,1485301932,1,2017-01-24 15:52:12,United States,1,1,24,3,15,52,12,3,0,0,32,0,3,28,0,False,0,0,(direct),(none),Direct,Chrome,Macintosh,desktop
21173,1814053538655247996,1483319025,1,2017-01-01 17:03:45,United States,1,1,1,1,17,3,45,3,0,0,28,0,3,49,0,False,0,0,(direct),(none),Direct,Chrome,Macintosh,desktop
21174,6632475970380148823,1485256352,1,2017-01-24 03:12:32,United States,1,1,24,3,3,12,32,3,0,0,36,0,1,29,0,False,0,0,youtube.com,referral,other,Safari,Macintosh,desktop
21175,5357307819206012105,1483378607,1,2017-01-02 09:36:47,United States,1,1,2,2,9,36,47,3,0,0,18,0,3,66,0,False,0,0,other,referral,other,Chrome,Macintosh,desktop
21176,2767207647504105236,1483984592,1,2017-01-09 09:56:32,United States,1,1,9,2,9,56,32,3,0,0,31,0,3,132,0,False,0,0,youtube.com,referral,other,Chrome,Windows,desktop


::: {.callout-note title="Notes"}

1. Earlier, it was mentioned that `totals.newVisits = 1` gives the first visit while `totals.newVisits = NULL` gives the future visit. In the training data, the first visit was picked up using this filter condition as a SQL filter. This allowed features from the first visit to be extracted. The label was whether the visitor made a purchase during their return visit. Here, the validation data uses the same approach. Per the business use-case, we want to predict new visitor's propensity of making a purchase during their return (or future) visit.

   The ML model will be trained using training data which only captures the first visit and this first visit occurs during the months covered by the training data only. The model will be validated using validation data that similarly covers the first visit that occurs during the months covered by the validation data only. The label of the validation data is analogous to that from the training data in that it indicates whether these new visitors (in the validation data split) made a purchase during a future visit.

   With this in mind, similar to the training data, we can get that first visit of visitors in the validation data using `totals.newVisits = 1` in the validation data SQL query above. For this reason, we will not use `totals.newVisits = NULL` in the SQL query to build the validation or test data splits.
2. The visitors in the training data do not need to be the same as those in the validation (or testing) data splits. Visitor ID will not be used as a feature during training, validation or evaluation (test data). Only the attributes of their first visit will be used since we are not interested in training a ML model for specific visitors who are identified by their ID.
:::

The start and end `visitStartTime`s match those specified by the required start (`val_start_date`) and end (`val_end_date`) dates of the validation data split

In [22]:
assert df_val["visitStartTime"].min().month == int(val_start_date[4:6])
assert df_val["visitStartTime"].max().month == int(val_end_date[4:6])
print(
    f"Visit Start Times: {df_val['visitStartTime'].min().strftime('%Y-%m-%d %H:%M:%S')} - "
    f"{df_val['visitStartTime'].max().strftime('%Y-%m-%d %H:%M:%S')}"
)

Visit Start Times: 2017-01-01 00:04:32 - 2017-01-31 23:57:51


### Create Test Data

Finally, the test data split is created using a similar approach to the validation data split (only the dates in the `first_visit_attributes` CTE are changed in order to capture the test data)

In [23]:
# | code-fold: false
# load data from BigQuery dataset, set datatypes and drop duplicates
query = get_sql_query(test_start_date, test_end_date, train_start_date, test_end_date)
df_test = (
    run_sql_query(query, **gcp_auth_dict, show_df=False)
    .pipe(set_datatypes, dtypes=dtypes_dict)
    .pipe(drop_duplicates, subset=["fullvisitorid"])
)
print(
    f"Got {len(df_test):,} rows  and {df_test.shape[1]:,} columns"
    "after dropping duplicates"
)

# perform frequency-encoding on categorical features
# # Apply the custom data transformation pipeline to prepare the training data split
df_test = pd.DataFrame(
    pipe_trans.transform(df_test), columns=categorical_columns + non_categorical_columns
)[non_categorical_columns + categorical_columns].pipe(set_datatypes, dtypes=dtypes_dict)
print(
    f"Got {len(df_test):,} rows and {df_test.shape[1]:,} columns after "
    "frequency-encoding categorical features"
)

with pd.option_context("display.max_columns", None):
    display(df_test.head())
    display(df_test.tail())

Query execution start time = 2023-04-13 17:38:43.708...done at 2023-04-13 17:38:48.259 (4.552 seconds).
Query returned 20,180 rows
Got 20,164 rows  and 29 columnsafter dropping duplicates
Got 20,164 rows and 29 columns after frequency-encoding categorical features


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,minute,second,hits,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,added_to_cart,made_purchase_on_future_visit,bounces,last_action,source,medium,channelGrouping,browser,os,deviceCategory
0,6667770433164438858,1487630668,1,2017-02-20 14:44:28,United States,1,2,20,2,14,44,28,7,9,0,39,0,7,75,0,False,0,0,google,organic,Organic Search,Chrome,Windows,desktop
1,861894924191702840,1486761793,1,2017-02-10 13:23:13,United States,1,2,10,6,13,23,13,7,18,0,0,0,7,38,0,False,0,0,google,organic,Organic Search,Chrome,Windows,desktop
2,987663115799971454,1487812839,1,2017-02-22 17:20:39,United States,1,2,22,4,17,20,39,4,9,0,0,0,4,21,0,False,0,0,google,organic,Organic Search,Chrome,Macintosh,desktop
3,7090908358039029290,1487412580,1,2017-02-18 02:09:40,United States,1,2,18,7,2,9,40,6,9,0,12,0,6,540,0,False,0,0,google,organic,Organic Search,other,Windows,desktop
4,294074179132707998,1487899413,1,2017-02-23 17:23:33,United States,1,2,23,5,17,23,33,5,18,0,3,0,5,117,0,False,0,0,google,other,other,Chrome,Macintosh,desktop


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,minute,second,hits,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,added_to_cart,made_purchase_on_future_visit,bounces,last_action,source,medium,channelGrouping,browser,os,deviceCategory
20159,171945862803080966,1487825038,1,2017-02-22 20:43:58,United States,1,2,22,4,20,43,58,3,0,0,39,0,3,26,0,False,0,0,(direct),(none),Direct,Safari,iOS,mobile
20160,6600545612381993294,1487380495,1,2017-02-17 17:14:55,United States,1,2,17,6,17,14,55,3,0,0,15,0,2,22,1,False,0,3,other,referral,other,Chrome,Macintosh,desktop
20161,258929636492512392,1487293726,1,2017-02-16 17:08:46,United States,1,2,16,5,17,8,46,3,0,0,5,0,3,1640,0,False,0,0,other,referral,other,other,Android,mobile
20162,7064613315175563493,1487258784,1,2017-02-16 07:26:24,United States,1,2,16,5,7,26,24,3,0,0,36,0,3,53,0,False,0,0,youtube.com,referral,other,other,Windows,desktop
20163,3684942462385225804,1487279963,1,2017-02-16 13:19:23,United States,1,2,16,5,13,19,23,3,0,0,29,0,3,117,0,False,0,0,youtube.com,referral,other,other,Windows,desktop


The start and end `visitStartTime`s match those specified by the required start (`test_start_date`) and end (`test_end_date`) dates of the test data split

In [24]:
assert df_test["visitStartTime"].min().month == int(test_start_date[4:6])
assert df_test["visitStartTime"].max().month == int(test_end_date[4:6])
print(
    f"Visit Start Times: {df_test['visitStartTime'].min().strftime('%Y-%m-%d %H:%M:%S')} - "
    f"{df_test['visitStartTime'].max().strftime('%Y-%m-%d %H:%M:%S')}"
)

Visit Start Times: 2017-02-01 00:00:06 - 2017-02-28 23:54:37


## Discussion of Duplicates During Data Preparation

1. By splitting data for ML, at the time of training the ML model, we are assuming that we don't yet have the validation or testing data during ML model development so we would not know if duplicates are or are not present in those data splits.

   If such a model is performant enough to be deployed to production, the same will apply to out-of-sample (unseen) visitors' visit data. The features that would be needed in production are the same as those that would be needed during ML development. So, when we do access the out-of-sample data in production, we again would want only the first visit per visitor so that (as was done during ML development) we can predict their propensity to make a purchase during a future visit. In order to accomplish this, when we were to get access to the unseen data in production, we could easily
   - consider the first valid visit per visitor (i.e. per `fullvisitorid`)
   - create features from this visit and make a prediction (inference) of propensity to purchase during a return (or future) visit
   - for subsequent visits
     - check for duplicates by `fullvisitorid`
     - drop any duplicated (subsequent) visits by the same `fullvisitorid` (since an inference prediction has already been made for this visitor)

   and this workflow does not involve data leakage or [lookahead bias](https://www.investopedia.com/terms/l/lookaheadbias.asp).

   For this reason, we can drop this type of duplicate in the validation and test data splits without being affected by data leakage or lookahead bias.

## Export to Disk

Show datatypes for all data splits

In [25]:
dfs = []
for df, split in zip([df_train, df_val, df_test], ["train", "val", "test"]):
    dfs.append(
        df.dtypes.rename(f"datatype_{split}")
        .reset_index()
        .rename(columns={"index": "column"})
    )
df_dtypes = reduce(
    lambda left, right: pd.merge(left, right, on=["column"], how="outer"), dfs
)
df_dtypes

Unnamed: 0,column,datatype_train,datatype_val,datatype_test
0,fullvisitorid,string[python],string[python],string[python]
1,visitId,string[python],string[python],string[python]
2,visitNumber,Int8,Int8,Int8
3,visitStartTime,datetime64[ns],datetime64[ns],datetime64[ns]
4,country,string[python],string[python],string[python]
5,quarter,Int8,Int8,Int8
6,month,Int8,Int8,Int8
7,day_of_month,Int8,Int8,Int8
8,day_of_week,Int8,Int8,Int8
9,hour,Int8,Int8,Int8


Show info for `DataFrame` with training data

In [26]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92551 entries, 0 to 92550
Data columns (total 29 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   fullvisitorid                  92551 non-null  string        
 1   visitId                        92551 non-null  string        
 2   visitNumber                    92551 non-null  Int8          
 3   visitStartTime                 92551 non-null  datetime64[ns]
 4   country                        92551 non-null  string        
 5   quarter                        92551 non-null  Int8          
 6   month                          92551 non-null  Int8          
 7   day_of_month                   92551 non-null  Int8          
 8   day_of_week                    92551 non-null  Int8          
 9   hour                           92551 non-null  Int8          
 10  minute                         92551 non-null  Int8          
 11  second         

Show info for `DataFrame` with validation data

In [27]:
df_val.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21177 entries, 0 to 21176
Data columns (total 29 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   fullvisitorid                  21177 non-null  string        
 1   visitId                        21177 non-null  string        
 2   visitNumber                    21177 non-null  Int8          
 3   visitStartTime                 21177 non-null  datetime64[ns]
 4   country                        21177 non-null  string        
 5   quarter                        21177 non-null  Int8          
 6   month                          21177 non-null  Int8          
 7   day_of_month                   21177 non-null  Int8          
 8   day_of_week                    21177 non-null  Int8          
 9   hour                           21177 non-null  Int8          
 10  minute                         21177 non-null  Int8          
 11  second         

Show info for `DataFrame` with test data

In [28]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20164 entries, 0 to 20163
Data columns (total 29 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   fullvisitorid                  20164 non-null  string        
 1   visitId                        20164 non-null  string        
 2   visitNumber                    20164 non-null  Int8          
 3   visitStartTime                 20164 non-null  datetime64[ns]
 4   country                        20164 non-null  string        
 5   quarter                        20164 non-null  Int8          
 6   month                          20164 non-null  Int8          
 7   day_of_month                   20164 non-null  Int8          
 8   day_of_week                    20164 non-null  Int8          
 9   hour                           20164 non-null  Int8          
 10  minute                         20164 non-null  Int8          
 11  second         

The training data is now exported to disk

In [29]:
# | code-fold: false
fpath_train = os.path.join(processed_data_dir, "train_processed.parquet.gzip")
df_train.to_parquet(fpath_train, index=False, compression='gzip', engine='pyarrow')

The validation data is now exported to disk

In [30]:
# | code-fold: false
fpath_val = os.path.join(processed_data_dir, "val_processed.parquet.gzip")
df_val.to_parquet(fpath_val, index=False, compression='gzip', engine='pyarrow')

The testing data is now exported to disk

In [31]:
# | code-fold: false
fpath_test = os.path.join(processed_data_dir, "test_processed.parquet.gzip")
df_test.to_parquet(fpath_test, index=False, compression='gzip', engine='pyarrow')

## ETL Workflow to Transform Data

We will now define an ETL workflow to perform the end-to-end transformation using Python functions that use `scikit-learn.Pipeline` and `pandas.pipe`. Such a workflow can be quickly re-used during ML development without needing to re-run all the Python code in this data transformation step and will allow the ability to change

1. start and end dates of raw data
   - this could be useful to add more training data if the ML model is not capturing variability in unseen data
2. thresholds used when frequency-encoding categorical features
   - this could be useful to increase or decrease the cardinality of categorical features
     - increasing cardinality might help improve the predictive power of such features
     - decreasing cardinality might help improve (reduce) ML model training duration and explainability

The ETL workflow will accept the following parameters

1. start and end date for each data split
2. dictionary with frequency encoding thresholds and categorical columns to which these thresholds must be applied
   - currently these thresholds are set to 5% and 10% based on the data preparation step
3. columns to use when dropping duplicates
   - currently a single column `fullvisitorid` is used to drop duplicates, based on the data preparation step
4. feature datatypes
   - these are neeed before and after encoding categorical features
5. path to folder in which to save transformed data
   - this is a path to a folder on the local disk where this analysis is being run

The following components are defined below

1. helper functions to
   - create `scikit-learn` data transformation pipeline
   - clean data
2. workflow functions to
   - extract raw data from Google BigQuery
   - transform raw data using `scikit-learn` data transformation pipeline
   - load transformed data into file in `data/processed`

Helper functions

In [32]:
# | code-fold: false
def make_data_transformation_pipeline(cols_to_group: Dict[str, List[str]]) -> Pipeline:
    """Create sklearn pipeline to transform cleaned data."""
    # define frequency-encoders for categorical columns that will require
    # a minimum frequency for categories
    categorical_transformer = Pipeline(
        steps=[
            (
                f"enc_{(k.split('_')[0]).zfill(2)}",
                RareLabelEncoder(
                    tol=int(k.split("_")[0]) / 100,
                    n_categories=2,
                    variables=v,
                    replace_with="other",
                ),
            )
            for k, v in cols_to_group.items()
        ]
    )
    preprocessor = ColumnTransformer(
        transformers=[("cat", categorical_transformer, categorical_columns)],
        remainder="passthrough",
    )

    # create overall data transformation pipeline
    pipe = Pipeline(steps=[("preprocessor", preprocessor)])
    return pipe


def clean_data(
    df: pd.DataFrame, datatypes_dict: Dict, subset: List[str]
) -> pd.DataFrame:
    """Perform data cleaning."""
    df = (
        # set column datatypes
        df.pipe(set_datatypes, dtypes=datatypes_dict)
        # drop duplicates
        .pipe(drop_duplicates, subset=subset)
    )
    print(
        f"Got {len(df):,} rows and {df.shape[1]:,} columns " "after dropping duplicates"
    )
    return df

Workflow functions

In [33]:
# | code-fold: false
def extract_data(
    split_start_date: str,
    split_end_date: str,
    train_start_date: str,
    test_end_date: str,
) -> pd.DataFrame:
    """Retrieve data from Google BigQuery dataset."""
    query = get_sql_query(
        split_start_date, split_end_date, train_start_date, test_end_date
    )
    df = run_sql_query(query, **gcp_auth_dict, show_df=False)
    return df


def transform_data(
    df: pd.DataFrame,
    categorical_columns: List[str],
    datatypes_dict: Dict,
    pipe: Union[Pipeline, None] = None,
) -> List[Union[pd.DataFrame, Pipeline]]:
    """Transform features in data."""
    # clean data
    df = df.pipe(clean_data, datatypes_dict=datatypes_dict, subset=["fullvisitorid"])

    # Get a list of the non-categorical columns
    non_categorical_columns = [c for c in list(df) if c not in categorical_columns]

    # train a data transformation pipeline to perform frequency-encoding on
    # categorical features in data
    # - this is needed for training data split only
    if pipe:
        _ = pipe_trans.fit(df)

    # Apply a trained data transformation pipeline to perform frequency-encoding on
    # categorical features in data
    # - this is needed for training, validation and test data splits
    cols_transformed = categorical_columns + non_categorical_columns
    df = pd.DataFrame(pipe_trans.transform(df), columns=cols_transformed)[
        non_categorical_columns + categorical_columns
    ]

    # set datatypes after data transformation
    df = df.pipe(set_datatypes, dtypes=datatypes_dict)

    print(
        f"Got {len(df):,} rows and {df.shape[1]:,} columns after "
        "frequency-encoding categorical features"
    )
    return [df, pipe]


def load_data(
    df: pd.DataFrame, processed_data_dir: str, split_type: str = "train"
) -> None:
    """Save data to file on local disk."""
    fpath = os.path.join(processed_data_dir, f"{split_type}_processed.parquet.gzip")
    df.to_parquet(fpath, index=False, compression="gzip", engine="pyarrow")
    print(f"Exported data to {fpath}")

::: {.callout-note title="Notes"}

These workflow functions also depend on the following helper functions (defined earlier in this step)

1. `set_datatypes()`
2. `drop_duplicates()`
3. `get_sql_query()`
4. `run_sql_query()`

If such a workflow is to be used in future steps in the analysis (eg. ML development), then both the

1. workflow functions
2. helper functions

must be defined in that step.
:::

We'll now demonstrate how this ETL workflow to retrieve and transform data gives the identical output to transformed data using the non-ETL (manual) workflow described earlier in this step.

First, define data transformation pipeline

In [34]:
# | code-fold: false
pipe = make_data_transformation_pipeline(cols_to_group)

Next, run ETL workflow to create training data and train the data transformation pipeline

In [35]:
# | code-fold: false
df_train_v2, pipe_trained = extract_data(
    train_start_date,
    train_end_date,
    train_start_date,
    test_end_date,
).pipe(
    transform_data,
    categorical_columns=categorical_columns,
    datatypes_dict=dtypes_dict,
    pipe=pipe,
)
df_train_v2.pipe(load_data, processed_data_dir, "train")

Query execution start time = 2023-04-13 17:38:48.599...done at 2023-04-13 17:39:06.459 (17.859 seconds).
Query returned 92,859 rows
Got 92,551 rows and 29 columns after dropping duplicates
Got 92,551 rows and 29 columns after frequency-encoding categorical features
Exported data to ../data/processed/train_processed.parquet.gzip


::: {.callout-note title="Notes"}

Here, we also collect the trained data transformation pipeline. This can be directly used to transform validation and test data splits without re-training.
:::

Next, run ETL workflow to create validation data, with the transformation pipeline that was trained using the training data

In [36]:
# | code-fold: false
df_val_v2, _ = extract_data(
    val_start_date,
    val_end_date,
    train_start_date,
    test_end_date,
).pipe(
    transform_data,
    categorical_columns=categorical_columns,
    datatypes_dict=dtypes_dict,
    pipe=pipe_trained,
)
df_val_v2.pipe(load_data, processed_data_dir, "val")

Query execution start time = 2023-04-13 17:39:07.551...done at 2023-04-13 17:39:12.790 (5.240 seconds).
Query returned 21,208 rows
Got 21,177 rows and 29 columns after dropping duplicates
Got 21,177 rows and 29 columns after frequency-encoding categorical features
Exported data to ../data/processed/val_processed.parquet.gzip


::: {.callout-note title="Notes"}

Here, we do not need to collect the data transformation pipeline since it was already trained using the training data.
:::

Finally, run the ETL workflow to create test data, with the transformation pipeline that was trained using the training data

In [37]:
# | code-fold: false
df_test_v2, _ = extract_data(
    test_start_date,
    test_end_date,
    train_start_date,
    test_end_date,
).pipe(
    transform_data,
    categorical_columns=categorical_columns,
    datatypes_dict=dtypes_dict,
    pipe=pipe_trained,
)
df_test_v2.pipe(load_data, processed_data_dir, "test")

Query execution start time = 2023-04-13 17:39:13.145...done at 2023-04-13 17:39:18.113 (4.968 seconds).
Query returned 20,180 rows
Got 20,164 rows and 29 columns after dropping duplicates
Got 20,164 rows and 29 columns after frequency-encoding categorical features
Exported data to ../data/processed/test_processed.parquet.gzip


::: {.callout-note title="Notes"}

Again, we do not need to collect the data transformation pipeline since it was already trained using the training data.
:::

Verify that the manual and ETL approaches give the identical output for training, validation and test data splits after data transformation

In [38]:
# | code-fold: false
assert df_train.equals(df_train_v2)
assert df_val.equals(df_val_v2)
assert df_test.equals(df_test_v2)

## Summary of Tasks Performed

This step performed the following

1. Overall
   - training, validation and test data splits were created that can be used to address the objective of training a ML model to predict a new visitor's propensity to make a purchase from the merchandise store on the Google Marketplace during February 2017
2. Data Transformation
   - Features in the prepared data splits were created at the visit level. Since the objective is to predict propensity to make a purchase during a future visit, features should also be at the level of visits. Based on [Google Analytics' definition](https://sporkmarketing.com/376/what-are-visitors-unique-visitors-and-page-views-google-analytics/), visits were defined by the combination of the
     - `fullvisitorid`
     - `visitId`
     - `visitNumber`
     - `visitStartTime`

     columns.
3. Data Splits
   - the data was split by month of the year
   - new visitors in the training data, who returned to the Google Merchandise Store during the period of months covering the training data, do not need to also be in the validation or test data splits
     - this is not a problem since the current project's business use-case is targeting new visitors and not the same/existing visitors
4. Feature Selection
   - The data splits were created using a subset of columns provided for visitor transactions on the store's website. These columns were selected based on
     - exploratory data analysis preformed in the preceding step
     - intuition about factors that would be predictive of a new visitor's propensity (probability) of making a purchase on a future visit to the store
   - It might be best to start by training a ML model with one of the General Features (`time_on_site`, `hits`, `pageviews`) and only add more if necessary to improve performance
5. Feature Processing
   - approaches to process features were adopted based on frequencies observed in the training data during the EDA step
     - categorical features were bucketed
   - based on analysis in the current step
     - `bounces` was shown i a binary column and should be treated as a categorical ML feature
     - candidates for handling class imbalance during ML training are
       - undersampling
       - no changes
       - SMOTE
6. Defined ETL workflow
   - this supports quickly changing parameters of the data transformation pipeline in each data split during future steps

## Summary of Assumptions

None.

## Limitations

None.

## Next Step

The next step will develop a baseline model using the transformed data splits.