# Data Transformation

In [1]:
%load_ext lab_black

In [2]:
import os
from datetime import datetime
from glob import glob

import numpy as np
import pandas as pd
from google.oauth2 import service_account

## About

### Objective
This notebook prepares data for ML model training. Training, validation and test data splits are created in order to support training a ML model to predict the propensity of new visitos to the Google Merchandise store on the Google Marketplace to make a purchase on a future visit.

### Overview of Data Transformation
The raw data is provided per visit. The ML model needs to be trained to make predicts at the visit level. However, useful features might exist based on actions (eg. adding item to a shopping cart, completing a purchase, etc.) performed by a visitor within each visit or the device used by the visitor during the visit. All actions performed per visit are found in a nested column `hits` for each visit so the raw data for this column must be exploded from one row per visit to one row per action in order to extract these action-based features. Exploding a nested row is similar to the `.explode()` `DataFrame` in `pandas` ([link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html#pandas.DataFrame.explode)). Then, the data is aggregated again by visit.

Exploding the `hits` column results in multiple rows where the sequence of actions is imporant. Features can be created based on this sequence. As an example, if a visit ended in an item(s) being added to the shopping cart, then the last action performed in the visit is a add-to-cart (encoded as an integer `3`). A feature will be created to indicate that last action of each visit and this requires exploding the `hits` column, which is a nested column.

Similarly, to get the browser used during a visit, a nested value `device.browser` needs to be extracted from the `device` column. Exploding the `device` column is not necessary for this purpose, since the there is only one browser used per visit (per the definition of a visit) so, when exploded, all rows for the visit contain the same browser. With this in mind, we can just directly access `device.browser` from the nested `device` column, which is [supported by `BigQuery`](https://cloud.google.com/bigquery/docs/nested-repeated#define_nested_and_repeated_columns).

Nested columns are exploded are
- `hits`

Nested columns that are used without exploding are
- `totals`
- `trafficSource`
- `device`

### Feature Selection
Features are selected based on
1. EDA performed in previous notebooks
2. intuition about features that might be predictive of a new visitor making a purchase on a future visit to the Merchandise store

### Data
The raw data from the `BigQuery` table, that was used in the preceding EDA notebooks, is again used here.

### Assumptions
None.

### Output
A file containing the features and labels will be exported for the
1. training
2. validation
3. test

data splits.

## User Inputs

In [3]:
PROJ_ROOT_DIR = os.path.join(os.pardir)

In [4]:
# start and end dates
train_start_date = "20160901"
train_end_date = "20161130"
val_start_date = "20161201"
val_end_date = "20161231"
test_start_date = "20170101"
test_end_date = "20170131"

# # Google Cloud PROJECT ID
# gcp_project_id = <google-project-id>
# # Google Cloud Service Account JSON local filepath
# gcp_creds_fpath = <google-cloud-service-account-json-file>

# IGNORE
# Google Cloud PROJECT ID
gcp_project_id = os.environ["GCP_PROJECT_ID"]
# Google Cloud Service Account JSON local filepath
gcp_creds_fpath = glob(
    os.path.join(os.path.join(PROJ_ROOT_DIR, "data", "raw"), "*.json")
)[0]

In [5]:
# data directories
data_dir = os.path.join(PROJ_ROOT_DIR, "data")
intermediate_data_dir = os.path.join(PROJ_ROOT_DIR, "data", "intermediate")

# authenticate BigQuery
gcp_credentials = service_account.Credentials.from_service_account_file(gcp_creds_fpath)
gcp_auth_dict = dict(gcp_project_id=gcp_project_id, gcp_creds=gcp_credentials)

# mapping dictionary to get meaningful names from the action_type column
mapper = {
    1: "Click through of product lists",
    2: "Product detail views",
    3: "Add product(s) to cart",
    4: "Remove product(s) from cart",
    5: "Check out",
    6: "Completed purchase",
    7: "Refund of purchase",
    8: "Checkout options",
    0: "Unknown",
}

Define a Python helper function to execute a SQL query using Google BigQuery

In [6]:
def run_sql_query(
    query: str,
    gcp_project_id: str,
    gcp_creds: os.PathLike,
    show_dtypes: bool = False,
    show_info: bool = False,
    show_df: bool = False,
) -> pd.DataFrame:
    """Run query on Gooble BigQuery and return results as pandas DataFrame."""
    start_time = datetime.now()
    start_time_str = start_time.strftime("%Y-%m-%d %H:%M:%S")
    print(f"Query execution start time = {start_time_str}...", end="")
    df = pd.read_gbq(
        query,
        project_id=gcp_project_id,
        credentials=gcp_creds,
        dialect="standard",
        # configuration is optional, since default for query caching is True
        configuration={"query": {"useQueryCache": True}},
        # use_bqstorage_api=True,
    )
    end_time = datetime.now()
    duration = end_time - start_time
    duration = duration.seconds + (duration.microseconds / 1_000_000)
    print(f"done at {end_time.strftime('%Y-%m-%d %H:%M:%S')} ({duration:.3f} seconds).")
    print(f"Query returned {len(df):,} rows")
    if show_df:
        with pd.option_context("display.max_columns", None):
            display(df)
    if show_dtypes:
        display(df.dtypes.rename("dtype").to_frame().transpose())
    if show_info:
        df.info()
    return df

## Get Data

### Create Training Data

**Get visitors with a purchase on their return visit to the Marketplace.**

To get these visitors, we will use a similar approach to that from the `01_get_data_eda.ipynb` notebook, where two criteria were used to identify a purchase on a return visit
1. `total.transactions > 0` to only get visitors who made a purchase on the marketplace
2. `totals.newVisits IS NULL` to only get visitors who have returned to the marketplace (i.e. this is not their first visit)
   - `totals.newVisits` is 1 for a visitor's first visit and `NULL` for subsequent visits

The ML model will be evaluated on its predicted *visitor propensity to make a purchase during a future visit*. The model will be
1. trained using data from visitors' first visit (September 1, 2016 to Nov 30, 2016 - training data split)
2. make predictions on a visitors' return visit (December 1, 2016 to December 31, 2016 - test data split)

In order to train this model, we need visitors who have made multiple visits to the marketplace between the combined train and test period and who have made a purchase during their return visit. For this reason, we will add a `datetime` filter to capture such visitors during the combined train and test period.

This is shown below

In [7]:
%%time
query = f"""
        SELECT fullvisitorid,
               IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, True, False) AS made_purchase_on_future_visit,
               IF(SUM(CASE WHEN totals.transactions > 0 AND totals.newVisits IS NULL THEN 1 ELSE 0 END) > 0, True, False) AS made_purchase_on_future_visit_v2
        FROM `data-to-insights.ecommerce.web_analytics`
        WHERE date BETWEEN '{train_start_date}' AND '{test_end_date}'
        AND geoNetwork.country = 'United States'
        GROUP BY fullvisitorid
        """
df = run_sql_query(query, **gcp_auth_dict, show_df=False)
display(
    # breakdown of returning purchasers using COUNTIF()
    df["made_purchase_on_future_visit"].value_counts().rename(
        "num_return_purchasers_using_countif"
    ).to_frame().merge(
        # breakdown of returning purchasers using CASE WHEN()
        df["made_purchase_on_future_visit_v2"].value_counts().rename(
            "num_return_purchasers_using_if_sum_casewhen"
        ).to_frame(),
        left_index=True,
        right_index=True,
    ).merge(
        (
            100
            * df["made_purchase_on_future_visit"]
            .value_counts(normalize=True)
            .rename("frac_made_purchase_on_future_visit")
        ).to_frame(),
        left_index=True,
        right_index=True,
    ).reset_index().rename(columns={"index": "made_return_purchase"})
)

Query execution start time = 2023-04-10 00:23:45...done at 2023-04-10 00:23:52 (6.463 seconds).
Query returned 117,479 rows


Unnamed: 0,made_purchase_on_future_visit,num_return_purchasers_using_countif,num_return_purchasers_using_if_sum_casewhen,frac_made_purchase_on_future_visit
0,False,112301,112301,95.592404
1,True,5178,5178,4.407596


CPU times: user 789 ms, sys: 80.2 ms, total: 869 ms
Wall time: 6.47 s


**Observations**
1. It is reassuring that both approaches used (`COUNTIF()` and `IF(SUM(CASE WHEN...))` give the same output. `COUNTIF` is a `BigQuery` function ([1](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#countif)).
2. These are the visitors that we will need to select when creating a dataset for use in ML training. We will create three a SQL query next to get visits by these visitors, during a certain number of months in order to create the ML model training data, and then extract features and that label from these visits. In order to build up the validation data split, a similar process will be repeated using the months immediately following the months covered by the training data. The test data split will occur chronologically after the validation data split.

**Create the dataset of observations to be used in ML training. To do this get the columns to be used as ML features, that capture visitors' first visit. Show this for the visitors who were identified above to have made a purchase on a future visit to the Marketplace.**

The following are the types of features that are selected
1. User-Facing Features
   - name of browser
   - type of operating system (one of Windows, Mac, Linux)
   - [referring channel](https://www.jellyfish.com/en-us/training/blog/google-analytics-channels-explained)
   - type of device
2. General Features
   - hits
   - bounces
   - page views
   - time spent on the Marketplace website

In order to get these columns, we'll use the following approach
1. get visitors who made a purchase on their return visit (this comes from the query immediately above)
2. Get all features for the first visit (`totals.newVisits = 1`) for all visitors
3. `INNER JOIN` features with the visitors who made a purchase on their return visit
   - this provides the features for only the visitors who made a purchase on their return visit, since we are not interested in using data for visitors who did not make a purchase on their return visit
4. `GROUP BY` visit and get the last action that was performed in that visit
   - ML modeling will be performed against data at the visit level since we need to predict propensity to make a purchase during a future visit
   - the `UNNEST` function has exploded nested actions per visit on separate rows, so the exploded data in the `first_visit_attributes` CTE is per action
   - since the ML model will be trained on visits, a `GROUP BY` is required to convert actions into visits
   - since we are only retrieving the first visit for reach visitor (from 2. above), getting the last action that was performed in that visit indicates how far the visitor advanced in the purchase process during their first visit to the marketplace
     - intuitively, this seems like it could be an indicator of whether the visitor will make a purchase on a return visit to the Google Merchandise store
   - a unique visit is defined by the combination of the following columns
     - `fullvisitorid`
     - `visitId`
     - `visitNumber`
     - `visitStartTime`

     so the `GROUP BY` must be performed over these columns

Steps 1., 2. and 3. are shown below

In [8]:
%%time
query = f"""
        WITH
        -- Step 1. get visitors with a purchase on a future visit
        next_visit_purchasers AS (
             SELECT fullvisitorid,
                    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, True, False) AS made_purchase_on_future_visit
             FROM `data-to-insights.ecommerce.web_analytics`
             WHERE date BETWEEN '{train_start_date}' AND '{test_end_date}'
             AND geoNetwork.country = 'United States'
             GROUP BY fullvisitorid
        ),
        -- Steps 2. and 3. get features (attributes) of the first visit at the action level
        first_visit_attributes AS (
            SELECT fullvisitorid,
                   visitId,
                   visitNumber,
                   DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime))) AS visitStartTime,
                   -- =========== GENERAL FEATURES ===========
                   -- total number of hits
                   (CASE WHEN totals.hits > 0 THEN totals.hits ELSE 0 END) AS hits,
                   -- number of bounces
                   (CASE WHEN totals.bounces > 0 THEN totals.bounces ELSE 0 END) AS bounces,
                   -- action performed during first visit
                   CAST(h.eCommerceAction.action_type AS INT64) AS action_type,
                   -- page views
                   IFNULL(totals.pageviews, 0) AS pageviews,
                   -- time on the website
                   IFNULL(totals.timeOnSite, 0) AS time_on_site,
                   -- source of the traffic from which the visit was initiated
                   trafficSource.source,
                   -- medium of the traffic from which the visit was initiated
                   trafficSource.medium,
                   -- =========== USER-FACING FEATURES ===========
                   -- referring channel connected to visit
                   channelGrouping,
                   -- user's browser
                   device.browser,
                   -- user's operating system
                   device.operatingSystem AS os,
                   -- user's type of device
                   device.deviceCategory,
                   -- =========== LABEL ===========
                   made_purchase_on_future_visit
            FROM `data-to-insights.ecommerce.web_analytics`,
            UNNEST(hits) AS h
            INNER JOIN next_visit_purchasers USING (fullvisitorid)
            WHERE date BETWEEN '{train_start_date}' AND '{train_end_date}'
            AND geoNetwork.country = 'United States'
            AND totals.newVisits = 1
        )
        SELECT *
        FROM first_visit_attributes
        """
df_train_actions = run_sql_query(query, **gcp_auth_dict, show_df=False)
df_train_actions['action_type'] = df_train_actions['action_type'].map(mapper).astype(pd.StringDtype())
display(df_train_actions)

Query execution start time = 2023-04-05 20:35:42...done at 2023-04-05 20:36:32 (49.723 seconds).
Query returned 526,980 rows


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,hits,bounces,action_type,pageviews,time_on_site,source,medium,channelGrouping,browser,os,deviceCategory,made_purchase_on_future_visit
0,2777908945786878159,1474972357,1,2016-09-27 10:32:37,12,0,Unknown,3,559,youtube.com,referral,Social,Chrome,Windows,desktop,False
1,2777908945786878159,1474972357,1,2016-09-27 10:32:37,12,0,Unknown,3,559,youtube.com,referral,Social,Chrome,Windows,desktop,False
2,2777908945786878159,1474972357,1,2016-09-27 10:32:37,12,0,Unknown,3,559,youtube.com,referral,Social,Chrome,Windows,desktop,False
3,2777908945786878159,1474972357,1,2016-09-27 10:32:37,12,0,Unknown,3,559,youtube.com,referral,Social,Chrome,Windows,desktop,False
4,2777908945786878159,1474972357,1,2016-09-27 10:32:37,12,0,Unknown,3,559,youtube.com,referral,Social,Chrome,Windows,desktop,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
526975,1446285229092173706,1478190491,1,2016-11-03 16:28:11,3,0,Unknown,3,224,(direct),(none),Direct,Chrome,Chrome OS,desktop,True
526976,1446285229092173706,1478190491,1,2016-11-03 16:28:11,3,0,Unknown,3,224,(direct),(none),Direct,Chrome,Chrome OS,desktop,True
526977,0486118040507415508,1477164641,1,2016-10-22 19:30:41,3,0,Unknown,3,78,t.co,referral,Social,Chrome,Windows,desktop,False
526978,0486118040507415508,1477164641,1,2016-10-22 19:30:41,3,0,Unknown,3,78,t.co,referral,Social,Chrome,Windows,desktop,False


CPU times: user 12.8 s, sys: 741 ms, total: 13.5 s
Wall time: 49.8 s


**Notes**
1. Visits versus Sessions ([1](https://www.blyp.ai/a/question-hub/google-analytics/sessions-vs-visits-are-they-the-same-in-google-analytics), [2](https://databox.com/sessions-users-pageviews-in-google-analytics#head2))
   - **VISIT:** A visit identifies a user reaching the marketplace website
     - A visitor can have multiple visits since they can visit the marketplace website multiple times
   - **SESSION:** A session captures a visitor's interactions on the site during a visit
     - A session begins at the start of the visit and ends after 30 minutes of inactivity by the visitor
2. Each row here corresponds to a single action performed by a single visitor during a visit.
3. The `made_purchase_on_future_visit` is the label for ML training. However, this column is currently shown at the user action level (since the visits were exploded using the `UNNEST` function on the `hits` column). The label value only changes at the visit level since we only know if a visitor will make a purchase on their return (or future) visit after that visit has ended and that applies to the entire visit. So, we can aggregate over this column (include this column in the `GROUP BY`) in order to get it at the visit level
   - this column indicates if a visitor makes a purchase during their *next* visit
   - a ML model will be trained to predict this probability (propensity) of making a purchase during the return visit to the Merchandise store
   - the ML model will be trained on features of the same visitor's *first* visit
   - this is a [forward-looking](https://docs.aws.amazon.com/whitepapers/latest/time-series-forecasting-principles-with-amazon-forecast/step-2-prepare-data.html#concepts-of-featurization-and-related-time-series) label (`y`)
4. We had to select `totals.newVisits = 1` since we only wanted ML features from the first visit. We can't use features from the return visit since we want to predict the outcome of the return visit *ahead of that visit* occuring. Earlier, we selected visitors who made a purchase on a future visit. For these visitors, the ML features (`X`) will be extracted from these visitors' first visit only. We cannot extract features from the return visit, since **the ML model is being trained before a future visit has occurred and so the features associated with the return visit are not known at the time of ML training.**

Finally, step 4. is performed where we aggregate by visit and get the last action performed during the first visit. The following columns are included as part of the definition of a visit
1. columns of aggregated stats per visit
   - `hits`
   - `bounces`
   - `pageviews`
   - `time_on_site`
2. columns showing features of the traffic source that initiated a visitor's visit
   - `source`
   - `medium`
   - `channelGrouping`
3. columns with features of the visitor's device used during a visit
   - `browser`
   - `os` (operating system)
   - `deviceCategory` (Windows, Mac or Linux)
4. (as mentioned above) the label column is reported per visit
   - `made_purchase_on_future_visit`

These are columns that only change at the visit level, so we must group over these columns.

This `GROUP BY` is shown below

In [9]:
%%time
query = f"""
        WITH
        -- Step 1. get visitors with a purchase on a future visit
        next_visit_purchasers AS (
             SELECT fullvisitorid,
                    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, True, False) AS made_purchase_on_future_visit
             FROM `data-to-insights.ecommerce.web_analytics`
             WHERE date BETWEEN '{train_start_date}' AND '{test_end_date}'
             AND geoNetwork.country = 'United States'
             GROUP BY fullvisitorid
        ),
        -- Steps 2. and 3. get features (attributes) of the first visit at the action level
        first_visit_attributes AS (
            SELECT -- =========== GEOSPATIAL AND TEMPORAL ATTRIBUTES OF VISIT ===========
                   geoNetwork.country,
                   EXTRACT(QUARTER FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS quarter,
                   EXTRACT(MONTH FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS month,
                   EXTRACT(DAY FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS day_of_month,
                   EXTRACT(DAYOFWEEK FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS day_of_week,
                   EXTRACT(HOUR FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS hour,
                   EXTRACT(MINUTE FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS minute,
                   EXTRACT(SECOND FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS second,
                   -- =========== VISIT AND VISITOR METADATA ===========
                   fullvisitorid,
                   visitId,
                   visitNumber,
                   DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern') AS visitStartTime,
                   -- =========== SOURCE OF SITE TRAFFIC ===========
                   -- source of the traffic from which the visit was initiated
                   trafficSource.source,
                   -- medium of the traffic from which the visit was initiated
                   trafficSource.medium,
                   -- referring channel connected to visit
                   channelGrouping,
                   -- =========== VISITOR ACTIVITY ===========
                   -- total number of hits
                   (CASE WHEN totals.hits > 0 THEN totals.hits ELSE 0 END) AS hits,
                   -- number of bounces
                   (CASE WHEN totals.bounces > 0 THEN totals.bounces ELSE 0 END) AS bounces,
                   -- action performed during first visit
                   CAST(h.eCommerceAction.action_type AS INT64) AS action_type,
                   -- page views
                   IFNULL(totals.pageviews, 0) AS pageviews,
                   -- time on the website
                   IFNULL(totals.timeOnSite, 0) AS time_on_site,
                   -- whether add-to-cart was performed during visit
                   (CASE WHEN CAST(h.eCommerceAction.action_type AS INT64) = 3 THEN 1 ELSE 0 END) AS added_to_cart,
                   (CASE WHEN CAST(h.eCommerceAction.action_type AS INT64) = 2 THEN 1 ELSE 0 END) AS product_details_viewed,
                   -- =========== VISITOR DEVICES ===========
                   -- user's browser
                   device.browser,
                   -- user's operating system
                   device.operatingSystem AS os,
                   -- user's type of device
                   device.deviceCategory,
                   -- =========== PROMOTION ===========
                   h.promotion,
                   h.promotionActionInfo AS pa_info,
                   -- =========== PRODUCT ===========
                   h.product,
                   -- =========== ML LABEL (DEPENDENT VARIABLE) ===========
                   made_purchase_on_future_visit
            FROM `data-to-insights.ecommerce.web_analytics`,
            UNNEST(hits) AS h
            INNER JOIN next_visit_purchasers USING (fullvisitorid)
            WHERE date BETWEEN '{train_start_date}' AND '{train_end_date}'
            AND geoNetwork.country = 'United States'
            AND totals.newVisits = 1
        ),
        -- Step 4. get aggregated features (attributes) per visit
        visit_attributes AS (
            SELECT fullvisitorid,
                   visitId,
                   visitNumber,
                   visitStartTime,
                   country,
                   quarter,
                   month,
                   day_of_month,
                   day_of_week,
                   hour,
                   minute,
                   second,
                   source,
                   medium,
                   channelGrouping,
                   hits,
                   bounces,
                   -- get the last action performed during the first visit
                   -- (this indicates where the visitor left off at the end of their visit)
                   MAX(action_type) AS last_action,
                   -- get number of products whose details were viewed
                   SUM(product_details_viewed) AS product_detail_views,
                   -- get number of promotions displayed and clicked during the first visit
                   COUNT(CASE WHEN pa_info IS NOT NULL THEN pa_info.promoIsView ELSE NULL END) AS promos_displayed,
                   COUNT(CASE WHEN pa_info IS NOT NULL THEN pa_info.promoIsClick ELSE NULL END) AS promos_clicked,
                   -- get number of products displayed and clicked during the first visit
                   COUNT(CASE WHEN pu.isImpression IS NULL THEN NULL ELSE 1 END) AS product_views,
                   COUNT(CASE WHEN pu.isClick IS NULL THEN NULL ELSE 1 END) AS product_clicks,
                   pageviews,
                   time_on_site,
                   browser,
                   os,
                   deviceCategory,
                   SUM(added_to_cart) AS added_to_cart,
                   made_purchase_on_future_visit,
            FROM first_visit_attributes
            LEFT JOIN UNNEST(promotion) as p
            LEFT JOIN UNNEST(product) as pu
            GROUP BY fullvisitorid,
                     visitId,
                     visitNumber,
                     visitStartTime,
                     country,
                     quarter,
                     month,
                     day_of_month,
                     day_of_week,
                     hour,
                     minute,
                     second,
                     source,
                     medium,
                     channelGrouping,
                     hits,
                     bounces,
                     pageviews,
                     time_on_site,
                     browser,
                     os,
                     deviceCategory,
                     made_purchase_on_future_visit
        )
        SELECT *
        FROM visit_attributes
        """
df_train = run_sql_query(query, **gcp_auth_dict, show_df=False)
df_train['last_action'] = df_train['last_action'].map(mapper).astype(pd.StringDtype())
display(df_train)

Query execution start time = 2023-04-05 20:36:32...done at 2023-04-05 20:36:46 (14.345 seconds).
Query returned 65,891 rows


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,...,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,made_purchase_on_future_visit
0,2777908945786878159,1474972357,1,2016-09-27 06:32:37,United States,3,9,27,3,6,...,0,52,1,3,559,Chrome,Windows,desktop,0,False
1,9007803403043242426,1473891862,1,2016-09-14 18:24:22,United States,3,9,14,4,18,...,3,403,3,28,2052,Chrome,Android,mobile,3,False
2,1966912255212222870,1479189762,1,2016-11-15 01:02:42,United States,4,11,15,3,1,...,0,0,0,4,136,Chrome,Windows,desktop,0,False
3,3445194212240387141,1474403759,1,2016-09-20 16:35:59,United States,3,9,20,3,16,...,0,18,0,4,15,Chrome,Windows,desktop,0,True
4,6808346617764426247,1476164097,1,2016-10-11 01:34:57,United States,4,10,11,3,1,...,1,12,0,3,231,Opera,Windows,desktop,0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65886,8873026352984936526,1477796454,1,2016-10-29 23:00:54,United States,4,10,29,7,23,...,0,15,0,3,21,Chrome,Macintosh,desktop,0,False
65887,0648960245700298531,1476599503,1,2016-10-16 02:31:43,United States,4,10,16,1,2,...,0,35,0,3,89,Chrome,Chrome OS,desktop,0,False
65888,4440109507675628547,1477885074,1,2016-10-30 23:37:54,United States,4,10,30,1,23,...,0,36,0,3,53,Chrome,Android,mobile,0,False
65889,1446285229092173706,1478190491,1,2016-11-03 12:28:11,United States,4,11,3,5,12,...,0,29,0,3,224,Chrome,Chrome OS,desktop,0,True


CPU times: user 2.7 s, sys: 170 ms, total: 2.87 s
Wall time: 14.4 s


Check if any duplicated visits are present

In [10]:
%%time
df_duplicated_visits = df_train[
    df_train.duplicated(
        subset=["fullvisitorid", "visitId", "visitNumber", "visitStartTime"], keep=False
    )
]
try:
    assert df_duplicated_visits.empty
    print("Did not find duplicate visits")
except AssertionError as e:
    print(f"{str(e)} Found duplicate visits")
    display(df_duplicated_visits)

Did not find duplicate visits
CPU times: user 14 ms, sys: 237 µs, total: 14.3 ms
Wall time: 13.4 ms


**Observations**
1. No duplicates are present at the visit level in the training data. This verifies that a valid definition of a visit (using the columns indicated earlier - `fullvisitorid`, `visitId`, etc.) has been chosen here.

Below are the
1. class imbalance in the ML labels (`y_train`)
2. number of unique categories in each categorical ML feature (`X_train`)
   - the individual categories accounting for more than five percent of all observations are also shown

from the training data

In [11]:
display(
    (
        100
        * df_train["made_purchase_on_future_visit"]
        .value_counts(normalize=True)
        .rename("fraction")
        .to_frame()
    ).merge(
        df_train["made_purchase_on_future_visit"]
        .value_counts()
        .rename("number")
        .to_frame(),
        how="left",
        left_index=True,
        right_index=True,
    )
)
for c in [
    "source",
    "medium",
    "channelGrouping",
    "browser",
    "os",
    "deviceCategory",
    "last_action",
]:
    with pd.option_context("display.max_rows", None):
        display(
            (
                100
                * df_train[c].value_counts(normalize=True).rename("fraction").to_frame()
            )
            .merge(
                df_train[c].value_counts().rename("number").to_frame(),
                how="left",
                left_index=True,
                right_index=True,
            )
            .assign(
                rank=lambda df: df["number"]
                .rank(ascending=False, method="dense")
                .astype(pd.Int8Dtype())
            )
            .assign(column=c)
            .assign(num_unique=df_train[c].nunique())
            .query("fraction >= 5")
        )

Unnamed: 0_level_0,fraction,number
made_purchase_on_future_visit,Unnamed: 1_level_1,Unnamed: 2_level_1
False,95.635216,63015
True,4.364784,2876


Unnamed: 0_level_0,fraction,number,rank,column,num_unique
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
google,47.117209,31046,1,source,105
(direct),21.716168,14309,2,source,105
mall.googleplex.com,12.807515,8439,3,source,105
youtube.com,10.131884,6676,4,source,105


Unnamed: 0_level_0,fraction,number,rank,column,num_unique
medium,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
organic,42.629494,28089,1,medium,7
referral,28.346815,18678,2,medium,7
(none),21.716168,14309,3,medium,7
cpc,5.292073,3487,4,medium,7


Unnamed: 0_level_0,fraction,number,rank,column,num_unique
channelGrouping,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Organic Search,42.629494,28089,1,channelGrouping,8
Direct,21.716168,14309,2,channelGrouping,8
Referral,17.170782,11314,3,channelGrouping,8
Social,11.176033,7364,4,channelGrouping,8
Paid Search,5.292073,3487,5,channelGrouping,8


Unnamed: 0_level_0,fraction,number,rank,column,num_unique
browser,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Chrome,75.026938,49436,1,browser,25
Safari,16.04468,10572,2,browser,25


Unnamed: 0_level_0,fraction,number,rank,column,num_unique
os,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Macintosh,33.189662,21869,1,os,14
Windows,28.976643,19093,2,os,14
iOS,13.467697,8874,3,os,14
Android,12.006192,7911,4,os,14
Linux,7.095051,4675,5,os,14
Chrome OS,5.106919,3365,6,os,14


Unnamed: 0_level_0,fraction,number,rank,column,num_unique
deviceCategory,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
desktop,74.307569,48962,1,deviceCategory,3
mobile,22.183606,14617,2,deviceCategory,3


Unnamed: 0_level_0,fraction,number,rank,column,num_unique
last_action,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Unknown,73.002383,48102,1,last_action,7
Product detail views,16.540954,10899,2,last_action,7


**Observations**
1. Categorical Features
   - in order to avoid categorical features with a high cardinality, we will only keep the most common categories for each such feature. A generic category of `Other` can be assigned to all other (less commonly occurring) categories. A bucketing strategy such as the one shown above, where all categories accounting for less than 5% of all observations (not shown above) can be grouped into the same bucket, seems reasonable as a starting point since it has strongly reduced the true cardinality (see the `num_unique` column) of the training data. Without any reduction in cardinality, the `source` and `os` (operating system) columns present the biggest problem due to their high cardinality.
   - if one-hot encoding is used to transform the categorical features in the training data, then tree-based models will perform poorly since the one-hot encoded data will be sparse. Bucketing to reduce the cardinality will help the performance of tree-based models. XGBoost has specific guidance to handle categorical data ([1](https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html#categorical-data), [2](https://machinelearningmastery.com/data-preparation-gradient-boosting-xgboost-python/)).
2. Class Imbalance
   - we will want to consider random [undersampling, or downsampling](https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/) to improve the imbalance ratio from approximately 20:1 in favor of the majority class to a ratio such as 10:1 or 5:1. This will help the ML model in two ways
     - reducing the degree of imbalance will learn from a relatively larger number of positive examples (visitor did make a purchase on a future visit - we are interested in these visitors) than in the raw data which will have a larger number negative examples (visitor did not make a purchase on a future visit - we are not interested in these visitors for the current business use-case)
     - keeping the true class imbalance is also inefficient in terms of training time since the model spends most of its time learning from uninteresting examples
       - reducing the class imbalance results in shorter model training times

     Other approaches to handle the class imbalance are
     - do nothing
       - train the model using the true distribution of the classes
       - if a model trained on the true imbalanced distribution can generalize to to unseen data then no undersampling is required
       - the disadvantage of this approach is that longer training time will be required
     - use a data-augmentation technique such as [SMOTE](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-106)

#### **Handling Missing Values**

Show a count of missing values in all the columns

In [12]:
display(df_train.isna().sum().rename("missing").to_frame())

Unnamed: 0,missing
fullvisitorid,0
visitId,0
visitNumber,0
visitStartTime,0
country,0
quarter,0
month,0
day_of_month,0
day_of_week,0
hour,0


**Notes**
1. Since the Google Analytics data is reported by visit, missing values in the General features columns
   - `hits`
   - `bounces`
   - `pageviews`
   - `time_on_site`

   would indicate zeros. Zeros were already present in the raw data so there were no missing values in these columns that had to be filled as part of the data preparation SQL query used here.
2. The manually chosen set of User-Facing features (all of which are categoricals) don't have missing values.
3. For both types of features, other choices than the ones made here could have missing values. In that case, feature imputation would be necessary without data leakage/lookahead bias. For the choices in this notebook, imputing missing values is not necessary in the features aggregated at the visit level.

#### **Assessing Zeros in General Features Columns**

1. `hits`
   - `hits` are users' interactions on the merchandise store's website that sends data to the Google Analytics server ([1](https://www.digishuffle.com/blogs/google-analytics-hits/#hitdefinition), [2](https://whatagraph.com/blog/articles/which-kinds-of-hits-does-google-analytics-track/#toc_0))
   - a visit is a group of hits ([1](https://www.optimizesmart.com/why-google-analytics-show-zero-sessions/))
   - [examples include](https://whatagraph.com/blog/articles/which-kinds-of-hits-does-google-analytics-track/)
     - viewing a page
     - social media interactions, like sharing or liking content using *share on social media* buttons
     - e-commerce interactions (add to cart, remove from cart, make purchase, etc.)
     - user timings (loading a page, loading an image, clicking a button)
     - etc.
   - there is no occurrence of zero hits in the Merchandise Store's Google Analytics dataset, so the minimum number of hits is 1, likely since viewing a page (which always occurs) is considered a hit
2. `time_on_site` (total duration of a single visit)
   - reaching the site triggers the start of a visit
   - time on site [is calculated as](https://roirevolution.com/blog/time-on-page-and-time-on-site-how-confident-are-you/) the difference between the timestamp of the last and first pages of a visit
   - a zero indicates the user did not navigate to further pages or trigger events on the site after reaching the site ([1](https://support.google.com/google-ads/thread/1455669?hl=en&msgid=1455678))
3. `bounces`
   - a bounce is when a single request is submitted from scripts embedded in the merchandise store's website to the [Google Analytics server](https://www.analyticsmania.com/post/introduction-to-google-tag-manager-server-side-tagging/) ([1](https://support.google.com/analytics/answer/1009409?hl=en))
     - if a bounce occurs, then time spent on the site is zero and a single page has most likely been viewed (`pageviews`)
     - a zero indicates the absence of a bounced visit, while `1` indicates a bounce was present

If a bounce occurs then, by definition, the following are true
1. zero time on site
2. a single hit is registered
3. (predominantly) a single page gets viewed

In [13]:
# time on site versus bounces
display(
    (
        (
            100
            * df_train.query("bounces == 1")["time_on_site"].value_counts(
                normalize=True
            )
        )
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "time_on_site"})
    )
    .assign(split="train")
    .assign(time_on_site=0)
    .iloc[[0]]
)

# number of pages viewed versus bounces
display(
    (
        (100 * df_train.query("bounces == 1")["pageviews"].value_counts(normalize=True))
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "pageviews"})
    )
    .assign(split="train")
    .assign(time_on_site=0)
    .iloc[[0]]
)

# hits versus bounces
display(
    (
        (100 * df_train.query("bounces == 1")["hits"].value_counts(normalize=True))
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "hits"})
    )
    .assign(split="train")
    .assign(time_on_site=0)
    .iloc[[0]]
)

Unnamed: 0,time_on_site,frequency,split
0,0,100.0,train


Unnamed: 0,pageviews,frequency,split,time_on_site
0,1,100.0,train,0


Unnamed: 0,hits,frequency,split,time_on_site
0,1,99.125592,train,0


Similarly, if zero time is spent on the site then this is almost always associated with
- a bounce occurs
- single page view
- single hit

In [14]:
# bounce
display(
    (
        (
            100
            * df_train.query("time_on_site == 0")["bounces"].value_counts(
                normalize=True
            )
        )
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "bounces"})
    )
    .assign(split="train")
    .assign(time_on_site=0)
    .iloc[[0]]
)
# single page view
display(
    (
        (
            (
                100
                * df_train.query("time_on_site == 0")["pageviews"].value_counts(
                    normalize=True
                )
            )
            .rename("frequency")
            .to_frame()
            .reset_index()
            .rename(columns={"index": "pageviews"})
        )
    )
    .assign(split="train")
    .assign(time_on_site=0)
    .iloc[[0]]
)
# hits
display(
    (
        (100 * df_train.query("time_on_site == 0")["hits"].value_counts(normalize=True))
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "hits"})
    )
    .assign(split="train")
    .assign(time_on_site=0)
    .iloc[[0]]
)

Unnamed: 0,bounces,frequency,split,time_on_site
0,1,99.787057,train,0


Unnamed: 0,pageviews,frequency,split,time_on_site
0,1,99.828607,train,0


Unnamed: 0,hits,frequency,split,time_on_site
0,1,98.930092,train,0


Similarly, if a single hit occurs on the e-commerce site then this is almost always associated with
- a bounce occurs
- single page view
- zero time on site

In [15]:
# hits versus bounces
display(
    (
        (100 * df_train.query("hits == 1")["bounces"].value_counts(normalize=True))
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "bounces"})
    )
    .assign(split="train")
    .assign(hits=1)
    .iloc[[0]]
)

# hits versus number of pages viewed
display(
    (
        (100 * df_train.query("hits == 1")["pageviews"].value_counts(normalize=True))
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "pageviews"})
    )
    .assign(split="train")
    .assign(hits=1)
    .iloc[[0]]
)

# hits versus time on site
display(
    (
        (100 * df_train.query("hits == 1")["time_on_site"].value_counts(normalize=True))
        .rename("frequency")
        .to_frame()
        .reset_index()
        .rename(columns={"index": "time_on_site"})
    )
    .assign(split="train")
    .assign(hits=1)
    .iloc[[0]]
)

Unnamed: 0,bounces,frequency,split,hits
0,1,99.98425,train,1


Unnamed: 0,pageviews,frequency,split,hits
0,1,99.98425,train,1


Unnamed: 0,time_on_site,frequency,split,hits
0,0,100.0,train,1


**Observations**
1. Based on the above, it might be worth training a ML model with only one of these General Features (`hits`, `time_on_site`, `pageviews`).
2. `bounces` is a binary column and should be treated as a categorical ML feature and not as a numerical feature.

### Create Validation Data

Using the same approach, the validation data split is now created by only changing
1. `train_start_date` to `val_start_date`
2. `train_end_date` to `val_end_date`

in the `first_visit_attributes` CTE in order to capture the validation data

In [16]:
%%time
query = f"""
        WITH
        -- Step 1. get visitors with a purchase on a future visit
        next_visit_purchasers AS (
             SELECT fullvisitorid,
                    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, True, False) AS made_purchase_on_future_visit
             FROM `data-to-insights.ecommerce.web_analytics`
             WHERE date BETWEEN '{train_start_date}' AND '{test_end_date}'
             AND geoNetwork.country = 'United States'
             GROUP BY fullvisitorid
        ),
        -- Steps 2. and 3. get attributes of the first visit
        first_visit_attributes AS (
            SELECT -- =========== GEOSPATIAL AND TEMPORAL ATTRIBUTES OF VISIT ===========
                   geoNetwork.country,
                   EXTRACT(QUARTER FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS quarter,
                   EXTRACT(MONTH FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS month,
                   EXTRACT(DAY FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS day_of_month,
                   EXTRACT(DAYOFWEEK FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS day_of_week,
                   EXTRACT(HOUR FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS hour,
                   EXTRACT(MINUTE FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS minute,
                   EXTRACT(SECOND FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS second,
                   -- =========== VISIT AND VISITOR METADATA ===========
                   fullvisitorid,
                   visitId,
                   visitNumber,
                   DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern') AS visitStartTime,
                   -- =========== SOURCE OF SITE TRAFFIC ===========
                   -- source of the traffic from which the visit was initiated
                   trafficSource.source,
                   -- medium of the traffic from which the visit was initiated
                   trafficSource.medium,
                   -- referring channel connected to visit
                   channelGrouping,
                   -- =========== VISITOR ACTIVITY ===========
                   -- total number of hits
                   (CASE WHEN totals.hits > 0 THEN totals.hits ELSE 0 END) AS hits,
                   -- number of bounces
                   (CASE WHEN totals.bounces > 0 THEN totals.bounces ELSE 0 END) AS bounces,
                   -- action performed during first visit
                   CAST(h.eCommerceAction.action_type AS INT64) AS action_type,
                   -- page views
                   IFNULL(totals.pageviews, 0) AS pageviews,
                   -- time on the website
                   IFNULL(totals.timeOnSite, 0) AS time_on_site,
                   -- whether add-to-cart was performed during visit
                   (CASE WHEN CAST(h.eCommerceAction.action_type AS INT64) = 3 THEN 1 ELSE 0 END) AS added_to_cart,
                   (CASE WHEN CAST(h.eCommerceAction.action_type AS INT64) = 2 THEN 1 ELSE 0 END) AS product_details_viewed,
                   -- =========== VISITOR DEVICES ===========
                   -- user's browser
                   device.browser,
                   -- user's operating system
                   device.operatingSystem AS os,
                   -- user's type of device
                   device.deviceCategory,
                   -- =========== PROMOTION ===========
                   h.promotion,
                   h.promotionActionInfo AS pa_info,
                   -- =========== PRODUCT ===========
                   h.product,
                   -- =========== ML LABEL (DEPENDENT VARIABLE) ===========
                   made_purchase_on_future_visit
            FROM `data-to-insights.ecommerce.web_analytics`,
            UNNEST(hits) AS h
            INNER JOIN next_visit_purchasers USING (fullvisitorid)
            WHERE date BETWEEN '{val_start_date}' AND '{val_end_date}'
            AND geoNetwork.country = 'United States'
            AND totals.newVisits = 1
        ),
        -- Step 4. get aggregated features (attributes) per visit
        visit_attributes AS (
            SELECT fullvisitorid,
                   visitId,
                   visitNumber,
                   visitStartTime,
                   country,
                   quarter,
                   month,
                   day_of_month,
                   day_of_week,
                   hour,
                   minute,
                   second,
                   source,
                   medium,
                   channelGrouping,
                   hits,
                   bounces,
                   -- get the last action performed during the first visit
                   -- (this indicates where the visitor left off at the end of their visit)
                   MAX(action_type) AS last_action,
                   -- get number of products whose details were viewed
                   SUM(product_details_viewed) AS product_detail_views,
                   -- get number of promotions displayed and clicked during the first visit
                   COUNT(CASE WHEN pa_info IS NOT NULL THEN pa_info.promoIsView ELSE NULL END) AS promos_displayed,
                   COUNT(CASE WHEN pa_info IS NOT NULL THEN pa_info.promoIsClick ELSE NULL END) AS promos_clicked,
                   -- get number of products displayed and clicked during the first visit
                   COUNT(CASE WHEN pu.isImpression IS NULL THEN NULL ELSE 1 END) AS product_views,
                   COUNT(CASE WHEN pu.isClick IS NULL THEN NULL ELSE 1 END) AS product_clicks,
                   pageviews,
                   time_on_site,
                   browser,
                   os,
                   deviceCategory,
                   SUM(added_to_cart) AS added_to_cart,
                   made_purchase_on_future_visit,
            FROM first_visit_attributes
            LEFT JOIN UNNEST(promotion) as p
            LEFT JOIN UNNEST(product) as pu
            GROUP BY fullvisitorid,
                     visitId,
                     visitNumber,
                     visitStartTime,
                     country,
                     quarter,
                     month,
                     day_of_month,
                     day_of_week,
                     hour,
                     minute,
                     second,
                     source,
                     medium,
                     channelGrouping,
                     hits,
                     bounces,
                     pageviews,
                     time_on_site,
                     browser,
                     os,
                     deviceCategory,
                     made_purchase_on_future_visit
        )
        SELECT *
        FROM visit_attributes
        """
df_val = run_sql_query(query, **gcp_auth_dict, show_df=False)
df_val['last_action'] = df_val['last_action'].map(mapper).astype(pd.StringDtype())
display(df_val)

Query execution start time = 2023-04-05 20:36:47...done at 2023-04-05 20:36:53 (6.507 seconds).
Query returned 26,968 rows


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,...,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,made_purchase_on_future_visit
0,0853753869601106645,1482568355,1,2016-12-24 03:32:35,United States,4,12,24,7,3,...,0,0,0,5,98,Chrome,Windows,desktop,0,False
1,8271729065579645365,1483122882,1,2016-12-30 13:34:42,United States,4,12,30,6,13,...,0,37,4,12,385,Chrome,Windows,desktop,0,False
2,3506392277497339165,1482004744,1,2016-12-17 14:59:04,United States,4,12,17,7,14,...,0,27,0,9,242,Chrome,Windows,desktop,0,False
3,0050520589307109172,1481433011,1,2016-12-11 00:10:11,United States,4,12,11,1,0,...,0,0,0,3,29,Chrome,Macintosh,desktop,0,False
4,5679674439019725229,1480653538,1,2016-12-01 23:38:58,United States,4,12,1,5,23,...,0,6,0,3,17,Chrome,Macintosh,desktop,0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26963,4628014361650993386,1481278014,1,2016-12-09 05:06:54,United States,4,12,9,6,5,...,0,30,0,3,86,Firefox,Windows,desktop,0,False
26964,436995257021965910,1480697963,1,2016-12-02 11:59:23,United States,4,12,2,6,11,...,0,22,0,3,20,Chrome,Windows,desktop,0,False
26965,6912704476581951344,1481395682,1,2016-12-10 13:48:02,United States,4,12,10,7,13,...,0,30,0,3,70,Chrome,Windows,desktop,0,False
26966,0745341923593722891,1481152121,1,2016-12-07 18:08:41,United States,4,12,7,4,18,...,0,36,0,3,182,Chrome,Macintosh,desktop,0,False


CPU times: user 1.12 s, sys: 87.9 ms, total: 1.21 s
Wall time: 6.53 s


**Notes**
1. Earlier, it was mentioned that `totals.newVisits = 1` gives the first visit while `totals.newVisits = NULL` gives the future visit. In the training data, the first visit was picked up using this filter condition as a SQL filter. This allowed features from the first visit to be extracted. The label was whether the visitor made a purchase during their return visit. Here, the validation data uses the same approach. Per the business use-case, we want to predict new visitor's propensity of making a purchase during their return visit. The ML model will be trained using training data which only captures the first visit and **this first visit occurs during the months covered by the training data only**. The model will be validated using validation data that similarly covers the first visit of visitors that **occurs during the months covered by the validation data only**. The label of the validation data is analogous to that from the training data in that it indicates whether these new visitors (in the validation data split) made a purchase during a future visit.

   The visitors in the training data do not need to be the same as those in the validation (or testing) data splits. Visitor ID will not be used as a feature during training, validation or evaluation (test data). Only the attributes of their first visit will be used.

   With this in mind, similar to the training data, we can get that first visit of visitors in the validation data using `totals.newVisits = 1` in the validation data SQL query above. For this reason, we will not use `totals.newVisits = NULL` in the SQL query to build the validation or test data splits.

### Create Test Data

Finally, the test data split is created using a similar approach to the validation data split (only the dates in the `first_visit_attributes` CTE are changed in order to capture the test data)

In [17]:
%%time
query = f"""
        WITH
        -- Step 1. get visitors with a purchase on a future visit
        next_visit_purchasers AS (
             SELECT fullvisitorid,
                    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, True, False) AS made_purchase_on_future_visit
             FROM `data-to-insights.ecommerce.web_analytics`
             WHERE date BETWEEN '{train_start_date}' AND '{test_end_date}'
             AND geoNetwork.country = 'United States'
             GROUP BY fullvisitorid
        ),
        -- Steps 2. and 3. get attributes of the first visit
        first_visit_attributes AS (
            SELECT -- =========== GEOSPATIAL AND TEMPORAL ATTRIBUTES OF VISIT ===========
                   geoNetwork.country,
                   EXTRACT(QUARTER FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS quarter,
                   EXTRACT(MONTH FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS month,
                   EXTRACT(DAY FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS day_of_month,
                   EXTRACT(DAYOFWEEK FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS day_of_week,
                   EXTRACT(HOUR FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS hour,
                   EXTRACT(MINUTE FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS minute,
                   EXTRACT(SECOND FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS second,
                   -- =========== VISIT AND VISITOR METADATA ===========
                   fullvisitorid,
                   visitId,
                   visitNumber,
                   DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern') AS visitStartTime,
                   -- =========== SOURCE OF SITE TRAFFIC ===========
                   -- source of the traffic from which the visit was initiated
                   trafficSource.source,
                   -- medium of the traffic from which the visit was initiated
                   trafficSource.medium,
                   -- referring channel connected to visit
                   channelGrouping,
                   -- =========== VISITOR ACTIVITY ===========
                   -- total number of hits
                   (CASE WHEN totals.hits > 0 THEN totals.hits ELSE 0 END) AS hits,
                   -- number of bounces
                   (CASE WHEN totals.bounces > 0 THEN totals.bounces ELSE 0 END) AS bounces,
                   -- action performed during first visit
                   CAST(h.eCommerceAction.action_type AS INT64) AS action_type,
                   -- page views
                   IFNULL(totals.pageviews, 0) AS pageviews,
                   -- time on the website
                   IFNULL(totals.timeOnSite, 0) AS time_on_site,
                   -- whether add-to-cart was performed during visit
                   (CASE WHEN CAST(h.eCommerceAction.action_type AS INT64) = 3 THEN 1 ELSE 0 END) AS added_to_cart,
                   (CASE WHEN CAST(h.eCommerceAction.action_type AS INT64) = 2 THEN 1 ELSE 0 END) AS product_details_viewed,
                   -- =========== VISITOR DEVICES ===========
                   -- user's browser
                   device.browser,
                   -- user's operating system
                   device.operatingSystem AS os,
                   -- user's type of device
                   device.deviceCategory,
                   -- =========== PROMOTION ===========
                   h.promotion,
                   h.promotionActionInfo AS pa_info,
                   -- =========== PRODUCT ===========
                   h.product,
                   -- =========== ML LABEL (DEPENDENT VARIABLE) ===========
                   made_purchase_on_future_visit
            FROM `data-to-insights.ecommerce.web_analytics`,
            UNNEST(hits) AS h
            INNER JOIN next_visit_purchasers USING (fullvisitorid)
            WHERE date BETWEEN '{test_start_date}' AND '{test_end_date}'
            AND geoNetwork.country = 'United States'
            AND totals.newVisits = 1
        ),
        -- Step 4. get aggregated features (attributes) per visit
        visit_attributes AS (
            SELECT fullvisitorid,
                   visitId,
                   visitNumber,
                   visitStartTime,
                   country,
                   quarter,
                   month,
                   day_of_month,
                   day_of_week,
                   hour,
                   minute,
                   second,
                   source,
                   medium,
                   channelGrouping,
                   hits,
                   bounces,
                   -- get the last action performed during the first visit
                   -- (this indicates where the visitor left off at the end of their visit)
                   MAX(action_type) AS last_action,
                   -- get number of products whose details were viewed
                   SUM(product_details_viewed) AS product_detail_views,
                   -- get number of promotions displayed and clicked during the first visit
                   COUNT(CASE WHEN pa_info IS NOT NULL THEN pa_info.promoIsView ELSE NULL END) AS promos_displayed,
                   COUNT(CASE WHEN pa_info IS NOT NULL THEN pa_info.promoIsClick ELSE NULL END) AS promos_clicked,
                   -- get number of products displayed and clicked during the first visit
                   COUNT(CASE WHEN pu.isImpression IS NULL THEN NULL ELSE 1 END) AS product_views,
                   COUNT(CASE WHEN pu.isClick IS NULL THEN NULL ELSE 1 END) AS product_clicks,
                   pageviews,
                   time_on_site,
                   browser,
                   os,
                   deviceCategory,
                   SUM(added_to_cart) AS added_to_cart,
                   made_purchase_on_future_visit,
            FROM first_visit_attributes
            LEFT JOIN UNNEST(promotion) as p
            LEFT JOIN UNNEST(product) as pu
            GROUP BY fullvisitorid,
                     visitId,
                     visitNumber,
                     visitStartTime,
                     country,
                     quarter,
                     month,
                     day_of_month,
                     day_of_week,
                     hour,
                     minute,
                     second,
                     source,
                     medium,
                     channelGrouping,
                     hits,
                     bounces,
                     pageviews,
                     time_on_site,
                     browser,
                     os,
                     deviceCategory,
                     made_purchase_on_future_visit
        )
        SELECT *
        FROM visit_attributes
        """
df_test = run_sql_query(query, **gcp_auth_dict, show_df=False)
df_test['last_action'] = df_test['last_action'].map(mapper).astype(pd.StringDtype())
display(df_test)

Query execution start time = 2023-04-05 20:36:53...done at 2023-04-05 20:36:58 (5.152 seconds).
Query returned 21,208 rows


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,...,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,made_purchase_on_future_visit
0,6550840611104307110,1483497115,1,2017-01-03 21:31:55,United States,1,1,3,3,21,...,0,99,2,18,442,Chrome,Windows,desktop,1,False
1,2014383825742859117,1483815159,1,2017-01-07 13:52:39,United States,1,1,7,7,13,...,0,4,0,4,77,Chrome,Android,mobile,0,False
2,1839197787123364583,1484945101,1,2017-01-20 15:45:01,United States,1,1,20,6,15,...,0,12,0,4,58,Chrome,Windows,desktop,0,False
3,3158790095800558975,1485416897,1,2017-01-26 02:48:17,United States,1,1,26,5,2,...,0,12,0,3,10,Safari,Macintosh,desktop,0,False
4,6789711867997096486,1485917602,1,2017-01-31 21:53:22,United States,1,1,31,3,21,...,0,46,0,5,118,Chrome,Chrome OS,desktop,0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21203,8120408577094735786,1483831423,1,2017-01-07 18:23:43,United States,1,1,7,7,18,...,0,48,0,4,104,Safari,iOS,mobile,0,False
21204,4355650898237121781,1484778015,1,2017-01-18 17:20:15,United States,1,1,18,4,17,...,0,36,0,4,112,Chrome,Android,mobile,0,False
21205,5528175146941449531,1484936834,1,2017-01-20 13:27:14,United States,1,1,20,6,13,...,0,23,0,4,606,Chrome,Linux,desktop,0,False
21206,4258710557821065313,1484981328,1,2017-01-21 01:48:48,United States,1,1,21,7,1,...,0,41,0,4,74,Chrome,Macintosh,desktop,0,False


CPU times: user 956 ms, sys: 59.5 ms, total: 1.02 s
Wall time: 5.17 s


In [9]:
# %%time
# query = f"""
#         WITH
#         -- Steps 2. and 3. get attributes of the first visit
#         first_visit_attributes AS (
#             SELECT -- =========== GEOSPATIAL AND TEMPORAL ATTRIBUTES OF VISIT ===========
#                    geoNetwork.country,
#                    EXTRACT(QUARTER FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS quarter,
#                    EXTRACT(MONTH FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS month,
#                    EXTRACT(DAY FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS day_of_month,
#                    EXTRACT(DAYOFWEEK FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS day_of_week,
#                    EXTRACT(HOUR FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS hour,
#                    EXTRACT(MINUTE FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS minute,
#                    EXTRACT(SECOND FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS second,
#                    -- =========== VISIT AND VISITOR METADATA ===========
#                    fullvisitorid,
#                    visitId,
#                    visitNumber,
#                    DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern') AS visitStartTime,
#                    -- =========== SOURCE OF SITE TRAFFIC ===========
#                    -- source of the traffic from which the visit was initiated
#                    trafficSource.source,
#                    -- medium of the traffic from which the visit was initiated
#                    trafficSource.medium,
#                    -- referring channel connected to visit
#                    channelGrouping,
#                    -- =========== VISITOR ACTIVITY ===========
#                    -- total number of hits
#                    (CASE WHEN totals.hits > 0 THEN totals.hits ELSE 0 END) AS hits,
#                    -- number of bounces
#                    (CASE WHEN totals.bounces > 0 THEN totals.bounces ELSE 0 END) AS bounces,
#                    -- action performed during first visit
#                    CAST(h.eCommerceAction.action_type AS INT64) AS action_type,
#                    -- page views
#                    IFNULL(totals.pageviews, 0) AS pageviews,
#                    -- time on the website
#                    IFNULL(totals.timeOnSite, 0) AS time_on_site,
#                    -- whether add-to-cart was performed during visit
#                    (CASE WHEN CAST(h.eCommerceAction.action_type AS INT64) = 3 THEN 1 ELSE 0 END) AS added_to_cart,
#                    (CASE WHEN CAST(h.eCommerceAction.action_type AS INT64) = 2 THEN 1 ELSE 0 END) AS product_details_viewed,
#                    -- =========== VISITOR DEVICES ===========
#                    -- user's browser
#                    device.browser,
#                    -- user's operating system
#                    device.operatingSystem AS os,
#                    -- user's type of device
#                    device.deviceCategory,
#                    -- =========== PROMOTION ===========
#                    h.promotion,
#                    h.promotionActionInfo AS pa_info,
#                    -- =========== PRODUCT ===========
#                    h.product
#             FROM `data-to-insights.ecommerce.web_analytics`,
#             UNNEST(hits) AS h
#             WHERE date BETWEEN '{test_start_date}' AND '{test_end_date}'
#             AND geoNetwork.country = 'United States'
#             AND totals.newVisits = 1
#         ),
#         -- Step 4. get aggregated features (attributes) per visit
#         visit_attributes AS (
#             SELECT fullvisitorid,
#                    visitId,
#                    visitNumber,
#                    visitStartTime,
#                    country,
#                    quarter,
#                    month,
#                    day_of_month,
#                    day_of_week,
#                    hour,
#                    minute,
#                    second,
#                    source,
#                    medium,
#                    channelGrouping,
#                    hits,
#                    bounces,
#                    -- get the last action performed during the first visit
#                    -- (this indicates where the visitor left off at the end of their visit)
#                    MAX(action_type) AS last_action,
#                    -- get number of products whose details were viewed
#                    SUM(product_details_viewed) AS product_detail_views,
#                    -- get number of promotions displayed and clicked during the first visit
#                    COUNT(CASE WHEN pa_info IS NOT NULL THEN pa_info.promoIsView ELSE NULL END) AS promos_displayed,
#                    COUNT(CASE WHEN pa_info IS NOT NULL THEN pa_info.promoIsClick ELSE NULL END) AS promos_clicked,
#                    -- get number of products displayed and clicked during the first visit
#                    COUNT(CASE WHEN pu.isImpression IS NULL THEN NULL ELSE 1 END) AS product_views,
#                    COUNT(CASE WHEN pu.isClick IS NULL THEN NULL ELSE 1 END) AS product_clicks,
#                    pageviews,
#                    time_on_site,
#                    browser,
#                    os,
#                    deviceCategory,
#                    SUM(added_to_cart) AS added_to_cart
#             FROM first_visit_attributes
#             LEFT JOIN UNNEST(promotion) as p
#             LEFT JOIN UNNEST(product) as pu
#             GROUP BY fullvisitorid,
#                      visitId,
#                      visitNumber,
#                      visitStartTime,
#                      country,
#                      quarter,
#                      month,
#                      day_of_month,
#                      day_of_week,
#                      hour,
#                      minute,
#                      second,
#                      source,
#                      medium,
#                      channelGrouping,
#                      hits,
#                      bounces,
#                      pageviews,
#                      time_on_site,
#                      browser,
#                      os,
#                      deviceCategory
#         )
#         SELECT *
#         FROM visit_attributes
#         """
# df_test = run_sql_query(query, **gcp_auth_dict, show_df=False)
# df_test['last_action'] = df_test['last_action'].map(mapper).astype(pd.StringDtype())
# display(df_test)

Query execution start time = 2023-04-10 00:25:24...done at 2023-04-10 00:25:30 (6.632 seconds).
Query returned 21,208 rows


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,...,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart
0,9760930093553598408,1485285021,1,2017-01-24 14:10:21,United States,1,1,24,3,14,...,18,0,13,4,11,547,Safari,iOS,tablet,0
1,2798321284779703960,1484590846,1,2017-01-16 13:20:46,United States,1,1,16,2,13,...,9,0,12,0,4,88,Chrome,iOS,mobile,0
2,0179324942539103208,1484111831,1,2017-01-11 00:17:11,United States,1,1,11,4,0,...,9,0,12,0,4,100,Chrome,Macintosh,desktop,0
3,266837473970952674,1485061618,1,2017-01-22 00:06:58,United States,1,1,22,1,0,...,9,0,12,0,5,465,Chrome,Linux,desktop,0
4,5471353368825396015,1485838474,1,2017-01-30 23:54:34,United States,1,1,30,2,23,...,81,0,414,22,132,2396,Chrome,Macintosh,desktop,20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21203,529256164119911406,1485057561,1,2017-01-21 22:59:21,United States,1,1,21,7,22,...,0,0,27,0,3,69,Chrome,Windows,desktop,0
21204,1910328943999025610,1484177427,1,2017-01-11 18:30:27,United States,1,1,11,4,18,...,0,0,36,0,3,65,Chrome,Macintosh,desktop,0
21205,330488721047533143,1485318551,1,2017-01-24 23:29:11,United States,1,1,24,3,23,...,0,0,36,0,3,335,Internet Explorer,Windows,desktop,0
21206,7248579154525368043,1483564092,1,2017-01-04 16:08:12,United States,1,1,4,4,16,...,0,0,36,0,3,1614,Chrome,Windows,desktop,0


CPU times: user 1.01 s, sys: 56.3 ms, total: 1.06 s
Wall time: 6.65 s


## Handle Duplicates

As discussed during the previous notebook on EDA, we'll now drop duplicates by `fullvisitorid`

In [18]:
df_train = df_train.drop_duplicates(subset=["fullvisitorid"], keep="first")
df_val = df_val.drop_duplicates(subset=["fullvisitorid"], keep="first")
df_test = df_test.drop_duplicates(subset=["fullvisitorid"], keep="first")

**Notes**
1. By splitting data for ML, at the time of training the ML model, we are assuming that we don't yet have the validation or testing data during ML model development so we would not know if duplicates are or are not present in those data splits.

   If such a model is performant enough to be deployed to production, the same will apply to out-of-sample (unseen) visitors' visit data. The features that would be needed in production are the same as those that would be needed during ML development. So, when we do access the out-of-sample data in production, we again would want only the first visit per visitor so that (as was done during ML development) we can predict their propensity to make a purchase during a future visit. In order to accomplish this, when we were to get access to the unseen data in production, we could easily
   - consider the first valid visit per visitor (i.e. per `fullvisitorid`)
   - create features from this visit and make a prediction (inference) of propensity to purchase during a return (or future) visit
   - for subsequent visits
     - check for duplicates by `fullvisitorid`
     - drop any duplicated (subsequent) visits by the same `fullvisitorid` (since an inference prediction has already been made for this visitor)

   and this workflow does not involve data leakage or [lookahead bias](https://www.investopedia.com/terms/l/lookaheadbias.asp).

   For this reason, we can drop this type of duplicate in the validation and test data splits now without being affected by data leakage or lookahead bias.

## Export to Disk

Change datatypes for all data splits to help reduce the size of the data in memory

In [19]:
dtypes_dict = {
    "fullvisitorid": pd.StringDtype(),
    "visitId": pd.StringDtype(),
    "visitNumber": pd.Int8Dtype(),
    "country": pd.StringDtype(),
    "quarter": pd.Int8Dtype(),
    "month": pd.Int8Dtype(),
    "day_of_month": pd.Int8Dtype(),
    "day_of_week": pd.Int8Dtype(),
    "hour": pd.Int8Dtype(),
    "minute": pd.Int8Dtype(),
    "second": pd.Int8Dtype(),
    "source": pd.StringDtype(),
    "medium": pd.StringDtype(),
    "channelGrouping": pd.StringDtype(),
    "hits": pd.Int16Dtype(),
    "bounces": pd.Int16Dtype(),
    "last_action": pd.StringDtype(),
    "product_detail_views": pd.Int16Dtype(),
    "promos_displayed": pd.Int16Dtype(),
    "promos_clicked": pd.Int16Dtype(),
    "product_views": pd.Int16Dtype(),
    "product_clicks": pd.Int16Dtype(),
    "pageviews": pd.Int16Dtype(),
    "time_on_site": pd.Int16Dtype(),
    "browser": pd.StringDtype(),
    "os": pd.StringDtype(),
    "added_to_cart": pd.Int16Dtype(),
    "deviceCategory": pd.StringDtype(),
}

In [20]:
%%time
df_train = df_train.astype(dtypes_dict)
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 65617 entries, 0 to 65890
Data columns (total 30 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   fullvisitorid                  65617 non-null  string        
 1   visitId                        65617 non-null  string        
 2   visitNumber                    65617 non-null  Int8          
 3   visitStartTime                 65617 non-null  datetime64[ns]
 4   country                        65617 non-null  string        
 5   quarter                        65617 non-null  Int8          
 6   month                          65617 non-null  Int8          
 7   day_of_month                   65617 non-null  Int8          
 8   day_of_week                    65617 non-null  Int8          
 9   hour                           65617 non-null  Int8          
 10  minute                         65617 non-null  Int8          
 11  second              

In [21]:
%%time
df_val = df_val.astype(dtypes_dict)
df_val.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26936 entries, 0 to 26967
Data columns (total 30 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   fullvisitorid                  26936 non-null  string        
 1   visitId                        26936 non-null  string        
 2   visitNumber                    26936 non-null  Int8          
 3   visitStartTime                 26936 non-null  datetime64[ns]
 4   country                        26936 non-null  string        
 5   quarter                        26936 non-null  Int8          
 6   month                          26936 non-null  Int8          
 7   day_of_month                   26936 non-null  Int8          
 8   day_of_week                    26936 non-null  Int8          
 9   hour                           26936 non-null  Int8          
 10  minute                         26936 non-null  Int8          
 11  second              

In [22]:
%%time
df_test = df_test.astype(dtypes_dict)
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21177 entries, 0 to 21207
Data columns (total 30 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   fullvisitorid                  21177 non-null  string        
 1   visitId                        21177 non-null  string        
 2   visitNumber                    21177 non-null  Int8          
 3   visitStartTime                 21177 non-null  datetime64[ns]
 4   country                        21177 non-null  string        
 5   quarter                        21177 non-null  Int8          
 6   month                          21177 non-null  Int8          
 7   day_of_month                   21177 non-null  Int8          
 8   day_of_week                    21177 non-null  Int8          
 9   hour                           21177 non-null  Int8          
 10  minute                         21177 non-null  Int8          
 11  second              

The training data is now exported to disk

In [23]:
%%time
fpath_train = os.path.join(intermediate_data_dir, "train_intermediate.parquet.gzip")
df_train.to_parquet(fpath_train, index=False, compression='gzip', engine='pyarrow')

CPU times: user 346 ms, sys: 18.9 ms, total: 365 ms
Wall time: 363 ms


The validation data is now exported to disk

In [24]:
%%time
fpath_val = os.path.join(intermediate_data_dir, "val_intermediate.parquet.gzip")
df_val.to_parquet(fpath_val, index=False, compression='gzip', engine='pyarrow')

CPU times: user 135 ms, sys: 7.88 ms, total: 143 ms
Wall time: 141 ms


The testing data is now exported to disk

In [25]:
%%time
fpath_test = os.path.join(intermediate_data_dir, "test_intermediate.parquet.gzip")
df_test.to_parquet(fpath_test, index=False, compression='gzip', engine='pyarrow')

CPU times: user 111 ms, sys: 315 µs, total: 112 ms
Wall time: 110 ms


## Summary of Tasks Performed

This notebook performed the following
1. Overall
   - training, validation and test data splits were created that can be used to address the objective of training a ML model to predict a new visitor's propensity to make a purchase from the merchandise store on the Google Marketplace during February 2017
2. Data Transformation
   - Features in the prepared data splits were created at the visit level. Since the objective is to predict propensity to make a purchase during a future visit, features should also be at the level of visits. Based on [Google Analytics' definition](https://sporkmarketing.com/376/what-are-visitors-unique-visitors-and-page-views-google-analytics/), visits were defined by the combination of the
     - `fullvisitorid`
     - `visitId`
     - `visitNumber`
     - `visitStartTime`

     columns and it was verified that no duplicated visits exist within each grouping of these columns.
3. Data Splits
   - the data was split by month of the year
   - new visitors in the training data, who returned to the Google Merchandise Store during the period of months covering the training data, do **not** need to also be in the validation or test data splits
     - this is not a problem since the current project's business use-case is targeting new visitors and not the same/existing visitors
4. **Feature Selection**
   - The data splits were created using a subset of columns provided for visitor transactions on the store's website. These columns were selected based on
     - exploratory data analysis preformed in the two preceding notebooks
     - intuition about factors that would be predictive of a new visitor's propensity (probability) of making a purchase on a future visit to the store
   - It might be best to start by training a ML model with one of the General Features (`time_on_site`, `hits`, `pageviews`) and only add more if necessary to improve performance
   - other features for the first visit that intuitively might be useful at predicing the propensity of a purchase on a return visit include
     - get the hour of the first and/or last action per visit, using `MIN(hits.hour)` or `MAX(hits.hour)`
     - get the last interaction per visit, using `hits.isInteraction` ([1](https://stackoverflow.com/questions/71012593/how-can-i-get-the-last-element-of-an-array-sql-bigquery))
     - determine whether a checkout step is specified with the last hit per visit `hits.eCommerceAction.step`
     - `hits.social`
     - `hits.social.socialNetwork`
     - `hits.social.socialInteractionAction`
     - `hits.product`
     - `hits.product.productBrand`
     - `hits.type`
     - `hits.transaction`
     - `socialEngagementType`
     - `datetime` attributes for `visitStartTime`
5. **Feature Processing**
   - approaches to process features were discussed based on frequencies observed in the training data
     - categorical features will be bucketed, with infrequent categories (less than 5% frequency) grouped together
     - `bounces` is a binary column and should be treated as a categorical ML feature
     - undersampling, no changes or SMOTE are candidates for handling class imbalance

## Summary of Assumptions

None.

## Limitations

None.

## Next Step

The next notebook will process the data splits in order to prepare them for ML model training. This will include
1. bucketing and encoding categorical features
2. scaling numerical features
3. applying a resampling technique to handle class imbalance

---