# Create Dashboard Charts and Tables

In [1]:
%load_ext autoreload
%autoreload 2

::: {.content-hidden}
Import necessary Python modules
:::

In [2]:
import os
import sys
from calendar import month_name
from glob import glob
from typing import Dict, List

import altair as alt
import pandas as pd

In [3]:
alt.renderers.set_embed_options(actions=False)

RendererRegistry.enable('default')

::: {.content-hidden}
Get relative path to project root directory
:::

In [4]:
PROJ_ROOT_DIR = os.path.join(os.pardir)
src_dir = os.path.join(PROJ_ROOT_DIR, "src")
sys.path.append(src_dir)

::: {.content-hidden}
Import custom Python modules
:::

In [5]:
%aimport bigquery_auth_helpers
from bigquery_auth_helpers import auth_to_bigquery

%aimport dash_helpers
import dash_helpers as dh

%aimport transform_helpers
import transform_helpers as th

%aimport xlsx_helpers
import xlsx_helpers as xlh

## About

### Overview
This step will create charts and tables to be displayed as part of the dashboard to summarize the audience group(s) for the marketing campaign.

### Implementation
The dashboard will be used by the business user (marketing team) to design the campaign so it must show all first-time visitors to the store during the inference period and the ML model's prediction of their propensity to make a purchase during a return visit to the store.

The dashboard will also show the following

1. most important features for predicting the propensity, separately for each audience group
2. summary of visit attributes (for first-time visitors) during the inference period and, in order to provide some context, a month-over-month comparison of these attributes starting from the first month of Google Analytics tracking data used during ML model development

### Order of Operations
This step can be run prospectively at the end of the inference period, just before the start of the campaign, when all the inference data (first-time visitors to the store) becomes available. This step cab be run after the following BigQuery tables have been created

1. `audience_cohorts`
2. `audience_profiles`
3. `monthly_summary`

These tables were uploaded to BigQuery in the previous step.

## User Inputs

Define the following

1. BigQuery
   - dataset id
   - table ids for audience
     - cohorts
     - profile
2. start and end dates for train, validation and test data

In [6]:
#| echo: true
# 1. GCP resources
gbq_dataset_id = 'mydemo2asdf'
gbq_table_id_cohorts = 'audience_cohorts'
gbq_table_id_profiles = 'audience_profiles'
gbq_table_id_feats_imp = 'audience_feats_imp'
gbq_table_id_summary = 'monthly_summary'
gbq_table_id_sa_fracs = "cohort_audience_fractions"
gbq_table_id_conv_rates = "audience_conversion_rates"
gbq_table_id_conv_rates_agg_combo = "conversion_rates_aggregated"
gbq_table_id_daily_perf = "daily_summary"
gbq_table_id_cat_feats_kpis = "categorical_features_kpis"

# 2. start and end dates
train_start_date = "20160901"
train_end_date = "20161231"
val_start_date = "20170101"
val_end_date = "20170131"
test_start_date = "20170201"
test_end_date = "20170228"
infer_start_date = '20170301'
infer_end_date = '20170331'

::: {.content-hidden}
Get path to data sub-folders and model folder
:::

In [7]:
data_dir = os.path.join(PROJ_ROOT_DIR, "data")
raw_data_dir = os.path.join(data_dir, "raw")
processed_data_dir = os.path.join(data_dir, "processed")
gcp_keys_dir = os.path.join(PROJ_ROOT_DIR, "gcp_keys")

::: {.content-hidden}
Load Google Cloud authentication credentials for use with the native BigQuery Python client
:::

In [8]:
gcp_proj_id = os.environ["GCP_PROJECT_ID"]

::: {.content-hidden}
Get fully resolved name of the BigQuery tables
:::

In [9]:
gbq_table_fully_resolved_cohorts = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_cohorts}"
gbq_table_fully_resolved_profiles = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_profiles}"
gbq_table_fully_resolved_feats_imp = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_feats_imp}"
gbq_summary_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_summary}"
gbq_sa_fracs_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_sa_fracs}"
gbq_conv_rates_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_conv_rates}"
gbq_conv_rates_combo_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_conv_rates_agg_combo}"
gbq_daily_perf_combo_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_daily_perf}"
gbq_table_fully_resolved_cat_feat_kpis = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_cat_feats_kpis}"

::: {.content-hidden}
Define a dictionary to specify datatypes of the profiles data
:::

In [10]:
dtypes_dict_profiles = {
    "Audience_Strategy": pd.StringDtype(),
    "Stat_Expanded": pd.StringDtype(),
    "High": pd.Float32Dtype(),
    "Low": pd.Float32Dtype(),
    "Medium": pd.Float32Dtype(),
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the feature importances data
:::

In [11]:
dtypes_dict_feats_imp = {
    "audience_strategy": pd.StringDtype(),
    "num_observations": pd.Int16Dtype(),
    "stat": pd.StringDtype(),
    "maudience": pd.StringDtype(),
    "value": pd.StringDtype(),
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the cohorts data
:::

In [12]:
dtypes_dict_cohort = {
    "infer_month": "str",
    "fullvisitorid": "str",
    "visitId": "str",
    "visitNumber": "int",
    "quarter": "int",
    "month": "int",
    "day_of_month": "int",
    "day_of_week": "int",
    "hour": "int",
    "minute": "int",
    "second": "int",
    "source": "str",  #
    "medium": "str",  #
    "channelGrouping": "str",  #
    "hits": "int",
    "bounces": "int",
    "last_action": "str",  #
    "promos_displayed": "int",
    "promos_clicked": "int",
    "product_views": "int",
    "product_clicks": "int",
    "pageviews": "int",
    "time_on_site": "int",
    "browser": "str",  #
    "os": "str",  #
    "deviceCategory": "str",  #
    "added_to_cart": "int",
    "revenue": "float",
    "score": "float",
    "predicted_score_label": "bool",
    "maudience": "str",
    "cohort": "str",
    "audience_strategy": "str",
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the monthly performance summary data
:::

In [13]:
dtypes_dict_monthly_summary = {
    'month': pd.StringDtype(),
    "split_type": pd.StringDtype(),
    "return_purchasers": pd.Int16Dtype(),
    "revenue": pd.Float32Dtype(),
    "visitors": pd.Int16Dtype(),
    "pageviews": pd.Int32Dtype(),
    "time_on_site": pd.Float32Dtype(),
    "audience_strategy": pd.StringDtype(),
    "bounce_rate": pd.Float32Dtype(),
    "conversion_rate": pd.Float32Dtype(),
    "product_clicks_rate": pd.Float32Dtype(),
    "add_to_cart_rate": pd.Float32Dtype(),
    "visitors_pct_change": pd.Float32Dtype(),
    "revenue_pct_change": pd.Float32Dtype(),
    "pageviews_pct_change": pd.Float32Dtype(),
    "time_on_site_pct_change": pd.Float32Dtype(),
    "bounce_rate_pct_change": pd.Float32Dtype(),
    "conversion_rate_pct_change": pd.Float32Dtype(),
    "product_clicks_rate_pct_change": pd.Float32Dtype(),
    "add_to_cart_rate_pct_change": pd.Float32Dtype(),
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the daily performance summary data
:::

In [14]:
dtypes_dict_daily_summary = {
    "date": pd.StringDtype(),
    "maudience": pd.StringDtype(),
    "revenue": pd.Float32Dtype(),
    "time_on_site": pd.Float32Dtype(),
    "bounce_rate": pd.Float32Dtype(),
    "product_clicks_rate": pd.Float32Dtype(),
    "add_to_cart_rate": pd.Float32Dtype(),
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the cohort-to-audience fraction
:::

In [15]:
dtypes_sa_frac = {
    "infer_month": pd.StringDtype(),
    "audience_strategy": pd.StringDtype(),
    "maudience": pd.StringDtype(),
    "cohort": pd.StringDtype(),
    "size": pd.Int16Dtype(),
    "group_size": pd.Int16Dtype(),
    "uplift": pd.Int8Dtype(),
    "power": pd.Int8Dtype(),
    "ci_level": pd.Int8Dtype(),
    "samp_to_aud_frac": pd.Float32Dtype(),
    "size_type": pd.StringDtype(),
    "data_type": pd.StringDtype(),
    "data_size": pd.Int32Dtype(),
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the conversion rates for each dataset (development and inference)
:::

In [16]:
dtypes_conv_rates = {
    "audience_strategy": pd.StringDtype(),
    "infer_month": pd.StringDtype(),
    "maudience": pd.StringDtype(),
    "pred_conversions": pd.Int16Dtype(),
    "total_visitors": pd.Int16Dtype(),
    "min_score": pd.Float32Dtype(),
    "true_conversions": pd.Int16Dtype(),
    "data_type": pd.StringDtype(),
    "data_size": pd.Int16Dtype(),
    "true_conv_rate": pd.Float32Dtype(),
    "overall_true_conv_rate": pd.Float32Dtype(),
    "pred_conv_rate": pd.Float32Dtype(),
    "overall_pred_conv_rate": pd.Float32Dtype(),
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the aggregated conversion rates
:::

In [17]:
dtypes_hmap = {
    "maudience": pd.StringDtype(),
    "data_type": pd.StringDtype(),
    "var": pd.StringDtype(),
    "value": pd.Float32Dtype(),
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the KPIs for all categories within each categorical feature
:::

In [18]:
dtypes_dict_categorical_kpis = {
    "feature_name": pd.StringDtype(),
    "feature_category": pd.StringDtype(),
    "variable": pd.StringDtype(),
    "value": pd.Float32Dtype(),
}

Create a mapping between action type integer and label, in order to get meaningful names from the `audience_strategy` column in the profiles data

In [19]:
#| echo: true
audience_strategy_mapper = {1: "Multi-Group", 2: "Single Group"}

::: {.content-hidden}
Define helper function to show datatypes and number of missing values for all columns in a `DataFrame`
:::

In [20]:
def summarize_df(df: pd.DataFrame) -> None:
    """Show datatypes and count missing values in columns of DataFrame."""
    display(
        df.dtypes.rename("dtype")
        .to_frame()
        .merge(
            df.isna().sum().rename("missing").to_frame(),
            left_index=True,
            right_index=True,
            how="left",
        )
        .reset_index()
        .rename(columns={"index": "column"})
    )

::: {.content-hidden}
Define helper function to load data from BigQuery table into a `DataFrame` and apply data transformation steps
:::

In [21]:
def get_data(
    query: str,
    mapper_dict: Dict[str, Dict[int, str]],
    dtypes_dict: Dict,
    gcp_keys_dir_path: str,
    date_col: str = "",
    data_type: str = "profiles",
    custom_sort_single_col: Dict[str, List[str]] = dict(),
) -> pd.DataFrame:
    """Get data from BigQuery table."""
    gcp_authorization_dict = auth_to_bigquery(gcp_keys_dir_path)
    df = th.extract_data(query, gcp_authorization_dict)
    if mapper_dict:
        df = df.pipe(th.map_columns, mapper_dict)
    df = df.pipe(th.set_datatypes, dtypes_dict)
    if date_col:
        df[date_col] = pd.to_datetime(df["date"], utc=False)
    if custom_sort_single_col and len(list(custom_sort_single_col)) == 1:
        col_sort = list(custom_sort_single_col)[0]
        sort_order = list(custom_sort_single_col.values())[0]
        df = df.set_index(col_sort).loc[sort_order].reset_index()
    return df

## Get Data

### Profiles

In [22]:
#| output: false
query = f"""
        SELECT *
        FROM {gbq_table_fully_resolved_profiles}
        """
df_profiles = get_data(
    query,
    {'Audience_Strategy': audience_strategy_mapper},
    dtypes_dict_profiles,
    gcp_keys_dir,
)

Query execution start time = 2023-07-07 15:36:37.066...done at 2023-07-07 15:36:38.481 (1.415 seconds).
Query returned 20 rows


In [23]:
with pd.option_context('display.max_rows', None):
    display(df_profiles)

Unnamed: 0,Audience_Strategy,Stat_Expanded,High,Medium,Low
0,Multi-Group,Hour (Mean),13.043304,12.969241,13.025514
1,Multi-Group,Day Of Week (Mean),4.00924,4.009931,3.992139
2,Multi-Group,Hits (Mean),6.214315,6.356552,6.497587
3,Multi-Group,Promos Displayed (Mean),8.550683,8.541931,8.612743
4,Multi-Group,Promos Clicked (Mean),0.0,0.0,0.000138
5,Multi-Group,Product Views (Mean),22.395256,23.241379,23.932423
6,Multi-Group,Product Clicks (Mean),0.647083,0.679448,0.704455
7,Multi-Group,Pageviews (Mean),5.287409,5.405241,5.49207
8,Multi-Group,Revenue (Mean),179.237625,178.182846,193.435837
9,Multi-Group,Added To Cart (Mean),0.19487,0.201241,0.19418


::: {.content-hidden}
Summarize the `DataFrame` with the audience profile
:::

In [24]:
#| output: false
summarize_df(df_profiles)

Unnamed: 0,column,dtype,missing
0,Audience_Strategy,string[python],0
1,Stat_Expanded,string[python],0
2,High,Float32,0
3,Medium,Float32,0
4,Low,Float32,0


### Feature Importances

In [25]:
#| output: false
query = f"""
        SELECT *
        FROM {gbq_table_fully_resolved_feats_imp}
        """
df_feats_imp = get_data(
    query,
    {'audience_strategy': audience_strategy_mapper},
    dtypes_dict_feats_imp,
    gcp_keys_dir,
)

Query execution start time = 2023-07-07 15:36:38.567...done at 2023-07-07 15:36:39.955 (1.388 seconds).
Query returned 30 rows


In [26]:
with pd.option_context('display.max_rows', None):
    display(df_feats_imp)

Unnamed: 0,audience_strategy,num_observations,stat,maudience,value
0,Multi-Group,500,medium = cpc,High,0.8555691242218018
1,Multi-Group,500,os = FreeBSD,High,0.7719526290893555
2,Multi-Group,500,browser = other,High,0.7091398239135742
3,Multi-Group,500,os = Nokia,High,0.6878795027732849
4,Multi-Group,500,hits,High,0.6865772604942322
5,Multi-Group,500,last action = Click through of product lists,High,0.4413046240806579
6,Multi-Group,500,promos displayed,High,0.3842121660709381
7,Multi-Group,500,os = Samsung,High,0.3355435132980346
8,Multi-Group,500,os = Windows,High,0.1674028635025024
9,Multi-Group,500,medium = (not set),High,0.1308207958936691


::: {.content-hidden}
Summarize the `DataFrame` with the ML feature importances
:::

In [27]:
#| output: false
summarize_df(df_feats_imp)

Unnamed: 0,column,dtype,missing
0,audience_strategy,string[python],0
1,num_observations,Int16,0
2,stat,string[python],0
3,maudience,string[python],0
4,value,string[python],0


### Cohorts

In [28]:
#| output: false
query = f"""
        SELECT * EXCEPT(made_purchase_on_future_visit, split_type)
        FROM {gbq_table_fully_resolved_cohorts}
        WHERE split_type = 'infer'
        """
df_dev_cohorts = (
    get_data(
        query,
        {'audience_strategy': audience_strategy_mapper},
        dtypes_dict_cohort,
        gcp_keys_dir,
    )
)

Query execution start time = 2023-07-07 15:36:40.033...done at 2023-07-07 15:36:46.901 (6.868 seconds).
Query returned 21,752 rows


In [29]:
with pd.option_context('display.max_columns', None):
    display(df_dev_cohorts.head())

Unnamed: 0,infer_month,fullvisitorid,visitId,visitNumber,visitStartTime,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,revenue,score,predicted_score_label,maudience,cohort,audience_strategy
0,March,2359966058072922045,1488364905,1,2017-03-01 02:41:45,1,3,1,4,2,41,45,(direct),(none),Direct,2,0,Unknown,0,0,15,0,2,18,Chrome,Windows,desktop,0,,0.3570451,True,High,Control,Multi-Group
1,March,7221532448991957109,1488368674,1,2017-03-01 03:44:34,1,3,1,4,3,44,34,(direct),(none),Direct,1,1,Unknown,0,0,12,0,1,0,Safari,Macintosh,desktop,0,,6.821068e-07,False,Low,Test,Multi-Group
2,March,4385673813050165685,1488370525,1,2017-03-01 04:15:25,1,3,1,4,4,15,25,(direct),(none),Direct,1,1,Unknown,9,0,0,0,1,0,Chrome,Linux,desktop,0,,0.03742834,False,Medium,Control,Multi-Group
3,March,1998897161334762608,1488370825,1,2017-03-01 04:20:25,1,3,1,4,4,20,25,(direct),(none),Direct,1,1,Unknown,9,0,0,0,1,0,Chrome,Linux,desktop,0,,0.3213757,False,High,Test,Multi-Group
4,March,633230752572028256,1488371869,1,2017-03-01 04:37:49,1,3,1,4,4,37,49,(direct),(none),Direct,1,1,Unknown,0,0,0,0,1,0,Chrome,Linux,desktop,0,,0.01514476,False,Low,,Multi-Group


::: {.content-hidden}
Summarize the `DataFrame` with the audience cohorts
:::

In [30]:
#| output: false
summarize_df(df_dev_cohorts)

Unnamed: 0,column,dtype,missing
0,infer_month,object,0
1,fullvisitorid,object,0
2,visitId,object,0
3,visitNumber,int64,0
4,visitStartTime,datetime64[ns],0
5,quarter,int64,0
6,month,int64,0
7,day_of_month,int64,0
8,day_of_week,int64,0
9,hour,int64,0


::: {.content-hidden}
Verify that the true label (or outcome of the return visit, `made_purchase_on_future_visit`) is not present in the cohorts data
:::

In [31]:
assert 'made_purchase_on_future_visit' not in list(df_dev_cohorts)

::: {.callout-note title="Notes"}
1. The true outcome is not known at the end of the inference period. It will only be known after the end of the marketing campaign, which occurs after the inference period.

### Monthly Performance Summary

In [32]:
#| output: false
query = f"""
        SELECT * EXCEPT(channelGrouping,deviceCategory,browser,os,visitor_type)
        FROM {gbq_summary_table_id_fully_resolved}
        """
df_monthly_summary = (
    get_data(
        query,
        {
            'audience_strategy': audience_strategy_mapper,
            "month": dict(
                zip([m for m in list(range(1, 12 + 1))], month_name[1:])
            ),
        },
        dtypes_dict_monthly_summary,
        gcp_keys_dir,
        custom_sort_single_col={
            "month": [
                'September',
                'October',
                'November',
                'December',
                'January',
                'February',
                'March',
            ]
        }
    )
)

Query execution start time = 2023-07-07 15:36:47.013...done at 2023-07-07 15:36:48.370 (1.357 seconds).
Query returned 7 rows


In [33]:
with pd.option_context('display.max_columns', None):
    display(df_monthly_summary)

Unnamed: 0,month,split_type,return_purchasers,revenue,visitors,pageviews,time_on_site,audience_strategy,bounce_rate,product_clicks_rate,add_to_cart_rate,conversion_rate,visitors_pct_change,revenue_pct_change,pageviews_pct_change,time_on_site_pct_change,bounce_rate_pct_change,conversion_rate_pct_change,product_clicks_rate_pct_change,add_to_cart_rate_pct_change,visitors_pct_change_gt_0,revenue_pct_change_gt_0,pageviews_pct_change_gt_0,time_on_site_pct_change_gt_0,bounce_rate_pct_change_gt_0,conversion_rate_pct_change_gt_0,product_clicks_rate_pct_change_gt_0,add_to_cart_rate_pct_change_gt_0
0,September,Train+Val,515.0,25347.169922,18612,122012,3.155372,Multi-Group,30.217064,1.754417,20.470665,2.767032,,,,,,,,,False,False,False,False,False,False,False,False
1,October,Train+Val,744.0,56828.398438,22603,137467,3.180688,Multi-Group,30.67292,2.973817,18.134762,3.291599,21.443155,124.20018,12.666787,0.802328,1.508602,18.957729,69.504578,-11.410979,True,True,True,True,True,True,True,False
2,November,Train+Val,1651.0,212099.9375,24400,162391,3.522282,Multi-Group,27.090164,3.284731,25.233606,6.766394,7.950272,273.22879,18.130898,10.739615,-11.680517,105.565582,10.455062,39.144966,True,True,True,True,False,True,True,True
3,December,Train+Val,1340.0,188476.625,26936,168928,3.386973,Multi-Group,29.287941,3.132498,23.514999,4.974755,10.393442,-11.137819,4.025469,-3.841523,8.112826,-26.478485,-4.634573,-6.810791,True,False,True,False,True,False,False,False
4,January,Train+Val,722.0,148537.96875,21177,125278,3.237079,Multi-Group,33.035839,3.095364,20.805592,3.409359,-21.380308,-21.190245,-25.839411,-4.425606,12.79673,-31.466791,-1.185452,-11.52204,False,False,False,False,True,False,False,False
5,February,Test,465.0,110950.46875,20164,106469,2.980083,Multi-Group,35.226147,2.697536,18.944654,2.30609,-4.783492,-25.304979,-15.013809,-7.939141,6.630086,-32.360016,-12.852382,-8.944409,False,False,False,False,True,False,False,False
6,March,Infer,,135771.375,21752,117350,3.199325,Multi-Group,33.357853,2.919381,19.676352,,7.875422,22.371161,10.219876,7.356926,-5.303712,0.0,8.223991,3.862292,True,True,True,True,False,False,True,True


::: {.content-hidden}
Summarize the `DataFrame` with the monthly performance summary
:::

In [34]:
#| output: false
summarize_df(df_monthly_summary)

Unnamed: 0,column,dtype,missing
0,month,string[python],0
1,split_type,string[python],0
2,return_purchasers,Int16,1
3,revenue,Float32,0
4,visitors,Int16,0
5,pageviews,Int32,0
6,time_on_site,Float32,0
7,audience_strategy,string[python],0
8,bounce_rate,Float32,0
9,product_clicks_rate,Float32,0


### Daily Performance Summary by Audience Group

In [35]:
#| output: false
query = f"""
        SELECT maudience,
               date,
               revenue,
               product_views,
               bounce_rate,
               product_clicks_rate,
               add_to_cart_rate,
               time_on_site
        FROM {gbq_daily_perf_combo_table_id_fully_resolved}
        WHERE agg_type != 'overall'
        """
df_daily_summary_aud = (
    get_data(
        query,
        {},
        dtypes_dict_daily_summary,
        gcp_keys_dir,
        date_col='date',
    )
)

Query execution start time = 2023-07-07 15:36:48.482...done at 2023-07-07 15:36:49.802 (1.320 seconds).
Query returned 274 rows


In [36]:
with pd.option_context('display.max_columns', None):
    display(df_daily_summary_aud)

Unnamed: 0,maudience,date,revenue,product_views,bounce_rate,product_clicks_rate,add_to_cart_rate,time_on_site
0,Low,2017-03-01,604.049988,6626,33.064518,3.305161,19.758064,3.179368
1,Low,2017-03-02,985.380005,6673,22.869955,2.592537,25.560537,2.928625
2,Low,2017-03-03,475.98999,4452,38.235294,2.336029,15.686275,2.297794
3,Low,2017-03-04,2587.060059,4382,33.333332,2.282063,25.531916,3.894208
4,Low,2017-03-05,1397.890015,5901,31.770834,3.287578,32.8125,2.807292
...,...,...,...,...,...,...,...,...
269,Development,2017-02-24,3498.649902,16981,30.864197,2.691243,19.753086,3.143621
270,Development,2017-02-25,1522.150024,10860,34.115139,2.394107,16.204691,2.898969
271,Development,2017-02-26,2136.179932,13847,32.871288,2.563732,19.20792,2.995182
272,Development,2017-02-27,5619.719727,17665,31.306597,2.666289,15.006469,2.897995


::: {.content-hidden}
Summarize the `DataFrame` with the daily performance summary by audience group
:::

In [37]:
#| output: false
summarize_df(df_daily_summary_aud)

Unnamed: 0,column,dtype,missing
0,maudience,string[python],0
1,date,datetime64[ns],0
2,revenue,Float32,0
3,product_views,Int64,0
4,bounce_rate,Float32,0
5,product_clicks_rate,Float32,0
6,add_to_cart_rate,Float32,0
7,time_on_site,Float32,0


### Daily Performance Summary Overall

In [38]:
#| output: false
query = f"""
        SELECT maudience,
               date,
               revenue,
               product_views,
               bounce_rate,
               product_clicks_rate,
               add_to_cart_rate,
               time_on_site
        FROM {gbq_daily_perf_combo_table_id_fully_resolved}
        WHERE agg_type = 'overall'
        """
df_daily_summary = (
    get_data(
        query,
        {},
        dtypes_dict_daily_summary,
        gcp_keys_dir,
        date_col='date',
    )
)

Query execution start time = 2023-07-07 15:36:49.861...done at 2023-07-07 15:36:51.068 (1.207 seconds).
Query returned 212 rows


In [39]:
with pd.option_context('display.max_columns', None):
    display(df_daily_summary)

Unnamed: 0,maudience,date,revenue,product_views,bounce_rate,product_clicks_rate,add_to_cart_rate,time_on_site
0,Inference,2017-03-01,4346.72998,18450,35.254692,3.241192,24.932976,2.882328
1,Inference,2017-03-02,5676.859863,18124,29.27928,2.907747,20.270269,3.02002
2,Inference,2017-03-03,2911.189941,14904,37.201908,2.630166,15.898252,2.874483
3,Inference,2017-03-04,5008.590332,14153,28.758169,2.727337,31.154684,4.059913
4,Inference,2017-03-05,3501.669922,15255,33.80035,2.766306,21.190893,3.223292
...,...,...,...,...,...,...,...,...
207,Development,2017-02-24,3498.649902,16981,30.864197,2.691243,19.753086,3.143621
208,Development,2017-02-25,1522.150024,10860,34.115139,2.394107,16.204691,2.898969
209,Development,2017-02-26,2136.179932,13847,32.871288,2.563732,19.20792,2.995182
210,Development,2017-02-27,5619.719727,17665,31.306597,2.666289,15.006469,2.897995


::: {.content-hidden}
Summarize the `DataFrame` with the overall daily performance summary
:::

In [40]:
#| output: false
summarize_df(df_daily_summary)

Unnamed: 0,column,dtype,missing
0,maudience,string[python],0
1,date,datetime64[ns],0
2,revenue,Float32,0
3,product_views,Int64,0
4,bounce_rate,Float32,0
5,product_clicks_rate,Float32,0
6,add_to_cart_rate,Float32,0
7,time_on_site,Float32,0


### Conversion Rates

In [41]:
#| output: false
query = f"""
        SELECT *
        FROM {gbq_conv_rates_table_id_fully_resolved}
        """
df_conv_rates = (
    get_data(
        query,
        {'audience_strategy': audience_strategy_mapper},
        dtypes_conv_rates,
        gcp_keys_dir,
    )
)

Query execution start time = 2023-07-07 15:36:51.150...done at 2023-07-07 15:36:52.501 (1.351 seconds).
Query returned 6 rows


In [42]:
with pd.option_context('display.max_columns', None):
    display(df_conv_rates)

Unnamed: 0,audience_strategy,infer_month,maudience,pred_conversions,total_visitors,min_score,true_conversions,data_type,data_size,true_conv_rate,overall_true_conv_rate,pred_conv_rate,overall_pred_conv_rate
0,Multi-Group,March,High,1243,7251,0.143627,0,Inference,21752,0.0,0.0,17.142464,5.714417
1,Multi-Group,March,Low,0,7251,0.0,0,Inference,21752,0.0,0.0,0.0,5.714417
2,Multi-Group,March,Medium,0,7250,0.022746,0,Inference,21752,0.0,0.0,0.0,5.714417
3,Multi-Group,,Low,0,6721,0.0,147,Development,20164,2.187175,2.30609,0.0,5.57925
4,Multi-Group,,Medium,0,6721,0.022545,153,Development,20164,2.276447,2.30609,0.0,5.57925
5,Multi-Group,,High,1125,6722,0.140977,165,Development,20164,2.454627,2.30609,16.73609,5.57925


::: {.content-hidden}
Summarize the `DataFrame` with the conversion rates
:::

In [43]:
#| output: false
summarize_df(df_conv_rates)

Unnamed: 0,column,dtype,missing
0,audience_strategy,string[python],0
1,infer_month,string[python],3
2,maudience,string[python],0
3,pred_conversions,Int16,0
4,total_visitors,Int16,0
5,min_score,Float32,0
6,true_conversions,Int16,0
7,data_type,string[python],0
8,data_size,Int16,0
9,true_conv_rate,Float32,0


### Cohort-to-Audience Size Fractions

In [44]:
#| output: false
query = f"""
        SELECT *
        FROM {gbq_sa_fracs_table_id_fully_resolved}
        """
df_sa_frac = (
    get_data(
        query,
        {},
        dtypes_sa_frac,
        gcp_keys_dir,
    )
)

Query execution start time = 2023-07-07 15:36:52.583...done at 2023-07-07 15:36:53.827 (1.244 seconds).
Query returned 12 rows


In [45]:
with pd.option_context('display.max_columns', None):
    display(df_sa_frac)

Unnamed: 0,audience_strategy,infer_month,maudience,cohort,size,group_size,uplift,power,ci_level,samp_to_aud_frac,size_type,data_type,data_size
0,1,,High,Test,1933,6721,10,55,55,28.760601,required,development,20164
1,1,,High,Control,1933,6721,10,55,55,28.760601,required,development,20164
2,1,,Medium,Test,2088,6721,10,55,55,31.066805,required,development,20164
3,1,,Medium,Control,2088,6721,10,55,55,31.066805,required,development,20164
4,1,,Low,Test,2176,6722,10,55,55,32.371319,required,development,20164
5,1,,Low,Control,2176,6722,10,55,55,32.371319,required,development,20164
6,1,March,High,Control,2085,7251,10,55,55,28.754654,randomly selected,inference,21752
7,1,March,High,Test,2085,7251,10,55,55,28.754654,randomly selected,inference,21752
8,1,March,Low,Control,2347,7251,10,55,55,32.36795,randomly selected,inference,21752
9,1,March,Low,Test,2347,7251,10,55,55,32.36795,randomly selected,inference,21752


::: {.content-hidden}
Summarize the `DataFrame` with the cohort-to-audience size fraction
:::

In [46]:
#| output: false
summarize_df(df_sa_frac)

Unnamed: 0,column,dtype,missing
0,audience_strategy,string[python],0
1,infer_month,string[python],6
2,maudience,string[python],0
3,cohort,string[python],0
4,size,Int16,0
5,group_size,Int16,0
6,uplift,Int8,0
7,power,Int8,0
8,ci_level,Int8,0
9,samp_to_aud_frac,Float32,0


### Aggregated Conversion Rates (Overall)

In [47]:
#| output: false
query = f"""
        SELECT *
        FROM {gbq_conv_rates_combo_table_id_fully_resolved}
        WHERE maudience IS NULL
        """
df_hmap = (
    get_data(
        query,
        {},
        dtypes_hmap,
        gcp_keys_dir,
    )
)

Query execution start time = 2023-07-07 15:36:53.885...done at 2023-07-07 15:36:55.053 (1.168 seconds).
Query returned 6 rows


In [48]:
with pd.option_context('display.max_columns', None):
    display(df_hmap)

Unnamed: 0,maudience,data_type,var,value
0,,Inference,Overall Pred Conv Rate,5.714417
1,,Inference,Overall True Conv Rate,0.0
2,,Inference,Data Size,21752.0
3,,Development,Overall Pred Conv Rate,5.57925
4,,Development,Overall True Conv Rate,2.30609
5,,Development,Data Size,20164.0


::: {.content-hidden}
Summarize the `DataFrame` with the overall aggregated conversion rates
:::

In [49]:
#| output: false
summarize_df(df_hmap)

Unnamed: 0,column,dtype,missing
0,maudience,string[python],6
1,data_type,string[python],0
2,var,string[python],0
3,value,Float32,0


### Aggregated Conversion Rates (per Audience Group)

In [50]:
#| output: false
query = f"""
        SELECT *
        FROM {gbq_conv_rates_combo_table_id_fully_resolved}
        WHERE maudience IS NOT NULL
        """
df_hmap_aud = (
    get_data(
        query,
        {},
        dtypes_hmap,
        gcp_keys_dir,
    )
)

Query execution start time = 2023-07-07 15:36:55.110...done at 2023-07-07 15:36:56.382 (1.272 seconds).
Query returned 12 rows


In [51]:
with pd.option_context('display.max_columns', None):
    display(df_hmap_aud)

Unnamed: 0,maudience,data_type,var,value
0,High,Inference,true,0.0
1,Low,Inference,true,0.0
2,Medium,Inference,true,0.0
3,High,Inference,pred,17.142464
4,Low,Inference,pred,0.0
5,Medium,Inference,pred,0.0
6,High,Development,true,2.454627
7,Low,Development,true,2.187175
8,Medium,Development,true,2.276447
9,High,Development,pred,16.73609


::: {.content-hidden}
Summarize the `DataFrame` with the conversion rates aggregated by audience group
:::

In [52]:
#| output: false
summarize_df(df_hmap_aud)

Unnamed: 0,column,dtype,missing
0,maudience,string[python],0
1,data_type,string[python],0
2,var,string[python],0
3,value,Float32,0


### Categorical Feature KPIs

In [53]:
#| output: false
query = f"""
        SELECT feature_name,
               feature_category,
               variable,
               value
        FROM {gbq_table_fully_resolved_cat_feat_kpis}
        WHERE variable IN ('CTR', 'Conversion Rate')
        """
df_development_agg = get_data(
    query,
    {},
    dtypes_dict_categorical_kpis,
    gcp_keys_dir,
)

Query execution start time = 2023-07-07 15:36:56.454...done at 2023-07-07 15:36:57.669 (1.215 seconds).
Query returned 76 rows


In [54]:
with pd.option_context('display.max_columns', None):
    display(df_development_agg)

Unnamed: 0,feature_name,feature_category,variable,value
0,os,Macintosh,CTR,3.307882
1,os,Linux,CTR,3.64461
2,os,Chrome OS,CTR,3.050689
3,os,Windows,CTR,2.519136
4,os,Android,CTR,1.998737
...,...,...,...,...
71,channelGrouping,Direct,Conversion Rate,3.855422
72,channelGrouping,Organic Search,Conversion Rate,0.826806
73,channelGrouping,Paid Search,Conversion Rate,0.420168
74,channelGrouping,Affiliates,Conversion Rate,0.34965


::: {.content-hidden}
Summarize the `DataFrame` with the KPIs for all categories in each categorical feature
:::

In [55]:
#| output: false
summarize_df(df_development_agg)

Unnamed: 0,column,dtype,missing
0,feature_name,string[python],0
1,feature_category,string[python],0
2,variable,string[python],0
3,value,Float32,0


## Create Charts and Tables

### Conversion Rates During Development and Inference

Show metrics used to estimate sample sizes

In [57]:
with pd.option_context('display.max_columns', None):
    display(df_conv_rates)

Unnamed: 0,audience_strategy,infer_month,maudience,pred_conversions,total_visitors,min_score,true_conversions,data_type,data_size,true_conv_rate,overall_true_conv_rate,pred_conv_rate,overall_pred_conv_rate
0,Multi-Group,March,High,1243,7251,0.143627,0,Inference,21752,0.0,0.0,17.142464,5.714417
1,Multi-Group,March,Low,0,7251,0.0,0,Inference,21752,0.0,0.0,0.0,5.714417
2,Multi-Group,March,Medium,0,7250,0.022746,0,Inference,21752,0.0,0.0,0.0,5.714417
3,Multi-Group,,Low,0,6721,0.0,147,Development,20164,2.187175,2.30609,0.0,5.57925
4,Multi-Group,,Medium,0,6721,0.022545,153,Development,20164,2.276447,2.30609,0.0,5.57925
5,Multi-Group,,High,1125,6722,0.140977,165,Development,20164,2.454627,2.30609,16.73609,5.57925


#### Charts

Show plot comparing true and predicted conversion rates for each audience group

In [58]:
chart = dh.plot_repeated_column_row_grouped_bar_chart(
    df_hmap_aud,
    xvar = "var",
    yvar = "value",
    color_by_col = "var",
    row_var = "data_type",
    col_var = "maudience",
    ptitle_str=(
        "Similar conversion rates by audience across "
        "development & inference"
    ),
    tooltip=[
        alt.Tooltip('maudience:N', title='Audience Group'),
        alt.Tooltip('data_type:N', title="Type of data"),
        alt.Tooltip('var:N', title="Quantity"),
        alt.Tooltip('value:Q', title='Value', format=".2f"),
    ],
    row_spacing = 25,
    bar_order=['true', 'pred'],
    bar_colors=['lightgrey', '#cc1e1f'],
    axis_label_fontsize=15,
    axis_title_fontsize=15,
    show_title = True,
    fig_size=dict(width=175, height=175),
)
chart

Show plot comparing overall statistics

In [59]:
chart = dh.plot_repeated_column_grouped_bar_chart(
    df_hmap,
    xvar="data_type",
    yvar="value",
    color_by_col="data_type",
    col_var="var",
    bar_order=['Development', 'Inference'],
    bar_colors=['lightgrey', '#cc1e1f'],
    ptitle_str=(
        "Similar overall conversion rates during "
        "development and inference"
    ),
    tooltip=[
        alt.Tooltip('var:N', title='Metric'),
        alt.Tooltip('data_type:N', title='Type of Data'),
        alt.Tooltip('value:N', title='Value', format=",.2f"),
    ],
    axis_label_fontsize = 14,
    axis_title_fontsize = 15,
    show_title = True,
    fig_size=dict(width=150, height=275),
)
chart

Show plot comparing development and inference statistics for each audience group

In [60]:
for k, yvar in enumerate(["total_visitors", 'min_score', 'pred_conv_rate', 'true_conv_rate']):
    chart = dh.plot_repeated_column_grouped_bar_chart_untidy_data(
        df_conv_rates,
        xvar = 'maudience',
        yvar = yvar,
        color_by_col = 'maudience',
        col_var = 'data_type',
        column_order = ['Development', 'Inference'],
        title_dict={
            "total_visitors": {
                "y": "Audience size",
                "title": (
                    "Similarly sized audiences are predicted between inference "
                    "and historical data"
                ),
            },
            "min_score": {
                "y": "Minimum Predicted Propensity",
                "title": (
                    "Similar propensities prediced by audience "
                    "across inference and historical data"
                ),
            },
            "pred_conv_rate": {
                "y": "Predicted Conversion Rate (%)",
                "title": (
                    "Similar conversion rates prediced by audience "
                    "across inference and historical data"
                ),
            },
            "true_conv_rate": {
                "y": "True Conversion Rate (%)",
                "title": (
                    "Highest true conversion rate observed for "
                    "high-propensity group"
                ),
            },
        },
        tooltip = [
            alt.Tooltip('infer_month:N', title='Inference Month'),
            alt.Tooltip('audience_strategy:N', title='Audience Strategy'),
            alt.Tooltip('data_type:N', title='Type of Dataset'),
            alt.Tooltip('data_size:Q', title='Dataset Size', format=","),
            alt.Tooltip('maudience:N', title='Audience Group'),
            alt.Tooltip('total_visitors:Q', title='Audience Size', format=","),
            alt.Tooltip('pred_conv_rate:Q', title='Predicted Conv. Rate (%)', format=",.2f"),
            alt.Tooltip('true_conv_rate:Q', title='True Conv. Rate (%)', format=",.2f"),
        ],
        bar_order=["High", "Medium", "Low"],
        bar_colors=["#cc1e1f", "#fc8767", "#fcb49a"],
        column_header_fontsize = 15,
        axis_label_fontsize = 15,
        title_fontsize = 15,
        show_title = True,
        show_legend = True if k == 0 else False,
        fig_size = dict(width=250, height=300),
    )
    display(chart)

### Monthly Summary Stastics About Data

Show metadata

In [61]:
with pd.option_context('display.max_columns', None):
    display(df_monthly_summary)

Unnamed: 0,month,split_type,return_purchasers,revenue,visitors,pageviews,time_on_site,audience_strategy,bounce_rate,product_clicks_rate,add_to_cart_rate,conversion_rate,visitors_pct_change,revenue_pct_change,pageviews_pct_change,time_on_site_pct_change,bounce_rate_pct_change,conversion_rate_pct_change,product_clicks_rate_pct_change,add_to_cart_rate_pct_change,visitors_pct_change_gt_0,revenue_pct_change_gt_0,pageviews_pct_change_gt_0,time_on_site_pct_change_gt_0,bounce_rate_pct_change_gt_0,conversion_rate_pct_change_gt_0,product_clicks_rate_pct_change_gt_0,add_to_cart_rate_pct_change_gt_0
0,September,Train+Val,515.0,25347.169922,18612,122012,3.155372,Multi-Group,30.217064,1.754417,20.470665,2.767032,,,,,,,,,False,False,False,False,False,False,False,False
1,October,Train+Val,744.0,56828.398438,22603,137467,3.180688,Multi-Group,30.67292,2.973817,18.134762,3.291599,21.443155,124.20018,12.666787,0.802328,1.508602,18.957729,69.504578,-11.410979,True,True,True,True,True,True,True,False
2,November,Train+Val,1651.0,212099.9375,24400,162391,3.522282,Multi-Group,27.090164,3.284731,25.233606,6.766394,7.950272,273.22879,18.130898,10.739615,-11.680517,105.565582,10.455062,39.144966,True,True,True,True,False,True,True,True
3,December,Train+Val,1340.0,188476.625,26936,168928,3.386973,Multi-Group,29.287941,3.132498,23.514999,4.974755,10.393442,-11.137819,4.025469,-3.841523,8.112826,-26.478485,-4.634573,-6.810791,True,False,True,False,True,False,False,False
4,January,Train+Val,722.0,148537.96875,21177,125278,3.237079,Multi-Group,33.035839,3.095364,20.805592,3.409359,-21.380308,-21.190245,-25.839411,-4.425606,12.79673,-31.466791,-1.185452,-11.52204,False,False,False,False,True,False,False,False
5,February,Test,465.0,110950.46875,20164,106469,2.980083,Multi-Group,35.226147,2.697536,18.944654,2.30609,-4.783492,-25.304979,-15.013809,-7.939141,6.630086,-32.360016,-12.852382,-8.944409,False,False,False,False,True,False,False,False
6,March,Infer,,135771.375,21752,117350,3.199325,Multi-Group,33.357853,2.919381,19.676352,,7.875422,22.371161,10.219876,7.356926,-5.303712,0.0,8.223991,3.862292,True,True,True,True,False,False,True,True


#### Chart

Show plot

In [63]:
ytitle = {
    "visitors": 'Visitors',
    "revenue": 'Revenue (USD)',
    "add_to_cart_rate": "Fraction of Visitors that Added Items to Cart (%)",
    "bounce_rate": "Bounce Rate (%)",
    "conversion_rate": "Conversion Rate (%)",
    "product_clicks_rate": "Product List Clickthrough Rate (%)",
    "time_on_site": "Average time spent on store website (minutes)",
    'pageviews': "Number of pages viewed during visit",
}

for yvar in list(ytitle):
    chart = dh.plot_statistic_bar_chart_combo(
        data=df_monthly_summary,
        yvar=yvar,
        color_by_col="split_type:N",
        colors={"Train+Val": "lightgrey", "Test": "grey", "Infer": "red"},
        marker_size=80,
        marker_colors=['red', 'green'],
        marker_values=[False, True],
        x_axis_sort=df_monthly_summary['month'].tolist(),
        ptitle=ytitle[yvar],
        axis_label_fontsize=14,
        ptitle_vertical_offset=-1,
        fig_size_bars=dict(width=575, height=300),
        fig_size_lines=dict(width=575, height=125),
    )
    display(chart)

::: {.callout-note title="Notes"}
1. This chart shows select numerical attributes of first-time visitors to the store during the inference (production) period.
2. The bar chart shows the monthly aggregated value of these attributes, separately for each month in the ML training+validation data, ML test data and inference data. Unless indicated, the statistics are monthly totals.
3. The line chart shows the month-over-month percentage change in each attribute.
:::

::: {.callout-tip title="Observations"}
1. Absolute performance peaks during the holiday season (November and/or December) and is most variable during September and October.
2. As expected, month-over-month growth drops during the winter after the holiday shopping season has ended (i.e. drops during January and February).
:::

### KPIs by Categorical Feature

Show KPIs per categorical

In [64]:
with pd.option_context('display.max_columns', None):
    display(df_development_agg.head(3))

Unnamed: 0,feature_name,feature_category,variable,value
0,os,Macintosh,CTR,3.307882
1,os,Linux,CTR,3.64461
2,os,Chrome OS,CTR,3.050689


#### Chart

Show plot

In [65]:
ptitle_dict = {
    "os": "Linux and Mac operating systems give the best combination of KPIs",
    "source": "Direct traffic gives best combination of KPIs",
    "browser": "Chrome offers best combination of KPIs among web browsers",
    'medium': 'Traffic reaching from CPM, referral or no medium gives the best combination of KPIs',
    'channelGrouping': "Referral, direct or display channel shows the best combination of KPIs",
    'deviceCategory': "Desktop devices give best combination of KPIs",
    'last_action': (
        "Ending a first visit with a Check Out or Add To/Remove from "
        "Cart gives best KPIs"
    ),
}
for k, feature in enumerate(list(ptitle_dict)):
    chart = dh.plot_stacked_bar_chart(
        data=df_development_agg.query(f"feature_name == '{feature}'"),
        xvar="feature_category",
        yvar="value",
        color_by_col='variable',
        colors={"CTR": '#cccccc', 'Conversion Rate': 'red'},
        show_title=True,
        show_legend=True if k == 0 else False,
        ptitle_str=ptitle_dict[feature],
        tooltip=[
            alt.Tooltip("feature_name:N", title='Categorical Feature'),
            alt.Tooltip("feature_category:N", title='Feature Sub-Category'),
            alt.Tooltip("variable:N", title='Rate Type'),
            alt.Tooltip("value:N", title='Rate (%)', format=".3f"),
        ],
        x_label_height=400,
        axis_label_fontsize=16,
        title_fontsize=18,
        title_fontweight='normal',
        x_tick_label_angle=-45,
        fig_size=dict(width=400, height=300),
    )
    display(chart)

### Feature Importances by Audience Group

Show feature importances per audience group

In [66]:
with pd.option_context('display.max_columns', None):
    display(df_feats_imp.head(3))

Unnamed: 0,audience_strategy,num_observations,stat,maudience,value
0,Multi-Group,500,medium = cpc,High,0.8555691242218018
1,Multi-Group,500,os = FreeBSD,High,0.7719526290893555
2,Multi-Group,500,browser = other,High,0.7091398239135742


#### Chart

Show plot

In [67]:
tooltip = [
    alt.Tooltip('stat', title='Feature'),
    alt.Tooltip('audience_strategy', title='Audience Strategy'),
    alt.Tooltip('maudience', title='Audience Group'),
    alt.Tooltip('num_observations', title='Obs. to get importance'),
    alt.Tooltip('value', title='Importance', format=".3f"),
]
chart = dh.plot_feature_importances(
    data=df_feats_imp,
    y_label_width=600,
    axis_label_fontsize=14,
    tooltip=tooltip,
    interactive=True,
    bar_color='#525252',  # "#3181bd", "#636363"
    show_x_ticks=False,
    fig_size=dict(width=450, height=300),
)
chart

Show the first-visit bounce rate in historical data among first-time visitors

In [68]:
#| output: true
(
    df_dev_cohorts.groupby('maudience', as_index=False)
    .agg({"bounces":"sum", "fullvisitorid": "count"})
    .rename(columns={"fullvisitorid": "visitors"})
    .assign(bounce_rate=lambda df: 100*df['bounces']/df['visitors'])
)

Unnamed: 0,maudience,bounces,visitors,bounce_rate
0,High,2394,7251,33.016136
1,Low,2391,7251,32.974762
2,Medium,2471,7250,34.082759


::: {.callout-note title="Notes"}
1. This chart shows the most important features to the prediction of the outcome in the current use-case (i.e. predicting the propensity of a visitor making a purchase during a future visit to the e-commerce store). These importances come from combined SHAPely values that provide global explanations, which allow us to interpret the entire ML model that was trained to make these predictions ([link](https://christophm.github.io/interpretable-ml-book/shap.html#shap-feature-importance)). The chart shows the features which changed the predicted absolute propensity on average by the most more percentage points.
2. The feature importances were calculated and are shown separately for each audience group.
:::

::: {.callout-tip title="Observations"}
1. For high propensity visitors, the most important features of their first visit to predict whether they made a purchase during a return visit were
   - reached the store site through a paid search (`medium__cpc`)
   - used an uncommonly used browser (`browser__other`) (**third highest combination of KPIs by `browser` and close to Safari, which is the second highest**)
   - used one of the following operating systems
     - FreeBSD (`os__FreeBSD`)
     - Nokia-based OS (`os__Nokia`)
       - this would have to be a mobile (phone or tablet) operating system, since Samsung does not offer a desktop-based operating system
   - interacted with the store site (`hits`)
2. For medium propensity visitors, the most important first visit features were
   - interacted with the store site (`hits`)
   - used one of the following frequently used operating systems
     - Macintosh (`os__Macintosh`) (**highest combination of KPIs by `os`**)
     - Chrome OS (`os__Chrome OS`) (**third-highest combination of KPIs by `os`**)
   - used one of the following infrequently used operating systems
     - Nintendo WII (`os__Nintendo Wii`)
     - Firefox OS (`os__Firefox`)
   - used undetermined (`medium__(not set)`) medium to access the store site
   - reached the store site using a google search (`source__google`) as the traffic [source](https://support.google.com/analytics/answer/1033173?hl=en)
3. For low propensity visitors, the most important features of their first visit were
   - reached the store site by sources other than google search or directly entering URL into web browser (`source__other`) (**highest combination of KPIs by `source`**)
   - used one of the following operating systems
     - SunOS (`os__SunOS`)
     - Macintosh (`os__Macintosh`) (**highest combination of KPIs by `os`**)
   - used an affiliate ([1](https://support.google.com/analytics/thread/21925739?hl=en&msgid=21929096), [2](https://support.google.com/analytics/thread/21925739/what-is-the-difference-between-affiliate-and-referral-traffic?hl=en)) (`medium__affiliate`) or [undetermined](https://www.owox.com/blog/use-cases/not-set-in-google-analytics/) (`medium__(not set)`) medium to access the store site
   - bounced from the site (`bounced__True`) (**average bounce rate among predicted low propensity visitors is approx. 33%**)
:::

### Profile by Audience Group

Show feature importances per audience group

In [69]:
with pd.option_context('display.max_columns', None):
    display(df_feats_imp.head(3))

Unnamed: 0,audience_strategy,num_observations,stat,maudience,value
0,Multi-Group,500,medium = cpc,High,0.8555691242218018
1,Multi-Group,500,os = FreeBSD,High,0.7719526290893555
2,Multi-Group,500,browser = other,High,0.7091398239135742


#### Table

Show the audience group profiles

In [70]:
with pd.option_context('display.max_columns', None):
    display(df_profiles)

Unnamed: 0,Audience_Strategy,Stat_Expanded,High,Medium,Low
0,Multi-Group,Hour (Mean),13.043304,12.969241,13.025514
1,Multi-Group,Day Of Week (Mean),4.00924,4.009931,3.992139
2,Multi-Group,Hits (Mean),6.214315,6.356552,6.497587
3,Multi-Group,Promos Displayed (Mean),8.550683,8.541931,8.612743
4,Multi-Group,Promos Clicked (Mean),0.0,0.0,0.000138
5,Multi-Group,Product Views (Mean),22.395256,23.241379,23.932423
6,Multi-Group,Product Clicks (Mean),0.647083,0.679448,0.704455
7,Multi-Group,Pageviews (Mean),5.287409,5.405241,5.49207
8,Multi-Group,Revenue (Mean),179.237625,178.182846,193.435837
9,Multi-Group,Added To Cart (Mean),0.19487,0.201241,0.19418


::: {.callout-tip title="Observations"}
1. High priority (stronger recommendations)
   - Visitors predicted to have a low propensity to make a return purchase
     - (`Revenue (Mean)`) spent more (higher revenue) on average during their first visit
     - (`Last Action Completed Purchase`) completed a purchase at the end of their first visit more often
     - (`Last Action Check Out`) checked out at the end of their first visit more often

     than visitors in the other two groups. We might want to offer stronger discounts or offers (**be more aggressive**) to the low propensity group as part of the campaign in order to prompt them to make a purchase a return visit to the store (i.e. to maximize campaign response and ROI).
   - (`Weekend Visitors`) There were more high propensity visitors predicted to make a return purchase who had their first visit on a weekend than medium and low propensity visitors. During the campaign (after the first visit), offers or discounts could be targeted at low > medium > high propensity visitors on weekends, in terms of priority, where low propensity visitors have the highest priority. We could offer more loyalty points on weekends to high propensity visitors, with fewer to medium and even less to low propensity visitors.
   - (`Last Action Removed Products From Cart`) More visitors predicted to have a high propensity to make a return purchase removed an item from the shopping cart at the end of their first visit than visitors in other two groups.
     - we should offer discounts on non-shopping cart items to low and medium propensity visitors (to convince them to remove items in their cart as was done by high-propensity visitors, and add items that are discounted, in the hopes they will purchase the new items in the cart)
     - we should offer
       - discounts on shopping cart items
       - recommendations for similar products to shopping cart items

       to high propensity visitors to convince them to purchase these during a return visit
2. Low priority (weaker recommendations)
   - (`Last Action Added To Cart`) Visitors predicted to have a medium propensity to make a return purchase added an item to their shopping cart at the end of their first visit more often than visitors in other two groups.
:::

### Daily Summary

Show the daily summary data

In [71]:
with pd.option_context('display.max_columns', None):
    display(df_daily_summary.head(3))

Unnamed: 0,maudience,date,revenue,product_views,bounce_rate,product_clicks_rate,add_to_cart_rate,time_on_site
0,Inference,2017-03-01,4346.72998,18450,35.254692,3.241192,24.932976,2.882328
1,Inference,2017-03-02,5676.859863,18124,29.27928,2.907747,20.270269,3.02002
2,Inference,2017-03-03,2911.189941,14904,37.201908,2.630166,15.898252,2.874483


#### Chart

Show plot

In [72]:
ytitle_daily = {
    "bounce_rate": "Bounce rate was weakest during 2016 holiday season",
    "product_clicks_rate": "Consistent product CTR except for September",
    "add_to_cart_rate": (
        "Rate of adding product(s) to cart was consistently highest during "
        "2016 holiday season"
    ),
    "time_on_site": "Consistent time spent on store site",
    "revenue": (
        "Maximum first-visit revenue during holiday season and consistently "
        "low during 2016 Sepember & October"
    ),
}
daily_legend_params = dict(
    direction="horizontal", orient="bottom", titleAnchor="start"
)

In [73]:
for k, yvar in enumerate(list(ytitle_daily)):
    metric = yvar.replace("_", " ").title()
    chart = dh.plot_time_dependent_scatter_chart(
        df_daily_summary,
        yvar = yvar,
        line_thickness = 0.5,
        color_by_col = "maudience",
        ptitle_str=ytitle_daily[yvar],
        axis_title_fontsize = 16,
        axis_label_fontsize = 14,
        axis_label_angle = -25,
        axis_tick_label_color = '#757575',
        marker_order = ["Development", "Inference"],
        marker_colors = ['#bdbdbd', '#e13128'],
        marker_size = 100,
        show_title = True,
        show_legend = True if k == 0 else '',
        tooltip = [
            alt.Tooltip('maudience', title='Audience group'),
            alt.Tooltip('date:T', title='Date'),
            alt.Tooltip(f"{yvar}:Q", title=metric, format=",.2f"),
        ],
        legend_params={
            "scatter": daily_legend_params,
            "area": daily_legend_params,
            "line": daily_legend_params,
        },
        ci_level=0.95,
        fig_size = dict(width=800, height=150),
    )
    display(chart)

### Daily Summary by Audience Group

Show the daily summary data by audience group

In [74]:
with pd.option_context('display.max_columns', None):
    display(df_daily_summary_aud)

Unnamed: 0,maudience,date,revenue,product_views,bounce_rate,product_clicks_rate,add_to_cart_rate,time_on_site
0,Low,2017-03-01,604.049988,6626,33.064518,3.305161,19.758064,3.179368
1,Low,2017-03-02,985.380005,6673,22.869955,2.592537,25.560537,2.928625
2,Low,2017-03-03,475.98999,4452,38.235294,2.336029,15.686275,2.297794
3,Low,2017-03-04,2587.060059,4382,33.333332,2.282063,25.531916,3.894208
4,Low,2017-03-05,1397.890015,5901,31.770834,3.287578,32.8125,2.807292
...,...,...,...,...,...,...,...,...
269,Development,2017-02-24,3498.649902,16981,30.864197,2.691243,19.753086,3.143621
270,Development,2017-02-25,1522.150024,10860,34.115139,2.394107,16.204691,2.898969
271,Development,2017-02-26,2136.179932,13847,32.871288,2.563732,19.20792,2.995182
272,Development,2017-02-27,5619.719727,17665,31.306597,2.666289,15.006469,2.897995


#### Chart

Show plot

In [75]:
ytitle_daily_aud = {
    "bounce_rate": (
        "Bounce rate for all audience groups within non-holiday range during "
        "development"
    ),
    "product_clicks_rate": (
        "Product CTR for all audience groups within 95% c.i. during "
        "development"
    ),
    "add_to_cart_rate": (
        "Add-to-cart rate for all audience groups within 95% c.i. during "
        "development"
    ),
    "time_on_site": (
        "Average time spent on store site for all audience groups within 95% "
        "c.i. during development"
    ),
    "revenue": (
        "First-visit revenue for all audience groups within post-holiday "
        "range"
    ),
}
daily_aud_legend_params = dict(
    direction="horizontal", orient="bottom", titleAnchor="start"
)

In [76]:
for k, yvar in enumerate(list(ytitle_daily_aud)):
    metric = yvar.replace("_", " ").title()
    chart = dh.plot_time_dependent_scatter_chart(
        df_daily_summary_aud,
        yvar = yvar,
        line_thickness = 0.5,
        color_by_col = "maudience",
        ptitle_str = ytitle_daily_aud[yvar],
        axis_title_fontsize = 16,
        axis_label_fontsize = 14,
        axis_label_angle = -25,
        axis_tick_label_color = '#757575',
        marker_order = ["Low", 'High', "Medium", 'Development'],
        marker_colors = ['#fbb4ae', 'darkred', '#c7e9c0', '#bdbdbd'],
        marker_size = 100,
        show_title = True,
        show_legend = True if k == 0 else '',
        tooltip = [
            alt.Tooltip('maudience', title='Audience group'),
            alt.Tooltip('date:T', title='Date'),
            alt.Tooltip(f"{yvar}:Q", title=metric, format=",.2f"),
        ],
        legend_params={
            "scatter": daily_aud_legend_params,
            "area": daily_aud_legend_params,
            "line": daily_aud_legend_params,
        },
        ci_level=0.95,
        fig_size = dict(width=800, height=150),
    )
    display(chart)

::: {.callout-tip title="Observations"}
1. First-visit bounce rate is at its lowest during the lead-up to the Christmas holiday season and can be largely attributed to holiday shopping.
2. The following first-visit attributes
   - product CTR
   - add-to-cart rate
   - revenue
   - average time spent on the store site

   are consistently highest during holiday season and this too can be largely attributed to increased visitor interaction with the store due to holiday shopping.
:::

### Cohorts

#### Table

Show inference data with audience and cohort groups assigned

In [77]:
with pd.option_context('display.max_columns', None):
    display(df_dev_cohorts.head())

Unnamed: 0,infer_month,fullvisitorid,visitId,visitNumber,visitStartTime,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,revenue,score,predicted_score_label,maudience,cohort,audience_strategy
0,March,2359966058072922045,1488364905,1,2017-03-01 02:41:45,1,3,1,4,2,41,45,(direct),(none),Direct,2,0,Unknown,0,0,15,0,2,18,Chrome,Windows,desktop,0,,0.3570451,True,High,Control,Multi-Group
1,March,7221532448991957109,1488368674,1,2017-03-01 03:44:34,1,3,1,4,3,44,34,(direct),(none),Direct,1,1,Unknown,0,0,12,0,1,0,Safari,Macintosh,desktop,0,,6.821068e-07,False,Low,Test,Multi-Group
2,March,4385673813050165685,1488370525,1,2017-03-01 04:15:25,1,3,1,4,4,15,25,(direct),(none),Direct,1,1,Unknown,9,0,0,0,1,0,Chrome,Linux,desktop,0,,0.03742834,False,Medium,Control,Multi-Group
3,March,1998897161334762608,1488370825,1,2017-03-01 04:20:25,1,3,1,4,4,20,25,(direct),(none),Direct,1,1,Unknown,9,0,0,0,1,0,Chrome,Linux,desktop,0,,0.3213757,False,High,Test,Multi-Group
4,March,633230752572028256,1488371869,1,2017-03-01 04:37:49,1,3,1,4,4,37,49,(direct),(none),Direct,1,1,Unknown,0,0,0,0,1,0,Chrome,Linux,desktop,0,,0.01514476,False,Low,,Multi-Group


::: {.callout-note title="Notes"}
1. The table shows the following for all first-time visitors to the store during the inference (production) period
   - inference month (`infer_month`)
   - attributes (characteristics) of the first visit (some of these are features used by the ML model)
   - the ML model's prediction of such visitors' propensity to make a purchase on a return visit to the store (hard - `predicted_score_label` - and soft - `score` - predictions)
   - assigned audience group (`maudience`), based on specified audience strategy (single or multiple audience groups, `audience_strategy`)
   - audience test and control cohorts (`cohort`).
2. The dashboard will contain a button to download this table to a `.XLSX` table using [Python libraries for working with MS Excel files](https://www.python-excel.org/).
3. This table is the main deliverable for this project.
:::

Export `DataFrame` with inference data, including the audience cohorts, to XLSX file with the following formatting

1. columns that are sufficiently wide to show the longest value in any row
2. bold header row
3. centered cell contents

In [78]:
# #| echo: true
# xlh.export_df_to_formatted_spreadsheet(
#     df_dev_cohorts,
#     os.path.join(processed_data_dir, "Audience_cohorts_predictions.xlsx"),
# )

::: {.callout-note title="Notes"}
1. The worksheet formatting improves the readability of the contents in this file by the end user (business client).
:::

## Recommendations

With the first-time visitors segmented into audience groups, as mentioned in the project scope, in order to best spend available marketing budget this allows greater flexibility in how customized a campaign response can be by using a different marketing approach with a customer that is predicted to have a high, medium or low propensity to make a return purchase. We will now make recommendations for more personalized marketing.

### Based on ML Model's Most Important Features

::: {.callout-note title="Notes"}
Based on

1. discoveries made from exploring the data
2. the ML model's most important features for predicting whether a visitor will make a return purchase

we should target the following visitor profile to maximize campaign response

1. High propensity visitors

   - reached the store site using a paid search
   - used an uncommonly used browser
   - used one of the following operating systems
     - FreeBSD
     - Nokia-based OS
   - interacted with content on the store site

   with an emphasis on visitors who accessed the site from an uncommonly used browser
3. Medium propensity visitors

   - interacted with content on the store site
   - used one of the following frequently used operating systems
     - Macintosh
     - Chrome OS
   - used one of the following infrequently used operating systems
     - Nintendo WII
     - Firefox OS
   - used an undetermined medium to access the store site
   - reached the store site using a google search

   with an emphasis on
   - visitors who used a Mac-based operating system to access the store site
   - Chrome OS users (i.e. chromebook users)
4. Low propensity visitors

   - reached the store site by
     - sources other than google search
     - directly entering URL into web browser
   - used one of the following operating systems
     - SunOS
     - Macintosh OS 
   - used an affiliate or undetermined medium to access the store site
   - did not bounce from the site

   with an emphasis on
   - visitors who reached the site during their first visit by
     - sources other than google search
     - directly entering URL into web browser
   - visitors who used a Mac-based operating system to access the store site
   - did not bounce from the site

where the recommended emphasis is based on factors that produced a good combination of KPIs (CTR and conversion rate) among first-time visitors in the closest available month of historical data.
:::

### Based on Observed Behavior During Production (Inference) Period

Show the average number of promotions viewed by high propensity visitors in historical data among first-time visitors

In [79]:
#| output: true
avg_promos_disp_high_prop = df_dev_cohorts.query("maudience == 'High'")['promos_displayed'].mean()
print(
    "Average number of promotions viewed by visitors in high propensity "
    f"group = {avg_promos_disp_high_prop:,.2f}%"
)

Average number of promotions viewed by visitors in high propensity group = 8.55%


::: {.callout-note title="Notes"}
Based on visitor behavior during their first visit to the store, we should include the following actions to maximize campaign response

1. Low propensity audience
   - offer stronger discounts or offers (be more aggressive)
   - offer the strongest discounts on weekends
   - offer discounts on non-shopping cart items
2. High propensity audience
   - offer the weakest discounts on weekends
   - offer discounts on shopping cart items and recommendations for similar products to shopping cart items
3. Medium propensity audience
   - offer the intermediate discounts (lower than low propensity and higher than the high propensity audience) on weekends
:::

## Next Step

The next step will delete all MLFlow-related resources created for this project.