# Create Dashboard Charts and Tables

In [None]:
%load_ext autoreload
%autoreload 2

::: {.content-hidden}
Import necessary Python modules
:::

In [None]:
import os
import sys
from calendar import month_name
from glob import glob
from typing import Dict, List

import altair as alt
import pandas as pd

In [None]:
alt.renderers.set_embed_options(actions=False)

::: {.content-hidden}
Get relative path to project root directory
:::

In [None]:
PROJ_ROOT_DIR = os.path.join(os.pardir)
src_dir = os.path.join(PROJ_ROOT_DIR, "src")
sys.path.append(src_dir)

::: {.content-hidden}
Import custom Python modules
:::

In [None]:
%aimport bigquery_auth_helpers
from bigquery_auth_helpers import auth_to_bigquery

%aimport dash_helpers
import dash_helpers as dh

%aimport transform_helpers
import transform_helpers as th

%aimport xlsx_helpers
import xlsx_helpers as xlh

## About

### Overview
This step will create charts and tables to be displayed as part of the dashboard to summarize the audience group(s) for the marketing campaign.

### Implementation
The dashboard will be used by the business user (marketing team) to design the campaign so it must show all first-time visitors to the store during the inference period and the ML model's prediction of their propensity to make a purchase during a return visit to the store.

The dashboard will also show the following

1. most important features for predicting the propensity, separately for each audience group
2. summary of visit attributes (for first-time visitors) during the inference period and, in order to provide some context, a month-over-month comparison of these attributes starting from the first month of Google Analytics tracking data used during ML model development

### Order of Operations
This step can be run prospectively at the end of the inference period, just before the start of the campaign, when all the inference data (first-time visitors to the store) becomes available. This step cab be run after the following BigQuery tables have been created

1. `audience_cohorts`
2. `audience_profiles`
3. `monthly_summary`

These tables were uploaded to BigQuery in the previous step.

## User Inputs

Define the following

1. BigQuery
   - dataset id
   - table ids for audience
     - cohorts
     - profile
2. start and end dates for train, validation and test data

In [None]:
#| echo: true
# 1. GCP resources
gbq_dataset_id = 'mydemo2asdf'
gbq_table_id_cohorts = 'audience_cohorts'
gbq_table_id_profiles = 'audience_profiles'
gbq_table_id_feats_imp = 'audience_feats_imp'
gbq_table_id_summary = 'monthly_summary'
gbq_table_id_sa_fracs = "cohort_audience_fractions"
gbq_table_id_conv_rates = "audience_conversion_rates"
gbq_table_id_conv_rates_agg_combo = "conversion_rates_aggregated"
gbq_table_id_daily_perf = "daily_summary"
gbq_table_id_cat_feats_kpis = "categorical_features_kpis"

# 2. start and end dates
train_start_date = "20160901"
train_end_date = "20161231"
val_start_date = "20170101"
val_end_date = "20170131"
test_start_date = "20170201"
test_end_date = "20170228"
infer_start_date = '20170301'
infer_end_date = '20170331'

::: {.content-hidden}
Get path to data sub-folders and model folder
:::

In [None]:
data_dir = os.path.join(PROJ_ROOT_DIR, "data")
raw_data_dir = os.path.join(data_dir, "raw")
processed_data_dir = os.path.join(data_dir, "processed")
gcp_keys_dir = os.path.join(PROJ_ROOT_DIR, "gcp_keys")

::: {.content-hidden}
Load Google Cloud authentication credentials for use with the native BigQuery Python client
:::

In [None]:
gcp_proj_id = os.environ["GCP_PROJECT_ID"]

::: {.content-hidden}
Get fully resolved name of the BigQuery tables
:::

In [None]:
gbq_table_fully_resolved_cohorts = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_cohorts}"
gbq_table_fully_resolved_profiles = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_profiles}"
gbq_table_fully_resolved_feats_imp = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_feats_imp}"
gbq_summary_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_summary}"
gbq_sa_fracs_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_sa_fracs}"
gbq_conv_rates_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_conv_rates}"
gbq_conv_rates_combo_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_conv_rates_agg_combo}"
gbq_daily_perf_combo_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_daily_perf}"
gbq_table_fully_resolved_cat_feat_kpis = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_cat_feats_kpis}"

::: {.content-hidden}
Define a dictionary to specify datatypes of the profiles data
:::

In [None]:
dtypes_dict_profiles = {
    "Audience_Strategy": pd.StringDtype(),
    "Stat_Expanded": pd.StringDtype(),
    "High": pd.Float32Dtype(),
    "Low": pd.Float32Dtype(),
    "Medium": pd.Float32Dtype(),
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the feature importances data
:::

In [None]:
dtypes_dict_feats_imp = {
    "audience_strategy": pd.StringDtype(),
    "num_observations": pd.Int16Dtype(),
    "stat": pd.StringDtype(),
    "maudience": pd.StringDtype(),
    "value": pd.StringDtype(),
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the cohorts data
:::

In [None]:
dtypes_dict_cohort = {
    "infer_month": "str",
    "fullvisitorid": "str",
    "visitId": "str",
    "visitNumber": "int",
    "quarter": "int",
    "month": "int",
    "day_of_month": "int",
    "day_of_week": "int",
    "hour": "int",
    "minute": "int",
    "second": "int",
    "source": "str",  #
    "medium": "str",  #
    "channelGrouping": "str",  #
    "hits": "int",
    "bounces": "int",
    "last_action": "str",  #
    "promos_displayed": "int",
    "promos_clicked": "int",
    "product_views": "int",
    "product_clicks": "int",
    "pageviews": "int",
    "time_on_site": "int",
    "browser": "str",  #
    "os": "str",  #
    "deviceCategory": "str",  #
    "added_to_cart": "int",
    "revenue": "float",
    "score": "float",
    "predicted_score_label": "bool",
    "maudience": "str",
    "cohort": "str",
    "audience_strategy": "str",
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the monthly performance summary data
:::

In [None]:
dtypes_dict_monthly_summary = {
    'month': pd.StringDtype(),
    "split_type": pd.StringDtype(),
    "return_purchasers": pd.Int16Dtype(),
    "revenue": pd.Float32Dtype(),
    "visitors": pd.Int16Dtype(),
    "pageviews": pd.Int32Dtype(),
    "time_on_site": pd.Float32Dtype(),
    "audience_strategy": pd.StringDtype(),
    "bounce_rate": pd.Float32Dtype(),
    "conversion_rate": pd.Float32Dtype(),
    "product_clicks_rate": pd.Float32Dtype(),
    "add_to_cart_rate": pd.Float32Dtype(),
    "visitors_pct_change": pd.Float32Dtype(),
    "revenue_pct_change": pd.Float32Dtype(),
    "pageviews_pct_change": pd.Float32Dtype(),
    "time_on_site_pct_change": pd.Float32Dtype(),
    "bounce_rate_pct_change": pd.Float32Dtype(),
    "conversion_rate_pct_change": pd.Float32Dtype(),
    "product_clicks_rate_pct_change": pd.Float32Dtype(),
    "add_to_cart_rate_pct_change": pd.Float32Dtype(),
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the daily performance summary data
:::

In [None]:
dtypes_dict_daily_summary = {
    "date": pd.StringDtype(),
    "maudience": pd.StringDtype(),
    "revenue": pd.Float32Dtype(),
    "time_on_site": pd.Float32Dtype(),
    "bounce_rate": pd.Float32Dtype(),
    "product_clicks_rate": pd.Float32Dtype(),
    "add_to_cart_rate": pd.Float32Dtype(),
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the cohort-to-audience fraction
:::

In [None]:
dtypes_sa_frac = {
    "infer_month": pd.StringDtype(),
    "audience_strategy": pd.StringDtype(),
    "maudience": pd.StringDtype(),
    "cohort": pd.StringDtype(),
    "size": pd.Int16Dtype(),
    "group_size": pd.Int16Dtype(),
    "uplift": pd.Int8Dtype(),
    "power": pd.Int8Dtype(),
    "ci_level": pd.Int8Dtype(),
    "samp_to_aud_frac": pd.Float32Dtype(),
    "size_type": pd.StringDtype(),
    "data_type": pd.StringDtype(),
    "data_size": pd.Int32Dtype(),
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the conversion rates for each dataset (development and inference)
:::

In [None]:
dtypes_conv_rates = {
    "audience_strategy": pd.StringDtype(),
    "infer_month": pd.StringDtype(),
    "maudience": pd.StringDtype(),
    "pred_conversions": pd.Int16Dtype(),
    "total_visitors": pd.Int16Dtype(),
    "min_score": pd.Float32Dtype(),
    "true_conversions": pd.Int16Dtype(),
    "data_type": pd.StringDtype(),
    "data_size": pd.Int16Dtype(),
    "true_conv_rate": pd.Float32Dtype(),
    "overall_true_conv_rate": pd.Float32Dtype(),
    "pred_conv_rate": pd.Float32Dtype(),
    "overall_pred_conv_rate": pd.Float32Dtype(),
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the aggregated conversion rates
:::

In [None]:
dtypes_hmap = {
    "maudience": pd.StringDtype(),
    "data_type": pd.StringDtype(),
    "var": pd.StringDtype(),
    "value": pd.Float32Dtype(),
}

::: {.content-hidden}
Define a dictionary to specify datatypes of the KPIs for all categories within each categorical feature
:::

In [None]:
dtypes_dict_categorical_kpis = {
    "feature_name": pd.StringDtype(),
    "feature_category": pd.StringDtype(),
    "variable": pd.StringDtype(),
    "value": pd.Float32Dtype(),
}

Create a mapping between action type integer and label, in order to get meaningful names from the `audience_strategy` column in the profiles data

In [None]:
#| echo: true
audience_strategy_mapper = {1: "Multi-Group", 2: "Single Group"}

::: {.content-hidden}
Define helper function to show datatypes and number of missing values for all columns in a `DataFrame`
:::

In [None]:
def summarize_df(df: pd.DataFrame) -> None:
    """Show datatypes and count missing values in columns of DataFrame."""
    display(
        df.dtypes.rename("dtype")
        .to_frame()
        .merge(
            df.isna().sum().rename("missing").to_frame(),
            left_index=True,
            right_index=True,
            how="left",
        )
        .reset_index()
        .rename(columns={"index": "column"})
    )

::: {.content-hidden}
Define helper function to load data from BigQuery table into a `DataFrame` and apply data transformation steps
:::

In [None]:
def get_data(
    query: str,
    mapper_dict: Dict[str, Dict[int, str]],
    dtypes_dict: Dict,
    gcp_keys_dir_path: str,
    date_col: str = "",
    data_type: str = "profiles",
    custom_sort_single_col: Dict[str, List[str]] = dict(),
) -> pd.DataFrame:
    """Get data from BigQuery table."""
    gcp_authorization_dict = auth_to_bigquery(gcp_keys_dir_path)
    df = th.extract_data(query, gcp_authorization_dict)
    if mapper_dict:
        df = df.pipe(th.map_columns, mapper_dict)
    df = df.pipe(th.set_datatypes, dtypes_dict)
    if date_col:
        df[date_col] = pd.to_datetime(df["date"], utc=False)
    if custom_sort_single_col and len(list(custom_sort_single_col)) == 1:
        col_sort = list(custom_sort_single_col)[0]
        sort_order = list(custom_sort_single_col.values())[0]
        df = df.set_index(col_sort).loc[sort_order].reset_index()
    return df

## Get Data

### Profiles

In [None]:
#| echo: true
query = f"""
        SELECT *
        FROM {gbq_table_fully_resolved_profiles}
        """
df_profiles = get_data(
    query,
    {'Audience_Strategy': audience_strategy_mapper},
    dtypes_dict_profiles,
    gcp_keys_dir,
)

In [None]:
with pd.option_context('display.max_rows', None):
    display(df_profiles)

::: {.content-hidden}
Summarize the `DataFrame` with the audience profile
:::

In [None]:
#| output: false
summarize_df(df_profiles)

### Feature Importances

In [None]:
#| echo: true
query = f"""
        SELECT *
        FROM {gbq_table_fully_resolved_feats_imp}
        """
df_feats_imp = get_data(
    query,
    {'audience_strategy': audience_strategy_mapper},
    dtypes_dict_feats_imp,
    gcp_keys_dir,
)

In [None]:
with pd.option_context('display.max_rows', None):
    display(df_feats_imp)

::: {.content-hidden}
Summarize the `DataFrame` with the ML feature importances
:::

In [None]:
#| output: false
summarize_df(df_feats_imp)

### Cohorts

In [None]:
#| echo: true
query = f"""
        SELECT * EXCEPT(made_purchase_on_future_visit, split_type)
        FROM {gbq_table_fully_resolved_cohorts}
        WHERE split_type = 'infer'
        """
df_dev_cohorts = (
    get_data(
        query,
        {'audience_strategy': audience_strategy_mapper},
        dtypes_dict_cohort,
        gcp_keys_dir,
    )
)

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_dev_cohorts.head())

::: {.content-hidden}
Summarize the `DataFrame` with the audience cohorts
:::

In [None]:
#| output: false
summarize_df(df_dev_cohorts)

::: {.content-hidden}
Verify that the true label (or outcome of the return visit, `made_purchase_on_future_visit`) is not present in the cohorts data
:::

In [None]:
assert 'made_purchase_on_future_visit' not in list(df_dev_cohorts)

::: {.callout-note title="Notes"}
1. The true outcome is not known at the end of the inference period. It will only be known after the end of the marketing campaign, which occurs after the inference period.

### Monthly Performance Summary

In [None]:
#| echo: true
query = f"""
        SELECT * EXCEPT(channelGrouping,deviceCategory,browser,os,visitor_type)
        FROM {gbq_summary_table_id_fully_resolved}
        """
df_monthly_summary = (
    get_data(
        query,
        {
            'audience_strategy': audience_strategy_mapper,
            "month": dict(
                zip([m for m in list(range(1, 12 + 1))], month_name[1:])
            ),
        },
        dtypes_dict_monthly_summary,
        gcp_keys_dir,
        custom_sort_single_col={
            "month": [
                'September',
                'October',
                'November',
                'December',
                'January',
                'February',
                'March',
            ]
        }
    )
)

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_monthly_summary)

::: {.content-hidden}
Summarize the `DataFrame` with the monthly performance summary
:::

In [None]:
#| output: false
summarize_df(df_monthly_summary)

### Daily Performance Summary by Audience Group

In [None]:
#| echo: true
query = f"""
        SELECT maudience,
               date,
               revenue,
               product_views,
               bounce_rate,
               product_clicks_rate,
               add_to_cart_rate,
               time_on_site
        FROM {gbq_daily_perf_combo_table_id_fully_resolved}
        WHERE agg_type != 'overall'
        """
df_daily_summary_aud = (
    get_data(
        query,
        {},
        dtypes_dict_daily_summary,
        gcp_keys_dir,
        date_col='date',
    )
)

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_daily_summary_aud)

::: {.content-hidden}
Summarize the `DataFrame` with the daily performance summary by audience group
:::

In [None]:
#| output: false
summarize_df(df_daily_summary_aud)

### Daily Performance Summary Overall

In [None]:
#| echo: true
query = f"""
        SELECT maudience,
               date,
               revenue,
               product_views,
               bounce_rate,
               product_clicks_rate,
               add_to_cart_rate,
               time_on_site
        FROM {gbq_daily_perf_combo_table_id_fully_resolved}
        WHERE agg_type = 'overall'
        """
df_daily_summary = (
    get_data(
        query,
        {},
        dtypes_dict_daily_summary,
        gcp_keys_dir,
        date_col='date',
    )
)

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_daily_summary)

::: {.content-hidden}
Summarize the `DataFrame` with the overall daily performance summary
:::

In [None]:
#| output: false
summarize_df(df_daily_summary)

### Conversion Rates

In [None]:
#| echo: true
query = f"""
        SELECT *
        FROM {gbq_conv_rates_table_id_fully_resolved}
        """
df_conv_rates = (
    get_data(
        query,
        {'audience_strategy': audience_strategy_mapper},
        dtypes_conv_rates,
        gcp_keys_dir,
    )
)

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_conv_rates)

::: {.content-hidden}
Summarize the `DataFrame` with the conversion rates
:::

In [None]:
#| output: false
summarize_df(df_conv_rates)

### Cohort-to-Audience Size Fractions

In [None]:
#| echo: true
query = f"""
        SELECT *
        FROM {gbq_sa_fracs_table_id_fully_resolved}
        """
df_sa_frac = (
    get_data(
        query,
        {},
        dtypes_sa_frac,
        gcp_keys_dir,
    )
)

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_sa_frac)

::: {.content-hidden}
Summarize the `DataFrame` with the cohort-to-audience size fraction
:::

In [None]:
#| output: false
summarize_df(df_sa_frac)

### Aggregated Conversion Rates (Overall)

In [None]:
#| echo: true
query = f"""
        SELECT *
        FROM {gbq_conv_rates_combo_table_id_fully_resolved}
        WHERE maudience IS NULL
        """
df_hmap = (
    get_data(
        query,
        {},
        dtypes_hmap,
        gcp_keys_dir,
    )
)

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_hmap)

::: {.content-hidden}
Summarize the `DataFrame` with the overall aggregated conversion rates
:::

In [None]:
#| output: false
summarize_df(df_hmap)

### Aggregated Conversion Rates (per Audience Group)

In [None]:
#| echo: true
query = f"""
        SELECT *
        FROM {gbq_conv_rates_combo_table_id_fully_resolved}
        WHERE maudience IS NOT NULL
        """
df_hmap_aud = (
    get_data(
        query,
        {},
        dtypes_hmap,
        gcp_keys_dir,
    )
)

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_hmap_aud)

::: {.content-hidden}
Summarize the `DataFrame` with the conversion rates aggregated by audience group
:::

In [None]:
#| output: false
summarize_df(df_hmap_aud)

### Categorical Feature KPIs

In [None]:
#| echo: true
query = f"""
        SELECT feature_name,
               feature_category,
               variable,
               value
        FROM {gbq_table_fully_resolved_cat_feat_kpis}
        WHERE variable IN ('CTR', 'Conversion Rate')
        """
df_development_agg = get_data(
    query,
    {},
    dtypes_dict_categorical_kpis,
    gcp_keys_dir,
)

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_development_agg)

::: {.content-hidden}
Summarize the `DataFrame` with the KPIs for all categories in each categorical feature
:::

In [None]:
#| output: false
summarize_df(df_development_agg)

In [None]:
df_dev_cohorts

## Create Charts and Tables

### Conversion Rates During Development and Inference

Show metrics used to estimate sample sizes

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_conv_rates)

#### Charts

Show plot comparing true and predicted conversion rates for each audience group

In [None]:
chart = dh.plot_repeated_column_row_grouped_bar_chart(
    df_hmap_aud,
    xvar = "var",
    yvar = "value",
    color_by_col = "var",
    row_var = "data_type",
    col_var = "maudience",
    ptitle_str=(
        "Similar conversion rates by audience across "
        "development & inference"
    ),
    tooltip=[
        alt.Tooltip('maudience:N', title='Audience Group'),
        alt.Tooltip('data_type:N', title="Type of data"),
        alt.Tooltip('var:N', title="Quantity"),
        alt.Tooltip('value:Q', title='Value', format=".2f"),
    ],
    row_spacing = 25,
    bar_order=['true', 'pred'],
    bar_colors=['lightgrey', '#cc1e1f'],
    axis_label_fontsize=15,
    axis_title_fontsize=15,
    show_title = True,
    fig_size=dict(width=175, height=175),
)
chart

Show plot comparing overall statistics

In [None]:
chart = dh.plot_repeated_column_grouped_bar_chart(
    df_hmap,
    xvar="data_type",
    yvar="value",
    color_by_col="data_type",
    col_var="var",
    bar_order=['Development', 'Inference'],
    bar_colors=['lightgrey', '#cc1e1f'],
    ptitle_str=(
        "Similar overall conversion rates during "
        "development and inference"
    ),
    tooltip=[
        alt.Tooltip('var:N', title='Metric'),
        alt.Tooltip('data_type:N', title='Type of Data'),
        alt.Tooltip('value:N', title='Value', format=",.2f"),
    ],
    axis_label_fontsize = 14,
    axis_title_fontsize = 15,
    show_title = True,
    fig_size=dict(width=150, height=275),
)
chart

Show plot comparing development and inference statistics for each audience group

In [None]:
for k, yvar in enumerate(["total_visitors", 'min_score', 'pred_conv_rate', 'true_conv_rate']):
    chart = dh.plot_repeated_column_grouped_bar_chart_untidy_data(
        df_conv_rates,
        xvar = 'maudience',
        yvar = yvar,
        color_by_col = 'maudience',
        col_var = 'data_type',
        column_order = ['Development', 'Inference'],
        title_dict={
            "total_visitors": {
                "y": "Audience size",
                "title": (
                    "Similarly sized audiences are predicted between inference "
                    "and historical data"
                ),
            },
            "min_score": {
                "y": "Minimum Predicted Propensity",
                "title": (
                    "Similar propensities prediced by audience "
                    "across inference and historical data"
                ),
            },
            "pred_conv_rate": {
                "y": "Predicted Conversion Rate (%)",
                "title": (
                    "Similar conversion rates prediced by audience "
                    "across inference and historical data"
                ),
            },
            "true_conv_rate": {
                "y": "True Conversion Rate (%)",
                "title": (
                    "Highest true conversion rate observed for "
                    "high-propensity group"
                ),
            },
        },
        tooltip = [
            alt.Tooltip('infer_month:N', title='Inference Month'),
            alt.Tooltip('audience_strategy:N', title='Audience Strategy'),
            alt.Tooltip('data_type:N', title='Type of Dataset'),
            alt.Tooltip('data_size:Q', title='Dataset Size', format=","),
            alt.Tooltip('maudience:N', title='Audience Group'),
            alt.Tooltip('total_visitors:Q', title='Audience Size', format=","),
            alt.Tooltip('pred_conv_rate:Q', title='Predicted Conv. Rate (%)', format=",.2f"),
            alt.Tooltip('true_conv_rate:Q', title='True Conv. Rate (%)', format=",.2f"),
        ],
        bar_order=["High", "Medium", "Low"],
        bar_colors=["#cc1e1f", "#fc8767", "#fcb49a"],
        column_header_fontsize = 15,
        axis_label_fontsize = 15,
        title_fontsize = 15,
        show_title = True,
        show_legend = True if k == 0 else False,
        fig_size = dict(width=250, height=300),
    )
    display(chart)

### Monthly Summary Stastics About Data

Show metadata

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_monthly_summary)

#### Chart

Show plot

In [None]:
df_monthly_summary

In [None]:
ytitle = {
    "visitors": 'Visitors',
    "revenue": 'Revenue (USD)',
    "add_to_cart_rate": "Fraction of Visitors that Added Items to Cart (%)",
    "bounce_rate": "Bounce Rate (%)",
    "conversion_rate": "Conversion Rate (%)",
    "product_clicks_rate": "Product List Clickthrough Rate (%)",
    "time_on_site": "Average time spent on store website (minutes)",
    'pageviews': "Number of pages viewed during visit",
}

for yvar in list(ytitle):
    chart = dh.plot_statistic_bar_chart_combo(
        data=df_monthly_summary,
        yvar=yvar,
        color_by_col="split_type:N",
        colors={"Train+Val": "lightgrey", "Test": "grey", "Infer": "red"},
        marker_size=80,
        marker_colors=['red', 'green'],
        marker_values=[False, True],
        x_axis_sort=df_monthly_summary['month'].tolist(),
        ptitle=ytitle[yvar],
        axis_label_fontsize=14,
        ptitle_vertical_offset=-1,
        fig_size_bars=dict(width=575, height=300),
        fig_size_lines=dict(width=575, height=125),
    )
    display(chart)

::: {.callout-note title="Notes"}
1. This chart shows select numerical attributes of first-time visitors to the store during the inference (production) period.
2. The bar chart shows the monthly aggregated value of these attributes, separately for each month in the ML training+validation data, ML test data and inference data. Unless indicated, the statistics are monthly totals.
3. The line chart shows the month-over-month percentage change in each attribute.
:::

::: {.callout-tip title="Observations"}
1. Absolute performance peaks during the holiday season (November and/or December) and is most variable during September and October.
2. As expected, month-over-month growth drops during the winter after the holiday shopping season has ended (i.e. drops during January and February).
:::

### KPIs by Categorical Feature

Show KPIs per categorical

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_development_agg.head(3))

#### Chart

Show plot

In [None]:
ptitle_dict = {
    "os": "Linux and Mac operating systems give the best combination of KPIs",
    "source": "Direct traffic gives best combination of KPIs",
    "browser": "Chrome offers best combination of KPIs among web browsers",
    'medium': 'Traffic reaching from CPM, referral or no medium gives the best combination of KPIs',
    'channelGrouping': "Referral, direct or display channel shows the best combination of KPIs",
    'deviceCategory': "Desktop devices give best combination of KPIs",
    'last_action': (
        "Ending a first visit with a Check Out or Add To/Remove from "
        "Cart gives best KPIs"
    ),
}
for k, feature in enumerate(list(ptitle_dict)):
    chart = dh.plot_stacked_bar_chart(
        data=df_development_agg.query(f"feature_name == '{feature}'"),
        xvar="feature_category",
        yvar="value",
        color_by_col='variable',
        colors={"CTR": '#cccccc', 'Conversion Rate': 'red'},
        show_title=True,
        show_legend=True if k == 0 else False,
        ptitle_str=ptitle_dict[feature],
        tooltip=[
            alt.Tooltip("feature_name:N", title='Categorical Feature'),
            alt.Tooltip("feature_category:N", title='Feature Sub-Category'),
            alt.Tooltip("variable:N", title='Rate Type'),
            alt.Tooltip("value:N", title='Rate (%)', format=".3f"),
        ],
        x_label_height=400,
        axis_label_fontsize=16,
        title_fontsize=18,
        title_fontweight='normal',
        x_tick_label_angle=-45,
        fig_size=dict(width=400, height=300),
    )
    display(chart)

### Feature Importances by Audience Group

Show feature importances per audience group

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_feats_imp.head(3))

#### Chart

Show plot

In [None]:
tooltip = [
    alt.Tooltip('stat', title='Feature'),
    alt.Tooltip('audience_strategy', title='Audience Strategy'),
    alt.Tooltip('maudience', title='Audience Group'),
    alt.Tooltip('num_observations', title='Obs. to get importance'),
    alt.Tooltip('value', title='Importance', format=".3f"),
]
chart = dh.plot_feature_importances(
    data=df_feats_imp,
    y_label_width=600,
    axis_label_fontsize=14,
    tooltip=tooltip,
    interactive=True,
    bar_color='#525252',  # "#3181bd", "#636363"
    show_x_ticks=False,
    fig_size=dict(width=450, height=300),
)
chart

Show the first-visit bounce rate in historical data among first-time visitors

In [None]:
#| output: true
(
    df_dev_cohorts.groupby('maudience', as_index=False)
    .agg({"bounces":"sum", "fullvisitorid": "count"})
    .rename(columns={"fullvisitorid": "visitors"})
    .assign(bounce_rate=lambda df: 100*df['bounces']/df['visitors'])
)

::: {.callout-note title="Notes"}
1. This chart shows the most important features to the prediction of the outcome in the current use-case (i.e. predicting the propensity of a visitor making a purchase during a future visit to the e-commerce store). These importances come from combined SHAPely values that provide global explanations, which allow us to interpret the entire ML model that was trained to make these predictions ([link](https://christophm.github.io/interpretable-ml-book/shap.html#shap-feature-importance)). The chart shows the features which changed the predicted absolute propensity on average by the most more percentage points.
2. The feature importances were calculated and are shown separately for each audience group.
:::

::: {.callout-tip title="Observations"}
1. For high propensity visitors, the most important features (of their first visit) to predict whether they made a return purchase were
   - using an uncommonly used browser (`browser__other`)
   - using one of the following operating systems
     - Windows (`os__Windows`) (**fourth highest combination of KPIs by `os`**)
     - FreeBSD (`os__FreeBSD`)
     - Nokia-based OS (`os__Nokia`)
       - this would have to be a mobile (phone or tablet) operating system, since Samsung does not offer a desktop-based operating system
   - interacting with the store site (`hits`)
   - reaching the store site through a paid search (`medium__cpc`)
2. For medium propensity visitors, the most important first visit features were
   - using the Mozilla Firefox browser (`os__Firefox OS`)
   - using google search (`source__google`) as the [source](https://support.google.com/analytics/answer/1033173?hl=en) in order to access the store site
   - using non-desktop operating systems
     - Windows Phone (`os__Windows Phone`)
     - Chrome OS (`os__Chrome OS`) (**third highest combination of KPIs by `os`**)
     - Firefox OS (`os__Firefox OS`)
   - used an affiliate (`medium__affiliate`) - referring website with a personal connection to the visitor - to access the store site
   - bouncing from the site (`bounced__1`, or `bounced__True`) (**average bounce rate among predicted low propensity visitors is approx. 33%**)
4. For low propensity visitors, the most important features (of their first visit) were
   - bouncing from the site (`bounced__1`, or `bounced__True`) (**average bounce rate among predicted low propensity visitors is approx. 33%**)
   - using a Mac-based OS (`os__MacIntosh`) (**highest combination of KPIs by `os`**)
   - used an affiliate ([1](https://support.google.com/analytics/thread/21925739?hl=en&msgid=21929096), [2](https://support.google.com/analytics/thread/21925739/what-is-the-difference-between-affiliate-and-referral-traffic?hl=en)) (`medium__affiliate`) or [undetermined](https://www.owox.com/blog/use-cases/not-set-in-google-analytics/) (`medium__(not set)`) medium to access the store site
:::

### Profile by Audience Group

Show feature importances per audience group

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_feats_imp.head(3))

#### Table

Show the audience group profiles

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_profiles)

::: {.callout-tip title="Observations"}
1. High priority (stronger recommendations)
   - Visitors predicted to have a low propensity to make a return purchase
     - (`Revenue (Mean)`) spent more (higher revenue) on average during their first visit
     - (`Last Action Completed Purchase`) completed a purchase at the end of their first visit more often
     - (`Last Action Check Out`) checked out at the end of their first visit more often

     than visitors in the other two groups. We might want to offer stronger discounts or offers (**be more aggressive**) to the low propensity group as part of the campaign in order to prompt them to make a purchase a return visit to the store (i.e. to maximize campaign response and ROI).
   - (`Weekend Visitors`) There were more high propensity visitors predicted to make a return purchase who had their first visit on a weekend than medium and low propensity visitors. During the campaign (after the first visit), offers or discounts could be targeted at low > medium > high propensity visitors on weekends, in terms of priority, where low propensity visitors have the highest priority. We could offer more loyalty points on weekends to high propensity visitors, with fewer to medium and even less to low propensity visitors.
   - (`Last Action Removed Products From Cart`) More visitors predicted to have a high propensity to make a return purchase removed an item from the shopping cart at the end of their first visit than visitors in other two groups.
     - we should offer discounts on non-shopping cart items to low and medium propensity visitors (to convince them to remove items in their cart as was done by high-propensity visitors, and add items that are discounted, in the hopes they will purchase the new items in the cart)
     - we should offer
       - discounts on shopping cart items
       - recommendations for similar products to shopping cart items

       to high propensity visitors to convince them to purchase these during a return visit
2. Low priority (weaker recommendations)
   - (`Last Action Added To Cart`) Visitors predicted to have a medium propensity to make a return purchase added an item to their shopping cart at the end of their first visit more often than visitors in other two groups.
:::

### Daily Summary

Show the daily summary data

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_daily_summary.head(3))

#### Chart

Show plot

In [None]:
ytitle_daily = {
    "bounce_rate": "Bounce rate was weakest during 2016 holiday season",
    "product_clicks_rate": "Consistent product CTR except for September",
    "add_to_cart_rate": (
        "Rate of adding product(s) to cart was consistently highest during "
        "2016 holiday season"
    ),
    "time_on_site": "Consistent time spent on store site",
    "revenue": (
        "Maximum first-visit revenue during holiday season and consistently "
        "low during 2016 Sepember & October"
    ),
}
daily_legend_params = dict(
    direction="horizontal", orient="bottom", titleAnchor="start"
)

In [None]:
for k, yvar in enumerate(list(ytitle_daily)):
    metric = yvar.replace("_", " ").title()
    chart = dh.plot_time_dependent_scatter_chart(
        df_daily_summary,
        yvar = yvar,
        line_thickness = 0.5,
        color_by_col = "maudience",
        ptitle_str=ytitle_daily[yvar],
        axis_title_fontsize = 16,
        axis_label_fontsize = 14,
        axis_label_angle = -25,
        axis_tick_label_color = '#757575',
        marker_order = ["Development", "Inference"],
        marker_colors = ['#bdbdbd', '#e13128'],
        marker_size = 100,
        show_title = True,
        show_legend = True if k == 0 else '',
        tooltip = [
            alt.Tooltip('maudience', title='Audience group'),
            alt.Tooltip('date:T', title='Date'),
            alt.Tooltip(f"{yvar}:Q", title=metric, format=",.2f"),
        ],
        legend_params={
            "scatter": daily_legend_params,
            "area": daily_legend_params,
            "line": daily_legend_params,
        },
        ci_level=0.95,
        fig_size = dict(width=800, height=150),
    )
    display(chart)

### Daily Summary by Audience Group

Show the daily summary data by audience group

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_daily_summary_aud)

#### Chart

Show plot

In [None]:
ytitle_daily_aud = {
    "bounce_rate": (
        "Bounce rate for all audience groups within non-holiday range during "
        "development"
    ),
    "product_clicks_rate": (
        "Product CTR for all audience groups within 95% c.i. during "
        "development"
    ),
    "add_to_cart_rate": (
        "Add-to-cart rate for all audience groups within 95% c.i. during "
        "development"
    ),
    "time_on_site": (
        "Average time spent on store site for all audience groups within 95% "
        "c.i. during development"
    ),
    "revenue": (
        "First-visit revenue for all audience groups within post-holiday "
        "range"
    ),
}
daily_aud_legend_params = dict(
    direction="horizontal", orient="bottom", titleAnchor="start"
)

In [None]:
for k, yvar in enumerate(list(ytitle_daily_aud)):
    metric = yvar.replace("_", " ").title()
    chart = dh.plot_time_dependent_scatter_chart(
        df_daily_summary_aud,
        yvar = yvar,
        line_thickness = 0.5,
        color_by_col = "maudience",
        ptitle_str = ytitle_daily_aud[yvar],
        axis_title_fontsize = 16,
        axis_label_fontsize = 14,
        axis_label_angle = -25,
        axis_tick_label_color = '#757575',
        marker_order = ["Low", 'High', "Medium", 'Development'],
        marker_colors = ['#fbb4ae', 'darkred', '#c7e9c0', '#bdbdbd'],
        marker_size = 100,
        show_title = True,
        show_legend = True if k == 0 else '',
        tooltip = [
            alt.Tooltip('maudience', title='Audience group'),
            alt.Tooltip('date:T', title='Date'),
            alt.Tooltip(f"{yvar}:Q", title=metric, format=",.2f"),
        ],
        legend_params={
            "scatter": daily_aud_legend_params,
            "area": daily_aud_legend_params,
            "line": daily_aud_legend_params,
        },
        ci_level=0.95,
        fig_size = dict(width=800, height=150),
    )
    display(chart)

::: {.callout-tip title="Observations"}
1. First-visit bounce rate is at its lowest during the lead-up to the Christmas holiday season and can be largely attributed to holiday shopping.
2. The following first-visit attributes
   - product CTR
   - add-to-cart rate
   - revenue
   - average time spent on the store site

   are consistently highest during holiday season and this too can be largely attributed to increased visitor interaction with the store due to holiday shopping.
:::

### Cohorts

#### Table

Show inference data with audience and cohort groups assigned

In [None]:
with pd.option_context('display.max_columns', None):
    display(df_dev_cohorts.head())

::: {.callout-note title="Notes"}
1. The table shows the following for all first-time visitors to the store during the inference (production) period
   - inference month (`infer_month`)
   - attributes (characteristics) of the first visit (some of these are features used by the ML model)
   - the ML model's prediction of such visitors' propensity to make a purchase on a return visit to the store (hard - `predicted_score_label` - and soft - `score` - predictions)
   - assigned audience group (`maudience`), based on specified audience strategy (single or multiple audience groups, `audience_strategy`)
   - audience test and control cohorts (`cohort`).
2. The dashboard will contain a button to download this table to a `.XLSX` table using [Python libraries for working with MS Excel files](https://www.python-excel.org/).
3. This table is the main deliverable for this project.
:::

Export `DataFrame` with inference data, including the audience cohorts, to XLSX file with the following formatting

1. columns that are sufficiently wide to show the longest value in any row
2. bold header row
3. centered cell contents

In [None]:
#| echo: true
xlh.export_df_to_formatted_spreadsheet(
    df_dev_cohorts,
    os.path.join(processed_data_dir, "Audience_cohorts_predictions.xlsx"),
)

::: {.callout-note title="Notes"}
1. The worksheet formatting improves the readability of the contents in this file by the end user (business client).
:::

## Recommendations

With the first-time visitors segmented into audience groups, as mentioned in the project scope, in order to best spend available marketing budget this allows greater flexibility in how customized a campaign response can be by using a different marketing approach with a customer that is predicted to have a high, medium or low propensity to make a return purchase. We will now make recommendations for more personalized marketing.

### Based on ML Model's Most Important Features

::: {.callout-note title="Notes"}
Based on
1. discoveries made from exploring the data
2. the ML model's most important features for predicting whether a visitor will make a return purchase

we should target the following visitor profile to maximize campaign response
1. High propensity visitors
   - used an uncommonly used browser
   - used one of the following operating systems
     - Windows
     - FreeBSD
     - Nokia-based OS
   - interacted with content on the store site
   - reached the store site through a paid search

   with an emphasis on Windows users
2. Medium propensity visitors
   - used the Mozilla Firefox browser
   - used google search in order to access the store site
   - used non-desktop operating systems
     - Windows Phone
     - Chrome OS
     - Firefox OS
   - used an affiliate referring website with a personal connection to the visitor to access the store site
   - bounced from the site during their first visit

   with an emphasis on
   - Android users (Chrome OS)
   - visitors who did not bounce from the site during their first visit
4. Low propensity visitors
   - bounced from the site during their first visit
   - used a Mac-based operating system
   - used an affiliate or undetermined medium to access the store site

   with an emphasis on visitors who used a Mac-based operating system to access the store site during their first visit

where the recommended emphasis is based on factors that produced a good combination of KPIs (CTR and conversion rate) among first-time visitors in the closest available month of historical data.
:::

### Based on Observed Behavior During Production (Inference) Period

Show the average number of promotions viewed by high propensity visitors in historical data among first-time visitors

In [None]:
#| output: true
avg_promos_disp_high_prop = df_dev_cohorts.query("maudience == 'High'")['promos_displayed'].mean()
print(
    "Average number of promotions viewed by visitors in high propensity "
    f"group = {avg_promos_disp_high_prop:,.2f}%"
)

::: {.callout-note title="Notes"}
Based on visitor behavior during their first visit to the store, we should include the following actions to maximize campaign response
1. Low propensity audience
   - offer stronger discounts or offers (be more aggressive)
   - offer the strongest discounts on weekends
   - offer discounts on non-shopping cart items
2. High propensity audience
   - offer the weakest discounts on weekends
   - offer discounts on shopping cart items and recommendations for similar products to shopping cart items
3. Medium propensity audience
   - offer the intermediate discounts (lower than low propensity and higher than the high propensity audience) on weekends
:::

## Next Step

The next step will delete all MLFlow-related resources created for this project.