# Upload Audience Data to BigQuery

In [1]:
%load_ext autoreload
%autoreload 2

::: {.content-hidden}
Import necessary Python modules
:::

In [2]:
import os
import sys
from calendar import day_name, month_name
from datetime import datetime
from glob import glob

import pandas as pd
import pytz
from google.cloud import bigquery
from google.oauth2 import service_account

::: {.content-hidden}
Get relative path to project root directory
:::

In [3]:
PROJ_ROOT_DIR = os.path.join(os.pardir)
src_dir = os.path.join(PROJ_ROOT_DIR, "src")
sys.path.append(src_dir)

::: {.content-hidden}
Import custom Python modules
:::

In [4]:
%aimport bigquery_auth_helpers
from bigquery_auth_helpers import auth_to_bigquery

%aimport bigquery_upload_helpers
import bigquery_upload_helpers as bquh

%aimport sql_helpers
import sql_helpers as sqlh

%aimport transform_helpers
import transform_helpers as trh

::: {.content-hidden}
Define helper function to show datatypes and number of missing values for all columns in a `DataFrame`
:::

In [5]:
def summarize_df(df: pd.DataFrame) -> None:
    """Show datatypes and count missing values in columns of DataFrame."""
    display(
        df.dtypes.rename("dtype")
        .to_frame()
        .merge(
            df.isna().sum().rename("missing").to_frame(),
            left_index=True,
            right_index=True,
            how="left",
        )
        .reset_index()
        .rename(columns={"index": "column"})
    )

## About

## Overview

This step will upload the following to separate tables in a private dataset on Google BigQuery

1. predicted audience groups and (test and control) cohorts<sup>[1](#myfootnote1)</sup>
2. audience profiles<sup>[1](#myfootnote1)</sup>
3. monthly performance<sup>[1](#myfootnote1)</sup>, relative to previous months
4. summary of required sample (cohort) sizes that were estimated in a previous step and metrics per cohort, based on desired inputs (combination of uplift, power, confidence level)

These tables will be accessed by the client-facing dashboard.

<a name="myfootnote1">1</a>: using inference data

## Order of Operations
This step can be run prospectively at the end of the inference period, just before the start of the campaign, when all the inference data (first-time visitors to the store) becomes available. This step is needed before the project's dashboard (main project deliverable) can be created since the dashboard will be populated with the data derived from data uploaded to BigQuery tables in this step.

## Outputs
The following tables will be created

1. Combined Audience Profile and Feature Importances
   - audience profile consisting of the following for each predicted audience group
     - audience strategy
     - name of descriptive or behavioral statistic about attribute of first visit
     - value of statistic in
       - High propensity group of visitors
       - Medium propensity group of visitors
       - Low propensity group of visitors
   - ML feature importances
     - audience strategy
     - number of observations used to learn feature importances
     - feature name
     - audience group
     - feature importance
2. All Available First-Time Visitors and Assigned Audience Groups and Cohorts
   - month during which first-time visitors visited the store site during inference period
     - for development data, no audience or cohorts are assigned
     - for inference predictions, audience groups and randomly selected cohorts are assigned
   - visitor ID from GA360 data
   - visit ID from GA360 data
   - visit number from GA360 data
   - visit start time
   - quarter of year during which visit occurred
   - month of year during which visit occurred
   - day of month during which visit occurred
   - day of week during which visit occurred
   - hour of day during which visit occurred
   - minute of hour during which visit occurred
   - second of minute during which visit occurred
   - traffic source from GA360 data
   - medium from GA360 data
   - channel from GA360 data
   - registered hits from GA360 data
   - bounced visits from GA360 data
   - last action performed during visit from GA360 data
   - number of promotions viewed from GA360 data
   - number of promotions clicked from GA360 data
   - number of product lists viewed from GA360 data
   - number of products clicked from GA360 data
   - pageviews from GA360 data
   - time spent on store site from GA360 data
   - browser used to access store site from GA360 data
   - operating system used to access store site from GA360 data
   - category of device used to access store site from GA360 data
   - whether a product was added to the shopping cart during first visit from GA360 data
   - revenue during first visit from GA360 data
   - predicted propensity for making a purchase during a return visit
   - predicted audience group
   - predicted cohort (test or control)
   - audience strategy (single- or multiple-audience)
   - true outcome (whether a purchase was made on a return visit to the store)
     - missing for inference data
   - type of data (development - `train_val`/`test` - or inference - `infer`)
4. Monthly Summary Statistics (Overall)
   - month during which first-time visitors visited the store site during inference period
   - type of data (development - `train_val`/`test` - or inference - `infer`)
   - total number of return purchasers
   - total revenue
   - total number of first-time visitors
   - total pageviews
   - average time spent on the store site
   - most populat channel
   - most popular category of device
   - most popular browser
   - most popular operating system
   - audience strategy (single- or multiple-audience)
   - bounce rate
   - product click rate
   - add-to-cart rate
   - type of visitor (development or inference)
   - conversion rate
   - rate of change in the following relative to the previous month
     - total number of return purchasers (development) or first-time visitors (inference)
     - total revenue
     - total pageviews
     - average time spent on the store site
     - bounce rate
     - conversion rate
     - product click rate
     - add-to-cart rate
5. Conversion Rates During Development and Inference
   - over the period covering the development and inference data
     - audience strategy (single- or multiple-audience)
     - month during which first-time visitors visited the store site during inference period
     - audience group (eg. High, Medium, Low predicted propensity)
     - predicted conversions
     - total number of visitors per audience group
     - minimum score
     - true number of conversions per audience group
     - type of visitor (development or inference)
     - total number of each type of visitor
     - true conversion rate per audience group
     - overall true conversion rate
     - predicted conversion rate per audience group
     - overall predicted conversion rate
6. Estimated and Actual Fractions of Cohort to Audience Size
   - over the period covering the development and inference data
     - audience strategy (single- or multiple-audience)
     - month during which first-time visitors visited the store site during inference period
     - audience group (eg. High, Medium, Low predicted propensity)
     - cohort (test or control)
     - cohort size
     - audience size
     - required uplift, power and confidence level used to estimate cohort size
     - ratio of cohort to audience sizes
     - type of cohort size
       - development (estimated based on required uplift, power and confidence level)
       - inference (assigned using estimates during development)
     - type of visitor (development or inference)
     - total number of each type of visitor
7. Aggregated Conversion Rates During Development and Inference (by Audience Group & Overall)
   - over the period covering the development and inference data
     - audience group (eg. High, Medium, Low predicted propensity)
     - data_type (development or inference)
     - true or predicted conversion rate
8. Daily Summary Statistics by Audience Group and Overall
   - over the period covering the development and inference data
     - total number of return purchasers
     - total revenue
     - total number of first-time visitors
     - total number of add-to-cart actions performed
     - total pageviews
     - average time spent on the store site
     - total product lists viewed
     - total products clicked on
     - total number of bounce events
     - audience strategy (single- or multiple-audience)
     - bounce rate
     - product click rate
     - add-to-cart rate
     - type of visitor (development or inference)
     - aggregation type (by audience group or overall)
9. KPIs by categorical feature
   - over the period covering the test split of the development data, show the following by sub-category
     - categorical feature
     - categorical feature value (sub-category)
     - number of conversions
     - number of visitors
     - average conversion rate
     - average clickthrough rate
     - total number of conversions
     - total number of clicks
     - total product lists viewed
     - total products clicked on
     - total revenue
     - total number of visitors 

## User Inputs

Define the following

1. best MLFlow run ID
2. list of categorical features
3. list of numerical features
4. BigQuery
   - dataset id
   - table ids for audience
     - cohorts
     - profile
5. dictionary to map profile statistic type to description
6. inputs (uplift, power, confidence level) for which random cohort sizes were estimated during an earlier step
7. type of audience strategy (single- or multi- group) from which cohorts were created during an earlier step

In [6]:
#| echo: true
# 1. 
best_run_id = "6a38156c7fbb4c289d7e4d1ba6b149b5"

# 2. categorical column
categorical_features = [
    "bounces",
    "last_action",
    "source",
    "medium",
    "channelGrouping",
    "browser",
    "os",
    "deviceCategory",
]

# 3. numerical columns
numerical_features = [
    "hits",
    "promos_displayed",
    "promos_clicked",
    "product_views",
    "product_clicks",
    "pageviews",
    "time_on_site",
]

# 4. GCP resources
gbq_dataset_id = 'mydemo2asdf'
gbq_table_id_cohorts = 'audience_cohorts'
gbq_table_id_profiles = 'audience_profiles'
gbq_table_feats_imps = 'audience_feats_imp'
gbq_table_id_summary = 'monthly_summary'
gbq_table_id_sa_fracs = "cohort_audience_fractions"
gbq_table_id_conv_rates = "audience_conversion_rates"
gbq_table_id_conv_rates_agg_combo = "conversion_rates_aggregated"
gbq_table_id_daily_perf = "daily_summary"
gbq_table_id_cat_feats_kpis = "categorical_features_kpis"

# 5. dictionary to map statistic type to description
stat_type_desc_mapper_dict = {
    "behavior": "Behavioral",
    "mean": "Descriptive statistics",
    "mode": "Descriptive statistics",
    "max": "Descriptive statistics",
    "feature_importance": "Most important ML features",
}

# 6. inputs used to estimate sample sizes
wanted_inputs = {
    "uplift_percentage": 10,
    "power_percentage": 55,
    "confidence_level_percentage": 55,
}

# 7. type of audience strategy to use when creating groups
audience_strategy = 1

::: {.content-hidden}
Get path to data sub-folders
:::

In [7]:
data_dir = os.path.join(PROJ_ROOT_DIR, "data")
raw_data_dir = os.path.join(data_dir, "raw")
processed_data_dir = os.path.join(data_dir, "processed")
gcp_keys_dir = os.path.join(PROJ_ROOT_DIR, "gcp_keys")

::: {.content-hidden}
Load Google Cloud authentication credentials for use with the native BigQuery Python client
:::

In [8]:
gcp_proj_id = os.environ["GCP_PROJECT_ID"]
gcp_creds_fpath = glob(os.path.join(gcp_keys_dir, "*", "*.json"))[0]
gcp_creds = service_account.Credentials.from_service_account_file(
    gcp_creds_fpath
)

::: {.content-hidden}
Get fully resolved name of the BigQuery table
:::

In [9]:
gbq_table_fully_resolved_cohorts = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_cohorts}"
gbq_table_fully_resolved_profiles = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_profiles}"
gbq_table_fully_resolved_feats_imp = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_feats_imps}"
gbq_summary_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_summary}"
gbq_sa_fracs_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_sa_fracs}"
gbq_conv_rates_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_conv_rates}"
gbq_conv_rates_combo_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_conv_rates_agg_combo}"
gbq_daily_perf_combo_table_id_fully_resolved = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_daily_perf}"
gbq_table_fully_resolved_cat_feat_kpis = f"{gcp_proj_id}.{gbq_dataset_id}.{gbq_table_id_cat_feats_kpis}"

::: {.content-hidden}
Create authenticated native BigQuery Python client
:::

In [10]:
client = bigquery.Client(project=gcp_proj_id, credentials=gcp_creds)

::: {.content-hidden} 
Get filepaths to audience

1. feature importances
2. profiles
:::

In [11]:
(
    fpath_feat_imps,
    fpath_profile,
    fpath_cohorts,
    fpath_development,
    fpath_sa_frac,
    fpath_conv_rates,
) = [
    glob(
        os.path.join(
            processed_data_dir, f"{prefix}_{best_run_id}__*.parquet.gzip"
        )
    )[-1]
    for prefix in [
        "audience_profiles_feature_importances__run",
        "audience_profiles__run",
        'audience_cohorts__run',
        "processed_data__run",
        "cohort_audience_fractions__run",
        "audience_development_inference_conversion_rates__run",
    ]
]
fpath_aud_sizes = glob(
    os.path.join(
        processed_data_dir, f"audience_sample_sizes__run_{best_run_id}.parquet.gzip"
    )
)[-1]

::: {.content-hidden}
Get inference data month from filepath
:::

In [12]:
#| output: false
infer_month_profile = fpath_profile.partition("infer_month_")[2].split("__")[0]
infer_month_feats = fpath_feat_imps.partition("infer_month_")[2].split("__")[0]
infer_month_cohort = fpath_cohorts.partition("infer_month_")[2].split("__")[0]
infer_month_frac = fpath_sa_frac.partition("infer_month_")[2].split("__")[0]
infer_month_conv_rates = fpath_conv_rates.partition("infer_month_")[2].split("__")[0]
try:
    assert infer_month_feats == infer_month_profile
    assert infer_month_cohort == infer_month_profile
    assert infer_month_frac == infer_month_profile
    assert infer_month_conv_rates == infer_month_profile
    print("Got same inference period from audience profile, feature importances and cohort")
except AssertionError as e:
    print(
        f"{str(e)}Did not get same inference period from audience profile and "
        "feature importance"
    )

Got same inference period from audience profile, feature importances and cohort


## Get Data

### Feature Importances

In [13]:
#| echo: true
df_feats = (
    pd.read_parquet(fpath_feat_imps)
    # select only wanted columns
    [['audience_strategy', 'num_observations', 'stat', 'maudience', 'value']]
)
df_feats

Unnamed: 0,audience_strategy,num_observations,stat,maudience,value
0,1,500,browser__other,High,0.765699
1,1,500,os__Nokia,High,0.628597
2,1,500,hits,High,0.618297
3,1,500,medium__cpc,High,0.597859
4,1,500,os__Windows,High,0.520569
5,1,500,os__FreeBSD,High,0.492881
6,1,500,last_action__Click through of product lists,High,0.358966
7,1,500,medium__(not set),High,0.352581
8,1,500,os__Samsung,High,0.240107
9,1,500,promos_displayed,High,0.180898


### Audience Profiles

In [14]:
#| echo: true
df_profile = pd.read_parquet(fpath_profile)
df_profile

Unnamed: 0,stat,stat_type,High,Low,Medium,feature,audience_strategy
0,hour__mean,mean,12.969797269342159,13.0235829540753,13.038344827586206,hour,1
1,day_of_week__mean,mean,3.9863467107985104,3.9991725279271826,4.027034482758621,day_of_week,1
2,source__mode,mode,google,google,google,source,1
3,medium__mode,mode,organic,organic,organic,medium,1
4,channelGrouping__mode,mode,Organic Search,Organic Search,Organic Search,channelGrouping,1
5,last_action__mode,mode,Unknown,Unknown,Unknown,last_action,1
6,browser__mode,mode,Chrome,Chrome,Chrome,browser,1
7,os__mode,mode,Macintosh,Macintosh,Macintosh,os,1
8,deviceCategory__mode,mode,desktop,desktop,desktop,deviceCategory,1
9,hits__mean,mean,6.290580609571093,6.426148117501034,6.3484137931034486,hits,1


### Audience Cohorts

In [15]:
#| echo: true
df_cohorts = pd.read_parquet(fpath_cohorts)

### Audience Sample (Cohort) Size Requirements and Metrics

In [16]:
#| echo: true
df_required_sample_sizes = (
    pd.read_parquet(
        fpath_aud_sizes,
        filters=[
            ('audience_strategy', '=', audience_strategy)
        ]# + [
        #     (k, "=", v)
        #     for k,v in zip(
        #         ['uplift', 'power', 'ci_level'],
        #         list(wanted_inputs.values()),
        #     )
        # ],
    )
    # select only wanted columns
    .drop(columns=['group_number', 'group_size_proportion'])
)

### ML Development Data

In [17]:
#| echo: true
df_development = pd.read_parquet(fpath_development)
df_development

Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,quarter,month,day_of_month,day_of_week,hour,minute,...,browser,os,deviceCategory,added_to_cart,revenue,made_purchase_on_future_visit,split_type,score,predicted_score_label,predicted_label
0,9087168862193205669,1481619204,1,2016-12-13 00:53:24,4,12,13,3,0,53,...,Chrome,Android,mobile,0,,0,train_val,0.092378,True,0
1,6192138532399050704,1475630300,1,2016-10-04 18:18:20,4,10,4,3,18,18,...,Chrome,Macintosh,desktop,0,,0,train_val,0.190694,True,0
2,9191817357533988982,1476824267,1,2016-10-18 13:57:47,4,10,18,3,13,57,...,Chrome,Macintosh,desktop,3,,0,train_val,0.005013,True,0
3,7461857486231186491,1481823417,1,2016-12-15 09:36:57,4,12,15,5,9,36,...,Chrome,Macintosh,desktop,0,,0,train_val,0.0547,True,0
4,6554145498187044905,1484077523,1,2017-01-10 11:45:23,1,1,10,3,11,45,...,Chrome,Macintosh,desktop,0,,0,train_val,0.279794,True,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133887,2859155514259411479,1486248414,1,2017-02-04 14:46:54,1,2,4,7,14,46,...,Chrome,Android,mobile,0,,0,test,0.058196,True,0
133888,460482791086125299,1486859939,1,2017-02-11 16:38:59,1,2,11,7,16,38,...,Safari,iOS,mobile,0,,0,test,0.028381,True,0
133889,1197241994723568470,1487107604,1,2017-02-14 13:26:44,1,2,14,3,13,26,...,Chrome,Windows,desktop,0,,0,test,0.082862,True,0
133890,91736817749198966,1488142038,1,2017-02-26 12:47:18,1,2,26,1,12,47,...,Chrome,Chrome OS,desktop,0,,0,test,0.005692,True,0


### Conversion Rates in Development and Inference

In [18]:
#| echo: true
df_conv_rates = pd.read_parquet(fpath_conv_rates)

### Estimated and Actual Cohort to Audience Fractions

In [19]:
#| echo: true
df_sa_frac = pd.read_parquet(fpath_sa_frac)

## Transform Data

### Feature Importances

In [20]:
#| echo: true
df_feats = (
    df_feats.assign(
        stat=lambda df: (
            df["stat"].str.replace("__", " = ")
            .str.replace(" = 1", " = True")
            .str.replace("_", " ")
        )
    )
)

### Audience Profile

In [21]:
df_profile_sliced = (
    df_profile
    # get behavioral and (average) descriptive statistics
    .query("(stat_type == 'behavior') | (stat.str.endswith('mean'))")
    # make stat column more reader friendly
    .assign(
        stat_expanded=lambda df: (
            df["stat"]
            .str.replace("__1", " = True")
            .str.replace("__0", " = False")
            .str.replace("__", " = ")
            .str.replace("_", " ")
            .str.replace("= mean", "(Mean)")
            .str.title()
        )
    )
    # drop unwanted columns
    .drop(columns=['stat'])
    .reset_index()
    # rename columns to titlecase
    .rename(columns=str.title)
    # set datatypes
    .astype(
        {
            'Audience_Strategy': pd.Int8Dtype(),
            "Stat_Expanded": pd.StringDtype(),
            "High": pd.Float32Dtype(),
            "Medium": pd.Float32Dtype(),
            "Low": pd.Float32Dtype(),
        }
    )
    # select only wanted columns
    [['Audience_Strategy', 'Stat_Expanded', 'High', 'Medium', 'Low']]
)
df_profile_sliced

Unnamed: 0,Audience_Strategy,Stat_Expanded,High,Medium,Low
0,1,Hour (Mean),12.969797,13.038344,13.023583
1,1,Day Of Week (Mean),3.986347,4.027034,3.999172
2,1,Hits (Mean),6.290581,6.348414,6.426148
3,1,Promos Displayed (Mean),8.506,8.530759,8.667356
4,1,Promos Clicked (Mean),0.0,0.000138,0.0
5,1,Product Views (Mean),23.131706,23.416689,23.020273
6,1,Product Clicks (Mean),0.672321,0.68,0.677975
7,1,Pageviews (Mean),5.33292,5.373931,5.475796
8,1,Revenue (Mean),179.170715,172.959793,197.689194
9,1,Added To Cart (Mean),0.188802,0.199172,0.201765


### Audience Cohorts

In [22]:
#| echo: true
df_cohorts = (
    df_cohorts
    .assign(made_purchase_on_future_visit=None)
    .assign(split_type='infer')
    .assign(infer_month=infer_month_feats)
    .astype(
        {
            'split_type': pd.StringDtype(),
            "infer_month": pd.StringDtype(),
            "made_purchase_on_future_visit": pd.BooleanDtype(),
        }
    )
)
col = df_cohorts.pop("infer_month")
df_cohorts.insert(0, col.name, col)
with pd.option_context('display.max_columns', None):
    display(df_cohorts)

Unnamed: 0,infer_month,fullvisitorid,visitId,visitNumber,visitStartTime,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,revenue,score,predicted_score_label,maudience,cohort,audience_strategy,made_purchase_on_future_visit,split_type
0,March,1089841380502674759,1490136731,1,2017-03-21 15:52:11,1,3,21,3,15,52,11,google,organic,Organic Search,1,1,Unknown,0,0,12,0,1,0,Chrome,Windows,desktop,0,,0.962094,False,High,Control,1,,infer
1,March,7602795256647724660,1489465336,1,2017-03-13 21:22:16,1,3,13,2,21,22,16,(direct),(none),Direct,11,0,Unknown,9,0,92,0,11,124,Chrome,Macintosh,desktop,0,,0.9324,False,High,Control,1,,infer
2,March,2420015317690803901,1490906576,1,2017-03-30 13:42:56,1,3,30,5,13,42,56,mall.googleplex.com,referral,Referral,1,1,Unknown,9,0,0,0,1,0,Chrome,Linux,desktop,0,,0.890133,False,High,Control,1,,infer
3,March,9155224690531283117,1488840618,1,2017-03-06 14:50:18,1,3,6,2,14,50,18,google,organic,Organic Search,1,1,Unknown,0,0,12,0,1,0,Chrome,Windows,desktop,0,,0.884398,False,High,Control,1,,infer
4,March,2205802376622271952,1489757220,1,2017-03-17 06:27:00,1,3,17,6,6,27,0,google,organic,Organic Search,1,1,Unknown,0,0,12,0,1,0,Chrome,Android,mobile,0,,0.882623,False,High,Control,1,,infer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21747,March,1484841107186174333,1490739877,1,2017-03-28 15:24:37,1,3,28,3,15,24,37,yahoo,organic,Organic Search,12,0,Product detail views,9,0,27,4,8,1157,Firefox,Macintosh,desktop,0,,0.0,False,Low,,1,,infer
21748,March,8767089865026337607,1488650631,1,2017-03-04 10:03:51,1,3,4,7,10,3,51,youtube.com,referral,Social,3,0,Unknown,9,0,5,0,3,73,Safari,iOS,tablet,0,,0.0,False,Low,,1,,infer
21749,March,1315146560525958752,1490744665,1,2017-03-28 16:44:25,1,3,28,3,16,44,25,siliconvalley.about.com,referral,Referral,4,0,Unknown,9,0,17,0,4,85,Chrome,Windows,desktop,0,,0.0,False,Low,,1,,infer
21750,March,8672023061064502362,1490737601,1,2017-03-28 14:46:41,1,3,28,3,14,46,41,google,organic,Organic Search,4,0,Unknown,9,0,12,0,4,79,Safari,iOS,mobile,0,,0.0,False,Low,,1,,infer


### Conversion Rates in Development and Inference

::: {.content-hidden}
Insert inference month and audience strategy columns
:::

In [23]:
df_conv_rates = (
    df_conv_rates
    .assign(infer_month=[None]*3 + [infer_month_conv_rates]*3)
    .assign(audience_strategy=audience_strategy)
    .astype(
        {
            "infer_month": pd.StringDtype(),
            "audience_strategy": pd.Int8Dtype(),
        }
    )
    .assign(
        data_type=lambda df: df['data_type'].str.replace('_', ' ').str.title()
    )
    .fillna(
        {
            "true_conv_rate": 0,
            "overall_true_conv_rate": 0,
            "true_conversions": 0,
            "pred_conv_rate": 0,
            "total_visitors": 0,
            'min_score': 0,
        }
    )
)
col = df_conv_rates.pop("infer_month")
df_conv_rates.insert(0, col.name, col)
col = df_conv_rates.pop("audience_strategy")
df_conv_rates.insert(0, col.name, col)
df_conv_rates

Unnamed: 0,audience_strategy,infer_month,maudience,pred_conversions,total_visitors,min_score,true_conversions,data_type,data_size,true_conv_rate,overall_true_conv_rate,pred_conv_rate,overall_pred_conv_rate
0,1,,High,1289,6722,0.164389,142,Development,20164,2.112467,2.30609,19.17584,6.392581
1,1,,Low,0,6721,0.0,161,Development,20164,2.395477,2.30609,0.0,6.392581
2,1,,Medium,0,6721,0.031337,162,Development,20164,2.410356,2.30609,0.0,6.392581
3,1,March,High,1408,7251,0.164693,0,Inference,21752,0.0,0.0,19.418011,6.472968
4,1,March,Low,0,7251,0.0,0,Inference,21752,0.0,0.0,0.0,6.472968
5,1,March,Medium,0,7250,0.031879,0,Inference,21752,0.0,0.0,0.0,6.472968


### Conversion Rates Aggregated by Audience Group

In [24]:
df_conv_rates_by_aud = (
    df_conv_rates[['maudience', 'data_type', 'true_conv_rate', 'pred_conv_rate']].melt(
        id_vars=['maudience', 'data_type'],
        value_vars=['true_conv_rate', 'pred_conv_rate'],
        value_name='value',
        var_name='var'
    )
    .assign(var=lambda df: df['var'].str.replace("_conv_rate", ""))
    .astype(
        {
            "maudience": pd.StringDtype(),
            'data_type': pd.StringDtype(),
            "var": pd.StringDtype(),
            "value": pd.Float32Dtype(),
        }
    )
)

### Conversion Rates Aggregated Overall

In [25]:
df_conv_rates_overall = (
    df_conv_rates[['data_type', 'overall_true_conv_rate', 'overall_pred_conv_rate', 'data_size']]
    .groupby(['data_type'], as_index=False).mean()
    .melt(
        id_vars=['data_type'],
        value_vars=['overall_pred_conv_rate', 'overall_true_conv_rate', 'data_size'],
        value_name='value',
        var_name='var'
    )
    .assign(var=lambda df: df['var'].str.replace("_", " ").str.title())
    .astype(
        {
            "data_type": pd.StringDtype(),
            "var": pd.StringDtype(),
            "value": pd.Float32Dtype(),
        }
    )
    .assign(maudience=None)
    .astype({"maudience": pd.StringDtype()})
)

### Combine Aggregated Conversion Rates

::: {.content-hidden}
Verify that columns in overall and per-audience conversion rates `DataFrame`s are identical
:::

In [26]:
assert df_conv_rates_by_aud.shape[1] == df_conv_rates_overall.shape[1]
assert list(df_conv_rates_overall[list(df_conv_rates_by_aud)]) == list(df_conv_rates_by_aud)

::: {.content-hidden}
Combine overall and per-audience conversion rates data
:::

In [27]:
df_conv_rates_agg_combo = pd.concat([df_conv_rates_by_aud, df_conv_rates_overall], ignore_index=True)

### Estimated and Actual Cohort to Audience Fractions

::: {.content-hidden}
Insert inference month and audience strategy columns
:::

In [28]:
df_sa_frac = (
    df_sa_frac
    .assign(infer_month=[None]*6 + [infer_month_frac]*6)
    .assign(audience_strategy=audience_strategy)
    .astype(
        {
            "infer_month": pd.StringDtype(),
            "audience_strategy": pd.Int8Dtype(),
        }
    )
)
col = df_sa_frac.pop("infer_month")
df_sa_frac.insert(0, col.name, col)
col = df_sa_frac.pop("audience_strategy")
df_sa_frac.insert(0, col.name, col)

::: {.callout-note title="Notes"}

1. The accuracy of the predicted propensities are different between the `inference` and `test_split` datasets. The model performance is relatively worse on the unseen data (`inference`) compared to the data seen during development (`test_split`). Audience groups (bins) are created based on these predicted propensities. So, this ML model inaccuracy can create bins with different sizes between these two datasets. Additionally, if the number of visitors in the `inference` and `test_split` datasets are different, then this also contributes to differently sized bins. Both of these factors (inaccuracy and differently sized datasets - see the `data_size` column) are present here and this explains why the following columns
   - predicted conversion rate (`pred_conv_rate`)
   - cohort size (`size`)
   - cohort-to-audience fraction (`sample_to_audience_frac`)
   
   are different between the test split (during ML model development) and inference (during production).
3. Similarly, the predicted (`pred_conv_rate`) and true (`true_conv_rate`) conversion rates are for the test split due to the poor accuracy of the ML model's predictions.
:::

### ML Development Data

In [29]:
#| echo: true
df_development = (
    df_development
    .assign(infer_month=None)
    .drop(columns=['predicted_label'])
    .assign(maudience=None)
    .assign(cohort=None)
    .assign(audience_strategy=None)
    .astype(
        {
            "infer_month": pd.StringDtype(),
            "maudience": pd.StringDtype(),
            "cohort": pd.StringDtype(),
            "audience_strategy": pd.Int8Dtype(),
        }
    )
)
col = df_development.pop("infer_month")
df_development.insert(0, col.name, col)
with pd.option_context('display.max_columns', None):
    display(df_development)

Unnamed: 0,infer_month,fullvisitorid,visitId,visitNumber,visitStartTime,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,revenue,made_purchase_on_future_visit,split_type,score,predicted_score_label,maudience,cohort,audience_strategy
0,,9087168862193205669,1481619204,1,2016-12-13 00:53:24,4,12,13,3,0,53,24,google,cpc,Paid Search,6,0,Unknown,9,1,36,0,5,86,Chrome,Android,mobile,0,,0,train_val,0.092378,True,,,
1,,6192138532399050704,1475630300,1,2016-10-04 18:18:20,4,10,4,3,18,18,20,mall.googleplex.com,referral,Referral,6,0,Unknown,18,1,34,0,5,40,Chrome,Macintosh,desktop,0,,0,train_val,0.190694,True,,,
2,,9191817357533988982,1476824267,1,2016-10-18 13:57:47,4,10,18,3,13,57,47,mall.googleplex.com,referral,Referral,21,0,Check out,27,0,78,1,17,640,Chrome,Macintosh,desktop,3,,0,train_val,0.005013,True,,,
3,,7461857486231186491,1481823417,1,2016-12-15 09:36:57,4,12,15,5,9,36,57,mall.googleplex.com,referral,Referral,1,1,Unknown,9,0,0,0,1,0,Chrome,Macintosh,desktop,0,,0,train_val,0.0547,True,,,
4,,6554145498187044905,1484077523,1,2017-01-10 11:45:23,1,1,10,3,11,45,23,google,organic,Organic Search,3,0,Unknown,9,0,12,0,3,20,Chrome,Macintosh,desktop,0,,0,train_val,0.279794,True,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133887,,2859155514259411479,1486248414,1,2017-02-04 14:46:54,1,2,4,7,14,46,54,google,organic,Organic Search,3,0,Unknown,0,0,17,0,3,40,Chrome,Android,mobile,0,,0,test,0.058196,True,,,
133888,,460482791086125299,1486859939,1,2017-02-11 16:38:59,1,2,11,7,16,38,59,google,organic,Organic Search,3,0,Unknown,0,0,12,0,3,50,Safari,iOS,mobile,0,,0,test,0.028381,True,,,
133889,,1197241994723568470,1487107604,1,2017-02-14 13:26:44,1,2,14,3,13,26,44,(direct),(none),Direct,3,0,Unknown,0,0,17,0,3,15,Chrome,Windows,desktop,0,,0,test,0.082862,True,,,
133890,,91736817749198966,1488142038,1,2017-02-26 12:47:18,1,2,26,1,12,47,18,google,organic,Organic Search,3,0,Unknown,0,0,27,0,3,76,Chrome,Chrome OS,desktop,0,,0,test,0.005692,True,,,


### Combine Development and Cohorts Data

::: {.content-hidden}
Verify that columns in cohorts and development `DataFrame`s are identical
:::

In [30]:
assert df_cohorts.shape[1] == df_development.shape[1]
assert list(df_development[list(df_cohorts)]) == list(df_cohorts)

::: {.content-hidden}
Combine development and cohorts data
:::

In [31]:
#| echo: true
df_dev_cohorts = (
    pd.concat([df_development[list(df_cohorts)], df_cohorts])
    .astype(
        {
            "source": pd.CategoricalDtype(),
            "medium": pd.CategoricalDtype(),
            "channelGrouping": pd.CategoricalDtype(),
            "browser": pd.CategoricalDtype(),
            "os": pd.CategoricalDtype(),
            "made_purchase_on_future_visit": pd.BooleanDtype(),
            "split_type": pd.StringDtype(),
        }
    )
)
with pd.option_context('display.max_columns', None):
    display(df_dev_cohorts)

Unnamed: 0,infer_month,fullvisitorid,visitId,visitNumber,visitStartTime,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,revenue,score,predicted_score_label,maudience,cohort,audience_strategy,made_purchase_on_future_visit,split_type
0,,9087168862193205669,1481619204,1,2016-12-13 00:53:24,4,12,13,3,0,53,24,google,cpc,Paid Search,6,0,Unknown,9,1,36,0,5,86,Chrome,Android,mobile,0,,0.092378,True,,,,False,train_val
1,,6192138532399050704,1475630300,1,2016-10-04 18:18:20,4,10,4,3,18,18,20,mall.googleplex.com,referral,Referral,6,0,Unknown,18,1,34,0,5,40,Chrome,Macintosh,desktop,0,,0.190694,True,,,,False,train_val
2,,9191817357533988982,1476824267,1,2016-10-18 13:57:47,4,10,18,3,13,57,47,mall.googleplex.com,referral,Referral,21,0,Check out,27,0,78,1,17,640,Chrome,Macintosh,desktop,3,,0.005013,True,,,,False,train_val
3,,7461857486231186491,1481823417,1,2016-12-15 09:36:57,4,12,15,5,9,36,57,mall.googleplex.com,referral,Referral,1,1,Unknown,9,0,0,0,1,0,Chrome,Macintosh,desktop,0,,0.0547,True,,,,False,train_val
4,,6554145498187044905,1484077523,1,2017-01-10 11:45:23,1,1,10,3,11,45,23,google,organic,Organic Search,3,0,Unknown,9,0,12,0,3,20,Chrome,Macintosh,desktop,0,,0.279794,True,,,,False,train_val
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21747,March,1484841107186174333,1490739877,1,2017-03-28 15:24:37,1,3,28,3,15,24,37,yahoo,organic,Organic Search,12,0,Product detail views,9,0,27,4,8,1157,Firefox,Macintosh,desktop,0,,0.0,False,Low,,1,,infer
21748,March,8767089865026337607,1488650631,1,2017-03-04 10:03:51,1,3,4,7,10,3,51,youtube.com,referral,Social,3,0,Unknown,9,0,5,0,3,73,Safari,iOS,tablet,0,,0.0,False,Low,,1,,infer
21749,March,1315146560525958752,1490744665,1,2017-03-28 16:44:25,1,3,28,3,16,44,25,siliconvalley.about.com,referral,Referral,4,0,Unknown,9,0,17,0,4,85,Chrome,Windows,desktop,0,,0.0,False,Low,,1,,infer
21750,March,8672023061064502362,1490737601,1,2017-03-28 14:46:41,1,3,28,3,14,46,41,google,organic,Organic Search,4,0,Unknown,9,0,12,0,4,79,Safari,iOS,mobile,0,,0.0,False,Low,,1,,infer


### Daily Performance Summary in Combined Data, by Audience Group

Get daily metadata by audience group

In [32]:
df_aud_hmap = (
    trh.perform_custom_aggregation(
        (
            df_dev_cohorts
            .sort_values(by=['visitStartTime'], ignore_index=True)
            .assign(date=lambda df: pd.to_datetime(df['visitStartTime']).dt.date)
            .assign(maudience=lambda df: df['maudience'].fillna("Development"))
        ),
        groupby_cols=['month', 'date', 'maudience'],
        agg_dict={
            "made_purchase_on_future_visit": ["sum"],
            "revenue": "sum",
            "fullvisitorid": "count",
            "added_to_cart": "sum",
            "pageviews": "sum",
            "time_on_site": "mean",
            "product_views": "sum",
            "product_clicks": "sum",
            "bounces": "sum",
        },
        audience_strategy=df_cohorts['audience_strategy'].unique().tolist()[0],
        column_renamer={
            "made_purchase_on_future_visit_sum": "return_purchasers",
            "revenue_sum": "revenue",
            "added_to_cart_sum": "add_to_cart",
            "fullvisitorid_count": "visitors",
            "product_views_sum": "product_views",
            "product_clicks_sum": "product_clicks",
            "pageviews_sum": "pageviews",
            "bounces_sum": "bounces",
            "time_on_site_mean": "time_on_site",
        },
        visitor_type_mapper=dict(
            zip(
                ['Development', 'High', 'Medium', 'Low'],
                ['return_purchasers', 'all_visitors', 'all_visitors', 'all_visitors']
            )
        ),
        dtypes_out={
            "month": pd.StringDtype(),
            "date": pd.StringDtype(),
            "maudience": pd.StringDtype(),
            "return_purchasers": pd.Int16Dtype(),
            "revenue": pd.Float32Dtype(),
            "visitors": pd.Int16Dtype(),
            "add_to_cart": pd.Int32Dtype(),
            "pageviews": pd.Int32Dtype(),
            "time_on_site": pd.Float32Dtype(),
            "product_views": pd.Int32Dtype(),
            "product_clicks": pd.Int32Dtype(),
            "bounces": pd.Int32Dtype(),
            'audience_strategy': pd.Int8Dtype(),
            "bounce_rate": pd.Float32Dtype(),
            "product_clicks_rate": pd.Float32Dtype(),
            "add_to_cart_rate": pd.Float32Dtype(),
            "visitor_type": pd.StringDtype(),
        },
    )
    .assign(agg_type='audience_group')
    .astype({"agg_type": pd.StringDtype()})
)
df_aud_hmap

Unnamed: 0,month,date,maudience,return_purchasers,revenue,visitors,add_to_cart,pageviews,time_on_site,product_views,product_clicks,bounces,audience_strategy,bounce_rate,product_clicks_rate,add_to_cart_rate,visitor_type,agg_type
0,September,2016-09-01,Development,21,824.710022,719,180,4895,3.083542,53001,892,203,1,28.233658,1.682987,25.034771,return_purchasers,audience_group
1,September,2016-09-02,Development,16,1087.579956,637,125,4314,3.171245,49597,817,193,1,30.298273,1.647277,19.623234,return_purchasers,audience_group
2,September,2016-09-03,Development,3,514.880005,386,80,2349,2.391839,25702,466,124,1,32.124352,1.813088,20.725389,return_purchasers,audience_group
3,September,2016-09-04,Development,7,163.350006,350,66,2145,2.963476,25647,356,114,1,32.57143,1.388077,18.857143,return_purchasers,audience_group
4,September,2016-09-05,Development,11,483.380005,469,89,3034,3.292111,36007,563,167,1,35.607677,1.563585,18.976545,return_purchasers,audience_group
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
269,March,2017-03-30,Low,0,832.950012,336,55,1622,2.727579,7271,186,136,1,40.476189,2.558108,16.369047,all_visitors,audience_group
270,March,2017-03-30,High,0,1356.719971,316,42,1490,3.068249,6316,193,122,1,38.607594,3.055732,13.29114,all_visitors,audience_group
271,March,2017-03-31,Low,0,2896.25,264,28,1429,3.5262,6117,186,85,1,32.196968,3.040706,10.606061,all_visitors,audience_group
272,March,2017-03-31,Medium,0,579.890015,263,56,1273,2.841382,4870,145,90,1,34.220531,2.977413,21.292776,all_visitors,audience_group


::: {.content-hidden}
Verify that there are no duplicated dates, since the stats were aggregated daily and by audience group
:::

In [33]:
assert df_aud_hmap[df_aud_hmap.duplicated(subset=['maudience', 'date'], keep=False)].empty

### Daily Performance Summary in Overall Combined Data

Get daily metadata overall

In [34]:
df_hmap = (
    trh.perform_custom_aggregation(
        (
            df_dev_cohorts
            .sort_values(by=['visitStartTime'], ignore_index=True)
            .assign(date=lambda df: pd.to_datetime(df['visitStartTime']).dt.date)
            .assign(maudience=lambda df: df['maudience'].fillna("Development"))
        ),
        groupby_cols=['month', 'date'],
        agg_dict={
            "made_purchase_on_future_visit": ["sum"],
            "revenue": "sum",
            "fullvisitorid": "count",
            "added_to_cart": "sum",
            "pageviews": "sum",
            "time_on_site": "mean",
            "product_views": "sum",
            "product_clicks": "sum",
            "bounces": "sum",
        },
        audience_strategy=df_cohorts['audience_strategy'].unique().tolist()[0],
        column_renamer={
            "made_purchase_on_future_visit_sum": "return_purchasers",
            "revenue_sum": "revenue",
            "added_to_cart_sum": "add_to_cart",
            "fullvisitorid_count": "visitors",
            "product_views_sum": "product_views",
            "product_clicks_sum": "product_clicks",
            "pageviews_sum": "pageviews",
            "bounces_sum": "bounces",
            "time_on_site_mean": "time_on_site",
        },
        visitor_type_mapper=dict(
            zip(
                [
                    'September',
                    'October',
                    'November',
                    'December',
                    'January',
                    'February',
                    'March',
                ],
                [
                    'return_purchasers',
                    'return_purchasers',
                    'return_purchasers',
                    'return_purchasers',
                    'return_purchasers',
                    'return_purchasers',
                    'all_visitors',
                ],
            )
        ),
        visitor_type_col='month',
        dtypes_out={
            "month": pd.StringDtype(),
            "date": pd.StringDtype(),
            "return_purchasers": pd.Int16Dtype(),
            "revenue": pd.Float32Dtype(),
            "visitors": pd.Int16Dtype(),
            "add_to_cart": pd.Int32Dtype(),
            "pageviews": pd.Int32Dtype(),
            "time_on_site": pd.Float32Dtype(),
            "product_views": pd.Int32Dtype(),
            "product_clicks": pd.Int32Dtype(),
            "bounces": pd.Int32Dtype(),
            'audience_strategy': pd.Int8Dtype(),
            "bounce_rate": pd.Float32Dtype(),
            "product_clicks_rate": pd.Float32Dtype(),
            "add_to_cart_rate": pd.Float32Dtype(),
            "visitor_type": pd.StringDtype(),
        },
    )
    .assign(agg_type='overall')
    .assign(
        maudience=lambda df: df['month'].map(
            dict(
                zip(
                    [
                        'September',
                        'October',
                        'November',
                        'December',
                        'January',
                        'February',
                        'March',
                    ],
                    [
                        'Development',
                        'Development',
                        'Development',
                        'Development',
                        'Development',
                        'Development',
                        'Inference',
                    ],
                )
            )
        )
        
    )
    .astype({"agg_type": pd.StringDtype(), 'maudience': pd.StringDtype()})
)
df_hmap

Unnamed: 0,month,date,return_purchasers,revenue,visitors,add_to_cart,pageviews,time_on_site,product_views,product_clicks,bounces,audience_strategy,bounce_rate,product_clicks_rate,add_to_cart_rate,visitor_type,agg_type,maudience
0,September,2016-09-01,21,824.710022,719,180,4895,3.083542,53001,892,203,1,28.233658,1.682987,25.034771,return_purchasers,overall,Development
1,September,2016-09-02,16,1087.579956,637,125,4314,3.171245,49597,817,193,1,30.298273,1.647277,19.623234,return_purchasers,overall,Development
2,September,2016-09-03,3,514.880005,386,80,2349,2.391839,25702,466,124,1,32.124352,1.813088,20.725389,return_purchasers,overall,Development
3,September,2016-09-04,7,163.350006,350,66,2145,2.963476,25647,356,114,1,32.57143,1.388077,18.857143,return_purchasers,overall,Development
4,September,2016-09-05,11,483.380005,469,89,3034,3.292111,36007,563,167,1,35.607677,1.563585,18.976545,return_purchasers,overall,Development
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
207,March,2017-03-27,0,7658.660156,823,182,4468,3.279364,18306,566,241,1,29.28311,3.091882,22.114216,all_visitors,overall,Inference
208,March,2017-03-28,0,6920.870117,853,112,4178,3.079367,17562,526,315,1,36.928486,2.995103,13.130129,all_visitors,overall,Inference
209,March,2017-03-29,0,5265.200195,840,168,4582,3.235536,18880,683,274,1,32.619049,3.617585,20.0,all_visitors,overall,Inference
210,March,2017-03-30,0,3357.469971,976,166,4701,2.78943,20800,613,390,1,39.959015,2.947115,17.008196,all_visitors,overall,Inference


::: {.content-hidden}
Verify that there are no duplicated dates, since the stats were aggregated daily
:::

In [35]:
assert df_hmap[df_hmap.duplicated(subset=['date'], keep=False)].empty

### Combined Daily Performance Summary

::: {.content-hidden}
Verify that columns in overall daily summaries and by audience group `DataFrame`s are identical
:::

In [36]:
assert df_hmap.shape[1] == df_aud_hmap.shape[1]
assert list(df_hmap[list(df_aud_hmap)]) == list(df_aud_hmap)

::: {.content-hidden}
Combine daily performance summary data
:::

In [37]:
df_hmap_combo = pd.concat([df_aud_hmap, df_hmap], ignore_index=True)
df_hmap_combo

Unnamed: 0,month,date,maudience,return_purchasers,revenue,visitors,add_to_cart,pageviews,time_on_site,product_views,product_clicks,bounces,audience_strategy,bounce_rate,product_clicks_rate,add_to_cart_rate,visitor_type,agg_type
0,September,2016-09-01,Development,21,824.710022,719,180,4895,3.083542,53001,892,203,1,28.233658,1.682987,25.034771,return_purchasers,audience_group
1,September,2016-09-02,Development,16,1087.579956,637,125,4314,3.171245,49597,817,193,1,30.298273,1.647277,19.623234,return_purchasers,audience_group
2,September,2016-09-03,Development,3,514.880005,386,80,2349,2.391839,25702,466,124,1,32.124352,1.813088,20.725389,return_purchasers,audience_group
3,September,2016-09-04,Development,7,163.350006,350,66,2145,2.963476,25647,356,114,1,32.57143,1.388077,18.857143,return_purchasers,audience_group
4,September,2016-09-05,Development,11,483.380005,469,89,3034,3.292111,36007,563,167,1,35.607677,1.563585,18.976545,return_purchasers,audience_group
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
481,March,2017-03-27,Inference,0,7658.660156,823,182,4468,3.279364,18306,566,241,1,29.28311,3.091882,22.114216,all_visitors,overall
482,March,2017-03-28,Inference,0,6920.870117,853,112,4178,3.079367,17562,526,315,1,36.928486,2.995103,13.130129,all_visitors,overall
483,March,2017-03-29,Inference,0,5265.200195,840,168,4582,3.235536,18880,683,274,1,32.619049,3.617585,20.0,all_visitors,overall
484,March,2017-03-30,Inference,0,3357.469971,976,166,4701,2.78943,20800,613,390,1,39.959015,2.947115,17.008196,all_visitors,overall


### Monthly Performance Summary in Combined Data

Get aggregated summary of metadata by month

In [39]:
df_summary = trh.perform_custom_aggregation(
    df_dev_cohorts,
    groupby_cols=['split_type', 'month'],
    agg_dict={
        "made_purchase_on_future_visit": ["sum"],
        "revenue": "sum",
        "fullvisitorid": "count",
        "added_to_cart": "sum",
        "pageviews": "sum",
        "time_on_site": "mean",
        "channelGrouping": pd.Series.mode,
        "deviceCategory": pd.Series.mode,
        "browser": pd.Series.mode,
        "os": pd.Series.mode,
        "product_views": "sum",
        "product_clicks": "sum",
        "bounces": "sum",
    },
    audience_strategy=df_cohorts['audience_strategy'].unique().tolist()[0],
    column_renamer={
        "made_purchase_on_future_visit_sum": "return_purchasers",
        "revenue_sum": "revenue",
        "added_to_cart_sum": "add_to_cart",
        "fullvisitorid_count": "visitors",
        "channelGrouping_mode": "channelGrouping",
        "deviceCategory_mode": "deviceCategory",
        "browser_mode": "browser",
        "os_mode": "os",
        "product_views_sum": "product_views",
        "product_clicks_sum": "product_clicks",
        "pageviews_sum": "pageviews",
        "bounces_sum": "bounces",
        "time_on_site_mean": "time_on_site",
    },
    visitor_type_mapper=dict(
        zip(
            ['train_val', 'test', 'infer'],
            ['return_purchasers', 'return_purchasers', 'all_visitors']
        )
    ),
    dtypes_out={
        'audience_strategy': pd.Int8Dtype(),
        "revenue": pd.Float32Dtype(),
        "visitors": pd.Int16Dtype(),
        "return_purchasers": pd.Int16Dtype(),
        "conversion_rate": pd.Float32Dtype(),
        "add_to_cart_rate": pd.Float32Dtype(),
        "product_clicks_rate": pd.Float32Dtype(),
        "pageviews": pd.Int32Dtype(),
        "bounce_rate": pd.Float32Dtype(),
        "time_on_site": pd.Float32Dtype(),
        "channelGrouping": pd.StringDtype(),
        "deviceCategory": pd.StringDtype(),
        "browser": pd.StringDtype(),
        "os": pd.StringDtype(),
        "split_type": pd.StringDtype(),
        "month": pd.Int8Dtype(),
        "visitor_type": pd.StringDtype(),
        'visitors_pct_change': pd.Float32Dtype(),
        'revenue_pct_change': pd.Float32Dtype(),
        'pageviews_pct_change': pd.Float32Dtype(),
        'time_on_site_pct_change': pd.Float32Dtype(),
        'bounce_rate_pct_change': pd.Float32Dtype(),
        'conversion_rate_pct_change': pd.Float32Dtype(),
        'product_clicks_rate_pct_change': pd.Float32Dtype(),
        'add_to_cart_rate_pct_change': pd.Float32Dtype(),
    },
    visitor_type_col='split_type',
    # list of months in chronological order of the visits in the GA360 data (starting
    # in August 2016 and ending in August 2017)
    df_months_ordered=pd.DataFrame(
        [8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8], columns=['month']
    ),
    zero_replacement_dict={"return_purchasers": {0: None}, "conversion_rate": {0: None}},
    cols_to_drop=[
        'add_to_cart',
        'product_views',
        'product_clicks',
        'bounces',
    ],
    mom_stats=[
        'visitors',
        'revenue',
        'pageviews',
        'time_on_site',
        'bounce_rate',
        'conversion_rate',
        'product_clicks_rate',
        'add_to_cart_rate',
    ],
).assign(split_type=lambda df: df["split_type"].str.replace("_", "+").str.title())
for c in (
    df_summary.columns[
        df_summary.columns.str.endswith("_pct_change")
    ].tolist()
):
    df_summary[f"{c}_gt_0"] = (df_summary[c].astype("float") > 0).astype(pd.BooleanDtype())
df_summary

Unnamed: 0,month,split_type,return_purchasers,revenue,visitors,pageviews,time_on_site,channelGrouping,deviceCategory,browser,...,product_clicks_rate_pct_change,add_to_cart_rate_pct_change,visitors_pct_change_gt_0,revenue_pct_change_gt_0,pageviews_pct_change_gt_0,time_on_site_pct_change_gt_0,bounce_rate_pct_change_gt_0,conversion_rate_pct_change_gt_0,product_clicks_rate_pct_change_gt_0,add_to_cart_rate_pct_change_gt_0
0,9,Train+Val,515.0,25347.169922,18610,122036,3.158059,Organic Search,desktop,Chrome,...,,,False,False,False,False,False,False,False,False
1,10,Train+Val,744.0,57754.410156,22605,137957,3.200277,Organic Search,desktop,Chrome,...,69.909203,-10.501069,True,True,True,True,True,True,True,False
2,11,Train+Val,1651.0,212317.96875,24400,162392,3.521526,Organic Search,desktop,Chrome,...,10.178357,37.701744,True,True,True,True,False,True,True,True
3,12,Train+Val,1340.0,188854.984375,26936,168867,3.386282,Organic Search,desktop,Chrome,...,-4.66746,-6.678822,True,False,True,False,True,False,False,False
4,1,Train+Val,722.0,148284.265625,21177,125336,3.238997,Organic Search,desktop,Chrome,...,-1.049209,-11.481061,False,False,False,False,True,False,False,False
5,2,Test,465.0,111282.46875,20164,106447,2.979854,Organic Search,desktop,Chrome,...,-12.988386,-9.225178,False,False,False,False,True,False,False,False
6,3,Infer,,136287.375,21752,117335,3.199188,Organic Search,desktop,Chrome,...,8.22617,3.873992,True,True,True,True,False,False,True,True


### Categorical Feature KPIs

In [40]:
infer_month = df_cohorts['month'].unique().tolist()[0]
df_development_grouped = df_development.query(f"month=={infer_month-1}").copy()
dfs_development_agg = []
for f in categorical_features[1:]:
    if f in ['source', 'browser']:
        df_development_grouped[f] = (
            trh.group_infrequent_categories(
                df_development_grouped[f], f
            )
        )
    df_development_agg = trh.agg_kpis(df_development_grouped, f)
    dfs_development_agg.append(df_development_agg)
df_development_agg = (
    pd.concat(dfs_development_agg, ignore_index=True)
    .astype(
        {
            "feature_name": pd.StringDtype(),
            "feature_category": pd.StringDtype(),
            "conversions": pd.Int32Dtype(),
            'product_views': pd.Int32Dtype(),
            'product_clicks': pd.Int32Dtype(),
            "visitors": pd.Int32Dtype(),
            "proportion": pd.Float32Dtype(),
            "ctr": pd.Float32Dtype(),
            "conversion_rate": pd.Float32Dtype(),
            "feature": pd.StringDtype(),
        }
    )
    .melt(
        id_vars=[
            'feature_name',
            'feature_category',
        ],
        value_vars=[
            'ctr',
            'conversion_rate',
            'revenue',
            'conversions',
            'product_views',
            'product_clicks',
            'visitors',
            'proportion',
        ]
    )
    .assign(
        variable=lambda df: (
            df['variable'].str.replace('_', ' ')
            .str.title().str.replace('Ctr', 'CTR')
        )
    )
    .assign(audience_strategy=audience_strategy)
    .assign(historical_data_month=month_name[infer_month-1])
    .assign(historical_data_size=len(df_development_grouped))
    .astype(
        {
            "historical_data_month": pd.StringDtype(),
            "historical_data_size": pd.Int32Dtype(),
            'audience_strategy': pd.Int8Dtype(),
            "variable": pd.StringDtype(),
            "value": pd.Float32Dtype(),
        }
    )
)
with pd.option_context('display.max_rows', None):
    display(
        df_development_agg.query("variable.str.contains('CTR|Conversion Rate')")
        .pivot(
            index=[
                'audience_strategy',
                'historical_data_month',
                'historical_data_size',
                'feature_name',
                'feature_category',
            ],
            columns=['variable'],
            values=['value'],
        )
        .reset_index()
    )

Unnamed: 0_level_0,audience_strategy,historical_data_month,historical_data_size,feature_name,feature_category,value,value
variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,CTR,Conversion Rate
0,1,February,20164,browser,Chrome,2.877512,3.009088
1,1,February,20164,browser,Safari,2.060734,0.351256
2,1,February,20164,browser,other,2.301302,0.310945
3,1,February,20164,channelGrouping,Affiliates,1.399795,0.34965
4,1,February,20164,channelGrouping,Direct,3.606626,3.855422
5,1,February,20164,channelGrouping,Display,2.531646,4.62963
6,1,February,20164,channelGrouping,Organic Search,2.343026,0.826806
7,1,February,20164,channelGrouping,Paid Search,2.604475,0.420168
8,1,February,20164,channelGrouping,Referral,3.10793,6.850829
9,1,February,20164,channelGrouping,Social,2.114267,0.0


## Upload to BigQuery Tables

### Audience Feature Importances

::: {.content-hidden}
Show summary DataFrame with feature importances (see the second item within #1. from the **About** section above)
:::

In [41]:
#| output: false
summarize_df(df_feats)

Unnamed: 0,column,dtype,missing
0,audience_strategy,Int8,0
1,num_observations,Int16,0
2,stat,string[python],0
3,maudience,string[python],0
4,value,Float32,0


::: {.content-hidden}
Define BigQuery Table Schema
:::

In [42]:
job_config_feats_imp = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("audience_strategy", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("num_observations", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("stat", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("maudience", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("value", "FLOAT64", mode='NULLABLE'),
    ]
)
job_config_feats_imp.write_disposition = 'WRITE_APPEND'

::: {.content-hidden}
Create BigQuery table (if it does not exist) and populate
:::

In [43]:
bquh.create_bq_table(gbq_table_fully_resolved_feats_imp, client)
bquh.append_df_to_bq_table(
    df_feats, job_config_feats_imp, gbq_table_fully_resolved_feats_imp, client
)

Created table named demoabc-381618.mydemo2asdf.audience_feats_imp
Completed upload
Found 30 rows and 5 columns in table mydemo2asdf.audience_feats_imp


### Audience Profile

::: {.content-hidden}
Show summary DataFrame with audience profile (see the first item within #1. from the **About** section above)
:::

In [44]:
#| output: false
summarize_df(df_profile_sliced)

Unnamed: 0,column,dtype,missing
0,Audience_Strategy,Int8,0
1,Stat_Expanded,string[python],0
2,High,Float32,0
3,Medium,Float32,0
4,Low,Float32,0


::: {.content-hidden}
Define BigQuery Table Schema
:::

In [45]:
job_config_profiles = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("Audience_Strategy", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("Stat_Expanded", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("High", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("Medium", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("Low", "FLOAT64", mode='NULLABLE'),
    ]
)
job_config_profiles.write_disposition = 'WRITE_APPEND'

::: {.content-hidden}
Create BigQuery table (if it does not exist) and populate
:::

In [46]:
bquh.create_bq_table(gbq_table_fully_resolved_profiles, client)
bquh.append_df_to_bq_table(
    df_profile_sliced, job_config_profiles, gbq_table_fully_resolved_profiles, client
)

Created table named demoabc-381618.mydemo2asdf.audience_profiles
Completed upload
Found 20 rows and 5 columns in table mydemo2asdf.audience_profiles


### Audience Cohorts

::: {.content-hidden}
Show summary `DataFrame` with inference data and predicted audience cohorts (see #2. from the **About** section above)
:::

In [47]:
#| output: false
summarize_df(df_dev_cohorts)

Unnamed: 0,column,dtype,missing
0,infer_month,string[python],133892
1,fullvisitorid,string[python],0
2,visitId,string[python],0
3,visitNumber,Int8,0
4,visitStartTime,datetime64[ns],0
5,quarter,Int8,0
6,month,Int8,0
7,day_of_month,Int8,0
8,day_of_week,Int8,0
9,hour,Int8,0


::: {.content-hidden}
Define BigQuery Table Schema
:::

In [48]:
job_config_cohorts = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("infer_month", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("fullvisitorid", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("visitId", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("visitNumber", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("visitStartTime", "DATETIME", mode='NULLABLE'),
        bigquery.SchemaField("quarter", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("month", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("day_of_month", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("day_of_week", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("hour", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("minute", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("second", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("source", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("medium", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("channelGrouping", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("hits", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("bounces", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("last_action", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("promos_displayed", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("promos_clicked", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("product_views", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("product_clicks", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("pageviews", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("time_on_site", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("browser", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("os", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("deviceCategory", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("added_to_cart", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("revenue", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("score", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("predicted_score_label", "BOOLEAN", mode='NULLABLE'),
        bigquery.SchemaField("maudience", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("cohort", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("audience_strategy", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("made_purchase_on_future_visit", "BOOLEAN", mode='NULLABLE'),
        bigquery.SchemaField("split_type", "STRING", mode='NULLABLE'),
    ]
)
job_config_cohorts.write_disposition = 'WRITE_APPEND'

::: {.content-hidden}
Create BigQuery table (if it does not exist) and populate
:::

In [49]:
bquh.create_bq_table(gbq_table_fully_resolved_cohorts, client)
bquh.append_df_to_bq_table(
    df_dev_cohorts, job_config_cohorts, gbq_table_fully_resolved_cohorts, client
)

Created table named demoabc-381618.mydemo2asdf.audience_cohorts
Completed upload
Found 155,644 rows and 36 columns in table mydemo2asdf.audience_cohorts


### Monthly Performance Summary

::: {.content-hidden}
Show summary `DataFrame` with monthly summary statistics for the inference data (see #3. from the **About** section above)
:::

In [50]:
#| output: false
summarize_df(df_summary)

Unnamed: 0,column,dtype,missing
0,month,Int8,0
1,split_type,string[python],0
2,return_purchasers,Int16,1
3,revenue,Float32,0
4,visitors,Int16,0
5,pageviews,Int32,0
6,time_on_site,Float32,0
7,channelGrouping,string[python],0
8,deviceCategory,string[python],0
9,browser,string[python],0


::: {.content-hidden}
Define BigQuery Table Schema
:::

In [55]:
job_config_summary = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("month", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("split_type", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("return_purchasers", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("revenue", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("visitors", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("pageviews", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("time_on_site", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("channelGrouping", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("deviceCategory", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("browser", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("os", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("audience_strategy", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("bounce_rate", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("product_clicks_rate", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("add_to_cart_rate", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("visitor_type", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("conversion_rate", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("visitors_pct_change", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("revenue_pct_change", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("pageviews_pct_change", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("time_on_site_pct_change", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("bounce_rate_pct_change", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("conversion_rate_pct_change", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("product_clicks_rate_pct_change", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("add_to_cart_rate_pct_change", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("visitors_pct_change_gt_0", "BOOLEAN", mode='REQUIRED'),
        bigquery.SchemaField("revenue_pct_change_gt_0", "BOOLEAN", mode='REQUIRED'),
        bigquery.SchemaField("pageviews_pct_change_gt_0", "BOOLEAN", mode='REQUIRED'),
        bigquery.SchemaField("time_on_site_pct_change_gt_0", "BOOLEAN", mode='REQUIRED'),
        bigquery.SchemaField("bounce_rate_pct_change_gt_0", "BOOLEAN", mode='REQUIRED'),
        bigquery.SchemaField("conversion_rate_pct_change_gt_0", "BOOLEAN", mode='REQUIRED'),
        bigquery.SchemaField("product_clicks_rate_pct_change_gt_0", "BOOLEAN", mode='REQUIRED'),
        bigquery.SchemaField("add_to_cart_rate_pct_change_gt_0", "BOOLEAN", mode='REQUIRED'),
    ]
)
job_config_summary.write_disposition = 'WRITE_APPEND'

::: {.content-hidden}
Create BigQuery table (if it does not exist) and populate
:::

In [56]:
bquh.create_bq_table(gbq_summary_table_id_fully_resolved, client)
bquh.append_df_to_bq_table(
    df_summary, job_config_summary, gbq_summary_table_id_fully_resolved, client
)

Created table named demoabc-381618.mydemo2asdf.monthly_summary
Completed upload
Found 7 rows and 33 columns in table mydemo2asdf.monthly_summary


### Conversion Rates in Development and Inference

::: {.content-hidden}
Show `DataFrame` with conversion rates (see #4. from the **About** section above)
:::

In [57]:
#| output: false
summarize_df(df_conv_rates)

Unnamed: 0,column,dtype,missing
0,audience_strategy,Int8,0
1,infer_month,string[python],3
2,maudience,string[python],0
3,pred_conversions,Int16,0
4,total_visitors,Int16,0
5,min_score,Float32,0
6,true_conversions,Int16,0
7,data_type,string[python],0
8,data_size,Int16,0
9,true_conv_rate,Float32,0


::: {.content-hidden}
Define BigQuery Table Schema
:::

In [58]:
job_config_conv_rates = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("audience_strategy", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("infer_month", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("maudience", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("pred_conversions", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("total_visitors", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("min_score", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("true_conversions", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("data_type", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("data_size", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("true_conv_rate", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("overall_true_conv_rate", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("pred_conv_rate", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("overall_pred_conv_rate", "FLOAT64", mode='NULLABLE'),
    ]
)
job_config_conv_rates.write_disposition = 'WRITE_APPEND'

::: {.content-hidden}
Create BigQuery table (if it does not exist) and populate
:::

In [59]:
bquh.create_bq_table(gbq_conv_rates_table_id_fully_resolved, client)
bquh.append_df_to_bq_table(
    df_conv_rates, job_config_conv_rates, gbq_conv_rates_table_id_fully_resolved, client
)

Created table named demoabc-381618.mydemo2asdf.audience_conversion_rates
Completed upload
Found 6 rows and 13 columns in table mydemo2asdf.audience_conversion_rates


### Estimated and Actual Cohort to Audience Fractions

::: {.content-hidden}
Show `DataFrame` with cohort-to-audience fractions (see #5. from the **About** section above)
:::

In [60]:
#| output: false
summarize_df(df_sa_frac)

Unnamed: 0,column,dtype,missing
0,audience_strategy,Int8,0
1,infer_month,string[python],6
2,maudience,string[python],0
3,cohort,string[python],0
4,size,Int16,0
5,group_size,Int16,0
6,uplift,Int8,0
7,power,Int8,0
8,ci_level,Int8,0
9,samp_to_aud_frac,Float32,0


::: {.content-hidden}
Define BigQuery Table Schema
:::

In [61]:
job_config_sa_frac = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("audience_strategy", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("infer_month", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("maudience", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("cohort", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("size", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("group_size", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("uplift", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("power", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("ci_level", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("samp_to_aud_frac", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("size_type", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("data_type", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("data_size", "INTEGER", mode='NULLABLE'),
    ]
)
job_config_sa_frac.write_disposition = 'WRITE_APPEND'

::: {.content-hidden}
Create BigQuery table (if it does not exist) and populate
:::

In [62]:
bquh.create_bq_table(gbq_sa_fracs_table_id_fully_resolved, client)
bquh.append_df_to_bq_table(
    df_sa_frac, job_config_sa_frac, gbq_sa_fracs_table_id_fully_resolved, client
)

Created table named demoabc-381618.mydemo2asdf.cohort_audience_fractions
Completed upload
Found 12 rows and 13 columns in table mydemo2asdf.cohort_audience_fractions


### Aggregated Conversion Rates

::: {.content-hidden}
Show `DataFrame` with combined aggregated conversion rates (see #6. from the **About** section above)
:::

In [63]:
#| output: false
summarize_df(df_conv_rates_agg_combo)

Unnamed: 0,column,dtype,missing
0,maudience,string[python],6
1,data_type,string[python],0
2,var,string[python],0
3,value,Float32,0


::: {.content-hidden}
Define BigQuery Table Schema
:::

In [64]:
job_config_conv_rates_combo = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("maudience", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("data_type", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("var", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("value", "FLOAT64", mode='NULLABLE'),
    ]
)
job_config_conv_rates_combo.write_disposition = 'WRITE_APPEND'

::: {.content-hidden}
Create BigQuery table (if it does not exist) and populate
:::

In [65]:
bquh.create_bq_table(gbq_conv_rates_combo_table_id_fully_resolved, client)
bquh.append_df_to_bq_table(
    df_conv_rates_agg_combo,
    job_config_conv_rates_combo,
    gbq_conv_rates_combo_table_id_fully_resolved,
    client
)

Created table named demoabc-381618.mydemo2asdf.conversion_rates_aggregated
Completed upload
Found 18 rows and 4 columns in table mydemo2asdf.conversion_rates_aggregated


### Daily Performance Summary

::: {.content-hidden}
Show `DataFrame` with combined daily performance summary (see #7. from the **About** section above)
:::

In [66]:
#| output: false
summarize_df(df_hmap_combo)

Unnamed: 0,column,dtype,missing
0,month,string[python],0
1,date,string[python],0
2,maudience,string[python],0
3,return_purchasers,Int16,0
4,revenue,Float32,0
5,visitors,Int16,0
6,add_to_cart,Int32,0
7,pageviews,Int32,0
8,time_on_site,Float32,0
9,product_views,Int32,0


::: {.content-hidden}
Define BigQuery Table Schema
:::

In [67]:
job_config_daily_perf_combo = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("month", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("date", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("maudience", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("return_purchasers", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("revenue", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("visitors", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("add_to_cart", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("pageviews", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("time_on_site", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("product_views", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("product_clicks", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("bounces", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("audience_strategy", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("bounce_rate", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("product_clicks_rate", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("add_to_cart_rate", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("visitor_type", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("agg_type", "STRING", mode='NULLABLE'),
    ]
)
job_config_daily_perf_combo.write_disposition = 'WRITE_APPEND'

::: {.content-hidden}
Create BigQuery table (if it does not exist) and populate
:::

In [68]:
bquh.create_bq_table(gbq_daily_perf_combo_table_id_fully_resolved, client)
bquh.append_df_to_bq_table(
    df_hmap_combo,
    job_config_daily_perf_combo,
    gbq_daily_perf_combo_table_id_fully_resolved,
    client
)

Created table named demoabc-381618.mydemo2asdf.daily_summary
Completed upload
Found 486 rows and 18 columns in table mydemo2asdf.daily_summary


### Categorical Feature KPIs

::: {.content-hidden}
Show summary DataFrame with categorical feature KPIs (see the second item within #1. from the **About** section above)
:::

In [69]:
#| output: false
summarize_df(df_development_agg)

Unnamed: 0,column,dtype,missing
0,feature_name,string[python],0
1,feature_category,string[python],0
2,variable,string[python],0
3,value,Float32,0
4,audience_strategy,Int8,0
5,historical_data_month,string[python],0
6,historical_data_size,Int32,0


::: {.content-hidden}
Define BigQuery Table Schema
:::

In [70]:
job_config_cat_feats_kpis = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("feature_name", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("feature_category", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("variable", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("value", "FLOAT64", mode='NULLABLE'),
        bigquery.SchemaField("audience_strategy", "INTEGER", mode='NULLABLE'),
        bigquery.SchemaField("historical_data_month", "STRING", mode='NULLABLE'),
        bigquery.SchemaField("historical_data_size", "INTEGER", mode='NULLABLE'),
    ]
)
job_config_cat_feats_kpis.write_disposition = 'WRITE_APPEND'

::: {.content-hidden}
Create BigQuery table (if it does not exist) and populate
:::

In [71]:
bquh.create_bq_table(gbq_table_fully_resolved_cat_feat_kpis, client)
bquh.append_df_to_bq_table(
    df_development_agg,
    job_config_cat_feats_kpis,
    gbq_table_fully_resolved_cat_feat_kpis,
    client,
)

Created table named demoabc-381618.mydemo2asdf.categorical_features_kpis
Completed upload
Found 304 rows and 7 columns in table mydemo2asdf.categorical_features_kpis


## Next Step

The next step will be to create summary charts for an end-user using the contents of these two newly created audience tables.