# Get Randomized Audience Cohorts From Inference

In [1]:
%load_ext autoreload
%autoreload 2

::: {.content-hidden}
Import necessary Python modules
:::

In [2]:
import os
import sys
from calendar import month_name
from datetime import datetime

import mlflow.sklearn
import pandas as pd

::: {.content-hidden}
Get relative path to project root directory
:::

In [3]:
PROJ_ROOT_DIR = os.path.join(os.pardir)
src_dir = os.path.join(PROJ_ROOT_DIR, "src")

sys.path.append(src_dir)

::: {.content-hidden}
Import custom Python modules
:::

In [4]:
#| output: false
%aimport audience_size_helpers
import audience_size_helpers as ash

%aimport bigquery_auth_helpers
from bigquery_auth_helpers import auth_to_bigquery

%aimport cohorts
import cohorts as ch

%aimport data_checks_helpers
import data_checks_helpers as dch

%aimport model_helpers
import model_helpers as modh

%aimport profile_audience
import profile_audience as pa

%aimport sql_helpers
import sql_helpers as sqlh

%aimport transform_helpers
import transform_helpers as th

%aimport utils
import utils as ut

  @numba.jit()
  @numba.jit()
  @numba.jit()
  from .autonotebook import tqdm as notebook_tqdm
  @numba.jit()


## About

### Overview

This step generates the randomized test and control cohorts needed to run a marketing campaign. The impact of the campaign on conversions (KPI) will be assessed using an A/B test at the end of the campaign. The randomized cohorts are needed in order to conduct this test. This is step 4. from a [typical A/B Testing workflow](https://www.datacamp.com/blog/data-demystified-what-is-a-b-testing).

### Order of Operations
This step can be run prospectively at the end of the inference period, just before the start of the campaign, when all the inference data (first-time visitors to the store) becomes available.

For the current use-case, the required sizes of one or more audience groups have been determined in the previous step. Recall that an audience strategy determines if one of more groups are used. For a strategy consisting of a single audience group, only the visitors predicted to have a high propensity to make a purchase on a return visit are selected to participate in the campaign and so randomized cohorts are to be drawn from this single group. For multiple audience groups, namely visitors predicted to have a low, medium or high propensity, randomized cohorts are to be drawn from each such group.

With this in mind, this step first assigns all first-time visitors to the store during the inference period to an audience group based on the requied audience strategy (single or multiple audience groups).

Next, before generating audience cohorts, features in the unseen (inference) data are checked for data drift relative to the test data used during ML model development.

1. Use statistical tests, that are called as part of EvidentlyAI's data drift monitor ([1](https://www.evidentlyai.com/blog/tutorial-3-historical-data-drift), [2](https://github.com/evidentlyai/evidently/blob/main/examples/integrations/mlflow_logging/historical_drift_visualization.ipynb)) [for tabular data](https://docs.evidentlyai.com/user-guide/customization/options-for-statistical-tests#tabular-drift-detection). The following [statistical tests and logic are used by EvidentlyAI](https://docs.evidentlyai.com/reference/data-drift-algorithm#tabular-data) and are also implemented here
   - numerical features with less than or equal to 1,000 observations (this is not used here since the inference data consists of more than 1,000 observations during the inference data period)
     - [Kolmogorov-Smirnov goodness-of-fit Test (two-sample version)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest)
   - numerical features with more than 1,000 observations (this is the case here since the inference data has more than 1,000 first-time visitors during the inference data period)
     - [Wasserstein distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html#scipy.stats.wasserstein_distance)
   - categorical features with less than or equal to 1,000 observations (not used here)
     - [chi-squared goodness-of-fit test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html#scipy.stats.chi2)
   - categorical features with more than 1,000 observations (used here)
     - [Jensen-Shannon distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html#scipy.spatial.distance.jensenshannon)

   Here, a custom function is defined to manually implement these tests on the unseen (inference) data.
3. Perform sanity checks on the data, by directly using EvidentlyAI's data test suite ([1](https://docs.evidentlyai.com/get-started/tutorial#7.-run-data-stability-tests), [2](https://docs.evidentlyai.com/user-guide/tests-and-reports/run-tests#using-test-presets)). A suite consisting of the following tests is applied to the unseen (inference) data in order to test that the number of
   - columns with missing values (columns with at least one missing value)
   - rows with missing values 
   - constant (single-valued) columns
   - duplicated rows (rows that are duplicates)
   - duplicated columns (columns that are duplicates)
   - column datatypes
   - empty columns (all values in a column are missing)
   - empty rows (all values in a row are missing)
   - rows with missing values (rows with at least one missing value)
   - number of columns

   in the unseen (inference) data equals the number in the reference data (test split from the data used during ML model development).

Finally, within each audience group, visitors are randomly assigned to test or control cohorts. A brief profile of each audience group is then performed using attributes of the visitors' first visit. The profile and the inference data with the audience group and cohort for each first-time visitor during the inference period can be used by the marketing team to design and implement the marketing campaign aimed at growing the customer base.

### Implementation

A custom python module in `src/cohorts.py` has been developed to create the test and control audience cohorts based on required sample sizes that were estimated in the previous step and logged as a MLFlow artifact (file) for the best MLFlow deployment candidate model.

The procedure followed in this step is briefle outlined below

1. the MLFlow artifact (file) is loaded in order to access the required sample sizes for each
   - audience strategy
   - desired combination of effect sizes (power, confidence level and uplift)
2. the file is filtered based on the
   - audience strategy
   - desired combination of effect sizes (see `wanted_effect_sizes` from **User Inputs**)
3. inference data (first-time visitors) is retrieved and the best ML model is used to make inference predictions of the propensity of these visitors to make a purchase on a return (future) visit to the store.
4. propensities are used to assign first-time visitors to audience groups (or bins)
   - if the desired audience strategy is for a single audience group, then visitors are placed into a single bin
   - if the desired audience strategy is for multiple audience groups, then multiple bins are creaed
5. for each bin (audience group), visitors are randomly placed into test and control cohorts
6. the inference data with
   - predicted propensity (probability)
   - audience group
   - cohort (test or control)

   is then logged as a MLFlow artifact
7. generate the following
   - profile of the audience groups in terms of
     - descriptive statistics
     - behavioral attributes (values of columns in the data by visitors assigned to each group)
   - summary of the cohorts and audience groups, showing the
     - conversion rate (KPI) per audience group, cohort and overall
     - size of random cohort relative to overall audience group

   which are then logged as separate MLFlow artifacts

Outputs produced by steps 6. and 7. in this procedure are for use by the marketing team to build and design the campaign.

## User Inputs

Define the following

1. start and end dates for inference data
2. name of column containing label (outcome)
3. list of categorical features
4. list of numerical features, including categorical features present in the raw data as integers
5. `audience_groups`
   - desired audience groups into which the first-time visitors propensities will be placed
     - `num_propens_groups` specifies the number of desired audience propensity groups (low, medium and high)
     - `propens_group_labels` specifies names of the desired audience propensity groups
6. inputs (uplift, powerm confidence level) for which random cohort sizes are to be created
7. type of audience strategy (single- or multi- group) from which to create cohorts

In [5]:
#| echo: true
# 1. start and end dates
infer_start_date = "20170301"
infer_end_date = "20170331"

# 2. label column
label = "made_purchase_on_future_visit"

# 3. categorical column
categoricals=[
    'bounces',
    'source',
    'medium',
    'channelGrouping',
    'last_action',
    'browser',
    'os',
    'deviceCategory',
]

# 4. numerical columns
numericals=[
    'hits',
    'pageviews',
    'promos_displayed',
    'product_views',
    'product_clicks',
    'time_on_site',
]

# 5. mapping dictionaries
audience_groups_strategy_1 = {
    "num_propens_groups": 3,
    "propens_group_labels": ["High", "Medium", "Low"],
}
audience_groups_strategy_2 = {
    "num_propens_groups": 3,
    "propens_group_labels": ["High", "High-Medium", "High-Medium-Low"],
}

# 6. wanted inputs for estimating sample sizes
wanted_inputs = {
    "uplift_percentage": 10,
    "power_percentage": 55,
    "confidence_level_percentage": 55,
}

# 7. type of audience strategy to use when creating groups
audience_strategy = 1

::: {.content-hidden}
Get path to data sub-folders
:::

In [6]:
data_dir = os.path.join(PROJ_ROOT_DIR, "data")
raw_data_dir = os.path.join(data_dir, "raw")
processed_data_dir = os.path.join(data_dir, "processed")
gcp_keys_dir = os.path.join(PROJ_ROOT_DIR, "gcp_keys")

Create a mapping between audience group number (0, 1, 2) and group name

In [7]:
#| echo: true
mapper_dict_audience_strategy_1 = dict(
    zip(
        range(audience_groups_strategy_1["num_propens_groups"]),
        audience_groups_strategy_1["propens_group_labels"],
    )
)
mapper_dict_audience_strategy_2 = dict(
    zip(
        range(audience_groups_strategy_2["num_propens_groups"]),
        audience_groups_strategy_2["propens_group_labels"],
    )
)
print(mapper_dict_audience_strategy_1)
print(mapper_dict_audience_strategy_2)

{0: 'High', 1: 'Medium', 2: 'Low'}
{0: 'High', 1: 'High-Medium', 2: 'High-Medium-Low'}


Get desired effect size queries and number of audience groups

In [8]:
#| echo: true
if audience_strategy == 1:
    query_inputs = (
        f"(uplift == {wanted_inputs['uplift_percentage']}) & "
        f"(power == {wanted_inputs['power_percentage']}) & "
        f"(ci_level == {wanted_inputs['confidence_level_percentage']})"
    )
else:
    query_inputs = (
        f"(group_size_proportion < 34) & "
        f"(uplift == {wanted_inputs['uplift_percentage']}) & "
        f"(power == {wanted_inputs['power_percentage']}) & "
        f"(ci_level == {wanted_inputs['confidence_level_percentage']})"
    )

num_bins = (
    audience_groups_strategy_1["num_propens_groups"]
    if audience_strategy == 1
    else audience_groups_strategy_2["num_propens_groups"]
)

::: {.content-hidden}
Define MLFlow storage paths
:::

In [9]:
mlruns_db_fpath = f"{raw_data_dir}/mlruns.db"
mlflow.set_tracking_uri(f"sqlite:///{mlruns_db_fpath}")

::: {.content-hidden}
Set environment variable to silence MLFlow `git` warning messsage
:::

In [10]:
os.environ["GIT_PYTHON_REFRESH"] = "quiet"

::: {.content-hidden}
Define a dictionary to specify datatypes of the transformed *inference* data
:::

In [11]:
dtypes_dict = {
    "fullvisitorid": pd.StringDtype(),
    "visitId": pd.StringDtype(),
    "visitNumber": pd.Int8Dtype(),
    "quarter": pd.Int8Dtype(),
    "month": pd.Int8Dtype(),
    "day_of_month": pd.Int8Dtype(),
    "day_of_week": pd.Int8Dtype(),
    "hour": pd.Int8Dtype(),
    "minute": pd.Int8Dtype(),
    "second": pd.Int8Dtype(),
    "source": pd.CategoricalDtype(),
    "medium": pd.CategoricalDtype(),
    "channelGrouping": pd.CategoricalDtype(),
    "hits": pd.Int16Dtype(),
    "bounces": pd.CategoricalDtype(),
    "last_action": pd.CategoricalDtype(),
    "promos_displayed": pd.Int16Dtype(),
    "promos_clicked": pd.Int16Dtype(),
    "product_views": pd.Int16Dtype(),
    "product_clicks": pd.Int16Dtype(),
    "pageviews": pd.Int16Dtype(),
    "time_on_site": pd.Int16Dtype(),
    "browser": pd.CategoricalDtype(),
    "os": pd.CategoricalDtype(),
    "added_to_cart": pd.Int16Dtype(),
    "deviceCategory": pd.CategoricalDtype(),
    "revenue": pd.Float32Dtype(),
    "made_purchase_on_future_visit": pd.Int8Dtype(),
}

::: {.content-hidden}
Create a mapping between action type integer and label, in order to get meaningful names from the `action_type` column
:::

In [12]:
action_mapper = {
    1: "Click through of product lists",
    2: "Product detail views",
    3: "Add product(s) to cart",
    4: "Remove product(s) from cart",
    5: "Check out",
    6: "Completed purchase",
    7: "Refund of purchase",
    8: "Checkout options",
    0: "Unknown",
}

::: {.content-hidden}
## Authenticate to `BigQuery`
:::

In [13]:
gcp_auth_dict = auth_to_bigquery(gcp_keys_dir)

## Get Inference Data

In [14]:
#| echo: true
query_infer = sqlh.get_sql_query_infer(infer_start_date, infer_end_date)
X_infer, _ = th.extract_data(query_infer, gcp_auth_dict).pipe(
    th.transform_data,
    datatypes_dict={k:v for k,v in dtypes_dict.items() if k != label},
    duplicate_cols=["fullvisitorid"],
    column_mapper_dict={'last_action': action_mapper},
)
X_infer = X_infer.pipe(th.shuffle_data)

Query execution start time = 2023-07-02 19:35:23.185...done at 2023-07-02 19:35:28.289 (5.103 seconds).
Query returned 21,768 rows
Got 21,752 rows and 28 columns after dropping duplicates
Transformed data has 21,752 rows & 28 columns


Get the size of each audience cohort in the inference data

In [15]:
#| echo: true
bin_size_infer = len(X_infer) / num_bins
bin_size_infer_control = int(bin_size_infer / 2)

## Get Model

### Fetch Latest Version of Best Deployment Candidate Model

Get name of best deployment candidate model from MLFlow model registry

In [16]:
#| echo: true
df_deployment_candidate_mlflow_models = modh.get_all_deployment_candidate_models()
best_run_model_name = modh.get_best_deployment_candidate_model(
    df_deployment_candidate_mlflow_models
)
with pd.option_context("display.max_colwidth", None):
    display(df_deployment_candidate_mlflow_models)

2023/07/02 23:35:28 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2023/07/02 23:35:28 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
2023/07/02 23:35:28 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2023/07/02 23:35:28 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.


Unnamed: 0,name,description,run_id,tags,version,score
0,BetaDistClassifier_20160901_20170228_133892_feats__20230702_193622,Best Model based on fbeta2 score of 0.4876148105,6a38156c7fbb4c289d7e4d1ba6b149b5,{'deployment-candidate': 'yes'},2,0.487615


::: {.content-hidden}
Load best deployment candidate model object
:::

In [17]:
model = mlflow.sklearn.load_model(model_uri=f"models:/{best_run_model_name}/latest")
model

### Get Data Used to Develop Best Deployment Candidate Model

Get all available data used during model development of the best deployment candidate model

In [18]:
#| echo: true
df_all = modh.get_data_for_run_id(df_deployment_candidate_mlflow_models, 'processed_data')

::: {.content-hidden}
Perform sanity checks on features in inference data
:::

In [19]:
modh.check_data_per_best_expt_run(
    df_deployment_candidate_mlflow_models,
    df_all.drop(columns=["score", "predicted_label", "predicted_score_label", "split_type"]),
    X_infer,
    label,
)

### Check Data Drift and Stability

Get test split from the data used during development of the best deployment candidate model. This was the last month of data used during ML model development and it will be considered as the reference data.

In [20]:
#| echo: true
X_test_best_run = (
    df_all
    .query("split_type == 'test'")
    .drop(columns=['split_type', label])
)

In [21]:
#| output: false
with pd.option_context("display.max_columns", None):
    display(X_test_best_run.head(2))
    display(X_test_best_run.tail(2))

Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,revenue,score,predicted_score_label,predicted_label
113728,4779737702674310083,1487736287,1,2017-02-21 20:04:47,1,2,21,3,20,4,47,google,organic,Organic Search,4,0,Unknown,9,0,12,0,4,54,Chrome,Macintosh,desktop,0,,0.089659,True,0
113729,9671083913773424047,1487184533,1,2017-02-15 10:48:53,1,2,15,4,10,48,53,google,organic,Organic Search,6,0,Unknown,18,0,5,0,6,97,Chrome,Macintosh,desktop,0,,0.076696,True,0


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,revenue,score,predicted_score_label,predicted_label
133890,91736817749198966,1488142038,1,2017-02-26 12:47:18,1,2,26,1,12,47,18,google,organic,Organic Search,3,0,Unknown,0,0,27,0,3,76,Chrome,Chrome OS,desktop,0,,0.005692,True,0
133891,4990910589525036851,1486257216,1,2017-02-04 17:13:36,1,2,4,7,17,13,36,youtube.com,referral,Social,3,0,Unknown,0,0,30,0,3,91,Edge,Windows,desktop,0,,0.201073,True,0


Check numerical features for drift between

1. ML development data (test split)
2. unseen data (inference)

using the Wasserstein test

In [22]:
#| echo: true
df_num_drift, _ = dch.detect_features_drift(
    X_infer.assign(revenue=lambda df: df['revenue'].fillna(0)),
    X_test_best_run.assign(revenue=lambda df: df['revenue'].fillna(0)),
    numericals=numericals+['revenue'],
    categoricals=[],
    kst_threshold=0.05,
    kst_params=dict(
        method='exact',
        N=max(len(X_test_best_run), len(X_infer)),
        alternative='two-sided',
    ),
    wsd_threshold=1,
)
display(df_num_drift)

Unnamed: 0,feature_type,feature,nunique_ref,nunique_curr,pct_diff_mean,pct_diff_std,abs_diff_mean,abs_diff_std,metric_value,metric_threshold,drift_detected,test_type
0,numerical,hits,104,112,-2.780131,-15.663082,-0.1719,-1.516563,0.220158,1,False,Wasserstein
1,numerical,pageviews,78,94,-2.181352,-18.05175,-0.115155,-1.336889,0.159072,1,False,Wasserstein
2,numerical,promos_displayed,21,18,-9.698937,2.97979,-0.757536,0.314995,0.788913,1,False,Wasserstein
3,numerical,product_views,273,286,1.213926,-5.456722,0.284963,-1.950397,0.709329,1,False,Wasserstein
4,numerical,product_clicks,30,35,-6.912383,-13.700511,-0.043756,-0.249357,0.043756,1,False,Wasserstein
5,numerical,time_on_site,1583,1702,-7.360568,-6.518361,-13.160053,-25.687965,13.524775,1,True,Wasserstein
6,numerical,revenue,361,414,-13.52888,-31.814381,-0.746641,-13.851818,0.756292,1,False,Wasserstein


::: {.callout-note title="Notes"}

1. For descriptive statistics, consider drift being present if both of the following conditions are met

   - `pct_diff_mean` > 5
     - percent difference between mean value in reference (test data) and current (inference) data is larger than 5%
   - `pct_diff_std` > 15
     - percent difference between standard deviation (which provides an indication of variability) in reference (test data) and current (inference) data is larger than 15%
2. Both percent differnces are calculated relative to the reference data.
:::

::: {.callout-tip title="Observations"}

1. The percent differences in the descriptive statistics (mean and standard deviation) suggest that drift is minimal for the following numerical features

   - `promos_displayed`
   - `product_views`
   - `time_on_site`
   - `product_clicks`
   - `revenue` (not a feature)
2. The `pageviews` and `hits` features, and `revenue` column, show a drift from their higher variability, as suggested by their higher standard deviation, which in-turn indicates the presence of outliers. These do not show any drift in the mean percent difference.
3. The last four columns are related to the statistical test comparing the same feature in two datasets. A threshold of 1 was arbitrarily chosen. The limitation of this test is due to the difficulty in setting a reasonable threshold against which the metric can be compared. This test of drift could be useful if multiple inference periods were present during the production period. In such a scenario, the percent change in the metric could be tracked from one period to the next in order to determine if the feature is showing the presence of drift. These four columns are of less use in drift monitoring for a single infernce period.
:::

Perform sanity checks on all features between

1. ML development data (test split)
2. unseen data (inference)

In [23]:
#| echo: true
dtypes_eai = dict(zip(categoricals, ['str']*len(categoricals)))
df_stability = dch.check_data_stability(
    X_infer.astype(dtypes_eai),
    X_test_best_run.astype(dtypes_eai),
    numericals=numericals+['revenue'],
    categoricals=categoricals,
)
df_stability

Test suite start time = 2023-07-02 19:35:29.901...done at 2023-07-02 19:35:34.455 (4.555 seconds).


Unnamed: 0,all_passed,total_tests,success_tests,failed_tests,len_curr_data,len_ref_data,features_checked,num_features_checked,SUCCESS
0,True,8,8,0,21752,20164,"[""bounces"", ""source"", ""medium"", ""channelGroupi...",15,8


::: {.callout-note title="Notes"}

1. The EvidentlyAI `TestSuite` module requires string features to have the `object` datatype. So, a datatype mapping dictionary was defined (`dtypes_eai`) to change the datatypes for these features.
:::

::: {.callout-tip title="Observations"}

1. It is reassuring that all data stability tests have passed. This indicates that features in the unseen (inference) data are stable in the 10 attributes listed in the **About** section, relative to those in the data used during ML development.
:::

## Make Inference Predictions

Make inference predictions with best deployment candidate model, using features extracted from the inference data

In [24]:
#| echo: true
#| output: false
y_infer_pred, y_infer_pred_proba = modh.make_inference(model, X_infer, label)
display(
    y_infer_pred.value_counts(normalize=True).rename('proportion').to_frame().merge(
        y_infer_pred.value_counts(normalize=False).rename('number').to_frame(),
        left_index=True,
        right_index=True,
        how='left',
    ).reset_index()
)



Unnamed: 0,made_purchase_on_future_visit,proportion,number
0,0,0.936879,20379
1,1,0.063121,1373


::: {.callout-note title="Notes"}

1. Per the business use-case, these are predictions of whether a visitor will make a purchase during a later (return) visit to the merchandise store. Such predictions are made using attributes (features) of the first visit by visitors to the store and they can only be evaluated at a later time (after the outcome of the same visitor's later visit is known).
:::

## Create Cohorts

Perform the following to prepare the inference observations for extracting cohorts from the audience group(s)

1. combine inference features and predicted hard and soft labels into single `DataFrame`
2. sort predictions in ascending order of the predicted probability (propensity) (score)
3. separate the predictions into bins, using the predicted probability, based on the required number of bins
4. rename `bin_number` column (for use in downstream step)
5. set datatypes for columns

In [25]:
#| echo: true
df_infer_pred = (
    ch.combine_infer_data(X_infer, y_infer_pred, y_infer_pred_proba)
    .pipe(ash.sort_scores, False)
    .pipe(ash.get_audience_groups_by_propensity, num_bins)
    .pipe(ch.rename_columns, {"group_number": "maudience"})
    .pipe(
        ash.set_datatypes,
        {
            "row_number": pd.Int16Dtype(),
            "fullvisitorid": pd.StringDtype(),
            "score": pd.Float32Dtype(),
            "predicted_score_label": pd.BooleanDtype(),
            "maudience": pd.Int8Dtype(),
        },
    )
)
with pd.option_context("display.max_columns", 1000):
    display(df_infer_pred)

Set all specified datatypes.


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,revenue,score,predicted_score_label,row_number,maudience
10526,1089841380502674759,1490136731,1,2017-03-21 15:52:11,1,3,21,3,15,52,11,google,organic,Organic Search,1,1,Unknown,0,0,12,0,1,0,Chrome,Windows,desktop,0,,0.962094,False,0,0
11744,7497175875994815624,1490799934,1,2017-03-29 08:05:34,1,3,29,4,8,5,34,(direct),(none),Direct,16,0,Completed purchase,9,0,12,1,13,461,Chrome,Linux,desktop,2,122.0,0.95769,False,1,0
18144,2375083666352246421,1490233047,1,2017-03-22 18:37:27,1,3,22,4,18,37,27,google,organic,Organic Search,3,0,Unknown,9,0,2,0,3,16,Chrome,Macintosh,desktop,0,,0.951394,False,2,0
18397,5413497526864769809,1490993900,1,2017-03-31 13:58:20,1,3,31,6,13,58,20,mall.googleplex.com,referral,Referral,3,0,Unknown,9,0,24,0,3,227,Chrome,Macintosh,desktop,0,,0.938781,False,3,0
4218,850212279804049015,1490995934,1,2017-03-31 14:32:14,1,3,31,6,14,32,14,mall.googleplex.com,referral,Referral,1,1,Unknown,9,0,0,0,1,0,Chrome,Macintosh,desktop,0,,0.938111,False,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21486,9563358471095861018,1489537209,1,2017-03-14 17:20:09,1,3,14,3,17,20,9,google,organic,Organic Search,1,1,Unknown,0,0,3,0,1,0,Chrome,Android,mobile,0,,0.0,False,21747,2
13583,3520574210856495188,1489181287,1,2017-03-10 13:28:07,1,3,10,6,13,28,7,google,organic,Organic Search,3,0,Unknown,0,0,36,0,3,75,Chrome,Windows,desktop,0,,0.0,True,21748,2
11259,8672023061064502362,1490737601,1,2017-03-28 14:46:41,1,3,28,3,14,46,41,google,organic,Organic Search,4,0,Unknown,9,0,12,0,4,79,Safari,iOS,mobile,0,,0.0,False,21749,2
8604,1110401747577425092,1490811943,1,2017-03-29 11:25:43,1,3,29,4,11,25,43,mall.googleplex.com,referral,Referral,5,0,Unknown,9,0,0,0,5,39,Chrome,Macintosh,desktop,0,,0.0,False,21750,2


::: {.callout-note title="Notes"}

1. This is the same data preparation that was used during the (previous) sample size estimation step step in order to prepare the test data split for estimating the required sample size.
:::

::: {.content-hidden}
Show datatypes of columns in inference predictions data
:::

In [26]:
with pd.option_context("display.max_columns", 1000):
    display(df_infer_pred.dtypes.rename("dtype").to_frame().transpose())

Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,revenue,score,predicted_score_label,row_number,maudience
dtype,string[python],string[python],Int8,datetime64[ns],Int8,Int8,Int8,Int8,Int8,Int8,Int8,category,category,category,Int16,category,category,Int16,Int16,Int16,Int16,Int16,Int16,category,category,category,Int16,Float32,Float32,boolean,Int16,Int8


Perform the following to get the test and control cohorts from the inference data

1. load estimates for required sample sizes associated with best deployment candidate model
2. get required sample sizes for the chosen audience strategy
3. get required sample sizes that can be supported by size of inference data
4. get required sample sizes that support all required audience groups (bins)
5. get required sample sizes that capture required effect sizes
6. get random test and control cohorts

In [27]:
#| echo: true
df_infer_audience_groups = (
    ch.load_file_from_mlflow_artifact(
        df_deployment_candidate_mlflow_models, "audience_sample_sizes"
    )
    .pipe(ch.get_sample_sizes_by_strategy, audience_strategy)
    .pipe(ch.get_suitable_sample_sizes, bin_size_infer_control)
    .pipe(ch.get_sample_sizes_with_all_audience_groups, num_bins)
    .pipe(ch.get_required_inputs, query_inputs)
    .pipe(
        ch.create_cohorts,
        df_infer_pred,
        mapper_dict_audience_strategy_1
        if audience_strategy == 1
        else mapper_dict_audience_strategy_2,
        audience_strategy,
    )
    .pipe(
        ash.set_datatypes,
        {
            "maudience": pd.StringDtype(),
            "cohort": pd.StringDtype(),
            "audience_strategy": pd.Int8Dtype(),
        },
    )
)
with pd.option_context("display.max_columns", 1000):
    display(df_infer_audience_groups)

audience=0: High, size=7,251,  excluded=2,389, wanted=2,431, control=2,431, test=2,431
audience=1: Medium, size=7,250,  excluded=3,000, wanted=2,125, control=2,125, test=2,125
audience=2: Low, size=7,251,  excluded=2,977, wanted=2,137, control=2,137, test=2,137
Found suitable sample sizes and generated cohorts.
Set all specified datatypes.


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,revenue,score,predicted_score_label,maudience,cohort,audience_strategy
0,1089841380502674759,1490136731,1,2017-03-21 15:52:11,1,3,21,3,15,52,11,google,organic,Organic Search,1,1,Unknown,0,0,12,0,1,0,Chrome,Windows,desktop,0,,0.962094,False,High,Control,1
1,7602795256647724660,1489465336,1,2017-03-13 21:22:16,1,3,13,2,21,22,16,(direct),(none),Direct,11,0,Unknown,9,0,92,0,11,124,Chrome,Macintosh,desktop,0,,0.9324,False,High,Control,1
2,2420015317690803901,1490906576,1,2017-03-30 13:42:56,1,3,30,5,13,42,56,mall.googleplex.com,referral,Referral,1,1,Unknown,9,0,0,0,1,0,Chrome,Linux,desktop,0,,0.890133,False,High,Control,1
3,9155224690531283117,1488840618,1,2017-03-06 14:50:18,1,3,6,2,14,50,18,google,organic,Organic Search,1,1,Unknown,0,0,12,0,1,0,Chrome,Windows,desktop,0,,0.884398,False,High,Control,1
4,2205802376622271952,1489757220,1,2017-03-17 06:27:00,1,3,17,6,6,27,0,google,organic,Organic Search,1,1,Unknown,0,0,12,0,1,0,Chrome,Android,mobile,0,,0.882623,False,High,Control,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21747,6932276822739810471,1490658817,1,2017-03-27 16:53:37,1,3,27,2,16,53,37,(direct),(none),Direct,7,0,Unknown,36,0,36,0,7,216,Safari (in-app),iOS,mobile,0,,0.0,False,Low,,1
21748,1484841107186174333,1490739877,1,2017-03-28 15:24:37,1,3,28,3,15,24,37,yahoo,organic,Organic Search,12,0,Product detail views,9,0,27,4,8,1157,Firefox,Macintosh,desktop,0,,0.0,False,Low,,1
21749,8767089865026337607,1488650631,1,2017-03-04 10:03:51,1,3,4,7,10,3,51,youtube.com,referral,Social,3,0,Unknown,9,0,5,0,3,73,Safari,iOS,tablet,0,,0.0,False,Low,,1
21750,8672023061064502362,1490737601,1,2017-03-28 14:46:41,1,3,28,3,14,46,41,google,organic,Organic Search,4,0,Unknown,9,0,12,0,4,79,Safari,iOS,mobile,0,,0.0,False,Low,,1


::: {.callout-note title="Notes"}

1. The final sample size used to generate the cohorts is the least number of samples (first-time visitors) to be included in each of the control and test cohorts of each audience group. These sizes come from the output of the previous step (designing a media or marketing experiment - [1](https://blog.hubspot.com/blog/tabid/6307/bid/31634/a-b-testing-in-action-3-real-life-marketing-experiments.aspx), [2](https://www.datacamp.com/blog/data-demystified-what-is-a-b-testing)).
:::

The treatment (or test) and control groups should be similar to each other for the property to be tested. This is a [fundamental requirement of test and control groups](https://www.mobileapps.com/blog/test-group#Similarities_Between_Test_and_Control_Groups). In this case, the probability (score column) is the property of interest. With this in mind, we now show selected descriptive statistics, for the score column, for both the test (treatment) and control cohorts within each desired audience group (low, medium and high propensity)

In [28]:
#| echo: true
df_aud_stats = ch.get_cohort_stats(df_infer_audience_groups)
df_aud_stats

Unnamed: 0,maudience,cohort,score_count,score_min,score_mean,score_median,score_max
0,High,Control,2431,0.164775,0.355874,0.308467,0.962094
1,High,Test,2431,0.164693,0.361409,0.317408,0.95769
2,High,,2389,0.164704,0.366237,0.32165,0.951394
3,Low,Control,2137,0.0,0.009483,0.006444,0.031874
4,Low,Test,2137,0.0,0.009594,0.006536,0.031806
5,Low,,2977,0.0,0.009781,0.00667,0.031868
6,Medium,Control,2125,0.0319,0.085171,0.07819,0.164649
7,Medium,Test,2125,0.031889,0.084812,0.078362,0.164643
8,Medium,,3000,0.031879,0.085988,0.079279,0.164632


::: {.callout-tip title="Observations"}

1. It is reassuring that there is agreement in the statistics for the probabilities per Test-Control cohort within the same audience group.
:::

## Summarize Cohorts

### Audience Profiles

Next, the audience groups in the inference data are briefly profiled, in order to identify characteristics that might help the marketing team build an appropriate strategy to be implemented during the campaign. For each audience group, the profile consists of the following

1. proportion of visitors who displayed the following behavior
   - whose last action during thier first visit was
     - `last_action == 'Add product(s) to cart'`
     - `last_action == 'Product detail_views'`
     - `last_action == 'Click through of product lists'`
   -  who
      - used a `medium == 'referral'` to get to the store's website on their first visit
      - added one item to their shopping cart (`added_to_cart > 0`) on their first visit
   - whose
     - first visit occurred on a weekend (`day_of_week.isin(@weekend_days)`)
2. bounce rate during all first visits
3. following descriptive statistics
   - `hour`, `day_of_week`
     - statistic: `mean`
   - `source`, `medium`, `channelGrouping`, `last_action`, `browser`, `os`, `deviceCategory`
     - statistic: `mode` (most common value)
   - `promos_displayed`, `promos_clicked`, `product_views`, `product_clicks`, `pageviews`, `added_to_cart`
     - statistics: `mean`, `max`

In [29]:
dtypes_profile = dict(
    stat=pd.StringDtype(),
    stat_type=pd.StringDtype(),
    High=pd.StringDtype(),
    Low=pd.StringDtype(),
    Medium=pd.StringDtype(),
    feature=pd.StringDtype(),
    audience_strategy=pd.Int8Dtype(),
)

In [30]:
#| echo: true
df_profile = (
    pa.get_audience_profile(df_infer_audience_groups, audience_strategy)
    .pipe(th.set_datatypes, dtypes_profile)
)
with pd.option_context("display.max_columns", 1000):
    display(df_profile)

Set all specified datatypes.
Set all specified datatypes.


Unnamed: 0,stat,stat_type,High,Low,Medium,feature,audience_strategy
0,hour__mean,mean,12.969797269342159,13.0235829540753,13.038344827586206,hour,1
1,day_of_week__mean,mean,3.9863467107985104,3.9991725279271826,4.027034482758621,day_of_week,1
2,source__mode,mode,google,google,google,source,1
3,medium__mode,mode,organic,organic,organic,medium,1
4,channelGrouping__mode,mode,Organic Search,Organic Search,Organic Search,channelGrouping,1
5,last_action__mode,mode,Unknown,Unknown,Unknown,last_action,1
6,browser__mode,mode,Chrome,Chrome,Chrome,browser,1
7,os__mode,mode,Macintosh,Macintosh,Macintosh,os,1
8,deviceCategory__mode,mode,desktop,desktop,desktop,deviceCategory,1
9,hits__mean,mean,6.290580609571093,6.426148117501034,6.3484137931034486,hits,1


::: {.callout-note title="Notes"}

1. The audience profile consists of different types of statistics about the first visit, for each audience group.
2. The `stat` column shows the attribute from the first visit in the inference data for which a statistic is calculated.
3. The `stat_type` column indicates the type of statistic.
4. The *High*, *Medium* and *Low* columns (or just the *High* column, if the single-group audience strategy is being used) show the statistic of each attribute *within* each audience group.
:::

### Conversion Rates per Audience Group and Overall

In [31]:
# development (test split)
df_test_best_run_conv_rates = (
    df_all
    .query("split_type == 'test'")
    # get label using predicted probability
    .assign(label=lambda df: df['score']>0.5)
    # create audience groups
    .pipe(ash.sort_scores, False)
    .pipe(ash.get_audience_groups_by_propensity, audience_groups_strategy_1["num_propens_groups"])
    .assign(audience_strategy=audience_strategy)
    # get audience group name
    .assign(
        maudience=lambda df: df['group_number'].map(
            dict(
                zip(
                    list(range(audience_groups_strategy_1["num_propens_groups"])),
                    audience_groups_strategy_1['propens_group_labels'],
                )
            )
        )
    )
    # get conversions, number of visitors and lowest prediced probability per audience group
    .groupby(['audience_strategy', 'maudience'], as_index=False)
    .agg({"label": "sum", "fullvisitorid": "count", "score": "min", label: "sum"})
    .rename(
        columns={
            "label": "pred_conversions",
            "fullvisitorid": "total_visitors",
            label: "true_conversions",
            "score": "min_score",
        }
    )
    # append metadata
    .assign(data_type='development')
    .assign(data_size=lambda df: df['total_visitors'].sum())
    # calculate true conversion rates per audience group and overall
    .assign(true_conv_rate=lambda df: 100*df['true_conversions'] / df['total_visitors'])
    .assign(overall_true_conv_rate=lambda df: 100*df['true_conversions'].sum()/df['data_size'].max())
    # calculate predicted conversion rates per audience group and overall
    .assign(pred_conv_rate=lambda df: 100*df['pred_conversions'] / df['total_visitors'])
    .assign(overall_pred_conv_rate=lambda df: 100*df['pred_conversions'].sum()/df['data_size'].max())
)

# inference
df_infer_conv_rates = (
    df_infer_audience_groups
    # get label using predicted probability
    .assign(label=lambda df: df['score']>0.5)
    # get conversions, number of visitors and lowest prediced probability per audience group
    .groupby(['audience_strategy', 'maudience'], as_index=False)
    .agg({"label": "sum", "fullvisitorid": "count", "score": "min"})
    .rename(columns={"label": "pred_conversions", "fullvisitorid": "total_visitors", "score": "min_score"})
    # append metadata
    .assign(data_type='inference')
    .assign(data_size=lambda df: df['total_visitors'].sum())
    # calculate predicted conversion rates per audience group and overall
    .assign(pred_conv_rate=lambda df: 100*df['pred_conversions'] / df['total_visitors'])
    .assign(overall_pred_conv_rate=lambda df: 100*df['pred_conversions'].sum()/df['data_size'].max())
)

# combined
df_conv_rates = (
    pd.concat([df_test_best_run_conv_rates, df_infer_conv_rates], ignore_index=True)
    .astype(
        {
            "maudience": pd.StringDtype(),
            "audience_strategy": pd.Int8Dtype(),
            "pred_conversions": pd.Int16Dtype(),
            "total_visitors": pd.Int16Dtype(),
            "min_score": pd.Float32Dtype(),
            "true_conversions": pd.Int16Dtype(),
            "data_type": pd.StringDtype(),
            "data_size": pd.Int16Dtype(),
            "true_conv_rate": pd.Float32Dtype(),
            "overall_true_conv_rate": pd.Float32Dtype(),
            "pred_conv_rate": pd.Float32Dtype(),
            "overall_pred_conv_rate": pd.Float32Dtype(),
        }
    )
)
df_conv_rates

Unnamed: 0,audience_strategy,maudience,pred_conversions,total_visitors,min_score,true_conversions,data_type,data_size,true_conv_rate,overall_true_conv_rate,pred_conv_rate,overall_pred_conv_rate
0,1,High,1289,6722,0.164389,142.0,development,20164,2.112467,2.30609,19.17584,6.392581
1,1,Low,0,6721,0.0,161.0,development,20164,2.395477,2.30609,0.0,6.392581
2,1,Medium,0,6721,0.031337,162.0,development,20164,2.410356,2.30609,0.0,6.392581
3,1,High,1408,7251,0.164693,,inference,21752,,,19.418011,6.472968
4,1,Low,0,7251,0.0,,inference,21752,,,0.0,6.472968
5,1,Medium,0,7250,0.031879,,inference,21752,,,0.0,6.472968


::: {.callout-tip title="Observations"}

1. Between the development and inference datasets, the
   - overall conversion rates(`overall_pred_conv_rate`) 
   - predicted conversion rates (`pred_conv_rate`) for each audience group
   - minimum score (`min_score`) for each audience group

   are in good agreement with eachother.
2. At the time of generating the (forward-looking) propensity predictions using the inference data, the true outcome is not known. So, the true number of converted visitors (`true_conversions`) and true conversion rate (`true_conv_rate`) for each audience group and overall true conversion rate (`overall_true_conv_rate`) are not known. Only their predicted values are known.
3. The inaccuracy of the ML model is seen from the difference in the true and predicted conversion rates overall and for each audience group.
:::

### Fraction of Audience Estimated and Actually Assigned to Random Cohorts

In [32]:
# development (test split)
df_sa_frac_test = (
    # create grid of audience group and cohort type
    pd.DataFrame.from_records(
        [
            {"maudience": 'High', 'cohort': 'Test'},
            {"maudience": 'High', 'cohort': 'Control'},
            {"maudience": 'Medium', 'cohort': 'Test'},
            {"maudience": 'Medium', 'cohort': 'Control'},
            {"maudience": 'Low', 'cohort': 'Test'},
            {"maudience": 'Low', 'cohort': 'Control'},
        ]
    ).merge(
        # within each audience group, get conversion rate per cohort and
        # get fraction of visitors in each audience group that are assigned
        # to a cohort
        # # get conversion rate per cohort and audience group
        ch.load_file_from_mlflow_artifact(
            df_deployment_candidate_mlflow_models, "audience_sample_sizes"
        )
        .pipe(ch.get_sample_sizes_by_strategy, audience_strategy)
        .pipe(ch.get_suitable_sample_sizes, bin_size_infer_control)
        .pipe(ch.get_sample_sizes_with_all_audience_groups, num_bins)
        .pipe(ch.get_required_inputs, query_inputs)
        # # calculate fraction
        .assign(samp_to_aud_frac=lambda df: 100*df['required_sample_size']/df['group_size'])
        .rename(columns={"required_sample_size": "size"})
        # # select required columns
        [['maudience', 'size', 'group_size', 'uplift', 'power', 'ci_level', 'samp_to_aud_frac']],
        on='maudience',
        how='left'
    )
    # append metadata
    .assign(size_type='required')
    .assign(data_type='development')
    .assign(data_size=lambda df: (df.groupby('data_type')['group_size'].transform('sum')/2).astype(pd.Int16Dtype()))
)

# inference
df_sa_frac_infer = (
    pd.concat(
       [
           # get size of cohorts within each audience group
           df_infer_audience_groups.groupby(['maudience', 'cohort'], as_index=False).size()
           # get audience group sizes
           .merge(
               (
                   df_infer_audience_groups
                   .groupby(['maudience'], as_index=False)['fullvisitorid']
                   .count()
                   .rename(columns={"fullvisitorid": "group_size"})
               ),
               on='maudience',
           )
           # # calculate fraction
           .assign(samp_to_aud_frac=lambda df: 100*df['size']/df['group_size']),
           pd.DataFrame.from_dict(wanted_inputs, orient='index').transpose()
       ], axis=1
    )
    # # append required inputs that were used to estimate required sample size per
    # # audience group
    .assign(uplift=lambda df: df['uplift_percentage'].fillna(method='ffill').astype(pd.Int8Dtype()))
    .assign(power=lambda df: df['power_percentage'].fillna(method='ffill').astype(pd.Int8Dtype()))
    .assign(ci_level=lambda df: df['confidence_level_percentage'].fillna(method='ffill').astype(pd.Int8Dtype()))
    # # drop unwanted columns
    .drop(columns=['uplift_percentage', 'power_percentage', 'confidence_level_percentage'])
    # append metadata
    .assign(size_type='randomly selected')
    .assign(data_type='inference')
    .assign(data_size=lambda df: (df.groupby('data_type')['group_size'].transform('sum')/2).astype(pd.Int16Dtype()))
)

# combined
df_sa_frac = (
    pd.concat([df_sa_frac_test, df_sa_frac_infer], ignore_index=True)
    .astype(
        {
            "maudience": pd.StringDtype(),
            "cohort": pd.StringDtype(),
            "size": pd.Int16Dtype(),
            "group_size": pd.Int16Dtype(),
            "samp_to_aud_frac": pd.Float32Dtype(),
            "size_type": pd.StringDtype(),
            "data_type": pd.StringDtype(),
            "data_size": pd.Int16Dtype(),
        }
    )
)
df_sa_frac

Unnamed: 0,maudience,cohort,size,group_size,uplift,power,ci_level,samp_to_aud_frac,size_type,data_type,data_size
0,High,Test,2254,6721,10,55,55,33.536674,required,development,20164
1,High,Control,2254,6721,10,55,55,33.536674,required,development,20164
2,Medium,Test,1970,6721,10,55,55,29.311115,required,development,20164
3,Medium,Control,1970,6721,10,55,55,29.311115,required,development,20164
4,Low,Test,1982,6722,10,55,55,29.485271,required,development,20164
5,Low,Control,1982,6722,10,55,55,29.485271,required,development,20164
6,High,Control,2431,7251,10,55,55,33.526409,randomly selected,inference,21752
7,High,Test,2431,7251,10,55,55,33.526409,randomly selected,inference,21752
8,Low,Control,2137,7251,10,55,55,29.471798,randomly selected,inference,21752
9,Low,Test,2137,7251,10,55,55,29.471798,randomly selected,inference,21752


::: {.callout-tip title="Observations"}

1. It is reassuring that there is agreement in the `samp_to_aud_frac` (fraction of visitors in the cohort relative to those in associated audience group) across the development and inference datasets. The required sizes (`size`) were estimated using the `test` split of the development data (`data_type`). These sample sizes were scaled to account for the difference between the sizes of the inference and development datasets (see the `group_size` column). The scaled up sample (cohort) sizes were rounded to the nearest integer. So, it is not surprising that there is good agreement between the fractions across the development and inference datasets.
:::

## Export to Disk and ML Experiment Tracking

Get the best MLFlow run ID

In [33]:
#| echo: true
best_run_id = df_deployment_candidate_mlflow_models.squeeze()["run_id"]

### Audience Cohorts

::: {.content-hidden}
Show summary of `DataFrame` with audience cohorts
:::

In [34]:
#| output: false
ut.summarize_df(df_infer_audience_groups)

Unnamed: 0,column,dtype,missing
0,fullvisitorid,string[python],0
1,visitId,string[python],0
2,visitNumber,Int8,0
3,visitStartTime,datetime64[ns],0
4,quarter,Int8,0
5,month,Int8,0
6,day_of_month,Int8,0
7,day_of_week,Int8,0
8,hour,Int8,0
9,minute,Int8,0


::: {.callout-note title="Notes"}

1. The `cohort` column has missing values for visitors who were not assigned to either the test or control groups. This is expected.
:::

Export to disk and log exported file as MLFlow artifact

In [35]:
#| echo: true
#| output: false
ut.export_and_track(
    os.path.join(
        processed_data_dir,
        f"audience_cohorts__run_"
        f"{best_run_id}__"
        f"infer_month_{month_name[1:][X_infer['month'].iloc[-1] - 1]}__"
        f"{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet.gzip",
    ),
    df_infer_audience_groups,
    (
        "inference audience cohorts for "
        f"{month_name[1:][X_infer['month'].iloc[-1] - 1]}"
    ),
    best_run_id,
)

Exported inference audience cohorts for March to file audience_cohorts__run_6a38156c7fbb4c289d7e4d1ba6b149b5__infer_month_March__20230702_194343.parquet.gzip
Logged inference audience cohorts for March as artifact in file audience_cohorts__run_6a38156c7fbb4c289d7e4d1ba6b149b5__infer_month_March__20230702_194343.parquet.gzip


### Audience Profiles

::: {.content-hidden}
Show summary `DataFrame` with audience profiles
:::

In [36]:
#| output: false
ut.summarize_df(df_profile)

Unnamed: 0,column,dtype,missing
0,stat,string[python],0
1,stat_type,string[python],0
2,High,string[python],0
3,Low,string[python],0
4,Medium,string[python],0
5,feature,string[python],0
6,audience_strategy,Int8,0


::: {.callout-note title="Notes"}

1. The behavioral attributes in the profile are not specific to an individual column. So, the `column` in the profile `DataFrame` has missing values for these attributes.
:::

Export to disk and log exported file as MLFlow artifact

In [37]:
#| echo: true
#| output: false
ut.export_and_track(
    os.path.join(
        processed_data_dir,
        f"audience_profiles__run_"
        f"{best_run_id}__"
        f"infer_month_{month_name[1:][X_infer['month'].iloc[-1] - 1]}__"
        f"{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet.gzip",
    ),
    df_profile,
    (
        "inference audience profiles for "
        f"{month_name[1:][X_infer['month'].iloc[-1] - 1]}"
    ),
    best_run_id,
)

Exported inference audience profiles for March to file audience_profiles__run_6a38156c7fbb4c289d7e4d1ba6b149b5__infer_month_March__20230702_194343.parquet.gzip
Logged inference audience profiles for March as artifact in file audience_profiles__run_6a38156c7fbb4c289d7e4d1ba6b149b5__infer_month_March__20230702_194343.parquet.gzip


### Conversion Rates per Audience Group and Overall

::: {.content-hidden}
Show summary `DataFrame` with overall and per-audience conversion rates
:::

In [38]:
#| output: false
ut.summarize_df(df_conv_rates)

Unnamed: 0,column,dtype,missing
0,audience_strategy,Int8,0
1,maudience,string[python],0
2,pred_conversions,Int16,0
3,total_visitors,Int16,0
4,min_score,Float32,0
5,true_conversions,Int16,3
6,data_type,string[python],0
7,data_size,Int16,0
8,true_conv_rate,Float32,3
9,overall_true_conv_rate,Float32,3


Export to disk and log exported file as MLFlow artifact

In [39]:
#| echo: true
#| output: false
ut.export_and_track(
    os.path.join(
        processed_data_dir,
        f"audience_development_inference_conversion_rates__run_"
        f"{best_run_id}__"
        f"infer_month_{month_name[1:][X_infer['month'].iloc[-1] - 1]}__"
        f"{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet.gzip",
    ),
    df_conv_rates,
    (
        "inference and development conversion rates for "
        f"{month_name[1:][X_infer['month'].iloc[-1] - 1]}"
    ),
    best_run_id,
)

Exported inference and development conversion rates for March to file audience_development_inference_conversion_rates__run_6a38156c7fbb4c289d7e4d1ba6b149b5__infer_month_March__20230702_194343.parquet.gzip
Logged inference and development conversion rates for March as artifact in file audience_development_inference_conversion_rates__run_6a38156c7fbb4c289d7e4d1ba6b149b5__infer_month_March__20230702_194343.parquet.gzip


### Estimated and True Fractions of Audience Assigned to Random Cohorts

::: {.content-hidden}
Show summary `DataFrame` with cohort fractions
:::

In [40]:
#| output: false
ut.summarize_df(df_sa_frac)

Unnamed: 0,column,dtype,missing
0,maudience,string[python],0
1,cohort,string[python],0
2,size,Int16,0
3,group_size,Int16,0
4,uplift,Int8,0
5,power,Int8,0
6,ci_level,Int8,0
7,samp_to_aud_frac,Float32,0
8,size_type,string[python],0
9,data_type,string[python],0


Export to disk and log exported file as MLFlow artifact

In [41]:
#| echo: true
#| output: false
ut.export_and_track(
    os.path.join(
        processed_data_dir,
        f"cohort_audience_fractions__run_"
        f"{best_run_id}__"
        f"infer_month_{month_name[1:][X_infer['month'].iloc[-1] - 1]}__"
        f"{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet.gzip",
    ),
    df_sa_frac,
    (
        "inference and development cohort-to-audience fractions for "
        f"{month_name[1:][X_infer['month'].iloc[-1] - 1]}"
    ),
    best_run_id,
)

Exported inference and development cohort-to-audience fractions for March to file cohort_audience_fractions__run_6a38156c7fbb4c289d7e4d1ba6b149b5__infer_month_March__20230702_194343.parquet.gzip
Logged inference and development cohort-to-audience fractions for March as artifact in file cohort_audience_fractions__run_6a38156c7fbb4c289d7e4d1ba6b149b5__infer_month_March__20230702_194343.parquet.gzip


## Next Step

The next step will add to the audience profiles, by exploring the best ML model's predictions of propensity for first-time visitors to the store during the unseen (inference) data period.