# Get Randomized Audience Cohorts From Inference

In [1]:
%load_ext autoreload
%autoreload 2

::: {.content-hidden}
Import necessary Python modules
:::

In [2]:
import os
import sys
from calendar import month_name
from datetime import datetime

import mlflow.sklearn
import pandas as pd

::: {.content-hidden}
Get relative path to project root directory
:::

In [3]:
PROJ_ROOT_DIR = os.path.join(os.pardir)
src_dir = os.path.join(PROJ_ROOT_DIR, "src")

sys.path.append(src_dir)

::: {.content-hidden}
Import custom Python modules
:::

In [4]:
%aimport audience_size_helpers
import audience_size_helpers as ash

%aimport bigquery_auth_helpers
from bigquery_auth_helpers import auth_to_bigquery

%aimport cohorts
import cohorts as ch

%aimport model_helpers
import model_helpers as modh

%aimport profile_audience
import profile_audience as pa

%aimport sql_helpers
import sql_helpers as sqlh

%aimport transform_helpers
import transform_helpers as th

%aimport utils
import utils as ut

## About

### Overview

This step generates the randomized test and control cohorts needed to run a marketing campaign. The impact of the campaign on conversions (KPI) will be assessed using an A/B test at the end of the campaign. The randomized cohorts are needed in order to conduct this test. This is step 4. from a [typical A/B Testing workflow](https://www.datacamp.com/blog/data-demystified-what-is-a-b-testing).

### Order of Operations
This step can be run prospectively at the end of the inference period, just before the start of the campaign, when all the inference data (first-time visitors to the store) becomes available.

For the current use-case, the required sizes of one or more audience groups have been determined in the previous step. Recall that an audience strategy determines if one of more groups are used. For a strategy consisting of a single audience group, only the visitors predicted to have a high propensity to make a purchase on a return visit are selected to participate in the campaign and so randomized cohorts are to be drawn from this single group. For multiple audience groups, namely visitors predicted to have a low, medium or high propensity, randomized cohorts are to be drawn from each such group.

With this in mind, this step first assigns all first-time visitors to the store during the inference period to an audience group based on the requied audience strategy (single or multiple audience groups).

Next, before generating audience cohorts, features in the unseen (inference) data are checked for data drift relative to the test data used during ML model development.

1. Use statistical tests, that are called as part of EvidentlyAI's data drift monitor ([1](https://www.evidentlyai.com/blog/tutorial-3-historical-data-drift), [2](https://github.com/evidentlyai/evidently/blob/main/examples/integrations/mlflow_logging/historical_drift_visualization.ipynb)) [for tabular data](https://docs.evidentlyai.com/user-guide/customization/options-for-statistical-tests#tabular-drift-detection). The following [statistical tests and logic are used by EvidentlyAI](https://docs.evidentlyai.com/reference/data-drift-algorithm#tabular-data) and are also implemented here
   - numerical features with less than or equal to 1,000 observations (this is not used here since the inference data consists of more than 1,000 observations during the inference data period)
     - [Kolmogorov-Smirnov goodness-of-fit Test (two-sample version)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest)
   - numerical features with more than 1,000 observations (this is the case here since the inference data has more than 1,000 first-time visitors during the inference data period)
     - [Wasserstein distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html#scipy.stats.wasserstein_distance)
   - categorical features with less than or equal to 1,000 observations (not used here)
     - [chi-squared goodness-of-fit test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html#scipy.stats.chi2)
   - categorical features with more than 1,000 observations (used here)
     - [Jensen-Shannon distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html#scipy.spatial.distance.jensenshannon)

   Here, a custom function is defined to manually implement these tests on the unseen (inference) data.
3. Perform sanity checks on the data, by directly using EvidentlyAI's data test suite ([1](https://docs.evidentlyai.com/get-started/tutorial#7.-run-data-stability-tests), [2](https://docs.evidentlyai.com/user-guide/tests-and-reports/run-tests#using-test-presets)). A suite consisting of the following tests is applied to the unseen (inference) data in order to test that the number of
   - columns with missing values (columns with at least one missing value)
   - rows with missing values 
   - constant (single-valued) columns
   - duplicated rows (rows that are duplicates)
   - duplicated columns (columns that are duplicates)
   - column datatypes
   - empty columns (all values in a column are missing)
   - empty rows (all values in a row are missing)
   - rows with missing values (rows with at least one missing value)
   - number of columns

   in the unseen (inference) data equals the number in the reference data (test split).

Finally, within each audience group, visitors are randomly assigned to test or control cohorts. A brief profile of each audience group is then performed using attributes of the visitors' first visit. The profile and the inference data with the audience group and cohort for each first-time visitor during the inference period can be used by the marketing team to design and implement the marketing campaign aimed at growing the customer base.

### Implementation

A custom python module in `src/cohorts.py` has been developed to create the test and control audience cohorts based on required sample sizes that were estimated in the previous step and logged as a MLFlow artifact (file) for the best MLFlow deployment candidate model.

The procedure followed in this step is briefle outlined below

1. the MLFlow artifact (file) is loaded in order to access the required sample sizes for each
   - audience strategy
   - desired combination of effect sizes (power, confidence level and uplift)
2. the file is filtered based on the
   - audience strategy
   - desired combination of effect sizes (see `wanted_effect_sizes` from **User Inputs**)
3. inference data (first-time visitors) is retrieved and the best ML model is used to make inference predictions of the propensity of these visitors to make a purchase on a return (future) visit to the store.
4. propensities are used to assign first-time visitors to audience groups (or bins)
   - if the desired audience strategy is for a single audience group, then visitors are placed into a single bin
   - if the desired audience strategy is for multiple audience groups, then multiple bins are creaed
5. for each bin (audience group), visitors are randomly placed into test and control cohorts
6. the inference data with
   - predicted propensity (probability)
   - audience group
   - cohort (test or control)

   is then logged as a MLFlow artifact for use by the marketing team to build and design the campaign.

## User Inputs

Define the following

1. start and end dates for inference data
2. name of column containing label (outcome)
3. list of categorical features
4. list of numerical features, including categorical features present in the raw data as integers
5. `audience_groups`
   - desired audience groups into which the first-time visitors propensities will be placed
     - `num_propens_groups` specifies the number of desired audience propensity groups (low, medium and high)
     - `propens_group_labels` specifies names of the desired audience propensity groups
6. inputs (uplift, powerm confidence level) for which random cohort sizes are to be created
7. type of audience strategy (single- or multi- group) from which to create cohorts

In [5]:
#| echo: true
# 1. start and end dates
infer_start_date = "20170301"
infer_end_date = "20170331"

# 2. label column
label = "made_purchase_on_future_visit"

# 3. categorical column
categoricals=[
    'bounces',
    'source',
    'medium',
    'channelGrouping',
    'last_action',
    'browser',
    'os',
    'deviceCategory',
]

# 4. numerical columns
numericals=[
    'hits',
    'pageviews',
    'promos_displayed',
    'product_views',
    'product_clicks',
    'time_on_site',
]

# 5. mapping dictionaries
audience_groups_strategy_1 = {
    "num_propens_groups": 3,
    "propens_group_labels": ["High", "Medium", "Low"],
}
audience_groups_strategy_2 = {
    "num_propens_groups": 3,
    "propens_group_labels": ["High", "High-Medium", "High-Medium-Low"],
}

# 6. wanted inputs for estimating sample sizes
wanted_inputs = {
    "uplift_percentage": 10,
    "power_percentage": 55,
    "confidence_level_percentage": 55,
}

# 7. type of audience strategy to use when creating groups
audience_strategy = 1

::: {.content-hidden}
Get path to data sub-folders
:::

In [6]:
data_dir = os.path.join(PROJ_ROOT_DIR, "data")
raw_data_dir = os.path.join(data_dir, "raw")
processed_data_dir = os.path.join(data_dir, "processed")

Create a mapping between audience group number (0, 1, 2) and group name

In [7]:
#| echo: true
mapper_dict_audience_strategy_1 = dict(
    zip(
        range(audience_groups_strategy_1["num_propens_groups"]),
        audience_groups_strategy_1["propens_group_labels"],
    )
)
mapper_dict_audience_strategy_2 = dict(
    zip(
        range(audience_groups_strategy_2["num_propens_groups"]),
        audience_groups_strategy_2["propens_group_labels"],
    )
)
print(mapper_dict_audience_strategy_1)
print(mapper_dict_audience_strategy_2)

{0: 'High', 1: 'Medium', 2: 'Low'}
{0: 'High', 1: 'High-Medium', 2: 'High-Medium-Low'}


Get desired effect size queries and number of audience groups

In [8]:
#| echo: true
if audience_strategy == 1:
    query_inputs = (
        f"(uplift == {wanted_inputs['uplift_percentage']}) & "
        f"(power == {wanted_inputs['power_percentage']}) & "
        f"(ci_level == {wanted_inputs['confidence_level_percentage']})"
    )
else:
    query_inputs = (
        f"(group_size_proportion < 34) & "
        f"(uplift == {wanted_inputs['uplift_percentage']}) & "
        f"(power == {wanted_inputs['power_percentage']}) & "
        f"(ci_level == {wanted_inputs['confidence_level_percentage']})"
    )

num_bins = (
    audience_groups_strategy_1["num_propens_groups"]
    if audience_strategy == 1
    else audience_groups_strategy_2["num_propens_groups"]
)

::: {.content-hidden}
Define MLFlow storage paths
:::

In [9]:
mlruns_db_fpath = f"{raw_data_dir}/mlruns.db"
mlflow.set_tracking_uri(f"sqlite:///{mlruns_db_fpath}")

::: {.content-hidden}
Set environment variable to silence MLFlow `git` warning messsage
:::

In [10]:
os.environ["GIT_PYTHON_REFRESH"] = "quiet"

::: {.content-hidden}
Define a dictionary to specify datatypes of the transformed *inference* data
:::

In [11]:
dtypes_dict = {
    "fullvisitorid": pd.StringDtype(),
    "visitId": pd.StringDtype(),
    "visitNumber": pd.Int8Dtype(),
    "quarter": pd.Int8Dtype(),
    "month": pd.Int8Dtype(),
    "day_of_month": pd.Int8Dtype(),
    "day_of_week": pd.Int8Dtype(),
    "hour": pd.Int8Dtype(),
    "minute": pd.Int8Dtype(),
    "second": pd.Int8Dtype(),
    "source": pd.CategoricalDtype(),
    "medium": pd.CategoricalDtype(),
    "channelGrouping": pd.CategoricalDtype(),
    "hits": pd.Int16Dtype(),
    "bounces": pd.CategoricalDtype(),
    "last_action": pd.CategoricalDtype(),
    "promos_displayed": pd.Int16Dtype(),
    "promos_clicked": pd.Int16Dtype(),
    "product_views": pd.Int16Dtype(),
    "product_clicks": pd.Int16Dtype(),
    "pageviews": pd.Int16Dtype(),
    "time_on_site": pd.Int16Dtype(),
    "browser": pd.CategoricalDtype(),
    "os": pd.CategoricalDtype(),
    "added_to_cart": pd.Int16Dtype(),
    "deviceCategory": pd.CategoricalDtype(),
    "made_purchase_on_future_visit": pd.Int8Dtype(),
}

::: {.content-hidden}
Create a mapping between action type integer and label, in order to get meaningful names from the `action_type` column
:::

In [12]:
action_mapper = {
    1: "Click through of product lists",
    2: "Product detail views",
    3: "Add product(s) to cart",
    4: "Remove product(s) from cart",
    5: "Check out",
    6: "Completed purchase",
    7: "Refund of purchase",
    8: "Checkout options",
    0: "Unknown",
}

::: {.content-hidden}
## Authenticate to `BigQuery`
:::

In [13]:
gcp_auth_dict = auth_to_bigquery(raw_data_dir)

## Get Inference Data

In [14]:
#| echo: true
query_infer = sqlh.get_sql_query_infer(infer_start_date, infer_end_date)
X_infer, _ = th.extract_data(query_infer, gcp_auth_dict).pipe(
    th.transform_data,
    datatypes_dict={k:v for k,v in dtypes_dict.items() if k != label},
    duplicate_cols=["fullvisitorid"],
    column_mapper_dict={'last_action': action_mapper},
)
X_infer = X_infer.pipe(th.shuffle_data)

Query execution start time = 2023-05-28 12:09:53.112...done at 2023-05-28 12:10:00.562 (7.450 seconds).
Query returned 21,768 rows
Got 21,752 rows and 27 columns after dropping duplicates
Transformed data has 21,752 rows & 27 columns


Get the size of each audience cohort in the inference data

In [15]:
#| echo: true
bin_size_infer = len(X_infer) / num_bins
bin_size_infer_control = int(bin_size_infer / 2)

## Get Model

### Fetch Latest Version of Best Deployment Candidate Model from Model Registry

Get name of best deployment candidate model from model registry

In [16]:
#| echo: true
df_deployment_candidate_mlflow_models = modh.get_all_deployment_candidate_models()
best_run_model_name = modh.get_best_deployment_candidate_model(
    df_deployment_candidate_mlflow_models
)
with pd.option_context("display.max_colwidth", None):
    display(df_deployment_candidate_mlflow_models)

Unnamed: 0,name,description,run_id,tags,version,score
0,BetaDistClassifier_20160901_20170228_133892_feats__20230526_214629,Best Model based on fbeta2 score of 0.4953794479,6c229408d32348f7bf84c5f2c78fdb87,{'deployment-candidate': 'yes'},2,0.495379


::: {.content-hidden}
Load best deployment candidate model object
:::

In [17]:
model = mlflow.sklearn.load_model(model_uri=f"models:/{best_run_model_name}/latest")
model

### Get Data Used in Development of Best Deployment Candidate Model

Get all available data used during model development of the best deployment candidate model

In [18]:
df_all = modh.get_data_for_run_id(df_deployment_candidate_mlflow_models, 'processed_data')

::: {.content-hidden}
Perform sanity checks on features in inference data
:::

In [19]:
modh.check_data_per_best_expt_run(
    df_deployment_candidate_mlflow_models,
    df_all.drop(columns=["score", "predicted_label", "predicted_score_label", "split_type"]),
    X_infer,
    label,
)

### Check Data Drift

Get test split from the data used during development of the best deployment candidate model

In [21]:
#| echo: true
X_test_best_run = (
    df_all
    .query("split_type == 'test'")
    .drop(columns=['split_type', label])
)

In [33]:
#| output: false
with pd.option_context("display.max_columns", None):
    display(X_test_best_run.head(2))
    display(X_test_best_run.tail(2))

Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,score,predicted_score_label,predicted_label
113728,92786413416632932,1487747471,1,2017-02-21 23:11:11,1,2,21,3,23,11,11,google,organic,Organic Search,4,0,Unknown,9,0,2,0,4,29,Chrome,Macintosh,desktop,0,0.234413,True,0
113729,9246024261954442050,1487316329,1,2017-02-16 23:25:29,1,2,16,5,23,25,29,siliconvalley.about.com,referral,Referral,4,0,Unknown,9,0,3,0,4,29,Chrome,Android,mobile,0,0.041587,True,0


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,score,predicted_score_label,predicted_label
133890,4670578101675392339,1487109517,1,2017-02-14 13:58:37,1,2,14,3,13,58,37,facebook.com,referral,Social,4,0,Unknown,0,0,9,0,4,39,Chrome,Macintosh,desktop,0,0.285869,True,0
133891,5309434940178594263,1487076882,1,2017-02-14 04:54:42,1,2,14,3,4,54,42,youtube.com,referral,Social,4,0,Unknown,0,0,48,0,4,150,Chrome,Chrome OS,desktop,0,0.046238,True,0


In [335]:
import json
from typing import Dict, List, Union
from scipy import spatial, stats
from evidently.test_suite import TestSuite
import evidently.tests as eaits

def detect_features_drift(
    curr_data: pd.DataFrame,
    refer_data: pd.DataFrame,
    numericals: List[str],
    categoricals: List[str],
    kst_threshold: float = 0.05,
    kst_params: Dict[str, Union[str, int]] = dict(
        method='exact', N=20, alternative='two-sided'
    ),
    wsd_threshold: float=0.5,
    jsd_threshold: float=0.5,
) -> pd.DataFrame:
    """Perform checks for drift in features."""
    len_smaller_dataset = min(len(refer_data), len(curr_data))
    categorical_stats_summary = []
    for c in categoricals:
        # get drift stats
        if len(curr_data) <= 1_000:
            csd = None
        else:
            jsd = spatial.distance.jensenshannon(
                curr_data[c].astype('int16').head(len_smaller_dataset).to_numpy(),
                refer_data[c].astype('int16').head(len_smaller_dataset).to_numpy()
            )
        # get summary stats
        summary_stats = {
            "feature_type": 'categorical',
            "feature": c,
            "nunique_ref": refer_data[c].nunique(),
            "nunique_curr": curr_data[c].nunique(),
            "metric_value": csd if len(curr_data) <= 1_000 else jsd,
            "metric_threshold": None if len(curr_data) <= 1_000 else jsd_threshold,
        }
        df_cat_stats = pd.DataFrame.from_dict(summary_stats, orient='index').transpose()
        # check if drift is present
        if len(curr_data) <= 1_000:
            df_cat_stats = pd.DataFrame()
        else:
            df_cat_stats = (
                df_cat_stats.assign(drift_detected=lambda df: df['metric_value'] > df['metric_threshold'])
                .assign(test_type='Jensen-Shannon')
            )
        categorical_stats_summary.append(df_cat_stats)

    numerical_stats_summary = []
    for c in numericals:
        # get descriptive stats
        feat_desc_stats = (
            refer_data[c].describe().rename('ref').to_frame().merge(
                curr_data[c].describe().rename('curr').to_frame(),
                left_index=True,
                right_index=True,
                how='left',
            )
            .assign(abs_diff=lambda df: df['ref']-df['curr'])
            .assign(pct_diff=lambda df: 100*((df['ref']-df['curr'])/df['ref']))
            .assign(feature=c)
        )
        # get drift stats
        if len(curr_data) <= 1_000:
            kst = stats.kstest(refer_data[c].to_numpy(), curr_data[c].to_numpy(), **kst_params)
        else:
            wsd = stats.wasserstein_distance(refer_data[c].to_numpy(), curr_data[c].to_numpy())
        # get summary stats
        summary_stats = {
            "feature_type": 'numerical',
            "feature": c,
            "nunique_ref": refer_data[c].nunique(),
            "nunique_curr": curr_data[c].nunique(),
            "pct_diff_mean": feat_desc_stats['pct_diff']['mean'],
            "pct_diff_std": feat_desc_stats['pct_diff']['std'],
            "abs_diff_mean": feat_desc_stats['abs_diff']['mean'],
            "abs_diff_std": feat_desc_stats['abs_diff']['std'],
            "metric_value": kst.pvalue if len(curr_data) <= 1_000 else wsd,
            "metric_threshold": kst_threshold if len(curr_data) <= 1_000 else wsd_threshold,
        }
        df_num_stats = pd.DataFrame.from_dict(summary_stats, orient='index').transpose()
        # check if drift is present
        if len(curr_data) <= 1_000:
            df_num_stats = (
                df_num_stats.assign(
                    reject_null=lambda df: df['metric_value'] < df['metric_threshold']
                ).assign(drift_detected=lambda df: df['reject_null'])
                .assign(test_type='Kolmogorov–Smirnov')
            )
            for k,v in kst_params.items():
                df_num_stats[k] = v
        else:
            df_num_stats = (
                df_num_stats.assign(
                    drift_detected=lambda df: df['metric_value'] > df['metric_threshold']
                ).assign(test_type='Wasserstein')
            )
        numerical_stats_summary.append(df_num_stats)

    df_num_stats = pd.concat(numerical_stats_summary or [pd.DataFrame()], ignore_index=True)
    df_cat_stats = pd.concat(categorical_stats_summary or [pd.DataFrame()], ignore_index=True)
    return [df_num_stats, df_cat_stats]

Check numerical features for drift between

1. ML development data (test split)
2. unseen data (inference)

using the Wasserstein test

In [336]:
df_num_drift, _ = detect_features_drift(
    X_infer_eai.astype(dtypes_eai),
    X_test_best_run_eai.astype(dtypes_eai),
    numericals=numericals,
    categoricals=[],
    kst_threshold=0.05,
    kst_params=dict(
        method='exact',
        N=max(len(X_test_best_run_eai), len(X_infer_eai)),
        alternative='two-sided',
    ),
    wsd_threshold=1,
)
display(df_num_drift)

Unnamed: 0,feature_type,feature,nunique_ref,nunique_curr,pct_diff_mean,pct_diff_std,abs_diff_mean,abs_diff_std,metric_value,metric_threshold,drift_detected,test_type
0,numerical,hits,104,112,-2.796628,-16.205373,-0.172968,-1.569269,0.222792,1,False,Wasserstein
1,numerical,pageviews,78,94,-2.224936,-19.113836,-0.117477,-1.415658,0.163044,1,False,Wasserstein
2,numerical,promos_displayed,21,18,-9.691698,2.996109,-0.757057,0.316746,0.788434,1,False,Wasserstein
3,numerical,product_views,273,286,1.213478,-5.642963,0.28491,-2.017004,0.722241,1,False,Wasserstein
4,numerical,product_clicks,30,35,-6.7913,-13.547475,-0.043027,-0.246633,0.043027,1,False,Wasserstein
5,numerical,time_on_site,1583,1701,-7.353722,-6.516268,-13.146065,-25.677424,13.510787,1,True,Wasserstein


::: {.callout-tip title="Observations"}

1. ... .
:::

Perform sanity checks on all features between

1. ML development data (test split)
2. unseen data (inference)

In [304]:
%%time
dtypes_eai = {
    "bounces": str,
    'source': str,
    'medium': str,
    'channelGrouping': str,
    'last_action': str,
    'browser': str,
    'os': str,
    'deviceCategory': str,
}
ts_cols_eai = categoricals + numericals
tests = TestSuite(tests=[
    eaits.TestNumberOfColumnsWithMissingValues(),
    eaits.TestNumberOfRowsWithMissingValues(),
    eaits.TestNumberOfConstantColumns(),
    eaits.TestNumberOfDuplicatedRows(),
    eaits.TestNumberOfDuplicatedColumns(),
    eaits.TestColumnsType(),
    eaits.TestNumberOfEmptyColumns(),
    eaits.TestNumberOfEmptyRows(),
    eaits.TestNumberOfRowsWithMissingValues(),
    eaits.TestNumberOfColumns(),
])
tests.run(
    reference_data=X_test_best_run.astype(dtypes_eai)[ts_cols_eai],
    current_data=X_infer.astype(dtypes_eai)[ts_cols_eai],
)
df_stability = (
    pd.DataFrame.from_dict(tests.as_dict()['summary'], orient='index')
    .transpose()
    .assign(success=lambda df: pd.json_normalize(df['by_status']))
    .assign(len_curr_data=len(X_infer))
    .assign(len_ref_data=len(X_test_best_run))
    .assign(features_checked=json.dumps(ts_cols_eai))
    .assign(num_features_checked=len(ts_cols_eai))
    .drop(columns=['by_status'])
)
df_stability

CPU times: user 12.2 s, sys: 6.94 ms, total: 12.2 s
Wall time: 12.2 s


Unnamed: 0,all_passed,total_tests,success_tests,failed_tests,success,len_curr_data,len_ref_data,features_checked,num_features_checked
0,True,11,11,0,11,21752,20164,"[""bounces"", ""source"", ""medium"", ""channelGroupi...",14


## Make Inference Predictions

Make inference predictions with best deployment candidate model

In [None]:
#| echo: true
y_infer_pred, y_infer_pred_proba = modh.make_inference(model, X_infer, label)
display(
    y_infer_pred.value_counts(normalize=True).rename('proportion').to_frame().merge(
        y_infer_pred.value_counts(normalize=False).rename('number').to_frame(),
        left_index=True,
        right_index=True,
        how='left',
    ).reset_index()
)

::: {.callout-note title="Notes"}

1. Per the business use-case, these are predictions of whether a visitor will make a purchase during a later (return) visit to the merchandise store. Such predictions are made using attributes (features) of the first visit by visitors to the store and they can only be evaluated at a later time (after the outcome of the same visitor's later visit is known).
:::

## Create Cohorts

Perform the following to prepare the inference observations for extracting cohorts

1. combine inference features and predicted hard and soft labels into single `DataFrame`
2. sort predictions in ascending order of the predicted probability (propensity) (score)
3. separate the predictions into bins, using the predicted probability, based on the required number of bins
4. rename `bin_number` column (for use in downstream step)
5. set datatypes for columns

In [None]:
#| echo: true
df_infer_pred = (
    ch.combine_infer_data(X_infer, y_infer_pred, y_infer_pred_proba)
    .pipe(ash.sort_scores, False)
    .pipe(ash.get_audience_groups_by_propensity, num_bins)
    .pipe(ch.rename_columns, {"group_number": "maudience"})
    .pipe(
        ash.set_datatypes,
        {
            "row_number": pd.Int16Dtype(),
            "fullvisitorid": pd.StringDtype(),
            "score": pd.Float32Dtype(),
            "predicted_score_label": pd.BooleanDtype(),
            "maudience": pd.Int8Dtype(),
        },
    )
)
with pd.option_context("display.max_columns", 1000):
    display(df_infer_pred.dtypes.rename("dtype").to_frame().T)
    display(df_infer_pred)

::: {.callout-note title="Notes"}

1. This is the same data preparation that was used during the (previous) sample size estimation step step in order to prepare the test data split for estimating the required sample size.
:::

Perform the following to get the test and control cohorts from the inference data

1. load estimates for required sample sizes associated with best deployment candidate model
2. get required sample sizes for the chosen audience strategy
3. get required sample sizes that can be supported by size of inference data
4. get required sample sizes that support all required audience groups (bins)
5. get required sample sizes that capture required effect sizes
6. get random test and control cohorts

In [None]:
#| echo: true
df_infer_audience_groups = (
    ch.load_file_from_mlflow_artifact(
        df_deployment_candidate_mlflow_models, "audience_sample_sizes"
    )
    .pipe(ch.get_sample_sizes_by_strategy, audience_strategy)
    .pipe(ch.get_suitable_sample_sizes, bin_size_infer_control)
    .pipe(ch.get_sample_sizes_with_all_audience_groups, num_bins)
    .pipe(ch.get_required_inputs, query_inputs)
    .pipe(
        ch.create_cohorts,
        df_infer_pred,
        mapper_dict_audience_strategy_1
        if audience_strategy == 1
        else mapper_dict_audience_strategy_2,
        audience_strategy,
    )
    .pipe(
        ash.set_datatypes,
        {
            "maudience": pd.StringDtype(),
            "cohort": pd.StringDtype(),
            "audience_strategy": pd.Int8Dtype(),
        },
    )
)
with pd.option_context("display.max_columns", 1000):
    display(df_infer_audience_groups)

::: {.callout-note title="Notes"}

1. The final sample size used to generate the cohorts is the least number of samples (first-time visitors) to be included in each of the control and test cohorts of each audience group. These sizes come from the output of the previous step (designing a media or marketing experiment - [1](https://blog.hubspot.com/blog/tabid/6307/bid/31634/a-b-testing-in-action-3-real-life-marketing-experiments.aspx), [2](https://www.datacamp.com/blog/data-demystified-what-is-a-b-testing)).
:::

The treatment (or test) and control groups should be similar to each other for the property to be tested. This is a [fundamental requirement of test and control groups](https://www.mobileapps.com/blog/test-group#Similarities_Between_Test_and_Control_Groups). In this case, the probability (score column) is the property of interest. With this in mind, we now show selected descriptive statistics, for the score column, for both the test (treatment) and control cohorts within each desired audience group (low, medium and high propensity)

In [None]:
#| echo: true
df_aud_stats = ch.get_cohort_stats(df_infer_audience_groups)
df_aud_stats

::: {.callout-tip title="Observations"}

1. It is reassuring that there is agreement in the statistics for the probabilities per Test-Control cohort within the same audience group.
:::

Finally, the audience groups in the inference data are now briefly profiled, in order to identify characteristics that might help the marketing team build an appropriate strategy to be implemented during the campaign. For each audience group, the profile consists of the following

1. proportion of visitors who displayed the following behavior
   - whose last action during thier first visit was
     - `last_action == 'Add product(s) to cart'`
     - `last_action == 'Product detail_views'`
     - `last_action == 'Click through of product lists'`
   -  who
      - used a `medium == 'referral'` to get to the store's website on their first visit
      - added one item to their shopping cart (`added_to_cart > 0`) on their first visit
   - whose
     - first visit occurred on a weekend (`day_of_week.isin(@weekend_days)`)
2. bounce rate during all first visits
3. following descriptive statistics
   - `hour`, `day_of_week`
     - statistic: `mean`
   - `source`, `medium`, `channelGrouping`, `last_action`, `browser`, `os`, `deviceCategory`
     - statistic: `mode` (most common value)
   - `promos_displayed`, `promos_clicked`, `product_views`, `product_clicks`, `pageviews`, `added_to_cart`
     - statistics: `mean`, `max`

In [None]:
#| echo: true
df_profile = pa.get_audience_profile(df_infer_audience_groups)
with pd.option_context("display.max_columns", 1000):
    display(df_profile)

::: {.callout-note title="Notes"}

1. The audience profile consists of different types of statistics about the first visit, for each audience group.
2. The `stat` column shows the attribute from the first visit in the inference data for which a statistic is calculated.
3. The `stat_type` column indicates the type of statistic.
4. The *High*, *Medium* and *Low* columns (or just the *High* column, if the single-group audience strategy is being used) show the statistic of each attribute *within* each audience group.
:::

## Export to Disk and ML Experiment Tracking

Get the best MLFlow run ID

In [None]:
#| echo: true
best_run_id = df_deployment_candidate_mlflow_models.squeeze()["run_id"]

### Audience Cohorts

Show summary of `DataFrame` with audience cohorts

In [None]:
#| echo: true
ut.summarize_df(df_infer_audience_groups)

::: {.callout-note title="Notes"}

1. The `cohort` column has missing values for visitors who were not assigned to either the test or control groups. This is expected.
:::

Export to disk and log exported file as MLFlow artifact

In [None]:
#| echo: true
ut.export_and_track(
    os.path.join(
        processed_data_dir,
        f"audience_cohorts__run_"
        f"{best_run_id}__"
        f"infer_month_{month_name[1:][X_infer['month'].iloc[-1] - 1]}__"
        f"{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet.gzip",
    ),
    df_infer_audience_groups,
    (
        "inference audience cohorts for "
        f"{month_name[1:][X_infer['month'].iloc[-1] - 1]}"
    ),
    best_run_id,
)

### Audience Profiles

Show summary `DataFrame` with audience profiles

In [None]:
#| echo: true
ut.summarize_df(df_profile)

::: {.callout-note title="Notes"}

1. The behavioral attributes in the profile are not specific to an individual column. So, the `column` in the profile `DataFrame` has missing values for these attributes.
:::

Export to disk and log exported file as MLFlow artifact

In [None]:
#| echo: true
ut.export_and_track(
    os.path.join(
        processed_data_dir,
        f"audience_profiles__run_"
        f"{best_run_id}__"
        f"infer_month_{month_name[1:][X_infer['month'].iloc[-1] - 1]}__"
        f"{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet.gzip",
    ),
    df_profile,
    (
        "inference audience profiles for "
        f"{month_name[1:][X_infer['month'].iloc[-1] - 1]}"
    ),
    best_run_id,
)

## Next Step

The next step will assess the performance of the marketing campaign, after the campaign has completed.