# Create Marketing Audience Cohorts

In [1]:
# | echo: false
%load_ext lab_black

Import Python libraries

In [2]:
import os
import random

import numpy as np
import pandas as pd

## About

### Objective
The final step in the workflow for this project is to generate the marketing audience test (or treatment) and control cohorts. Generating the marketing audience cohorts is the deliverable that is required at the end of this project. This step will create this deliverable.

### Audience Propensity Groups
Three marketing audience propensity groups were chosen in the previous step (post-processing)

1. `low`
   - visitors in this audience group are predicted to have a low propensity (or likelihood) to make a purchase on a return visit to the store
2. `medium`
3. `high`

Predictions of propensity (probabilities) come from the ML model's predicted probabilities for visitors in the unseen data split. The unseen data covers March 1 - 31, 2017 and represents the production period for this project.

### Test (or Treatment) and Control Cohorts
The test and control cohorts are randomly selected from each audience propensity group (low, medium and high). The size of each cohort is chosen based on the output of the previous analysis step (the media experiment design, or post-processing).

In the current step

1. the control cohort is randomly chosen from each audience propensity group (low, medium and high)
2. a random sampling from all visitors that do not belong to the control group is drawn and these visitors are placed in the test (or treatment) group
   - all other visitors do not belong to a cohort

### Summary of Analysis in This Step
In summary, the final output of this step is to append two new columns to the propensity predictions for the unseen data

1. `audience_group`
2. `cohort`

### Assumptions

As was the case with the preceding (post-processing, or media design) step, the current step can only be be run on March 31, 2017 after

1. all the unseen data has been collected
2. propensity (probabilities) have been predicted for this data

So, we will assume that

1. the current date is April 1, 2017
2. the unseen data has been
   - collected
   - processed (drop duplicates, bin categorical features, etc. as was done during ML model development)
3. we have used the trained ML model to make predictions of the probability (likelihood or propenisty) of making a purchase on a future visit for all visitors in the unseen data

## User Inputs

In [3]:
PROJ_ROOT_DIR = os.path.join(os.pardir)

Define the following

1. `audience_groups`
   - desired audience groups into which the first-time visitors propensities will be placed
     - `num_propens_groups` specifies the number of groups
     - `propens_group_labels` specifies names of the groups
   - this is a Python dictionary with the following key-value pairs
     - `num_propens_groupsnum_propens_groups` represents the number of desired audience propensity groups (low, medium and high)
    - `propens_group_labels` gives the desired audience propensity group labels
2. `min_control_group_sample_sizes`
   - the least number of samples (first-time visitors) to be included in both the control and test cohorts of each audience group
   - these should come from the output of the previous step ([designing a media or marketing experiment](https://blog.hubspot.com/blog/tabid/6307/bid/31634/a-b-testing-in-action-3-real-life-marketing-experiments.aspxhttps://blog.hubspot.com/blog/tabid/6307/bid/31634/a-b-testing-in-action-3-real-life-marketing-experiments.aspx))
   - this is a Python list with integers that represent the desired minimum size of the control and test cohort

In [4]:
# | code-fold: false
audience_groups = {
    "num_propens_groups": 3,
    "propens_group_labels": ["High", "Medium", "Low"],
}

min_control_group_sizes = [1000, 2000, 3000]

Define filepath to data directory where the unseen data, with predictions, is stored

In [5]:
# | code-fold: false
model_dir = os.path.join(PROJ_ROOT_DIR, "models")
ml_preds_fpath = os.path.join(model_dir, "unseen_data_predictions.parquet.gzip")

Define filepath to data directory where the created audience cohorts will be stored

In [6]:
# | code-fold: false
data_dir = os.path.join(PROJ_ROOT_DIR, "data")
processed_data_dir = os.path.join(data_dir, "processed")

Create a mapping between audience group number (0, 1, 2) and name (high, medium, low), where

- 0 is mapped to high
- 1 is mapped to medium
- 2 is mapped to low

since it is [standard to](https://stackoverflow.com/a/26502255/4057186https://stackoverflow.com/a/26502255/4057186) assign a label the top percentile (highest propensity) with the smallest number (0)

In [7]:
# | code-fold: false
mapper_dict_audience = dict(
    zip(
        range(audience_groups["num_propens_groups"]),
        audience_groups["propens_group_labels"],
    )
)

Define a helper function to show datatypes and number of missing values in a `DataFrame`

In [8]:
def summarize_df(df: pd.DataFrame) -> None:
    """Show datatypes and count missing values in columns of DataFrame."""
    display(
        df.dtypes.rename("dtype")
        .to_frame()
        .merge(
            df.isna().sum().rename("missing").to_frame(),
            left_index=True,
            right_index=True,
            how="left",
        )
    )

## Load Data

In [9]:
# | echo: false

# # DUMMY DATA - IGNORE

# nrows = 24_000

# from scipy.stats import beta


# df_prediction = (
#     pd.DataFrame(
#         np.random.choice(nrows, nrows, replace=False), columns=["fullvisitorid"]
#     )
#     .assign(
#         label=lambda df: pd.Series(np.random.default_rng(88).beta(1.0, 4.2, size=nrows))
#         > 0.5
#     )
#     .assign(
#         score=lambda df: pd.Series(
#             beta.rvs(0.125, 2.42, size=nrows, random_state=88)
#         ).astype(pd.Float32Dtype())
#     )
#     .assign(predicted_score_label=lambda df: df["score"] > 0.5)
# )
# df_prediction.astype(
#     {
#         "fullvisitorid": pd.StringDtype(),
#         "score": pd.Float32Dtype(),
#         "predicted_score_label": pd.BooleanDtype(),
#     }
# ).to_parquet(ml_preds_fpath, index=False, engine="pyarrow", compression="gzip")

Load the predictions of propensity to make a purchase on a return visit for all first-time visitors in the unseen data split

In [10]:
# | code-fold: false
df = pd.read_parquet(ml_preds_fpath, engine="pyarrow")
display(df.head())
display(df.tail())
summarize_df(df)
df.info()

Unnamed: 0,fullvisitorid,label,score,predicted_score_label
0,9568,False,0.008079,False
1,18180,False,0.030055,False
2,154,False,0.0,False
3,1718,False,0.000253,False
4,413,False,0.065967,False


Unnamed: 0,fullvisitorid,label,score,predicted_score_label
23995,19834,False,7.7e-05,False
23996,3875,False,1e-06,False
23997,20780,False,1e-05,False
23998,11178,True,0.278035,False
23999,12921,False,0.003235,False


Unnamed: 0,dtype,missing
fullvisitorid,string[python],0
label,bool,0
score,Float32,0
predicted_score_label,boolean,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24000 entries, 0 to 23999
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   fullvisitorid          24000 non-null  string 
 1   label                  24000 non-null  bool   
 2   score                  24000 non-null  Float32
 3   predicted_score_label  24000 non-null  boolean
dtypes: Float32(1), bool(1), boolean(1), string(1)
memory usage: 375.1 KB


**Notes**

1. The `fullvisitorid` column is the ID of each visitor who
   - made a purchase on a return visit to the store
   - made their first visit to the store during the dates covered by the test data split
2. The true ML label (`y_true`) will not be known until a later date, after March 31, 2017 (see the project scope for details)
3. The `predicted_score_label` column is the predicted label using ML (`y_pred`).
4. The `score` column is the predicted probability (using `.pred_proba()`), which is the propensity of a visitor to make a purchase on a return visit.
5. Any other columns are either
   - used in ML as features (`X`), or
   - not used in ML

## Process Data

Labels for the audience and cohort groups will now be assigned to all visitors in the test data.

For the audience, all visitors will be placed in one of three groups based on their predicted propensity to make a purchase on a return visit to the store. Groups are created using `pandas.qcut()`. There will be three such groups - low, medium and high propensity.

To do this, a separate column `audience_group` will be assigned and will contain integers (2, 1, 0). These integers are in descending order since

1. the scores (predicted probabilities) have been sorted in descending order
2. `pandas.qcut()` assigns bin numbers starting at 0, which is a standard practice (see the explanation for `mapper_dict_audience` earlier in this step)

Later, labels (`Low`, `Medium` or `High`) will be mapped to the integers using the mapping dictionary created above (`mapper_dict_audience`).

The the cohort, a subset of visitors will be placed in one of two cohorts - test (or treatment) and control. All other visitors will be assigned a choort of `None`. To do this, a separate column `audience_group` will be assigned and will contain `Control`, `Test` or `None`.

### Get Audience Groups Based on Propensity to Make Purchase on Return Visit

To create the audience for the marketing campaign, we'll first

1. (optionally) sort the predicted probabilities in the unseen (or inference) data, in descending order
2. assign a row number to each observation (i.e. to each row or visitor) in this unseen data

In [11]:
%%time
df = (
    df
    .sort_values(by="score", ascending=False)
    .assign(row_number=lambda df: range(len(df)))
)
df.head()

CPU times: user 3.09 ms, sys: 2.87 ms, total: 5.96 ms
Wall time: 5.35 ms


Unnamed: 0,fullvisitorid,label,score,predicted_score_label,row_number
5277,6539,False,0.957472,True,0
997,15317,False,0.929886,True,1
12343,4313,False,0.918627,True,2
1644,9806,False,0.908974,True,3
22156,4375,False,0.906313,True,4


Next, we'll bin the predicted probabilities using their quantiles with `pandas.qcut`, where the visitors are binned based on visit row number in order to avoid duplication of bin boundaries (three bins are created since earlier we specified we wanted. Three audience groups, namely low, medium and high propensity to purchase.

This is done below

In [12]:
# | code-fold: false
df["audience_group"] = pd.qcut(
    x=df["row_number"], q=audience_groups["num_propens_groups"], labels=False
)
display(df.head())
display(df.tail())

Unnamed: 0,fullvisitorid,label,score,predicted_score_label,row_number,audience_group
5277,6539,False,0.957472,True,0,0
997,15317,False,0.929886,True,1,0
12343,4313,False,0.918627,True,2,0
1644,9806,False,0.908974,True,3,0
22156,4375,False,0.906313,True,4,0


Unnamed: 0,fullvisitorid,label,score,predicted_score_label,row_number,audience_group
21391,11855,False,0.0,False,23995,2
12926,1147,False,0.0,False,23996,2
1387,10329,False,0.0,False,23997,2
20790,14354,False,0.0,False,23998,2
575,610,False,0.0,False,23999,2


### Create Test and Control Cohorts

With the three audience propensity groups now assigned, we can create the test and control cohorts within each audience group.

Test and control group cohorts should be randomly selected without replacement. This means the same visitor should not be randomly selected multiple times from the audience group and placed in each cohort. The two cohorts need to be independent and so the same visitor cannot be assigned to both cohorts.

In Python, there are two options to make such random selections without duplication

1. `random.sample()` [in the Python standard library](https://www.educative.io/answers/what-is-randomsample-in-python)
2. [`numpy.random.choice()` with `replacement=False` in the `numpy` library](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html#numpy.random.choice)
   
Here, the `numpy` approach is used.

To create these groups, we'll iterate over each of the desired minimum test and control visitor cohort sizes that were recommended by the previous step about experiment design (post-processing) and then draw a [random sample](https://www.optimove.com/resources/learning-center/control-groups-in-marketing) from each propensity audience group to be chosen as control and test.

The approach will consist of the following steps for each audience group (low, medium or high propensity)

1. draw random sample of visitors, without replacement, as the control group
2. get all non-selected visitors
3. from non-selected visitors, draw random sample of visitors, without replacement, as the treatment (test) group
4. (optional) get all visitors that are not part of either control or test group
5. extract the following
   - all features (`X`)
   - metadata columns not used in ML
   - predicted ML label (`y_pred`)
   - predicted propensities (probabilities, in the `score` column)

   and store the results in
   - `df_test` (test or treatment cohort of visitors)
   - `df_control` (control cohort of visitors)
   - (optional) `df_excluded` (cohort of visitors that are not selected in either test or control group)

This approach is shown below

In [13]:
%%time
groups = []
for k, group_size in enumerate(min_control_group_sizes):
    df_audience = df.query(f"audience_group == {k}")
    audience_array = df_audience['fullvisitorid'].to_numpy()

    # 1. get control 
    rng = np.random.default_rng(88)
    control_group_visitors = rng.choice(audience_array, group_size, replace=False).tolist()

    # 2. get all remaining visitors
    other_array = list(set(audience_array) - set(control_group_visitors))

    # 3. get test group visitors by randomly sampling from remaining visitors
    test_group_visitors = rng.choice(other_array, group_size, replace=False).tolist()

    # 4. get combined control and test cohorts of visitors
    combined_cohort_visitors = control_group_visitors + test_group_visitors
    # get all excluded visitors (not part of test or control cohort)
    excluded_visitors = list(set(audience_array) - set(combined_cohort_visitors))

    # # get audience scores
    # audience_scores = df_audience.query(
    #     "fullvisitorid.isin(@combined_cohort_visitors)"
    # )['score'].to_numpy()

    print(
        f"audience={k}: {mapper_dict_audience[k]}, "
        f"size={len(audience_array):,}, ",
        f"excluded={len(other_array):,}, "
        f"wanted={group_size:,}, "
        f"control={len(control_group_visitors):,}, "
        f"test={len(test_group_visitors):,}"
    )

    # 5. get extract attributes for each cohort per audience group
    df_test = (
        df_audience.drop(columns=['row_number'])
        .query("fullvisitorid.isin(@test_group_visitors)")
        .assign(audience_group=k)
        .assign(cohort='Test')
    )
    df_control = (
        df_audience.drop(columns=['row_number'])
        .query("fullvisitorid.isin(@control_group_visitors)")
        .assign(audience_group=k)
        .assign(cohort='Control')
    )
    df_excluded = (
        df_audience.drop(columns=['row_number'])
        .query("fullvisitorid.isin(@excluded_visitors)")
        .assign(audience_group=k)
        .assign(cohort=None)
    )
    df_coh = pd.concat([df_control, df_test, df_excluded], ignore_index=True)
    groups.append(df_coh)
df_test_audience_groups = (
    pd.concat(groups, ignore_index=True)
    .assign(audience_group=lambda df: df['audience_group'].map(mapper_dict_audience))
    .astype({'audience_group': pd.StringDtype(), 'cohort': pd.StringDtype()})
)
display(df_test_audience_groups.head())
display(df_test_audience_groups.head())

audience=0: High, size=8,000,  excluded=7,000, wanted=1,000, control=1,000, test=1,000
audience=1: Medium, size=8,000,  excluded=6,000, wanted=2,000, control=2,000, test=2,000
audience=2: Low, size=8,000,  excluded=5,000, wanted=3,000, control=3,000, test=3,000


Unnamed: 0,fullvisitorid,label,score,predicted_score_label,audience_group,cohort
0,11635,False,0.805879,True,High,Control
1,9652,False,0.783223,True,High,Control
2,15041,False,0.769196,True,High,Control
3,6620,False,0.765363,True,High,Control
4,3295,False,0.736128,True,High,Control


Unnamed: 0,fullvisitorid,label,score,predicted_score_label,audience_group,cohort
0,11635,False,0.805879,True,High,Control
1,9652,False,0.783223,True,High,Control
2,15041,False,0.769196,True,High,Control
3,6620,False,0.765363,True,High,Control
4,3295,False,0.736128,True,High,Control


CPU times: user 61.1 ms, sys: 5.16 ms, total: 66.3 ms
Wall time: 65.6 ms


Earlier, we mentioned that the treatment (or test) and control groups should be similar to each other for the property to be tested. This is a [fundamental requirement of test and control groups](https://www.mobileapps.com/blog/test-group#Similarities_Between_Test_and_Control_Groups). In this case, the probability (`score` column) is the property of interest.

With this in mind, we now show selected descriptive statistics for both the test (treatment) and control cohorts within each desired audience group (low, medium and high propensity)

In [14]:
# | code-fold: false
df_aud_stats = df_test_audience_groups.groupby(
    ["audience_group", "cohort"], dropna=False, as_index=False
).agg({"score": ["count", "min", "mean", "median", "max"]})
df_aud_stats.columns = [
    "_".join(c).rstrip("_") for c in df_aud_stats.columns.to_flat_index()
]
df_aud_stats = df_aud_stats.astype(
    {f"score_{stat}": pd.Float32Dtype() for stat in ["min", "mean", "median", "max"]}
).sort_values(by=["audience_group"])
df_aud_stats

Unnamed: 0,audience_group,cohort,score_count,score_min,score_mean,score_median,score_max
0,High,Control,1000,0.011979,0.137127,0.073412,0.805879
1,High,Test,1000,0.011967,0.145986,0.083493,0.918627
2,High,,6000,0.011962,0.143285,0.077881,0.957472
3,Low,Control,3000,0.0,6e-06,0.0,5.2e-05
4,Low,Test,3000,0.0,6e-06,0.0,5.2e-05
5,Low,,2000,0.0,6e-06,0.0,5.2e-05
6,Medium,Control,2000,5.3e-05,0.002735,0.001312,0.011958
7,Medium,Test,2000,5.3e-05,0.002649,0.00122,0.011951
8,Medium,,4000,5.3e-05,0.002617,0.001202,0.011958


**Observations**

1. With the exception of outliers in the *High* propensity audience group, we see good agreement in the statistics for the probabilities per Test-Control cohort within the same audience group.

## Export to Disk

Finally, the unseen data with the

1. audience (`audience_group`)
2. cohorts (`cohort`)

columns assigned will now be exported to disk for use by the marketing team.

Show the datatypes

In [15]:
summarize_df(df_test_audience_groups)

Unnamed: 0,dtype,missing
fullvisitorid,string[python],0
label,bool,0
score,Float32,0
predicted_score_label,boolean,0
audience_group,string[python],0
cohort,string[python],12000


**Notes**
1. The `cohort` has missing values for visitors who were not assigned to either the test or control groups. This is expected.

Save to disk

In [16]:
# | code-fold: false
audience_cohorts_fpath = os.path.join(
    processed_data_dir, "marketing_audience_cohort_groups__unseen_data.parquet.gzip"
)
df_test_audience_groups.to_parquet(
    audience_cohorts_fpath, index=False, engine="pyarrow", compression="gzip"
)

## Summary of Tasks Performed

This step in the project's overall workflow has performed the following

1. created marketing audience propensity groups
2. created test (or treatment) and control cohorts
3. demonstrated similarity between test and control cohorts, in terms of propensity to make a purchase, as required

## Summary of Assumptions

1. As was the case with the preceding (post-processing, or media design) step, the current step can only be be run on March 31, 2017 after
   - all the unseen data has been collected
   - propensity (probabilities) have been predicted for this data

   So, we will assume that

   - the current date is April 1, 2017
   - the unseen data has been
     - collected
     - processed (drop duplicates, bin categorical features, etc. as was done during ML model development)
   - we have used the trained ML model to make predictions of the probability (likelihood or propenisty) of making a purchase on a future visit for all visitors in the unseen data1. As was the case with the preceding (post-processing, or media design) step, the current step can only be be run on March 31, 2017 after
   - all the unseen data has been collected
   - propensity (probabilities) have been predicted for this data

   So, we have assumed that

   - the current date is April 1, 2017
   - the unseen data has been
     - collected
     - processed (drop duplicates, bin categorical features, etc. as was done during ML model development)
   - we have used the trained ML model to make predictions of the probability (likelihood or propenisty) of making a purchase on a future visit for all visitors in the unseen data

## Limitations

1. In practice, we may want to perform a brief profiling of the cohorts (get the main attributes using some of the columns with ML features such as browser, number of pageviews, etc.) within each audience propensity group and append this profile as a new column as this might help the marketing team devise an appropriate strategy/promotion.

   Profiling of the test and control cohorts within each of the audience propensity groups could provide helpful context to the marketing team when they are devising a marketing promotion/strategy. This cohort profiling step was not performed here.