# Exploratory Data Analysis (continued)

In [1]:
%load_ext lab_black

In [2]:
import os
from datetime import datetime
from glob import glob
from typing import Dict

import numpy as np
import pandas as pd
from google.oauth2 import service_account

## About

This notebook explores the visit data for the Google Merchandise store on the Google Marketplace.

### Objectives
The goals of this notebook are the following
1. prepare visit data
2. explore the prepared data to look for any underlying patterns that can be captured through features that might help ML modeling
3. observe the class imbalance of the labels (`y`) to get a sense of any resampling techniques that could aid the performance of a ML model

### Discussion of Study Period for This Project
We will need to discuss the data that can be used for EDA without suffering from lookahead bias/data leakage. The visits data will be transformed from a single row per visit to get a single row per visitor. 

### Data Selection for EDA
With the above in mind, EDA in this notebook will cover the training data.

## User Inputs

In [3]:
PROJ_ROOT_DIR = os.path.join(os.pardir)

In [4]:
# start and end dates
train_start_date = "20160801"
train_end_date = "20170331"
test_end_date = "20170430"

# # Google Cloud PROJECT ID
# gcp_project_id = <google-project-id>
# # Google Cloud Service Account JSON local filepath
# gcp_creds_fpath = <google-cloud-service-account-json-file>

# IGNORE
# Google Cloud PROJECT ID
gcp_project_id = os.environ["GCP_PROJECT_ID"]
# Google Cloud Service Account JSON local filepath
gcp_creds_fpath = glob(
    os.path.join(os.path.join(PROJ_ROOT_DIR, "data", "raw"), "*.json")
)[0]

In [5]:
# authenticate BigQuery
gcp_credentials = service_account.Credentials.from_service_account_file(gcp_creds_fpath)
gcp_auth_dict = dict(gcp_project_id=gcp_project_id, gcp_creds=gcp_credentials)

# mapping dictionary to get meaningful names from the action_type column
mapper = {
    1: "Click through of product lists",
    2: "Product detail views",
    3: "Add product(s) to cart",
    4: "Remove product(s) from cart",
    5: "Check out",
    6: "Completed purchase",
    7: "Refund of purchase",
    8: "Checkout options",
    0: "Unknown",
}

Define a Python helper function to execute a SQL query using Google BigQuery

In [6]:
def run_sql_query(
    query: str,
    gcp_project_id: str,
    gcp_creds: os.PathLike,
    show_dtypes: bool = False,
    show_info: bool = False,
    show_df: bool = False,
) -> pd.DataFrame:
    """Run query on Gooble BigQuery and return results as pandas DataFrame."""
    start_time = datetime.now()
    start_time_str = start_time.strftime("%Y-%m-%d %H:%M:%S")
    print(f"Query execution start time = {start_time_str}...", end="")
    df = pd.read_gbq(
        query,
        project_id=gcp_project_id,
        credentials=gcp_creds,
        dialect="standard",
        # configuration is optional, since default for query caching is True
        configuration={"query": {"useQueryCache": True}},
        # use_bqstorage_api=True,
    )
    end_time = datetime.now()
    duration = end_time - start_time
    duration = duration.seconds + (duration.microseconds / 1_000_000)
    print(f"done at {end_time.strftime('%Y-%m-%d %H:%M:%S')} ({duration:.3f} seconds).")
    print(f"Query returned {len(df):,} rows")
    if show_df:
        with pd.option_context("display.max_columns", None):
            display(df)
    if show_dtypes:
        display(df.dtypes.rename("dtype").to_frame().transpose())
    if show_info:
        df.info()
    return df

## Data Preparation

**Question 1. Get visitors with a purchase on a future visit to the Marketplace.**

To get these visitors, a similar approach to that from the `01_get_data_eda.ipynb` notebook will be used. In that notebook, two criteria were used to identify a purchase on a future visit, namely `total.transactions > 0` and `totals.newVisits IS NULL`. Those will be used here as well.

A query with these filters is executed below

In [7]:
%%time
query = f"""
        SELECT fullvisitorid,
               IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, True, False) AS made_purchase_on_future_visit,
               IF(SUM(CASE WHEN totals.transactions > 0 AND totals.newVisits IS NULL THEN 1 ELSE 0 END) > 0, True, False) AS made_purchase_on_future_visit_v2
        FROM `data-to-insights.ecommerce.web_analytics`
        WHERE date BETWEEN '{train_start_date}' AND '{test_end_date}'
        AND geoNetwork.country = 'United States'
        GROUP BY fullvisitorid
        """
df = run_sql_query(query, **gcp_auth_dict, show_df=False)
display(
    # breakdown of returning purchasers using COUNTIF()
    df["made_purchase_on_future_visit"].value_counts().rename(
        "num_return_purchasers_using_countif"
    ).to_frame().merge(
        # breakdown of returning purchasers using CASE WHEN()
        df["made_purchase_on_future_visit_v2"].value_counts().rename(
            "num_return_purchasers_using_if_sum_casewhen"
        ).to_frame(),
        left_index=True,
        right_index=True,
    ).merge(
        (
            100
            * df["made_purchase_on_future_visit"]
            .value_counts(normalize=True)
            .rename("frac_made_purchase_on_future_visit")
        ).to_frame(),
        left_index=True,
        right_index=True,
    ).reset_index().rename(columns={"index": "made_return_purchase"})
)

Query execution start time = 2023-04-05 19:45:37...done at 2023-04-05 19:45:46 (8.859 seconds).
Query returned 200,507 rows


Unnamed: 0,made_purchase_on_future_visit,num_return_purchasers_using_countif,num_return_purchasers_using_if_sum_casewhen,frac_made_purchase_on_future_visit
0,False,192248,192248,95.880942
1,True,8259,8259,4.119058


CPU times: user 1.09 s, sys: 164 ms, total: 1.26 s
Wall time: 8.87 s


**Observations**
1. `COUNTIF()` is a BigQuery SQL function ([1](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#countif)) but it gives the same output as a standard SQL-based approach using `IF(SUM(CASE WHEN...))`. For the rest of this notebook, `COUNTIF()` will be used
2. Over the selected months, the class imbalance is close to 95% to 5% (or 95:5)
   - this comes from the `made_purchase_on_future_visit` column
   - these are the class labels for ML experiments

**Question 2. Extract attributes from the first visit by visitors that made a purchase on a future visit.**

The following are the attributes extracted from the first visit (for the above visitors only) and the high-level categories that they belong to
1. geospatial and temporal
   - country
   - `datetime` attributes (day of month, hour of day, etc.)
2. metadata of each visit and visitor
   - these are *id* (or equivalent) columns
3. traffic sources and channels
   - traffic sources
     - these are search engines, social media networks, and other sources that result in visitors reaching the merchandise store's website ([link](https://support.google.com/analytics/answer/6205762?hl=en#understanding&zippy=%2Cin-this-article))
   - channels
     - these are groups of traffic sources ([link](https://support.google.com/analytics/answer/6010097?hl=en#zippy=%2Cin-this-article))
4. visitor activity on site
   - hits
   - bounces
   - page views
   - time spent on site
   - number of add-to-cart actions performed
5. visitor's device used to access site
   - browser
   - device category
   - operating system
6. label for machine learning
   - `made_purchase_on_future_visit`
     - same as in the above query
     - indicates whether a visitor makes a purchase during their next visit
7. Product
   - products viewed
   - products clicked
8. Promotion
   - promotions viewed (impressions)
   - promotions clicked

In [8]:
%%time
query = f"""
        WITH
        -- Step 1. get visitors with a purchase on a future visit
        next_visit_purchasers AS (
             SELECT fullvisitorid,
                    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, True, False) AS made_purchase_on_future_visit
             FROM `data-to-insights.ecommerce.web_analytics`
             WHERE date BETWEEN '{train_start_date}' AND '{test_end_date}'
             AND geoNetwork.country = 'United States'
             GROUP BY fullvisitorid
        ),
        -- Steps 2. and 3. get attributes of the first visit
        first_visit_attributes AS (
            SELECT -- =========== GEOSPATIAL AND TEMPORAL ATTRIBUTES OF VISIT ===========
                   geoNetwork.country,
                   EXTRACT(QUARTER FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS quarter,
                   EXTRACT(MONTH FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS month,
                   EXTRACT(DAY FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS day_of_month,
                   EXTRACT(DAYOFWEEK FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS day_of_week,
                   EXTRACT(HOUR FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS hour,
                   EXTRACT(MINUTE FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS minute,
                   EXTRACT(SECOND FROM DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern')) AS second,
                   -- =========== VISIT AND VISITOR METADATA ===========
                   fullvisitorid,
                   visitId,
                   visitNumber,
                   DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern') AS visitStartTime,
                   -- =========== SOURCE OF SITE TRAFFIC ===========
                   -- source of the traffic from which the visit was initiated
                   trafficSource.source,
                   -- medium of the traffic from which the visit was initiated
                   trafficSource.medium,
                   -- referring channel connected to visit
                   channelGrouping,
                   -- =========== VISITOR ACTIVITY ===========
                   -- total number of hits
                   (CASE WHEN totals.hits > 0 THEN totals.hits ELSE 0 END) AS hits,
                   -- number of bounces
                   (CASE WHEN totals.bounces > 0 THEN totals.bounces ELSE 0 END) AS bounces,
                   -- action performed during first visit
                   CAST(h.eCommerceAction.action_type AS INT64) AS action_type,
                   -- page views
                   IFNULL(totals.pageviews, 0) AS pageviews,
                   -- time on the website
                   IFNULL(totals.timeOnSite, 0) AS time_on_site,
                   -- whether add-to-cart was performed during visit
                   (CASE WHEN CAST(h.eCommerceAction.action_type AS INT64) = 3 THEN 1 ELSE 0 END) AS added_to_cart,
                   (CASE WHEN CAST(h.eCommerceAction.action_type AS INT64) = 2 THEN 1 ELSE 0 END) AS product_details_viewed,
                   -- =========== VISITOR DEVICES ===========
                   -- user's browser
                   device.browser,
                   -- user's operating system
                   device.operatingSystem AS os,
                   -- user's type of device
                   device.deviceCategory,
                   -- =========== PROMOTION ===========
                   h.promotion,
                   h.promotionActionInfo AS pa_info,
                   -- =========== PRODUCT ===========
                   h.product,
                   -- =========== ML LABEL (DEPENDENT VARIABLE) ===========
                   made_purchase_on_future_visit
            FROM `data-to-insights.ecommerce.web_analytics`,
            UNNEST(hits) AS h
            INNER JOIN next_visit_purchasers USING (fullvisitorid)
            WHERE date BETWEEN '{train_start_date}' AND '{train_end_date}'
            AND geoNetwork.country = 'United States'
            AND totals.newVisits = 1
        ),
        -- Step 4. get aggregated features (attributes) per visit
        visit_attributes AS (
            SELECT fullvisitorid,
                   visitId,
                   visitNumber,
                   visitStartTime,
                   country,
                   quarter,
                   month,
                   day_of_month,
                   day_of_week,
                   hour,
                   minute,
                   second,
                   source,
                   medium,
                   channelGrouping,
                   hits,
                   bounces,
                   -- get the last action performed during the first visit
                   -- (this indicates where the visitor left off at the end of their visit)
                   MAX(action_type) AS last_action,
                   -- get number of products whose details were viewed
                   SUM(product_details_viewed) AS product_detail_views,
                   -- get number of promotions displayed and clicked during the first visit
                   COUNT(CASE WHEN pa_info IS NOT NULL THEN pa_info.promoIsView ELSE NULL END) AS promos_displayed,
                   COUNT(CASE WHEN pa_info IS NOT NULL THEN pa_info.promoIsClick ELSE NULL END) AS promos_clicked,
                   -- get number of products displayed and clicked during the first visit
                   COUNT(CASE WHEN pu.isImpression IS NULL THEN NULL ELSE 1 END) AS product_views,
                   COUNT(CASE WHEN pu.isClick IS NULL THEN NULL ELSE 1 END) AS product_clicks,
                   pageviews,
                   time_on_site,
                   browser,
                   os,
                   deviceCategory,
                   SUM(added_to_cart) AS added_to_cart,
                   made_purchase_on_future_visit,
            FROM first_visit_attributes
            LEFT JOIN UNNEST(promotion) as p
            LEFT JOIN UNNEST(product) as pu
            GROUP BY fullvisitorid,
                     visitId,
                     visitNumber,
                     visitStartTime,
                     country,
                     quarter,
                     month,
                     day_of_month,
                     day_of_week,
                     hour,
                     minute,
                     second,
                     source,
                     medium,
                     channelGrouping,
                     hits,
                     bounces,
                     pageviews,
                     time_on_site,
                     browser,
                     os,
                     deviceCategory,
                     made_purchase_on_future_visit
        )
        SELECT *
        FROM visit_attributes
        """
df = run_sql_query(query, **gcp_auth_dict, show_df=False)
with pd.option_context('display.max_columns', None):
    display(df)

Query execution start time = 2023-04-05 19:45:52...done at 2023-04-05 19:46:34 (41.703 seconds).
Query returned 176,304 rows


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,product_detail_views,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,made_purchase_on_future_visit
0,483329569933708956,1477437687,1,2016-10-25 19:21:27,United States,4,10,25,3,19,21,27,google,organic,Organic Search,6,0,0,0,0,0,0,0,6,602,Chrome,Windows,desktop,0,False
1,9534112552538425546,1476671570,1,2016-10-16 22:32:50,United States,4,10,16,1,22,32,50,youtube.com,referral,Social,6,0,0,0,54,0,0,0,1,270,Opera,Windows,desktop,0,False
2,4648924122067625674,1475958245,1,2016-10-08 16:24:05,United States,4,10,8,7,16,24,5,youtube.com,referral,Social,5,0,0,0,36,0,4,0,2,198,Opera,Windows,desktop,0,False
3,6917772450375508123,1488649070,1,2017-03-04 12:37:50,United States,1,3,4,7,12,37,50,google,organic,Organic Search,24,0,0,0,54,0,104,0,24,830,Chrome,Chrome OS,desktop,0,False
4,2743152869399749836,1481229164,1,2016-12-08 15:32:44,United States,4,12,8,5,15,32,44,google,organic,Organic Search,5,0,0,0,18,0,0,0,5,1004,Chrome,Windows,desktop,0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
176299,7394810252875006838,1487700692,1,2017-02-21 13:11:32,United States,1,2,21,3,13,11,32,plus.google.com,referral,Social,3,0,3,0,0,0,6,0,2,192,Chrome,Linux,desktop,1,False
176300,3465406422267443252,1484007004,1,2017-01-09 19:10:04,United States,1,1,9,2,19,10,4,youtube.com,referral,Social,3,0,0,0,0,0,33,0,3,162,Chrome,Windows,desktop,0,False
176301,9646206586674539171,1486848395,1,2017-02-11 16:26:35,United States,1,2,11,7,16,26,35,youtube.com,referral,Social,3,0,0,0,0,0,36,0,3,21,Chrome,Windows,desktop,0,False
176302,9088496055778533168,1482668980,1,2016-12-25 07:29:40,United States,4,12,25,1,7,29,40,youtube.com,referral,Social,3,0,0,0,0,0,36,0,3,69,Chrome,Linux,desktop,0,False


CPU times: user 7.19 s, sys: 416 ms, total: 7.61 s
Wall time: 41.7 s


**Notes**
1. Below is a brief overview of the CTEs used here
   - `next_visit_purchasers`
     - gets visitors who made a purchase on a return visit to the merchandise store on the Google Marketplace
   - `first_visit_attributes`
     - gets attributes of first visit
     - the statement `INNER JOIN next_visit_purchasers USING (fullvisitorid)` is used to only select the visitors that made a purchase on a return visit to the store (these `fullvisitorid`s are stored in the `next_visit_purchasers` CTE)
   - `visit_attributes`
     - aggregates values from nested columns to get views and clicks for promotions and products
   - the BigQuery SQL function `UNNEST` was used to flatten nested columns
2. The start and end dates of the ML training, validation and test data splits were defined. The training dates have been used to filter the `first_visit_attributes` CTE in order since EDA in this notebook will only be performed using the training data in order to avoid data leakage (or lookahead bias).
3. The SQL required to extract most of these columns was fairly straightforward and was determined from (a) the documentation for the dataset and (b) examining the first few rows of the dataset in these columns. For brevity, we won't discuss these columns in further detail. These column categories are listed below
   - geospatial and temporal
   - metadata of each visit and visitor
   - traffic sources and channels
   - visitor activity on merchandise store website
   - visitor's device
   - label for machine learning (discussed in *Data Preparation Question 2. above*)
4. These columns were extracted based on intuition about the attributes of each visit that will help to predict the probability of a visitor making a purchase on a future (return) visit to the merchandise store.

**Question 3. What fraction of visitors added one or more items to their shopping cart during their first visit?**

In [9]:
for c in ["added_to_cart"]:
    display(
        (100 * df[c].value_counts(dropna=False, normalize=True))
        .rename("number")
        .reset_index()
        .rename(columns={"index": "added_to_cart"})
    )

Unnamed: 0,added_to_cart,number
0,0.0,89.543629
1,1.0,6.247164
2,2.0,2.044764
3,3.0,0.870655
4,4.0,0.457165
5,5.0,0.258644
6,6.0,0.167325
7,7.0,0.101529
8,8.0,0.085648
9,9.0,0.044809


**Observations**
1. During the months covered by the training data, nearly 90% of visitors did not add an item to their shopping cart during their first visit to the merchandise store (`added_to_cart = 0`). Only 10% of all visitors added an item to their shopping cart during this time.

**Question 4. Comment on duplicates present in the data prepared above. What are some possible reasons for the presence of duplicates in the above prepared data? How should these be handled?**

Below we show that there are duplicates within the `fullvisitorid` column

In [10]:
print(
    f"Number of rows = {len(df):,}\nNumber of unique visitor IDs = "
    f"{df['fullvisitorid'].nunique():,}\n"
    f"Largest visitNumber = {df['visitNumber'].max()}"
)

Number of rows = 176,304
Number of unique visitor IDs = 175,859
Largest visitNumber = 1


These duplicates are retrieved below, showing that multiple `visitId`s are present for the same `visitorid`

In [11]:
df.groupby(["fullvisitorid"]).agg(
    {"visitId": "count", "visitNumber": "max"}
).reset_index().rename(columns={"visitId": "num_visitIds"}).query("num_visitIds > 1")

Unnamed: 0,fullvisitorid,num_visitIds,visitNumber
243,0014997413479849928,2,1
1455,0083782248104182622,2,1
2206,012569301201854368,2,1
2625,0153393931967124172,2,1
2781,0161343516497795152,2,1
...,...,...,...
174217,9906208132011345120,2,1
174390,9915457192772678365,2,1
174978,9949751653823311987,2,1
175028,9952616174324085427,2,1


**Observations**
1. Duplicates can occur by
   - `fullvisitorid`
   - `visitId`

   so, we should explore both cases separately.

Duplicated `visitId`s are shown below

In [12]:
%%time
dup_visit_ids = df[df.duplicated(subset=["visitId"], keep=False)]
num_dups = len(df[df.duplicated(subset=["visitId"], keep="first")])
print(
    f"Found {num_dups:,} duplicated visitIds out of "
    f"{len(df):,} ({100*num_dups/len(df):.3f}%)"
)
with pd.option_context("display.max_columns", None):
    display(dup_visit_ids.sort_values(by=["visitId"]).head(25))

Found 1,411 duplicated visitIds out of 176,304 (0.800%)


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,product_detail_views,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,made_purchase_on_future_visit
64103,2935275602576654325,1470150511,1,2016-08-02 11:08:31,United States,3,8,2,3,11,8,31,(direct),(none),Direct,1,1,0,0,0,0,2,0,1,0,Chrome,Android,mobile,0,False
79429,9697795609276773118,1470150511,1,2016-08-02 11:08:31,United States,3,8,2,3,11,8,31,youtube.com,referral,Social,1,1,0,0,0,0,0,0,1,0,Chrome,Windows,desktop,0,False
168291,9683319538826858411,1470160418,1,2016-08-02 13:53:38,United States,3,8,2,3,13,53,38,youtube.com,referral,Social,1,1,0,0,0,0,0,0,1,0,Chrome,Windows,desktop,0,False
90517,9603025468779152457,1470160418,1,2016-08-02 13:53:38,United States,3,8,2,3,13,53,38,google,organic,Organic Search,13,0,3,1,0,0,124,1,11,161,Chrome,Windows,desktop,1,False
1206,255738894985569638,1470168999,1,2016-08-02 16:16:39,United States,3,8,2,3,16,16,39,google,organic,Organic Search,7,0,0,0,0,0,38,0,7,28,Chrome,Macintosh,desktop,0,False
50802,2034380726948166051,1470168999,1,2016-08-02 16:16:42,United States,3,8,2,3,16,16,42,(direct),(none),Direct,10,0,2,2,0,0,136,2,8,328,Chrome,Windows,desktop,0,False
7665,1497363999004274326,1470171393,1,2016-08-02 16:56:33,United States,3,8,2,3,16,56,33,mall.googleplex.com,referral,Referral,14,0,2,3,0,0,94,3,11,90,Chrome,Macintosh,desktop,0,False
49494,9866587659793762889,1470171393,1,2016-08-02 16:56:33,United States,3,8,2,3,16,56,33,mall.googleplex.com,referral,Referral,7,0,0,0,0,0,87,0,7,50,Chrome,Linux,desktop,0,False
111949,9095035992209125819,1470172055,1,2016-08-02 17:07:35,United States,3,8,2,3,17,7,35,google,organic,Organic Search,9,0,2,1,0,2,38,1,6,127,Safari,iOS,tablet,0,False
148141,6423253706798498849,1470172055,1,2016-08-02 17:07:35,United States,3,8,2,3,17,7,35,google,organic,Organic Search,3,0,0,0,0,0,29,0,3,26,Chrome,Macintosh,desktop,0,False


CPU times: user 32.1 ms, sys: 71 µs, total: 32.2 ms
Wall time: 31 ms


**Observations**
1. For the same `visitId`, different traffic sources (`source`, `medium`, `channelGrouping`) bring the same or different visitors (`fullvisitorid`) to the website at the same `datetime` (`visitStartTime`). Google Analytics assigns the same `visitId` to such visitors. There are two types of nested duplicates here
   - the same visitor accessing the merchandise store from
     - multiple devices at the same time
       - this cross-device tracking appears to be [allowed by Google Analytics](https://blog.google/products/marketingplatform/360/cross-device-capabilities/)
     - the same device and same browser (using separate browser windows after clearing cookies) at the same time
       - this is also [allowed by Google Analytics](https://www.quora.com/In-Google-Analytics-why-do-I-have-more-unique-visitors-than-visits)
   - different visitors are accessing the merchandise store from multiple devices at the same time
     - this is not a duplicated occurrence
     - most likely this corresponds to two distinct visitors who happened to navigate to the site at the same time
2. There are a negligible number of such duplicates in the dataset.

Duplicated `fullvisitorId`s are shown below

In [13]:
%%time
dup_visitor_ids = df[df.duplicated(subset=["fullvisitorid"], keep=False)]
num_dups = len(df[df.duplicated(subset=["fullvisitorid"], keep="first")])
print(
    f"Found {num_dups:,} duplicated fullvisitorid out of "
    f"{len(df):,} ({100*num_dups/len(df):.3f}%)"
)
with pd.option_context("display.max_columns", None):
    display(dup_visitor_ids.sort_values(by=["fullvisitorid"]).head(25))

Found 445 duplicated fullvisitorid out of 176,304 (0.252%)


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,product_detail_views,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,made_purchase_on_future_visit
95776,14997413479849928,1477513747,1,2016-10-26 16:29:07,United States,4,10,26,4,16,29,7,(direct),(none),Direct,14,0,2,1,18,0,38,4,10,77,Chrome,Macintosh,desktop,0,False
52214,14997413479849928,1474324872,1,2016-09-19 18:41:12,United States,3,9,19,2,18,41,12,mall.googleplex.com,referral,Referral,17,0,2,3,9,0,168,5,12,349,Chrome,Macintosh,desktop,0,False
158289,83782248104182622,1488911527,1,2017-03-07 13:32:07,United States,1,3,7,3,13,32,7,sites.google.com,referral,Referral,4,0,0,0,9,0,14,0,4,21,Chrome,Linux,desktop,0,False
96181,83782248104182622,1470254339,1,2016-08-03 15:59:02,United States,3,8,3,4,15,59,2,(direct),(none),Direct,16,0,0,0,0,0,280,0,16,538,Chrome,Linux,desktop,0,False
122225,12569301201854368,1477496472,1,2016-10-26 11:41:12,United States,4,10,26,4,11,41,12,analytics.google.com,referral,Referral,1,1,0,0,9,0,0,0,1,0,Chrome,Macintosh,desktop,0,False
16939,12569301201854368,1477339547,1,2016-10-24 16:05:52,United States,4,10,24,2,16,5,52,(direct),(none),Direct,2,0,0,0,9,0,10,0,2,12,Chrome,Macintosh,desktop,0,False
51549,153393931967124172,1481528488,1,2016-12-12 03:00:03,United States,4,12,12,2,3,0,3,google,organic,Organic Search,13,0,2,1,9,0,92,2,11,614,Safari,Macintosh,desktop,0,False
132537,153393931967124172,1481528488,1,2016-12-12 02:41:28,United States,4,12,12,2,2,41,28,google,organic,Organic Search,5,0,0,0,9,0,15,0,5,1105,Safari,Macintosh,desktop,0,False
47409,161343516497795152,1490770250,1,2017-03-29 03:00:19,United States,1,3,29,4,3,0,19,(direct),(none),Direct,3,0,6,0,9,0,0,0,3,28,Chrome,Macintosh,desktop,0,False
74025,161343516497795152,1490770250,1,2017-03-29 02:50:50,United States,1,3,29,4,2,50,50,(direct),(none),Direct,17,0,5,1,0,0,12,1,14,517,Chrome,Macintosh,desktop,2,False


CPU times: user 85.6 ms, sys: 0 ns, total: 85.6 ms
Wall time: 84 ms


**Observations**
1. For the same `fullvisitorid`, different traffic sources (`source`, `medium`, `channelGrouping`) bring the same visitor (`fullvisitorid`) to the website at the different `datetime`s (`visitStartTime`s) from the same device (`browser`, `os`, `deviceCategory`). There are two types of nested duplicates here
   - the same visit by the same visitor with a <30 minute period of inactivity between duplicates
     - (by default) [Google Analytics allows up to 30 minutes of inactivity before starting a new visit](https://support.google.com/analytics/answer/2731565?hl=en#time-based-expiration&zippy=%2Cin-this-article), so it is not clear why such a short period of inactivity should start a new *instance* of the same visit instead of just accumulating stats into the same *instance*
   - a different visit by the same visitor with a >30 minute period of inactivity between duplicates
     - it is also not clear why these duplicates are present in the data
2. Since
   - the use-case for this project requires attributes of **only** the first visit (per visitor) to be used to predict the probability of a purchase during a future visit by the same visitor
   - it is not clear why this type of duplicated record is present in the prepared data

   we will want to drop this type of duplicate from the training, validation and test splits of the prepared data (we're assuming that the same type of problem can occur throughout the dataset and not just in the training data).
3. There are a negligible number of such duplicates in the dataset.

With this in mind, columns with duplicates in the `fullvisitorid` column are dropped. This will be done using Python

In [14]:
%%time
df = df.drop_duplicates(subset=["fullvisitorid"], keep="first")

CPU times: user 34.3 ms, sys: 11.5 ms, total: 45.8 ms
Wall time: 45.1 ms


**Question 5. Show and comment on unique values in the nested promotion column.**

The number of promotions and products displayed (impressions) and clicked are shown below

In [15]:
%%time
for c in [
    "promos_displayed",
    "promos_clicked",
    "product_views",
    "product_clicks",
    "product_detail_views",
]:
    df_num_visitor_counts = (
        df[c]
        .value_counts(dropna=False)
        .rename("num_visitors")
        .reset_index()
        .rename(columns={"index": f"num_{c}"})
    )
    assert (
        type(df_num_visitor_counts.query("num_visitors == 0")[c].squeeze()).__name__
        == "NAType"
    )
    display(df_num_visitor_counts.query("num_visitors > 0"))

Unnamed: 0,promos_displayed,num_visitors
0,9,72437
1,0,61381
2,18,19665
3,13,7524
4,27,6294
...,...,...
62,12,1
63,432,1
64,299,1
65,195,1


Unnamed: 0,promos_clicked,num_visitors
0,0,157997
1,1,13522
2,2,2572
3,3,948
4,4,401
5,5,183
6,6,101
7,7,55
8,8,25
9,9,15


Unnamed: 0,product_views,num_visitors
0,0,51208
1,12,26675
2,24,8544
3,3,4520
4,36,4189
...,...,...
705,997,1
706,627,1
707,653,1
708,673,1


Unnamed: 0,product_clicks,num_visitors
0,0,131028
1,1,15967
2,2,10191
3,3,5618
4,4,3749
...,...,...
67,138,1
68,91,1
69,158,1
70,51,1


Unnamed: 0,product_detail_views,num_visitors
0,0,131103
1,1,19590
2,2,9568
3,3,5450
4,4,3170
...,...,...
62,52,1
63,61,1
64,70,1
65,58,1


CPU times: user 53.6 ms, sys: 27 µs, total: 53.6 ms
Wall time: 51.6 ms


**Notes**
1. Product views are the number of times a product was seen while in a list of other products (eg. on a product listing page or in a product category page).
2. Product clicks are the number of times a product was clicked on after being viewed.
3. Product detail views are the number of times a visitor has visited a product's page (not just viewed its details as part of a product listing).
   - a visitor might have viewed product details page for products that are
     - part of a product listing
     - not part of a product listing
4. Similar logic applies to promotions (eg. banners) views and clicks.

**Observations**
1. Products and promotions on the merchandise store's website on the Google Marketplace are not being
   - viewed (as part of a listing or in detail)
   - clicked

   often.

Promotion-related columns are flattened and shown for a single visit

In [16]:
%%time
query = f"""
        WITH visit_promotion_attrs AS (
            SELECT fullvisitorid,
                   visitId,
                   visitNumber,
                   DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern') AS visitStartTime,
                   CAST(h.ecommerceaction.action_type AS INT64) AS action_type,
                   h.promotion,
                   h.promotionActionInfo AS pa_info,
                   trafficSource.source,
                   trafficSource.medium,
                   channelGrouping,
                   device.browser,
                   device.operatingSystem,
                   device.deviceCategory
            FROM `data-to-insights.ecommerce.web_analytics`,
            UNNEST(hits) AS h
            WHERE visitId = 1476880065  -- 1476880065, 1478579523, 1474972357, 1478844153
            AND geoNetwork.country = 'United States'
        )
        SELECT * EXCEPT(promotion, promoId, pa_info, visitStartTime),
               pa_info,
               CASE WHEN pa_info IS NOT NULL THEN pa_info.promoIsView ELSE NULL END AS view_promo,
               CASE WHEN pa_info IS NOT NULL THEN pa_info.promoIsClick ELSE NULL END AS click_promo
        FROM visit_promotion_attrs
        LEFT JOIN UNNEST(promotion) as p
        """
df_raw = run_sql_query(query, **gcp_auth_dict, show_df=False)
df_raw["action_type"] = df_raw["action_type"].map(mapper)
with pd.option_context('display.max_colwidth', None, 'display.max_rows', None):
    display(df_raw.head(10))

Query execution start time = 2023-04-05 19:48:28...done at 2023-04-05 19:48:30 (1.579 seconds).
Query returned 42 rows


Unnamed: 0,fullvisitorid,visitId,visitNumber,action_type,source,medium,channelGrouping,browser,operatingSystem,deviceCategory,promoName,promoCreative,promoPosition,pa_info,view_promo,click_promo
0,3072592563711482446,1476880065,1,Unknown,google,organic,Organic Search,Chrome,Android,mobile,,,,,,
1,3072592563711482446,1476880065,1,Unknown,google,organic,Organic Search,Chrome,Android,mobile,,,,,,
2,3072592563711482446,1476880065,1,Unknown,google,organic,Organic Search,Chrome,Android,mobile,,,,,,
3,3072592563711482446,1476880065,1,Unknown,google,organic,Organic Search,Chrome,Android,mobile,,,,,,
4,3072592563711482446,1476880065,1,Unknown,google,organic,Organic Search,Chrome,Android,mobile,Apparel,home_main_link_apparel.jpg,Row 1,"{'promoIsView': True, 'promoIsClick': None}",True,
5,3072592563711482446,1476880065,1,Unknown,google,organic,Organic Search,Chrome,Android,mobile,Backpacks,home_bags_google_2.jpg,Row 2 Combo,"{'promoIsView': True, 'promoIsClick': None}",True,
6,3072592563711482446,1476880065,1,Unknown,google,organic,Organic Search,Chrome,Android,mobile,Mens T-Shirts,mens-tshirts.jpg,Row 3-1,"{'promoIsView': True, 'promoIsClick': None}",True,
7,3072592563711482446,1476880065,1,Unknown,google,organic,Organic Search,Chrome,Android,mobile,Womens T-Shirts,womens-tshirts.jpg,Row 3-2,"{'promoIsView': True, 'promoIsClick': None}",True,
8,3072592563711482446,1476880065,1,Unknown,google,organic,Organic Search,Chrome,Android,mobile,Office,green_row_link_to_office.jpg,Row 5 Color Combo,"{'promoIsView': True, 'promoIsClick': None}",True,
9,3072592563711482446,1476880065,1,Unknown,google,organic,Organic Search,Chrome,Android,mobile,Drinkware,red_row_hydrate.jpg,Row 4 Color Combo,"{'promoIsView': True, 'promoIsClick': None}",True,


CPU times: user 154 ms, sys: 3.55 ms, total: 157 ms
Wall time: 1.6 s


**Notes**
1. The `CASE WHEN` was constructed for both `view_promo` and `click_promo` columns based on the nested `pa_info` column.

**Observations**
1. When `view_promo = True`, a visitor has viewed a promotion. If it is not viewed, then it is `NULL`.
2. When a visitor clicks a promotion after viewing it
   - `click_promo = True`
   - `view_promo` is `NULL`
     - this prevents double-counting a promotion that is both viewed and clicked

**Question 6. Show and comment on unique values in the nested product column.**

Product-related columns are flattened and shown for a single visit

In [17]:
%%time
query = f"""
        WITH visit_product_attrs AS (
            SELECT fullvisitorid,
               visitId,
               visitNumber,
               DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern') AS visitStartTime,
               CAST(h.ecommerceaction.action_type AS INT64) AS action_type,
               h.product,
               (CASE WHEN ARRAY_LENGTH(h.product) = 0 THEN 0 ELSE ARRAY_LENGTH(h.product) END) AS product_count,
               (CASE WHEN CAST(h.eCommerceAction.action_type AS INT64) = 2 THEN 1 ELSE 0 END) AS product_details_viewed,
               trafficSource.source,
               trafficSource.medium,
               channelGrouping,
               device.browser,
               device.operatingSystem,
               device.deviceCategory
            FROM `data-to-insights.ecommerce.web_analytics`,
            UNNEST(hits) AS h
            WHERE visitId = 1478579523  -- 1478579523, 1474972357
            AND geoNetwork.country = 'United States'
        )
        SELECT *,
               p.isImpression AS viewed_product,
               p.isClick AS clicked_product
        FROM visit_product_attrs
        LEFT JOIN UNNEST(product) as p
        """
df_raw = run_sql_query(query, **gcp_auth_dict, show_df=False)
df_raw["action_type"] = df_raw["action_type"].map(mapper)
with pd.option_context('display.max_columns', None):
    display(df_raw)

Query execution start time = 2023-04-05 19:48:40...done at 2023-04-05 19:48:41 (1.598 seconds).
Query returned 266 rows


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,action_type,product,product_count,product_details_viewed,source,medium,channelGrouping,browser,operatingSystem,deviceCategory,productSKU,v2ProductName,v2ProductCategory,productVariant,productBrand,productRevenue,localProductRevenue,productPrice,localProductPrice,productQuantity,productRefundAmount,localProductRefundAmount,isImpression,isClick,customDimensions,customMetrics,productListName,productListPosition,viewed_product,clicked_product
0,7270403007208566857,1478579523,1,2016-11-07 23:32:03,Unknown,[],0,0,siliconvalley.about.com,referral,Referral,Chrome,Chrome OS,desktop,,,,,,,,,,,,,,,[],[],,,,
1,7270403007208566857,1478579523,1,2016-11-07 23:32:03,Unknown,[],0,0,siliconvalley.about.com,referral,Referral,Chrome,Chrome OS,desktop,,,,,,,,,,,,,,,[],[],,,,
2,7270403007208566857,1478579523,1,2016-11-07 23:32:03,Unknown,[],0,0,siliconvalley.about.com,referral,Referral,Chrome,Chrome OS,desktop,,,,,,,,,,,,,,,[],[],,,,
3,7270403007208566857,1478579523,1,2016-11-07 23:32:03,Unknown,[],0,0,siliconvalley.about.com,referral,Referral,Chrome,Chrome OS,desktop,,,,,,,,,,,,,,,[],[],,,,
4,7270403007208566857,1478579523,1,2016-11-07 23:32:03,Unknown,[],0,0,siliconvalley.about.com,referral,Referral,Chrome,Chrome OS,desktop,,,,,,,,,,,,,,,[],[],,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261,7270403007208566857,1478579523,1,2016-11-07 23:32:03,Unknown,"[{'productSKU': 'GGOEGAAX0318', 'v2ProductName...",12,0,siliconvalley.about.com,referral,Referral,Chrome,Chrome OS,desktop,GGOEGAAX0686,YouTube Youth Short Sleeve Tee Red,Home/Shop by Brand/YouTube/,(not set),(not set),,,18990000,18990000,,,,True,,[],[],Category,9,True,
262,7270403007208566857,1478579523,1,2016-11-07 23:32:03,Unknown,"[{'productSKU': 'GGOEGAAX0318', 'v2ProductName...",12,0,siliconvalley.about.com,referral,Referral,Chrome,Chrome OS,desktop,GGOEYHPA003510,YouTube Trucker Hat,Home/Shop by Brand/YouTube/,(not set),(not set),,,21990000,21990000,,,,True,,[],[],Category,10,True,
263,7270403007208566857,1478579523,1,2016-11-07 23:32:03,Unknown,"[{'productSKU': 'GGOEGAAX0318', 'v2ProductName...",12,0,siliconvalley.about.com,referral,Referral,Chrome,Chrome OS,desktop,GGOEGAAX0330,YouTube Men's Skater Tee Charcoal,Home/Shop by Brand/YouTube/,(not set),(not set),,,19990000,19990000,,,,True,,[],[],Category,11,True,
264,7270403007208566857,1478579523,1,2016-11-07 23:32:03,Unknown,"[{'productSKU': 'GGOEGAAX0318', 'v2ProductName...",12,0,siliconvalley.about.com,referral,Referral,Chrome,Chrome OS,desktop,GGOEYDHJ056099,22 oz YouTube Bottle Infuser,Home/Shop by Brand/YouTube/,(not set),(not set),,,4990000,4990000,,,,True,,[],[],Category,12,True,


CPU times: user 185 ms, sys: 7.17 ms, total: 192 ms
Wall time: 1.64 s


**Observations**
1. The `product` column is a nested column with a multi-element array. Each element (dictionary) of the array corresponds to a different product in a listing or category of products. With the BigQuery `UNNEST()` function, this column is exploded into the following standalone columns
   - `productSKU`
   - `v2ProductName`
   - `v2ProductCategory`
   - `productVariant`
   - `productBrand`
   - `productRevenue`
   - `localProductRevenue`
   - `productPrice`
   - `localProductPrice`
   - `productQuantity`
   - `productRefundAmount`
   - `localProductRefundAmount`
   - `isImpression`
   - `isClick`
   - `customDimensions`
   - `customMetrics`
   - `productListName`
   - `productListPosition`
   - `viewed_product`
   - `clicked_product`
2. `viewed_product` is `True` for every product in the product listing that was viewed.

If a product is viewed in a listing (`product_count > 0`) during a visit, then there are only two possible values for `clicked_product` and `viewed_product`, as shown below

In [18]:
for c in ["viewed_product", "clicked_product"]:
    # show unique values
    display(df_raw.query("product_count > 0")[c].value_counts(dropna=False).to_frame())

    # verify that False is not a unique value
    assert df_raw.query("product_count > 0").query(f"{c} == False").empty

Unnamed: 0_level_0,count
viewed_product,Unnamed: 1_level_1
True,189
,41


Unnamed: 0_level_0,count
clicked_product,Unnamed: 1_level_1
,213
True,17


If a product is not viewed in a listing, then the only value in these same two columns is `NULL` since they come from a nested column `product` which contains an empty array `[]` if a product is such a scenario.

This is shown below

In [19]:
for c in ["viewed_product", "clicked_product"]:
    # show unique values
    display(df_raw.query("product_count == 0")[c].value_counts(dropna=False).to_frame())

Unnamed: 0_level_0,count
viewed_product,Unnamed: 1_level_1
,36


Unnamed: 0_level_0,count
clicked_product,Unnamed: 1_level_1
,36


For every product in the product listing that was viewed and clicked
- `clicked_product` is `True`
- `viewed_product` is `NULL`

which prevents double-counting products that are both viewed and clicked (similar to for promotions), as shown below

In [20]:
display(df_raw.query("clicked_product == True")[["viewed_product", "clicked_product"]])

Unnamed: 0,viewed_product,clicked_product
20,,True
23,,True
42,,True
44,,True
75,,True
140,,True
149,,True
152,,True
155,,True
158,,True


Product detail views and clicking of products that were viewed in a product or product category listing can also be retrieved from the `action_type` column, which tracks each action performed by a visitor during a visit. Its unique values are shown below

In [21]:
df_raw["action_type"].value_counts(dropna=False).to_frame()

Unnamed: 0_level_0,count
action_type,Unnamed: 1_level_1
Unknown,225
Product detail views,22
Click through of product lists,17
Add product(s) to cart,2


Below, we verify that the products that were clicked can be equivalently determined using separated nested columns
- `clicked_product` (extracted from nested column `product`)
  - `clicked_product == True`
- `action_type` (extracted from nested column `hits`)
  - `action_type == 'Click through of product lists'`

In [22]:
visit_prodict_view_click_cols = [
    "fullvisitorid",
    "visitStartTime",
    "action_type",
    "product_details_viewed",
    "isImpression",
    "isClick",
    "viewed_product",
    "clicked_product",
]
assert df_raw.query("clicked_product == True")[visit_prodict_view_click_cols].equals(
    df_raw.query("action_type == 'Click through of product lists'")[
        visit_prodict_view_click_cols
    ]
)

When a product is viewed in a listing (`viewed_product`), during a visit, the product count for those visits is greater than zero

In [23]:
assert df_raw.query("viewed_product == True")["product_count"].min() > 0
display(df_raw.query("viewed_product == True")["product_count"].describe().to_frame())

Unnamed: 0,product_count
count,189.0
mean,9.719577
std,3.355004
min,2.0
25%,6.0
50%,12.0
75%,12.0
max,12.0


**For informational purposes, the raw dataset without unnesting the products and promotions columns is shown below for a small number of visits.**

A small number of `visitId`s is shown below

In [24]:
visit_ids_dict = {
    1478844153: "papayawhip",
    1476880065: "mistyrose",
    1478579523: "lavender",
    1474972357: "lightcyan",
}
visit_ids_str = "(" + ", ".join([str(v) for v in list(visit_ids_dict)]) + ")"

Attributes for these visits are retrieved below without unnesting `product` and `promotion`

In [25]:
%%time
query = f"""
        WITH visit_promotion_attrs AS (
            SELECT fullvisitorid,
                   visitId,
                   visitNumber,
                   DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Eastern') AS visitStartTime,
                   CAST(h.ecommerceaction.action_type AS INT64) AS action_type,
                   (CASE WHEN CAST(h.eCommerceAction.action_type AS INT64) = 2 THEN 1 ELSE 0 END) AS product_details_viewed,
                   trafficSource.source,
                   trafficSource.medium,
                   channelGrouping,
                   device.browser,
                   device.operatingSystem,
                   device.deviceCategory,
                   -- visit
                   totals.timeOnSite,
                   totals.timeOnScreen,
                   totals.visits,
                   -- nested columns
                   h.product,
                   h.promotion,
                   -- experimental columns that were not used
                   h.isInteraction,
                   trafficSource.campaign,
                   trafficSource.isTrueDirect,
            FROM `data-to-insights.ecommerce.web_analytics`,
            UNNEST(hits) AS h
            WHERE visitId IN {visit_ids_str}
        )
        SELECT * EXCEPT(visitStartTime)
        FROM visit_promotion_attrs
        """
df_raw = run_sql_query(query, **gcp_auth_dict, show_df=False)
df_raw["action_type"] = df_raw["action_type"].map(mapper)

Query execution start time = 2023-04-05 19:48:56...done at 2023-04-05 19:48:57 (1.398 seconds).
Query returned 143 rows
CPU times: user 141 ms, sys: 4.77 ms, total: 145 ms
Wall time: 1.4 s


These attributes are shown below per `visitId` (uncomment code block to show output)

In [26]:
# %%time
# for visit_id in list(visit_ids_dict):
#     with pd.option_context("display.max_columns", None, "display.max_rows", None):
#         display(df_raw.query(f"visitId == {visit_id}"))

### Change Data Types in Prepared Data

In [27]:
dtypes_dict = {
    "fullvisitorid": pd.StringDtype(),
    "visitId": pd.StringDtype(),
    "visitNumber": pd.Int8Dtype(),
    "country": pd.StringDtype(),
    "quarter": pd.Int8Dtype(),
    "month": pd.Int8Dtype(),
    "day_of_month": pd.Int8Dtype(),
    "day_of_week": pd.Int8Dtype(),
    "hour": pd.Int8Dtype(),
    "minute": pd.Int8Dtype(),
    "second": pd.Int8Dtype(),
    "source": pd.StringDtype(),
    "medium": pd.StringDtype(),
    "channelGrouping": pd.StringDtype(),
    "hits": pd.Int16Dtype(),
    "bounces": pd.Int16Dtype(),
    "last_action": pd.Int8Dtype(),
    "product_detail_views": pd.Int16Dtype(),
    "promos_displayed": pd.Int16Dtype(),
    "promos_clicked": pd.Int16Dtype(),
    "product_views": pd.Int16Dtype(),
    "product_clicks": pd.Int16Dtype(),
    "pageviews": pd.Int16Dtype(),
    "time_on_site": pd.Int16Dtype(),
    "browser": pd.StringDtype(),
    "os": pd.StringDtype(),
    "added_to_cart": pd.Int16Dtype(),
    "deviceCategory": pd.StringDtype(),
}

In [28]:
%%time
df = df.astype(dtypes_dict)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 175859 entries, 0 to 176303
Data columns (total 30 columns):
 #   Column                         Non-Null Count   Dtype         
---  ------                         --------------   -----         
 0   fullvisitorid                  175859 non-null  string        
 1   visitId                        175859 non-null  string        
 2   visitNumber                    175859 non-null  Int8          
 3   visitStartTime                 175859 non-null  datetime64[ns]
 4   country                        175859 non-null  string        
 5   quarter                        175859 non-null  Int8          
 6   month                          175859 non-null  Int8          
 7   day_of_month                   175859 non-null  Int8          
 8   day_of_week                    175859 non-null  Int8          
 9   hour                           175859 non-null  Int8          
 10  minute                         175859 non-null  Int8          
 11  secon

### Separate Columns by Type

In [29]:
with pd.option_context(
    "display.max_colwidth", None, "display.max_rows", None, "display.max_columns", None
):
    display(df.head(3))

Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,minute,second,source,medium,channelGrouping,hits,bounces,last_action,product_detail_views,promos_displayed,promos_clicked,product_views,product_clicks,pageviews,time_on_site,browser,os,deviceCategory,added_to_cart,made_purchase_on_future_visit
0,483329569933708956,1477437687,1,2016-10-25 19:21:27,United States,4,10,25,3,19,21,27,google,organic,Organic Search,6,0,0,0,0,0,0,0,6,602,Chrome,Windows,desktop,0,False
1,9534112552538425546,1476671570,1,2016-10-16 22:32:50,United States,4,10,16,1,22,32,50,youtube.com,referral,Social,6,0,0,0,54,0,0,0,1,270,Opera,Windows,desktop,0,False
2,4648924122067625674,1475958245,1,2016-10-08 16:24:05,United States,4,10,8,7,16,24,5,youtube.com,referral,Social,5,0,0,0,36,0,4,0,2,198,Opera,Windows,desktop,0,False


Create lists of columns based on their type
- `datetime`
- categorical
- numerical

In [30]:
datetime_columns = [
    "quarter",
    "month",
    "day_of_month",
    "day_of_week",
    "hour",
]
categorical_columns = [
    "bounces",
    "last_action",
    "source",
    "medium",
    "channelGrouping",
    "browser",
    "os",
    "deviceCategory",
]
numerical_columns = [
    "hits",
    "product_detail_views",
    "promos_displayed",
    "promos_clicked",
    "product_views",
    "product_clicks",
    "pageviews",
    "time_on_site",
    "added_to_cart",
]

### Handling Categorical Columns

High-cardinality categorical features are a problem for machine learning models as they create a large number of dummy variables (after dummy encoding), or a sparse matrix ([1](https://blog.knoldus.com/sparse-matrices-what-makes-them-important-for-machine-learning/), [2](https://www.aiplusinfo.com/blog/what-is-a-sparse-matrix-how-is-it-used-in-machine-learning/)) that slows ML model training. So, it is frequently necessary to reduce this cardinality before training a ML model.

Two of the well-known apprroaches to reduce dimensionality of such features are ([1](https://arxiv.org/abs/2301.12710), [2](https://www.linkedin.com/advice/0/how-do-you-deal-categorical-features-high-cardinality))
1. [frequency endoding](https://towardsdatascience.com/dealing-with-features-that-have-high-cardinality-1c9212d7ff1b)
   - only keep the `N` most common values for each feature and replace all the other (infrequently occurring) values with a placeholder value such as `other`
   - this will be the approach used for the current project
2. class label or target encoding
   - group categorical features by the class labels (the dependent variable, or `y`)

Reducing the cardinality of such features is performed during the data transformation step of a ML workflow. Here, we will demonstrate this before exploratory data analysis (this notebook) and then apply it during data transformation (next notebook).

Show the number of unique values in all categorical columns

In [31]:
%%time
df_nunique = pd.DataFrame.from_records(
    [{"column": c, "num_unique_values": df[c].nunique()} for c in categorical_columns]
)
df_nunique

CPU times: user 40.2 ms, sys: 2.7 ms, total: 42.9 ms
Wall time: 41.7 ms


Unnamed: 0,column,num_unique_values
0,bounces,2
1,last_action,7
2,source,162
3,medium,7
4,channelGrouping,8
5,browser,30
6,os,17
7,deviceCategory,3


**Observations**
1. High-cardinality categorical columns are present in the training data.
2. The following categorical columns are high-cardinality columns with the largest number of unique values
   - `source` (source of visitor traffic reaching the merchandise store's website)
   - `browser`
   - `os` (visitor's operating system used to access merchandise store's website)

   and will likely need to be binned or grouped
3. Infrequently occurring values in the following medium-cardinality columns related the source of website visitor traffic will also be grouped
   - `channelGrouping`
   - `medium`
4. `last_action` will be left unchanged
   - it is likely that the last action performed by a visitor during their first visit to the merchandise store will have some influence on their probability (propensity) to make a purchase during a future visit

The category distributions (frequencies) after grouping are shown below for all categorical columns (including those that were grouped)

In [32]:
%%time
dfs_cats_groups = []
for c in categorical_columns:
    # get fraction of unique values
    df_frequencies = df[c].value_counts().rename('number_of_visitors').to_frame().merge(
        (df[c].value_counts(normalize=True).rename('fraction_of_visitors')*100).to_frame(),
        left_index=True,
        right_index=True,
    )

    # map unique values for last_action and bounces to get meaningful names
    if c == 'last_action':
        df_frequencies.index = df_frequencies.index.map(mapper)
    if c == 'bounces':
        df_frequencies.index = df_frequencies.index.map({0: False, 1: True})

    # get running total of fraction (cumulative sum)
    df_frequencies = df_frequencies.sort_values(by=["fraction_of_visitors"]).assign(
        cumulative_fraction_of_visitors=lambda df: df["fraction_of_visitors"].cumsum(), column_name=c
    ).sort_values(by=["fraction_of_visitors"], ascending=False)

    # rename columns
    df_frequencies = df_frequencies.reset_index().rename(columns={c: "column_value"})
    dfs_cats_groups.append(df_frequencies)
df_frequencies_raw = pd.concat(dfs_cats_groups, ignore_index=True)
col = df_frequencies_raw.pop("column_name")
df_frequencies_raw.insert(0, col.name, col)
with pd.option_context("display.max_rows", None):
    display(df_frequencies_raw)

Unnamed: 0,column_name,column_value,number_of_visitors,fraction_of_visitors,cumulative_fraction_of_visitors
0,bounces,False,121218,68.929085,100.0
1,bounces,True,54641,31.070915,31.070915
2,last_action,Unknown,129476,73.624893,100.0
3,last_action,Product detail views,27207,15.470917,26.375107
4,last_action,Add product(s) to cart,8870,5.043814,10.90419
5,last_action,Completed purchase,5708,3.245782,5.860377
6,last_action,Check out,2948,1.676343,2.614595
7,last_action,Remove product(s) from cart,1581,0.899016,0.938252
8,last_action,Click through of product lists,69,0.039236,0.039236
9,source,google,83363,47.403317,100.0


CPU times: user 120 ms, sys: 3.37 ms, total: 123 ms
Wall time: 122 ms


**Observations**
1. We'll create frequency groupings as follows
   - `source` and `browser`
     - all categories which occur with a frequency of less than 5% will be grouped into a single value `other`
   - `os`, `channelGrouping` and `medium`
     - all categories which occur with a frequency of less than 10% will be grouped into a single value `other`

   These thresholds were determined by examining the output of `df_frequencies`, which shows the freqencies of all categories for all categorical columns.

Below are lists of categorical columns to be grouped based on this threshold (5% or 10%)

In [33]:
cols_to_group_5_pct = ["source", "browser"]
cols_to_group_10_pct = ["os", "channelGrouping", "medium"]

We'll get names for the columns after grouping, by adding a `_grouped` suffix

In [34]:
grouped_cols_5_pct = [f"{c}_grouped" for c in cols_to_group_5_pct]
grouped_cols_10_pct = [f"{c}_grouped" for c in cols_to_group_10_pct]

Next, create lists of categorical columns that will and will not be grouped and then combine them into a single list

In [35]:
categorical_columns_mapped = (
    # columns that will not be grouped
    list(
        set(categorical_columns) - set(cols_to_group_5_pct) - set(cols_to_group_10_pct)
    )
    # columns that will be grouped
    + grouped_cols_5_pct
    + grouped_cols_10_pct
)

Create a duplicate of the columns that will be grouped and add a `_grouped` suffix to their column name

In [36]:
for c in cols_to_group_5_pct + cols_to_group_10_pct:
    df[f"{c}_grouped"] = df[c]

Finally, perform the grouping using
1. `pandas.value_counts(normalize=True) < 0.05` (5% threshold)
2. `pandas.value_counts(normalize=True) < 0.10` (10% threshold)

where all infrequently occurring values that satisfy these filters will have their values replaced by `other`

In [37]:
df[grouped_cols_5_pct] = df[grouped_cols_5_pct].apply(
    lambda x: x.mask(x.map(x.value_counts(normalize=True)) < 0.05, "other"), axis=0
)
df[grouped_cols_10_pct] = df[grouped_cols_10_pct].apply(
    lambda x: x.mask(x.map(x.value_counts(normalize=True)) < 0.10, "other"), axis=0
)

The cardinality of the columns before and after grouping is shown below

In [38]:
%%time
df_nunique.merge(
    pd.DataFrame.from_records(
        [
            {
                "column": c.replace('_grouped', ''),
                'column_grouped': c,
                "num_unique_values_after_grouping": df[c].nunique(),
            }
            for c in categorical_columns_mapped
        ]
    ).assign(column_grouped=lambda df: df['column_grouped'] != df['column']),
    on=['column'],
    how='left',
)

CPU times: user 44.9 ms, sys: 148 µs, total: 45.1 ms
Wall time: 44.7 ms


Unnamed: 0,column,num_unique_values,column_grouped,num_unique_values_after_grouping
0,bounces,2,False,2
1,last_action,7,False,7
2,source,162,True,5
3,medium,7,True,4
4,channelGrouping,8,True,4
5,browser,30,True,3
6,os,17,True,5
7,deviceCategory,3,False,3


**Observations**
1. The cardinality has been significantly reduced for the columns where the infrequently occurring values were grouped (`column_grouped == True`).
1. The cardinality is unchanged for the columns where the infrequently occurring values were not grouped (`column_grouped == False`).

The category distributions (frequencies) after grouping are shown below for all categorical columns (including those that were grouped)

In [39]:
%%time
dfs_cats_groups = []
for c in categorical_columns_mapped:
    # get fraction of unique values
    df_frequencies = df[c].value_counts().rename('number_of_visitors').to_frame().merge(
        (df[c].value_counts(normalize=True).rename('fraction_of_visitors')*100).to_frame(),
        left_index=True,
        right_index=True,
    )

    # map unique values for last_action and bounces to get meaningful names
    if c == 'last_action':
        df_frequencies.index = df_frequencies.index.map(mapper)
    if c == 'bounces':
        df_frequencies.index = df_frequencies.index.map({0: False, 1: True})

    # get running total of fraction (cumulative sum)
    df_frequencies = df_frequencies.sort_values(by=["fraction_of_visitors"]).assign(
        cumulative_fraction_of_visitors=lambda df: df["fraction_of_visitors"].cumsum(), column_name=c
    ).sort_values(by=["fraction_of_visitors"], ascending=False)

    # rename columns
    df_frequencies = df_frequencies.reset_index().rename(columns={c: "column_value"})
    dfs_cats_groups.append(df_frequencies)
df_frequencies_grouped = pd.concat(dfs_cats_groups, ignore_index=True)
col = df_frequencies_grouped.pop("column_name")
df_frequencies_grouped.insert(0, col.name, col)
with pd.option_context("display.max_rows", None):
    display(df_frequencies_grouped)

Unnamed: 0,column_name,column_value,number_of_visitors,fraction_of_visitors,cumulative_fraction_of_visitors
0,deviceCategory,desktop,126053,71.678447,100.0
1,deviceCategory,mobile,43282,24.611763,28.321553
2,deviceCategory,tablet,6524,3.70979,3.70979
3,last_action,Unknown,129476,73.624893,100.0
4,last_action,Product detail views,27207,15.470917,26.375107
5,last_action,Add product(s) to cart,8870,5.043814,10.90419
6,last_action,Completed purchase,5708,3.245782,5.860377
7,last_action,Check out,2948,1.676343,2.614595
8,last_action,Remove product(s) from cart,1581,0.899016,0.938252
9,last_action,Click through of product lists,69,0.039236,0.039236


CPU times: user 113 ms, sys: 0 ns, total: 113 ms
Wall time: 112 ms


**Notes**
1. These distributions are shown here after frequency encoding (grouping) the high-cardinality columns in order to determine the thresholds (5% and 10%) for replacing infrequently occurring values in these columns. Earlier, the same was shown in the raw categorical columns. In that `DataFrame`, there were 236 unique categories across all categorical columns (length of `df_frequencies_raw`). After dummy encoding (where we will drop duplicate categories in each raw categorical column - [1](https://towardsdatascience.com/encoding-categorical-variables-one-hot-vs-dummy-encoding-6d5b9c46e2db)), there would be 236 - `<number-of-categorical-columns>` = 236 - 8 = 228 features.
2. After frequency grouping (where we will drop duplicate categories in each grouped categorical column), there are 34 unique categories. After dummy encoding, the number of dummy variables will be 34 - `<number-of-categorical-columns>` = 34 - 8 = 26 features. This frequency encoding approach has reduced the cardinality by (228 - 26) / 228 = 0.89 (or 89%).

The reduction in cardinality of the categorical feaures, after frequency grouping, is calculated below

In [40]:
frac_reduction_in_cats_cardinality = (
    100
    * (
        (len(df_frequencies_raw) - len(categorical_columns))
        - (len(df_frequencies_grouped) - len(categorical_columns))
    )
    / (len(df_frequencies_raw) - len(categorical_columns))
)
print(
    "Frequency encoding (grouping) has reduced cardinality of categorical features by "
    f"{frac_reduction_in_cats_cardinality:,.3f}%"
)

Frequency encoding (grouping) has reduced cardinality of categorical features by 89.035%


#### **Impact on Data Processing**

The groupings above have been learnt from the training data. We now need to create a lookup table for columns that were grouped so that we can apply the same groupings to unseen data (validation and test data splits). This means whenever we encounter the same infrequently occurring values in the validation and test data splits, they will be replaced by `other`.

This lookup table is defined below

In [41]:
df_groupings = pd.DataFrame.from_dict(
    {
        c: df[c]
        .value_counts(normalize=True)
        .rename("fraction")
        .to_frame()
        .query(f"fraction < {threshold}")
        .index.tolist()
        for cols, threshold in zip(
            [cols_to_group_5_pct, cols_to_group_10_pct], [0.05, 0.10]
        )
        for c in cols
    },
    orient="index",
).transpose()
df_groupings

Unnamed: 0,source,browser,os,channelGrouping,medium
0,sites.google.com,Firefox,Linux,Social,cpc
1,Partners,Internet Explorer,Chrome OS,Paid Search,affiliate
2,dfa,Edge,(not set),Affiliates,cpm
3,moma.corp.google.com,Safari (in-app),Windows Phone,Display,(not set)
4,siliconvalley.about.com,Opera,Nintendo Wii,(Other),
...,...,...,...,...,...
153,google.com.mx,,,,
154,start.wow.com,,,,
155,collaborate.northwestern.edu,,,,
156,grow.googleplex.com,,,,


We'll also create a lookup table of unique values in the categorical columns that were not grouped. Whenever we encounter these values in the validation or test data splits, they will remain unchanged.

This lookup table is defined below

In [42]:
df_ungrouped = pd.DataFrame.from_dict(
    {
        c: df[c]
        .value_counts(normalize=True)
        .rename("fraction")
        .to_frame()
        .query(f"fraction >= {threshold}")
        .index.tolist()
        for cols, threshold in zip(
            [cols_to_group_5_pct, cols_to_group_10_pct], [0.05, 0.10]
        )
        for c in cols
    },
    orient="index",
).transpose()
df_ungrouped

Unnamed: 0,source,browser,os,channelGrouping,medium
0,google,Chrome,Macintosh,Organic Search,organic
1,(direct),Safari,Windows,Direct,referral
2,mall.googleplex.com,,iOS,Referral,(none)
3,youtube.com,,Android,,


For a quick demonstration of using these two lookup tables, we'll create a dummy validation data `DataFrame` below with two categorical features

In [43]:
df_val = pd.DataFrame.from_records(
    [
        {"source": "Partners", "browser": "Internet Explorer"},
        {"source": "dfa", "browser": "new-browser"},
        {"source": "new-source-value", "browser": "Chrome"},
    ]
)
df_val

Unnamed: 0,source,browser
0,Partners,Internet Explorer
1,dfa,new-browser
2,new-source-value,Chrome


We'll now apply both the lookup tables defined above using the following approach
1. for all columns that were grouped, create columns with a suffix `_grouped` which contains the value `other` for infrequently occurring values
2. for all columns that were not grouped, create columns with a suffix `_ungrouped` which contains the same values with no changes
3. combine columns with the `_ungrouped` and `_grouped` suffixes into a single column column
   - to do this, fill missing values in the `_grouped` column with those in the `_ungrouped` column
4. drop original columns and rename the combined columns appropriately

In [44]:
categorical_columns_validation_data = ["source", "browser"]
for c in categorical_columns_validation_data:
    # 1. replace infrequent values in columns that were grouped (add suffix _grouped)
    df_val[f"{c}_grouped"] = df_val[c].map(
        {c_grouped: "other" for c_grouped in df_groupings[c].tolist()}
    )
    # 2. keep all values in columns that were not grouped (add suffix _ungrouped)
    df_val[f"{c}_ungrouped"] = df_val[c].map(
        {c_ungrouped: c_ungrouped for c_ungrouped in df_ungrouped[c].tolist()}
    )
    # 3. combine columns that were replaced (_grouped) and those that were not replaced (_ungrouped)
    df_val[f"{c}_grouped"] = df_val[f"{c}_grouped"].fillna(df_val[f"{c}_ungrouped"])
# 4. drop unwanted columns and rename
df_val = df_val.drop(
    columns=["browser_ungrouped", "source_ungrouped"]
    + categorical_columns_validation_data
).rename(columns={f"{c}_grouped": c for c in categorical_columns_validation_data})
df_val

Unnamed: 0,source,browser
0,other,other
1,other,
2,,Chrome


**Observations**
1. In both features of the validation data, there are new categories that were not seen in the training data. After applying the two lookup tables above, these values are replaced by `None`s. We can fill these missing values using
   - `new` (or keep it as `None`) to indicate this is a new category
     - the ML model has not seen this value in the appropriate feature during training, so the predictive power of such a feature in the unseen (validation) data will likely be reduced or minimal (the model won't know its relationship to the label `y`)
   - `other` to group this into the infrequently occurring categories that were identified from the training data
     - the disadvantage is that these new categories might have a different relationship to the label label (`y`) than the grouped (`other`) category
     - in such a scenario
       - the ML model might not able to leverage the full predictive power of such new categories in the validation (unseen) data when it makes predictions since it was not trained to learn this relationship in the training data
       - the model will make predictions based on the relationship learnt between the grouped (`other`) category and the label (`y`)

During data processing (after this EDA notebook), we will create the training, validation and test data splits for ML development and the same workflow will be used to handle categorical features during data processing.

## Exploratory Data Analysis

We will interchangeably refer to each row in the data that was prepared above as visits or visitors. Since duplicates by `fullvisitorid` have been dropped, each row represents a unique visitor. Also, for the rest of this notebook (EDA), we will only use the training data.

The prepared training data is shown below

In [45]:
with pd.option_context("display.max_rows", None):
    display(df.head())

Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,country,quarter,month,day_of_month,day_of_week,hour,...,browser,os,deviceCategory,added_to_cart,made_purchase_on_future_visit,source_grouped,browser_grouped,os_grouped,channelGrouping_grouped,medium_grouped
0,483329569933708956,1477437687,1,2016-10-25 19:21:27,United States,4,10,25,3,19,...,Chrome,Windows,desktop,0,False,google,Chrome,Windows,Organic Search,organic
1,9534112552538425546,1476671570,1,2016-10-16 22:32:50,United States,4,10,16,1,22,...,Opera,Windows,desktop,0,False,youtube.com,other,Windows,other,referral
2,4648924122067625674,1475958245,1,2016-10-08 16:24:05,United States,4,10,8,7,16,...,Opera,Windows,desktop,0,False,youtube.com,other,Windows,other,referral
3,6917772450375508123,1488649070,1,2017-03-04 12:37:50,United States,1,3,4,7,12,...,Chrome,Chrome OS,desktop,0,False,google,Chrome,other,Organic Search,organic
4,2743152869399749836,1481229164,1,2016-12-08 15:32:44,United States,4,12,8,5,15,...,Chrome,Windows,desktop,0,False,google,Chrome,Windows,Organic Search,organic


**Notes**
1. For the categorical columns, both the raw and frequency grouped versions are present. The versions with frequency grouping contain the suffix `_grouped` in their column names.

### Informative

<span style='color:red'>**To be Done.**</span>

### Insights

**Question 7. Show the fraction of visitors who did and did not make a purchase on their return visit to the merchandise store by month and day of the month.**

In [46]:
df_agg = df.groupby(["month", "day_of_month"], as_index=False).agg(
    {"made_purchase_on_future_visit": ["sum"], "fullvisitorid": "count"}
)
df_agg.columns = ["_".join(c).rstrip("_") for c in df_agg.columns.to_flat_index()]
df_agg = df_agg.rename(
    columns={
        "made_purchase_on_future_visit_sum": "return_purchasers",
        "fullvisitorid_count": "return_visitors",
    }
).assign(
    frac_return_purchasers=lambda df: 100
    * df["return_purchasers"]
    / df["return_visitors"]
)
df_agg

Unnamed: 0,month,day_of_month,return_purchasers,return_visitors,frac_return_purchasers
0,1,1,14,430,3.255814
1,1,2,10,529,1.890359
2,1,3,46,795,5.786164
3,1,4,31,853,3.634232
4,1,5,37,722,5.124654
...,...,...,...,...,...
239,12,27,37,665,5.56391
240,12,28,30,577,5.199307
241,12,29,23,588,3.911565
242,12,30,14,436,3.211009


Show plot

<span style='color:red'>**To be Done.**</span>

**Observations**
1. ...

**Question 8. Show the fraction of visitors who made a purchase on their return visit to the merchandise store by month and day of the week. Compare this to the fraction of channels used by these visitors.**

Aggregate to get number of
- visitors
- channels

used by visitors who did and did not make purchase on a return visit

In [47]:
df_agg = df.groupby(
    ["month", "day_of_week", "made_purchase_on_future_visit"], as_index=False
).agg({"fullvisitorid": "count", "channelGrouping": "nunique"})
df_agg = df_agg.rename(
    columns={"channelGrouping": "num_channels", "fullvisitorid": "return_visitors"}
)
df_agg.head()

Unnamed: 0,month,day_of_week,made_purchase_on_future_visit,return_visitors,num_channels
0,1,1,False,2409,7
1,1,1,True,79,4
2,1,2,False,3279,7
3,1,2,True,147,6
4,1,3,False,4158,7


Pivot the data so that the number of visitors who did and did not make a return purchase are shown as separate columns (repeat for number of channels used by visitors who did and did not make a return purchase)

In [48]:
df_agg_untidy = df_agg.pivot(
    index=["month", "day_of_week"],
    columns=["made_purchase_on_future_visit"],
    values=["return_visitors", "num_channels"],
).reset_index()
df_agg_untidy.columns = [
    "_".join([str(c) for c in c_list]).rstrip("_")
    for c_list in df_agg_untidy.columns.to_flat_index()
]
df_agg_untidy.head()

Unnamed: 0,month,day_of_week,return_visitors_False,return_visitors_True,num_channels_False,num_channels_True
0,1,1,2409,79,7,4
1,1,2,3279,147,7,6
2,1,3,4158,191,7,5
3,1,4,3164,122,7,6
4,1,5,2800,130,7,5


Get total number of visitors and channels

In [49]:
# get totals
df_agg_untidy["num_channels"] = (
    df_agg_untidy["num_channels_False"] + df_agg_untidy["num_channels_True"]
)
df_agg_untidy["return_visitors"] = (
    df_agg_untidy["return_visitors_False"] + df_agg_untidy["return_visitors_True"]
)
df_agg_untidy.head()

Unnamed: 0,month,day_of_week,return_visitors_False,return_visitors_True,num_channels_False,num_channels_True,num_channels,return_visitors
0,1,1,2409,79,7,4,11,2488
1,1,2,3279,147,7,6,13,3426
2,1,3,4158,191,7,5,12,4349
3,1,4,3164,122,7,6,13,3286
4,1,5,2800,130,7,5,12,2930


Convert number of visitors and channels into fractions of visitors and channels

In [50]:
df_agg_untidy["frac_channels_used_by_return_purchasers"] = 100 * (
    df_agg_untidy["num_channels_True"] / df_agg_untidy["num_channels"]
)
df_agg_untidy["frac_return_purchasers"] = 100 * (
    df_agg_untidy["return_visitors_True"] / df_agg_untidy["return_visitors"]
)
df_agg_untidy

Unnamed: 0,month,day_of_week,return_visitors_False,return_visitors_True,num_channels_False,num_channels_True,num_channels,return_visitors,frac_channels_used_by_return_purchasers,frac_return_purchasers
0,1,1,2409,79,7,4,11,2488,36.363636,3.175241
1,1,2,3279,147,7,6,13,3426,46.153846,4.290718
2,1,3,4158,191,7,5,12,4349,41.666667,4.391814
3,1,4,3164,122,7,6,13,3286,46.153846,3.712721
4,1,5,2800,130,7,5,12,2930,41.666667,4.43686
5,1,6,2600,106,7,5,12,2706,41.666667,3.917221
6,1,7,1926,51,7,4,11,1977,36.363636,2.579666
7,2,1,2065,59,7,4,11,2124,36.363636,2.777778
8,2,2,2903,121,7,4,11,3024,36.363636,4.001323
9,2,3,3089,131,7,5,12,3220,41.666667,4.068323


Melt data to get back to `GROUP BY` format, and only keep columns with fractions (drop columns with numbers)

In [51]:
df_agg_tidy = df_agg_untidy.melt(
    id_vars=["month", "day_of_week"],
    value_vars=["frac_channels_used_by_return_purchasers", "frac_return_purchasers"],
)
df_agg_tidy

Unnamed: 0,month,day_of_week,variable,value
0,1,1,frac_channels_used_by_return_purchasers,36.363636
1,1,2,frac_channels_used_by_return_purchasers,46.153846
2,1,3,frac_channels_used_by_return_purchasers,41.666667
3,1,4,frac_channels_used_by_return_purchasers,46.153846
4,1,5,frac_channels_used_by_return_purchasers,41.666667
...,...,...,...,...
109,12,3,frac_return_purchasers,5.510156
110,12,4,frac_return_purchasers,5.274665
111,12,5,frac_return_purchasers,5.283259
112,12,6,frac_return_purchasers,5.003626


Show plot

<span style='color:red'>**To be Done.**</span>

**Observations**
1. ...

## Summary of Assumptions

1. Duplicates in `fullvisitorid` are not well understood and so they should be dropped in the training, validation and test data splits during data transformation.

## Summary of Tasks Performed

This notebook has performed the following
1. extracted attributes from dataset to create a *prepared dataset* for use in EDA
   - flattened nested columns for products and promotions
   - extracted columns that should intuitively help predict probability of making a purchase on a return (future) visit
2. addressed duplicated visits
3. handled high-cardinality categorical columns
4. performed non-exhaustive EDA for insights into the prepared data

## Limitations

1. Limited EDA has been performed. For columns for which we have not performed EDA, we will be relying on our intuition to determine whether they will be useful as ML features.

## Next Step

The next notebook will transform the raw data in this dataset, in order to prepare it for machine learning, as follows
1. extract the attributes of each visit that were recommended in this notebook
2. drop duplicates by `fullvisitorid`
3. handle high-cardinality categorical columns

This will be done separately for training, validation and test data splits. To prevent lookahead bias, the data will first be split into training, validation and test splits before
1. dropping duplicates
2. creating a lookup table using the training data to create category groupings for the categorical features
   - as shown in this notebook, these groupings are used to address the issue of high-cardinality categorical features during ML model development

---