# Explore Google Analytics Tracking Data

In [1]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

Import necessary Python modules

In [2]:
import os
import sys

import pandas as pd

Get relative path to project root directory

In [3]:
PROJ_ROOT_DIR = os.path.join(os.pardir)
src_dir = os.path.join(PROJ_ROOT_DIR, "src")
sys.path.append(src_dir)

Import custom Python modules

In [4]:
%aimport bigquery_auth_helpers
from bigquery_auth_helpers import auth_to_bigquery

%aimport transform_helpers
import transform_helpers as th

## About

Explore Google Analytics (GA 360) tracking data.

## User Inputs

In [5]:
split_start_date = "20160901"
split_end_date = "20160910"
train_split_start_date = "20160901"
test_split_end_date = "20160930"

Get path to data sub-folders

In [6]:
data_dir = os.path.join(PROJ_ROOT_DIR, "data")
raw_data_dir = os.path.join(data_dir, "raw")

## Authenticate to `BigQuery`

In [7]:
gcp_auth_dict = auth_to_bigquery(raw_data_dir)

## Explore Google Analytics Tracking Data

### Get Return Visits

Get all occurrences of return visits during September 2016

In [8]:
%%time
query_str = f"""
            WITH
            return_visits AS (
                 SELECT fullvisitorid,
                        visitId,
                        visitNumber,
                        DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Pacific') AS visitStartTime,
                        IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, True, False) AS made_purchase_on_future_visit
                 FROM `data-to-insights.ecommerce.web_analytics`
                 WHERE date BETWEEN '{train_split_start_date}' AND '{test_split_end_date}'
                 AND geoNetwork.country = 'United States'
                 GROUP BY fullvisitorid,
                          visitId,
                          visitNumber,
                          visitStartTime
            )
            SELECT *
            FROM return_visits
            ORDER BY fullvisitorid
            """
df = th.extract_data(query_str, gcp_auth_dict)
df.head()

Query execution start time = 2023-05-09 18:01:05.113...done at 2023-05-09 18:01:08.594 (3.481 seconds).
Query returned 28,013 rows
CPU times: user 559 ms, sys: 59.3 ms, total: 618 ms
Wall time: 3.49 s


Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,made_purchase_on_future_visit
0,93957001069502,1474985724,1,2016-09-27 07:15:24,False
1,245437374675368,1472862842,1,2016-09-02 17:34:02,False
2,639845445148063,1473694653,1,2016-09-12 08:37:33,False
3,139156957304532,1473013369,1,2016-09-04 11:22:49,False
4,1601342180848204,1474930306,1,2016-09-26 15:51:46,False


Get visitors who made multiple return visits

In [9]:
df_return_visits = df[df.duplicated(subset=["fullvisitorid"], keep=False)]
df_return_visits.head(10)

Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,made_purchase_on_future_visit
12,5884918507288420,1474464720,1,2016-09-21 06:32:00,False
13,5884918507288420,1474469723,2,2016-09-21 07:55:23,False
19,10286039787739137,1475249827,2,2016-09-30 08:37:07,False
20,10286039787739137,1475084026,1,2016-09-28 10:33:46,False
32,15065858137292339,1473641554,5,2016-09-11 17:52:34,False
33,15065858137292339,1473119040,2,2016-09-05 16:44:00,False
34,15065858137292339,1473630321,4,2016-09-11 14:45:21,False
35,15065858137292339,1473466324,3,2016-09-09 17:12:04,False
36,15065858137292339,1472856937,1,2016-09-02 15:55:37,False
37,15065858137292339,1473997381,6,2016-09-15 20:43:01,False


Return visits are made up by two types of visitors

1. those that made a purchase
2. those that did not make a purchase

The number of visitors who did and did not make a purchase during a return visit are shown below

In [10]:
df_return_visits["made_purchase_on_future_visit"].value_counts().reset_index().assign(
    proportion=lambda df: df["count"] / df["count"].sum()
)

Unnamed: 0,made_purchase_on_future_visit,count,proportion
0,False,10507,0.955616
1,True,488,0.044384


::: {.callout-tip title="Observations"}

1. Only approximately 4.4% of all return visits in September 2016 resulted in a purchase.
:::

### Show Frequency of Repeat Customers During Return Visits

Some visitors made multiple return visits and a small subset of such visitors made multiple purchases. This is shown below.

First, for each visitor, get the following

1. whether that vistor ever made a purchase on a return visit
2. number of purchases made across all return visits

In [11]:
df_return_visits_with_purchase = df_return_visits.groupby(
    "fullvisitorid", as_index=False
).agg({"made_purchase_on_future_visit": ["max", "sum"]})
df_return_visits_with_purchase.columns = [
    "_".join(a).rstrip("_")
    for a in df_return_visits_with_purchase.columns.to_flat_index()
]
df_return_visits_with_purchase

Unnamed: 0,fullvisitorid,made_purchase_on_future_visit_max,made_purchase_on_future_visit_sum
0,0005884918507288420,False,0
1,0010286039787739137,False,0
2,0015065858137292339,False,0
3,0026203741366904270,True,1
4,0027817676806595220,False,0
...,...,...,...
3755,9986848664463401272,False,0
3756,9990362099175067703,False,0
3757,999203594099745000,False,0
3758,9992704342633956099,False,0


Next, get visitors who made a purchase on a return visit

In [12]:
df_return_visits_with_purchase = df_return_visits_with_purchase.query(
    "made_purchase_on_future_visit_max == True"
)
df_return_visits_with_purchase

Unnamed: 0,fullvisitorid,made_purchase_on_future_visit_max,made_purchase_on_future_visit_sum
3,0026203741366904270,True,1
7,0036417634769000138,True,1
8,0037518757923116572,True,1
14,0061519776091452595,True,1
19,0070976956518566605,True,1
...,...,...,...
3709,9891815404632176641,True,1
3718,9912185644936709935,True,1
3725,9941749289816017941,True,1
3742,9961396584113412108,True,1


::: {.callout-note title="Notes"}

1. `made_purchase_on_future_visit` indicates if a purchase was made during a return visit.
2. `made_purchase_on_future_visit_max` indicates whether a visitor who made a return visit to the store made a purchase during any such visit.
3. `made_purchase_on_future_visit_sum` shows the total number of purchases made by a visitor across all return visits.
:::

Finally, filter the occurrences of return visits during September 2016 to only capture all visitors who made **a purchase on multiple return visits**

In [13]:
df_return_visits_merged = df_return_visits.merge(
    df_return_visits_with_purchase, on=["fullvisitorid"], how="inner"
)

Visitors who made a purchase on multiple return visits are shown below

In [14]:
df_return_visits_merged.query("made_purchase_on_future_visit_sum > 2").head(15)

Unnamed: 0,fullvisitorid,visitId,visitNumber,visitStartTime,made_purchase_on_future_visit,made_purchase_on_future_visit_max,made_purchase_on_future_visit_sum
316,2074164338647079047,1474310682,22,2016-09-19 11:44:42,False,True,4
317,2074164338647079047,1472735454,16,2016-09-01 06:10:54,False,True,4
318,2074164338647079047,1474373863,23,2016-09-20 05:17:43,True,True,4
319,2074164338647079047,1473196307,19,2016-09-06 14:11:47,True,True,4
320,2074164338647079047,1473792585,21,2016-09-13 11:49:45,True,True,4
321,2074164338647079047,1473446196,20,2016-09-09 11:36:36,True,True,4
322,2074164338647079047,1472746638,18,2016-09-01 09:17:18,False,True,4
323,2074164338647079047,1472739238,17,2016-09-01 07:13:58,False,True,4
535,280738376597848400,1473873900,2,2016-09-14 10:25:00,True,True,3
536,280738376597848400,1473878689,3,2016-09-14 11:44:49,False,True,3


::: {.callout-tip title="Observations"}

1. We can see that these are the subset of return visitors who made a purchase on multiple return visits since the values in the `made_purchase_on_future_visit_sum` column that are larger than 1.
:::

For visitors who who made a purchae on multiple return visits, show the number of visitors who made a purchase on a single such visit and those that made a purchase during multiple such visits

In [15]:
df_return_visits_merged.assign(
    made_purchase_on_multiple_return_visits=lambda df: df[
        "made_purchase_on_future_visit_sum"
    ]
    > 1
).groupby("made_purchase_on_multiple_return_visits", as_index=False)[
    "fullvisitorid"
].count().rename(
    columns={"fullvisitorid": "count"}
).assign(
    proportion=lambda df: df["count"] / df["count"].sum()
)

Unnamed: 0,made_purchase_on_multiple_return_visits,count,proportion
0,False,1385,0.835344
1,True,273,0.164656


::: {.callout-tip title="Observations"}

1. Most visitors who made a purchae on multiple return visits during September 2016 only made a purchase during one of thse visits. Only approximately 16% of such visitors made a purchase during multiple such visits.
2. The SQL logic defined above is capturing repeat customers. These are visitors who made a purchase during more than one visit to the store. This is required per the scope of this project since the business wants to grow both repeat as well as new customers.
:::