# Get Data

In [1]:
# | echo: false
%load_ext lab_black

Import Python modules

In [2]:
import os
from datetime import date, datetime
from glob import glob

import pandas as pd
import pytz
from google.oauth2 import service_account

## About

### Objective
To start analysis for this project, we will first retrieve visitor session data from a small sample of the Google Merchandise Store between the summer of August 1, 2016 and August 1, 2017.

### Data
The dataset is provided by Google `BigQuery` (Google's [data warehouse](https://cloud.google.com/bigquery)) and is available [here](https://support.google.com/analytics/answer/7586738?hl=en#zippy=%2Cin-this-article). It consists of a single table of visitor session data. The [documentation for this dataset](https://support.google.com/analytics/answer/3437719?hl=en&ref_topic=3416089) shows the column names and description.

## User Inputs

Get relative path to project root directory

In [3]:
# | code-fold: false
PROJ_ROOT_DIR = os.path.join(os.pardir)

In [4]:
# | echo: false

Retrieve credentials for `bigquery` client

In [5]:
# | code-fold: false
# Google Cloud PROJECT ID
gcp_project_id = os.environ["GCP_PROJECT_ID"]

Get filepath to Google Cloud Service Account JSON key

In [6]:
# | code-fold: false
raw_data_dir = os.path.join(PROJ_ROOT_DIR, "data", "raw")
gcp_creds_fpath = glob(os.path.join(raw_data_dir, "*.json"))[0]

Authenticate `bigquery` client and get dictionary with credentials

In [7]:
# | code-fold: false
gcp_credentials = service_account.Credentials.from_service_account_file(gcp_creds_fpath)
gcp_auth_dict = dict(gcp_project_id=gcp_project_id, gcp_creds=gcp_credentials)

Create a mapping between action type integer and label, in order to get meaningful names from the action_type column

In [8]:
# | code-fold: false
mapper = {
    1: "Click through of product lists",
    2: "Product detail views",
    3: "Add product(s) to cart",
    4: "Remove product(s) from cart",
    5: "Check out",
    6: "Completed purchase",
    7: "Refund of purchase",
    8: "Checkout options",
    0: "Unknown",
}

Define a Python helper function to execute a SQL query using Google BigQuery

In [9]:
# | code-fold: false
def run_sql_query(
    query: str,
    gcp_project_id: str,
    gcp_creds: os.PathLike,
    show_dtypes: bool = False,
    show_info: bool = False,
    show_df: bool = False,
) -> pd.DataFrame:
    """Run query on BigQuery and return results as pandas.DataFrame."""
    start_time = datetime.now(pytz.timezone("US/Eastern"))
    start_time_str = start_time.strftime("%Y-%m-%d %H:%M:%S.%f")
    print(f"Query execution start time = {start_time_str[:-3]}...", end="")
    df = pd.read_gbq(
        query,
        project_id=gcp_project_id,
        credentials=gcp_creds,
        dialect="standard",
        # configuration is optional, since default for query caching is True
        configuration={"query": {"useQueryCache": True}},
        # use_bqstorage_api=True,
    )
    end_time = datetime.now(pytz.timezone("US/Eastern"))
    end_time_str = end_time.strftime("%Y-%m-%d %H:%M:%S.%f")
    duration = end_time - start_time
    duration = duration.seconds + (duration.microseconds / 1_000_000)
    print(f"done at {end_time_str[:-3]} ({duration:.3f} seconds).")
    print(f"Query returned {len(df):,} rows")
    if show_df:
        with pd.option_context("display.max_columns", None):
            display(df)
    if show_dtypes:
        display(df.dtypes.rename("dtype").to_frame().transpose())
    if show_info:
        df.info()
    return df

## Get Data

Show the [column properties (schema)](https://cloud.google.com/bigquery/docs/information-schema-columns#example) of the sessions by users on the Google Marketplace

In [10]:
# | code-fold: true
query = """
        SELECT table_name,
               column_name,
               data_type,
               is_nullable,
               ordinal_position
        FROM `data-to-insights`.ecommerce.INFORMATION_SCHEMA.COLUMNS
        WHERE table_name = 'web_analytics'
        """
_ = run_sql_query(query, **gcp_auth_dict, show_df=True)

Query execution start time = 2023-04-13 16:50:18.755...done at 2023-04-13 16:50:20.428 (1.673 seconds).
Query returned 15 rows


Unnamed: 0,table_name,column_name,data_type,is_nullable,ordinal_position
0,web_analytics,visitorId,INT64,YES,1
1,web_analytics,visitNumber,INT64,YES,2
2,web_analytics,visitId,INT64,YES,3
3,web_analytics,visitStartTime,INT64,YES,4
4,web_analytics,date,STRING,YES,5
5,web_analytics,totals,"STRUCT<visits INT64, hits INT64, pageviews INT...",YES,6
6,web_analytics,trafficSource,"STRUCT<referralPath STRING, campaign STRING, s...",YES,7
7,web_analytics,device,"STRUCT<browser STRING, browserVersion STRING, ...",YES,8
8,web_analytics,geoNetwork,"STRUCT<continent STRING, subContinent STRING, ...",YES,9
9,web_analytics,customDimensions,"ARRAY<STRUCT<index INT64, value STRING>>",NO,10


::: {.callout-note title="Notes"}

1. There are eight flattened columns and six nested columns in the `web_analytics` table.
2. The flattened columns are briefly explained below
   - `geoNetwork` provides information about the geography of the users who accessed the store website
   - `customDimensions` are [user-configured combinations of values of specific metrics](https://support.google.com/analytics/answer/2709828?hl=en#zippy=%2Cin-this-articlehttps://support.google.com/analytics/answer/2709828?hl=en#zippy=%2Cin-this-article) to be tracked
      - these won't be used for the current project
   - `device` provides information about the user's electronic device that was used to access the store website
   - `totals` provides aggregated stats per visit
   - `trafficSource` contains metadata about the traffic source that resulted in a user's visit
   - `hits` [tracks execution of Google analytics tracking code](https://paradoxmarketing.io/capabilities/knowledge-management/insights/which-kind-of-hits-does-google-analytics-track-what-to-know/) embedded in the store's website (this happends when a user performs an interaction with the store webpage)
:::

::: {.callout-tip title="Observations"}

1. While features for machine learning can be extracted from all the nested and flattened columns, the usefulness of these features is based on our understanding of the data. Only the features that can be
   - well understood
   - extracted without data leakage/lookahead bias

   should be extracted for use in ML. The following nested features should be intuitive and so will likely be easier to understand in the context of e-commerce transactions that the other nested columns
   - `geoNetwork`
   - `device`
   - `totals`
:::

A simple `SELECT` is used to show a subset of the columns for the first 15 visits between November 1, 2016 and November 2, 2016

In [11]:
# | code-fold: false
query = """
        SELECT fullvisitorid,
               -- get date of transaction
               date,
               -- convert date to datetime in year-month-date format
               PARSE_DATE('%Y%m%d', DATE) AS datetime_ymd,
               -- visit start time and number
               DATETIME(TIMESTAMP(TIMESTAMP_SECONDS(visitStartTime)), 'US/Pacific') AS visitStartTime,
               visitNumber,
               -- source of the traffic from which the session was initiated
               trafficSource.source,
               -- medium of the traffic from which the session was initiated
               trafficSource.medium,
               -- referring channel connected to session
               channelGrouping,
               -- user's type of device
               device.deviceCategory AS device_category,
               -- user's operating system
               device.operatingSystem AS os,
               -- referring page for visits that ended in a transaction
               CAST(h.eCommerceAction.action_type AS INT64) AS action_type,
               -- transactions
               totals.transactions,
               -- return visits to the merchandise store
               totals.newVisits AS newVisits
        FROM `data-to-insights.ecommerce.web_analytics`,
        UNNEST(hits) AS h
        WHERE date BETWEEN '20161101' AND '20161102'
        LIMIT 15
        """
df = run_sql_query(query, **gcp_auth_dict, show_df=False)
df["action_type"] = df["action_type"].map(mapper).astype(pd.StringDtype())
print(
    f"Datetimes: {df['visitStartTime'].min().strftime('%Y-%m-%d %H:%M:%S')} - "
    f"{df['visitStartTime'].max().strftime('%Y-%m-%d %H:%M:%S')}"
)
display(df)
df.info()

Query execution start time = 2023-04-13 16:50:20.458...done at 2023-04-13 16:50:21.646 (1.188 seconds).
Query returned 15 rows
Datetimes: 2016-11-01 16:26:08 - 2016-11-02 12:58:15


Unnamed: 0,fullvisitorid,date,datetime_ymd,visitStartTime,visitNumber,source,medium,channelGrouping,device_category,os,action_type,transactions,newVisits
0,7563857575439343066,20161102,2016-11-02,2016-11-02 12:58:15,1,google,organic,Organic Search,desktop,Windows,Unknown,,1.0
1,7563857575439343066,20161102,2016-11-02,2016-11-02 12:58:15,1,google,organic,Organic Search,desktop,Windows,Unknown,,1.0
2,7563857575439343066,20161102,2016-11-02,2016-11-02 12:58:15,1,google,organic,Organic Search,desktop,Windows,Unknown,,1.0
3,7563857575439343066,20161102,2016-11-02,2016-11-02 12:58:15,1,google,organic,Organic Search,desktop,Windows,Unknown,,1.0
4,7563857575439343066,20161102,2016-11-02,2016-11-02 12:58:15,1,google,organic,Organic Search,desktop,Windows,Unknown,,1.0
5,7563857575439343066,20161102,2016-11-02,2016-11-02 12:58:15,1,google,organic,Organic Search,desktop,Windows,Unknown,,1.0
6,7563857575439343066,20161102,2016-11-02,2016-11-02 12:58:15,1,google,organic,Organic Search,desktop,Windows,Unknown,,1.0
7,719274696629877759,20161101,2016-11-01,2016-11-01 16:26:08,2,google,organic,Organic Search,desktop,Macintosh,Unknown,,
8,719274696629877759,20161101,2016-11-01,2016-11-01 16:26:08,2,google,organic,Organic Search,desktop,Macintosh,Unknown,,
9,719274696629877759,20161101,2016-11-01,2016-11-01 16:26:08,2,google,organic,Organic Search,desktop,Macintosh,Unknown,,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   fullvisitorid    15 non-null     object        
 1   date             15 non-null     object        
 2   datetime_ymd     15 non-null     dbdate        
 3   visitStartTime   15 non-null     datetime64[ns]
 4   visitNumber      15 non-null     Int64         
 5   source           15 non-null     object        
 6   medium           15 non-null     object        
 7   channelGrouping  15 non-null     object        
 8   device_category  15 non-null     object        
 9   os               15 non-null     object        
 10  action_type      15 non-null     string        
 11  transactions     0 non-null      Int64         
 12  newVisits        10 non-null     Int64         
dtypes: Int64(3), datetime64[ns](1), dbdate(1), object(7), string(1)
memory usage: 1.7+ KB


::: {.callout-note title="Notes"}

1. For the `datetime_ymd` column, [`dbdate` is the expected BigQuery SQL datatype](https://cloud.google.com/python/docs/reference/bigquery/latest/upgrading#changes-to-data-types-when-reading-a-pandas-dataframe).
2. The columns shown here are a subset of all the columns available in the table.
3. `UNNEST` has to be used [to flatten arrays (nested columns, such as `hits`)](https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays) (explode them into separate rows), which is similar to [`pandas.DataFrame.explode()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html).
4. `BigQuery` uses query output caching, so subsequent runs of the same query SQL will return results in a shorter period of time than the previous run. See limitations [here](https://cloud.google.com/bigquery/docs/cached-results#limitations).
5. The timezone in the `DATETIME()` function was arbitrarily [set to be](https://stackoverflow.com/a/56869441/4057186) *US/Pacific*. This means we are assuming our local timezone is *US/Pacific*. This will be kept consistent throughout this project.
:::

## Next Step

The next step in the analysis workflow will involve exploring the visits data.