# Week 3 Exercise Exploratory Data Analysis

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys

import pandas as pd
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL

In [3]:
PROJ_ROOT = os.pardir
src_dir = os.path.join(PROJ_ROOT, "src")
sys.path.append(src_dir)

In [4]:
%aimport sql_utils
from sql_utils import compare_outputs

## About

Week 3 exercise Exploratory Data Analysis (EDA).

## Create SQL Engine

In [5]:
engine = create_engine(
    URL(
        drivername="driver",
        account=os.getenv("UPLIMIT_SNOWFLAKE_ACCOUNT"),
        user=os.getenv("UPLIMIT_SNOWFLAKE_USER"),
        password=os.getenv("UPLIMIT_SNOWFLAKE_PASS"),
        warehouse=os.getenv("UPLIMIT_SNOWFLAKE_WAREHOUSE"),
        database=os.getenv("UPLIMIT_SNOWFLAKE_DB_NAME"),
        role=os.getenv("UPLIMIT_SNOWFLAKE_ROLE"),
        timezone=os.getenv("MY_TIMEZONE"),
    ),
    connect_args=dict(session_parameters={"USE_CACHED_RESULT": False}),
)

## Connect

Load SQL extension

In [6]:
%load_ext sql

Connect to Uplimit's Snowflake instance

In [7]:
%sql engine --alias connection

## Exploratory Data Analysis

### Overview of Table

Show the first four rows of the events table

In [8]:
%%sql result = <<
SELECT *
FROM events.website_activity
LIMIT 4

event_id,session_id,user_id,event_timestamp,event_details
cbf76489-37c3-4fc9-a6f1-be0823067aec,a81ce2ae-74f6-4409-8ba6-e0807e714db9,86d12402-6608-4d50-af53-dd12b6041531,2023-01-24 03:58:59.424000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
e4090e60-bb72-4057-a576-c6d3c0dfae62,70ce9652-49f6-4c00-a9d3-ca0aee482147,4558e9dd-41e9-424c-b149-7d8f93729be8,2023-01-24 11:58:57.985000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
3d781e12-3834-4f8c-9b5f-224b0f3f7e38,63b4464c-0523-4c66-bc50-8e54470308bf,98a8e24b-10e2-4d5e-b749-d73e067c8127,2023-01-24 15:58:56.625000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
8590374a-924a-426f-b90e-8660560dfb20,c7d3a7e5-1a77-4aa3-b575-93d7761b260b,1dfbafda-1510-4427-a9fb-5994460e3e50,2023-01-24 19:58:55.322000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"


Show the data type for the `event_details` column

In [9]:
%%sql result = <<
SELECT event_details,
       TYPEOF(event_details::VARIANT) AS data_type
FROM events.website_activity
LIMIT 4

event_details,data_type
"{ ""event"":""pageview"", ""page"":""home"" }",VARCHAR
"{ ""event"":""pageview"", ""page"":""home"" }",VARCHAR
"{ ""event"":""pageview"", ""page"":""home"" }",VARCHAR
"{ ""event"":""pageview"", ""page"":""home"" }",VARCHAR


**Observations**

1. The `event_details` column is stored as a string. This needs to be converted to a `JSON` object [using `PARSE_JSON`](https://docs.snowflake.com/en/sql-reference/functions/parse_json) before its attributes (`event` and `page`) can be queried.

Count the number of rows in the events table

In [10]:
%%sql result = <<
SELECT COUNT(*) AS num_rows
FROM events.website_activity

num_rows
348


### Checking for Duplicates

Count the number of unique values in each column. Get this using the
1. `event_id`
2. `event_timestamp`

columns

In [11]:
%%sql result = <<
SELECT COUNT(DISTINCT(session_id)) AS num_sessions,
       COUNT(DISTINCT(user_id)) AS num_users,
       COUNT(DISTINCT(event_id)) AS num_events,
       COUNT(DISTINCT(event_timestamp)) AS num_event_timestamps,
       COUNT(*) AS num_rows,
       COUNT(DISTINCT(event_details)) AS num_event_details
FROM events.website_activity

num_sessions,num_users,num_events,num_event_timestamps,num_rows,num_event_details
178,178,309,309,348,9


**Observations**

1. The number of rows is greater than the number of unique `event_id`s and unique `event_timestamps`. This indicates there are some duplicated `event_id`s and `event_timestamp`s in the events table.
2. One session is recorded per user. The `user_id` and `session_id` columns are redundant. If only unique sessions are of interest, then we can group describe a session by using only the `session_id` column.
3. Each `event_id` corresponds to its own `event_timestamp`. Similar to above, the `event_id` and `event_timestamp` columns are redundant. If unique events are of interest, then we can describe an event using the `event_id` column.

Get the rows which have a duplicated `event_id` or `event_timestamp` from five sessions

In [12]:
%%sql result = <<
WITH num_events_per_session AS (
    SELECT user_id,
           session_id,
           COUNT(DISTINCT(event_id)) AS num_unique_event_ids,
           COUNT(event_id) AS num_event_ids,
           COUNT(DISTINCT(event_timestamp)) AS num_unique_event_timestamps,
           COUNT(event_timestamp) AS num_event_timestamps,
           COUNT(DISTINCT(event_details)) AS num_unique_event_details,           
           COUNT(event_details) AS num_event_details
    FROM events.website_activity
    /* each search has its own event_id */
    WHERE session_id IN (
        '3a0b4329-3ee0-4196-ad7f-27a6fac03a5f',
        'c7d3a7e5-1a77-4aa3-b575-93d7761b260b',
        'a81ce2ae-74f6-4409-8ba6-e0807e714db9',
        'fb3e5bb6-b715-4a9a-b5fe-74d298502acc',
        'dd1c8bed-752d-4ca4-9b64-314d0a8f29c0'
    )
    GROUP BY ALL
),
sessions_with_duplicated_events AS (
    SELECT *
    FROM num_events_per_session
    WHERE num_unique_event_ids != num_event_ids
    OR num_event_timestamps != num_unique_event_timestamps
)
SELECT wa.*
FROM sessions_with_duplicated_events
INNER JOIN events.website_activity AS wa USING (session_id)

session_id,event_id,user_id,event_timestamp,event_details
a81ce2ae-74f6-4409-8ba6-e0807e714db9,cbf76489-37c3-4fc9-a6f1-be0823067aec,86d12402-6608-4d50-af53-dd12b6041531,2023-01-24 03:58:59.424000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
a81ce2ae-74f6-4409-8ba6-e0807e714db9,cbf76489-37c3-4fc9-a6f1-be0823067aec,86d12402-6608-4d50-af53-dd12b6041531,2023-01-24 03:58:59.424000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
c7d3a7e5-1a77-4aa3-b575-93d7761b260b,8590374a-924a-426f-b90e-8660560dfb20,1dfbafda-1510-4427-a9fb-5994460e3e50,2023-01-24 19:58:55.322000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
c7d3a7e5-1a77-4aa3-b575-93d7761b260b,8590374a-924a-426f-b90e-8660560dfb20,1dfbafda-1510-4427-a9fb-5994460e3e50,2023-01-24 19:58:55.322000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
fb3e5bb6-b715-4a9a-b5fe-74d298502acc,0eb76668-026d-44a5-a824-a1e68456d4d3,d985a0f1-0af3-4204-bf16-275f3639e892,2023-01-24 23:58:54.022000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
fb3e5bb6-b715-4a9a-b5fe-74d298502acc,0eb76668-026d-44a5-a824-a1e68456d4d3,d985a0f1-0af3-4204-bf16-275f3639e892,2023-01-24 23:58:54.022000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
fb3e5bb6-b715-4a9a-b5fe-74d298502acc,8f16e244-9d02-4cba-9d59-7959fefa8ccb,d985a0f1-0af3-4204-bf16-275f3639e892,2023-01-24 23:59:04.022000-05:00,"{ ""event"":""search"", ""page"":""search"", ""search_tag"":""comfort-food"" }"
fb3e5bb6-b715-4a9a-b5fe-74d298502acc,7e8c6e59-4852-4989-9745-239650e7e028,d985a0f1-0af3-4204-bf16-275f3639e892,2023-01-25 00:05:00.022000-05:00,"{ ""event"":""search"", ""page"":""search"", ""search_tag"":""comfort-food"" }"
fb3e5bb6-b715-4a9a-b5fe-74d298502acc,d59fede2-4e48-42d0-b4c2-b1878a6a9146,d985a0f1-0af3-4204-bf16-275f3639e892,2023-01-25 00:05:21.022000-05:00,"{ ""event"":""view_recipe"", ""page"":""recipe"", ""recipe_id"":""44dcd777-5b10-41e2-90df-4dca4b696971"" }"
dd1c8bed-752d-4ca4-9b64-314d0a8f29c0,01c1e6f5-690c-4894-b82d-8927f702aa59,2bc9219e-0bd2-4aa7-bc17-5f435fcfaae4,2023-01-30 19:32:45.473000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"


Show all the rows with a duplicated `event_id` or `event_timestamp` that were found above

In [13]:
with pd.option_context('display.max_colwidth', None):
    with pd.option_context('display.max_rows', None):
        display(result.DataFrame())

Unnamed: 0,session_id,event_id,user_id,event_timestamp,event_details
0,a81ce2ae-74f6-4409-8ba6-e0807e714db9,cbf76489-37c3-4fc9-a6f1-be0823067aec,86d12402-6608-4d50-af53-dd12b6041531,2023-01-24 03:58:59.424000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
1,a81ce2ae-74f6-4409-8ba6-e0807e714db9,cbf76489-37c3-4fc9-a6f1-be0823067aec,86d12402-6608-4d50-af53-dd12b6041531,2023-01-24 03:58:59.424000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
2,c7d3a7e5-1a77-4aa3-b575-93d7761b260b,8590374a-924a-426f-b90e-8660560dfb20,1dfbafda-1510-4427-a9fb-5994460e3e50,2023-01-24 19:58:55.322000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
3,c7d3a7e5-1a77-4aa3-b575-93d7761b260b,8590374a-924a-426f-b90e-8660560dfb20,1dfbafda-1510-4427-a9fb-5994460e3e50,2023-01-24 19:58:55.322000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
4,fb3e5bb6-b715-4a9a-b5fe-74d298502acc,0eb76668-026d-44a5-a824-a1e68456d4d3,d985a0f1-0af3-4204-bf16-275f3639e892,2023-01-24 23:58:54.022000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
5,fb3e5bb6-b715-4a9a-b5fe-74d298502acc,0eb76668-026d-44a5-a824-a1e68456d4d3,d985a0f1-0af3-4204-bf16-275f3639e892,2023-01-24 23:58:54.022000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"
6,fb3e5bb6-b715-4a9a-b5fe-74d298502acc,8f16e244-9d02-4cba-9d59-7959fefa8ccb,d985a0f1-0af3-4204-bf16-275f3639e892,2023-01-24 23:59:04.022000-05:00,"{ ""event"":""search"", ""page"":""search"", ""search_tag"":""comfort-food"" }"
7,fb3e5bb6-b715-4a9a-b5fe-74d298502acc,7e8c6e59-4852-4989-9745-239650e7e028,d985a0f1-0af3-4204-bf16-275f3639e892,2023-01-25 00:05:00.022000-05:00,"{ ""event"":""search"", ""page"":""search"", ""search_tag"":""comfort-food"" }"
8,fb3e5bb6-b715-4a9a-b5fe-74d298502acc,d59fede2-4e48-42d0-b4c2-b1878a6a9146,d985a0f1-0af3-4204-bf16-275f3639e892,2023-01-25 00:05:21.022000-05:00,"{ ""event"":""view_recipe"", ""page"":""recipe"", ""recipe_id"":""44dcd777-5b10-41e2-90df-4dca4b696971"" }"
9,dd1c8bed-752d-4ca4-9b64-314d0a8f29c0,01c1e6f5-690c-4894-b82d-8927f702aa59,2bc9219e-0bd2-4aa7-bc17-5f435fcfaae4,2023-01-30 19:32:45.473000-05:00,"{ ""event"":""pageview"", ""page"":""home"" }"


**Observations**

1. The implementation of event logging in the `events.website_activity` table contains a single duplicate for all events corresponding to a view of the Virtual Kitchen website home page. Events with `{ "event":"pageview", "page":"home" }` in the `event_details` column are **always** duplicated once. The duplicate contains identical values in the `event_id` and `event_timestamp` columns. This is seen in all five `session_id`s shown above.
2. The `event_id` and `event_timestamp` columns are never duplicated for user searches. See the `session_id`s `fb3e5bb6-b715-4a9a-b5fe-74d298502acc`, `dd1c8bed-752d-4ca4-9b64-314d0a8f29c0` and `3a0b4329-3ee0-4196-ad7f-27a6fac03a5f` in which the user performed multiple searches during a single session and each search was assigned a unique `event_id` and `event_timestamp`.
3. If a recipe is viewed (`view_recipe`), then it occurs at the end of the session. It is always the last event in each session. This means the session ends when the user views a recipe.

### Strategy to Remove Duplicates

Duplicated `event_id` or `event_timestamp` (events corresponding to a view of the home page) can be removed by taking the first such event only, as shown below

In [14]:
%%sql result = <<
WITH website_activity_no_duplicates_no_redundancy AS (
    SELECT session_id,
           event_id,
           event_details,
           MIN(event_timestamp) AS event_timestamp
    FROM events.website_activity
    GROUP BY ALL
)
SELECT *
FROM website_activity_no_duplicates_no_redundancy

session_id,event_id,event_details,event_timestamp
a81ce2ae-74f6-4409-8ba6-e0807e714db9,cbf76489-37c3-4fc9-a6f1-be0823067aec,"{ ""event"":""pageview"", ""page"":""home"" }",2023-01-24 03:58:59.424000-05:00
70ce9652-49f6-4c00-a9d3-ca0aee482147,e4090e60-bb72-4057-a576-c6d3c0dfae62,"{ ""event"":""pageview"", ""page"":""home"" }",2023-01-24 11:58:57.985000-05:00
63b4464c-0523-4c66-bc50-8e54470308bf,3d781e12-3834-4f8c-9b5f-224b0f3f7e38,"{ ""event"":""pageview"", ""page"":""home"" }",2023-01-24 15:58:56.625000-05:00
c7d3a7e5-1a77-4aa3-b575-93d7761b260b,8590374a-924a-426f-b90e-8660560dfb20,"{ ""event"":""pageview"", ""page"":""home"" }",2023-01-24 19:58:55.322000-05:00
fb3e5bb6-b715-4a9a-b5fe-74d298502acc,0eb76668-026d-44a5-a824-a1e68456d4d3,"{ ""event"":""pageview"", ""page"":""home"" }",2023-01-24 23:58:54.022000-05:00
44484161-a129-4b7d-947b-184a4c1c5423,afc01df0-3b6d-4f6d-b171-7912f16fcf70,"{ ""event"":""pageview"", ""page"":""home"" }",2023-01-25 11:51:58.751000-05:00
dc11aa86-e1bd-489e-8dba-bc91d7f90786,46e5231c-b9e3-4364-b736-142356bf2972,"{ ""event"":""pageview"", ""page"":""home"" }",2023-01-25 15:51:57.051000-05:00
c28c6f7d-fdec-4d6a-8ef6-4f341a420e9d,7114298b-64a5-4d7f-b654-a76a6d070783,"{ ""event"":""pageview"", ""page"":""home"" }",2023-01-25 19:49:58.224000-05:00
ee793591-d1f3-4c45-8476-f7775283137f,1d65b9fc-4076-43c3-b38a-8b9cb1384cf7,"{ ""event"":""pageview"", ""page"":""home"" }",2023-01-25 23:51:55.831000-05:00
47d255b7-4f00-412c-a4ca-7f651db8287f,74c18f69-6012-451f-916e-767c6568d7da,"{ ""event"":""pageview"", ""page"":""home"" }",2023-01-26 18:56:45.152000-05:00


### Unique Events Per Page

#### Checking for Last Action Performed by Users During a Session

Get the last event performed by users during a session and the number of users who performed them

In [15]:
%%sql result = <<
/* get last event performed by each user */
WITH users_last_events AS (
    SELECT DISTINCT user_id,
           LAST_VALUE(event_details_json:event) OVER(
               PARTITION BY user_id
               ORDER BY event_timestamp
           ) AS last_event
    FROM (
        SELECT user_id,
               event_timestamp,
               PARSE_JSON(event_details) AS event_details_json
        FROM events.website_activity
    )
),
/* get how often the last event was performed */
frequency_of_last_events AS (
    SELECT last_event,
           COUNT(*) AS num_users
    FROM users_last_events
    GROUP BY ALL
)
SELECT *,
       100*num_users/(
           SELECT SUM(num_users)
           FROM frequency_of_last_events
       ) AS frac_users
FROM frequency_of_last_events
GROUP BY ALL

last_event,num_users,frac_users
"""pageview""",123,69.101124
"""view_recipe""",55,30.898876


**Observations**

1. Approximately 30% of users ended their session by viewing a recipe. The remaining 70% ended their session by viewing the home page of the Virtual Kitchen website.

(optional) Get two users who ended their session by viewing a recipe

In [16]:
%%sql result = <<
SELECT user_id
FROM (
    SELECT user_id,
           LAST_VALUE(event_details_json:event) OVER(
               PARTITION BY session_id
               ORDER BY event_timestamp
           ) AS last_event
    FROM (
        SELECT user_id,
               session_id,
               event_timestamp,
               PARSE_JSON(event_details) AS event_details_json
        FROM events.website_activity
    )
)
WHERE last_event = 'view_recipe'
LIMIT 2

user_id
4558e9dd-41e9-424c-b149-7d8f93729be8
d985a0f1-0af3-4204-bf16-275f3639e892


(optional) Get five users who ended their session by viewing the home page of the Virtual Kitchen website

In [17]:
%%sql result = <<
SELECT user_id
FROM (
    SELECT user_id,
           LAST_VALUE(event_details_json:event) OVER(
               PARTITION BY session_id
               ORDER BY event_timestamp
           ) AS last_event
    FROM (
        SELECT user_id,
               session_id,
               event_timestamp,
               PARSE_JSON(event_details) AS event_details_json
        FROM events.website_activity
    )
)
WHERE last_event != 'pageview'
LIMIT 5

user_id
4558e9dd-41e9-424c-b149-7d8f93729be8
d985a0f1-0af3-4204-bf16-275f3639e892
57a8bfd3-87cb-4415-9e42-39314a7f8926
03a4f6b8-9c42-4242-b15f-fce3674f5b9f
9b5e5510-40d1-4dda-bc31-05800e3347c4


#### Get the number of unique events per type of page viewed

For each type of page that is viewed, count the number of events

In [18]:
%%sql result = <<
WITH website_activity_no_duplicates_no_redundancy AS (
    SELECT session_id,
           event_id,
           event_details,
           MIN(event_timestamp) AS event_timestamp,
    FROM events.website_activity
    GROUP BY ALL
),
t1 AS (
    SELECT event_timestamp,
           PARSE_JSON(event_details) AS ev
    FROM website_activity_no_duplicates_no_redundancy
)
SELECT ev:page AS page,
       COUNT(event_timestamp) AS num_events
FROM t1
GROUP BY ALL

page,num_events
"""home""",178
"""search""",76
"""recipe""",55


**Observations**

1. Users visited three types of pages. The home page of the Virtual kitchen website was most frequently viewed page, followed by a recipe search and then by viewing a recipe. This is expected since users must access the home page in order to perform a search. Similarly, users must perform a search to view a recipe.
2. When a (recipe) search is performed, the recipe suggestion algorithm is used to display results (recipes). The search algorithm is being changed by the Virtual Kitchen developers, so (recipe) search results returned in the future will be different from those being currently returned.

#### Get the number of unique events per user action

For each type of user action that is performed on the site, count the number of events

In [20]:
%%sql result = <<
WITH website_activity_no_duplicates_no_redundancy AS (
    SELECT session_id,
           event_id,
           event_details,
           MIN(event_timestamp) AS event_timestamp,
    FROM events.website_activity
    GROUP BY ALL
),
t1 AS (
    SELECT event_timestamp,
           PARSE_JSON(event_details) AS ev
    FROM website_activity_no_duplicates_no_redundancy
)
SELECT ev:event AS action,
       COUNT(event_timestamp) AS num_events
FROM t1
GROUP BY ALL

action,num_events
"""pageview""",178
"""search""",76
"""view_recipe""",55


**Observations**

1. Users performed three types of actions. The most frequently performed action was viewing the Virtual Kitchen home page.
2. `search` and `view_recipe` result in a page being viewed. However, these events are not logged as page views. Only views of the home page are logged as page views.

#### Compare Unique Events and Pages

For each action and the page on which it was performed, count the number of events

In [22]:
%%sql result = <<
WITH website_activity_no_duplicates_no_redundancy AS (
    SELECT session_id,
           event_id,
           event_details,
           MIN(event_timestamp) AS event_timestamp,
    FROM events.website_activity
    GROUP BY ALL
),
event_details_json AS (
    SELECT event_timestamp,
           PARSE_JSON(event_details) AS ev
    FROM website_activity_no_duplicates_no_redundancy
)
SELECT ev:event AS action,
       ev:page AS page,
       COUNT(event_timestamp) AS num_events
FROM event_details_json
GROUP BY ALL

action,page,num_events
"""pageview""","""home""",178
"""search""","""search""",76
"""view_recipe""","""recipe""",55


**Observations**

1. A single action is registered per page. So, the following have only one possible combination of `action` and `page`
   - viewing the home page (`pageview` and `home`)
   - performing a search (`search` and `search`)
   - viewing a recipe (`view_recipe` and `recipe`)

   For this reason, the event counts found here for the combination of `action` and `page` match those found above using only `action` or only `page`.
2. In order to avoid selecting redundant columns and optimize query performance when website traffic increases, the `event` attribute of the `event_details` column will be used to extract info about the page being viewed or the action being performed. The `page` attribute will be excluded in all `SELECT` statements.
3. This query output provides an indication of the [customer journey](https://www.snowflake.com/guides/customer-journey/) on the Virtual Kitchen website. Based on 1. and 2. and on the two sub-sections above (*Get the number of unique events per user action* and *Compare Unique Events and Pages*), we can infer the customer journey as View Home Page > Perform a Recipe Search > View a Recipe.

## Disconnect

In [21]:
%sql --close connection