# Reporting on user journeys to a GOV.UK page

Calculate the count and proportion of sessions that have the same journey behaviour.

This script finds sessions that visit a specific page (`DESIRED_PAGE`) in their journey. From the last visit to
`DESIRED_PAGE` in the session, the journey is subsetted to include the last N pages including `DESIRED_PAGE`
(`NUMBER_OF_STAGES`). If the subsetted journey contains the entrance page this is flagged.

The count and proportion of sessions visiting distinct, subsetted journeys are compiled together, and returned as a
sorted list in descending order split by subsetted journeys including the entrance page.

## Arguments

- `START_DATE`: String in YYYYMMDD format defining the start date of your query.
- `END_DATE`: String in YYYYMMDD format defining the end date of your query.
- `DESIRED_PAGE`: String of the desired GOV.UK page path of interest.
- `NUMBER_OF_STAGES`: Integer defining how many pages in the past (including `DESIRED_PAGE`) should be considered when subsetting the user journeys. Note that journeys with fewer pages than `NUMBER_OF_STAGES` will always be included.
- `PAGE_TYPE`: Boolean flag indicating that `PAGE` page paths are required. One of `PAGE_TYPE` or `EVENT_TYPE` must be selected.
- `EVENT_TYPE`: Boolean flag indicating that `EVENT` page paths are required. One of `PAGE_TYPE` or `EVENT_TYPE` must be selected.
- `DEVICE_DESKTOP`: Boolean flag indicating that desktop devices should be included in this query. One of `DEVICE_DESKTOP`, `DEVICE_MOBILE`, or `DEVICE_TABLET` must be selected.
- `DEVICE_MOBILE`: Boolean flag indicating that mobile devices should be included in this query. One of `DEVICE_DESKTOP`, `DEVICE_MOBILE`, or `DEVICE_TABLET` must be selected.
- `DEVICE_TABLET`: Boolean flag indicating that tablet devices should be included in this query. One of `DEVICE_DESKTOP`, `DEVICE_MOBILE`, or `DEVICE_TABLET` must be selected.

### Optional arguments

- `FLAG_EVENTS`: Boolean flag. If `TRUE`, all `EVENT` page paths will have a ` [E]` suffix. This is useful if both `PAGE_TYPE` and `EVENT_TYPE` are selected, so you can differentiate between the same page path with different types. If `FALSE`, no suffix is appended to `EVENT` page paths.
- `EVENT_CATEGORY`: Boolean flag indicating that the event category should be displayed in this query. 
- `EVENT_ACTION`: Boolean flag indicating that the event action should be displayed in this query. 
- `EVENT_LABEL`: Boolean flag indicating that the event label should be displayed in this query. 
- `REMOVE_DESIRED_PAGE_REFRESHES`: Boolean flag. If `TRUE` sequential page paths of the same type are removed when the query calculates the last visit to the desired page. In other words, it will only use the first visit in a series of sequential visits to desired page if they have the same type. Other earlier visits to the desired page will remain, as will any earlier desired page refreshes.
- `TRUNCATE_SEARCHES`: Boolean flag. If `TRUE`, all GOV.UK search page paths are truncated to `Sitesearch ({TYPE}): {KEYWORDS}`, where `{TYPE}` is the GOV.UK search content type, and `{KEYWORDS}` are the search keywords. If there are no keywords, this is set to `none`. If `FALSE`, GOV.UK search page paths are not truncated.

## Returns

A Google BigQuery result containing the subsetted user journey containing `PAGE_TYPE` and/or `EVENT_TYPE` page paths in reverse from the last visit to `DESIRED_PAGE` with a maximum length `NUMBER_OF_STAGES`. Counts and the proportion of sessions that have this subsetted journey are also shown. Subsetted journeys that incorporate the first page visited by a session are flagged as well. The results are presented in descending order, with the most popular subsetted user journey first.

## Assumptions

- Only exact matches to `DESIRED_PAGE` are currently supported.
- Previous visits to `DESIRED_PAGE` are ignored, only the last visit is used.
- If `REMOVE_DESIRED_PAGE_REFRESHES` is `TRUE`, only the first visit in a series of sequential visits (page refreshes) to `DESIRED_PAGE` are used to determine which is the last visit.
- If `REMOVE_DESIRED_PAGE_REFRESHES` is `TRUE`, and there is more than one page type (`PAGE_TYPE` and `EVENT_TYPE` are both selected), only the first visit in page refreshes to the same `DESIRED_PAGE` and page type are used to determine which is the last visit.
- Journeys shorter than the number of desired stages (`NUMBER_OF_STAGES`) are always included.
- GOV.UK search page paths are assumed to have the format `/search/{TYPE}?keywords={KEYWORDS}{...}`, where `{TYPE}` is the GOV.UK search content type, `{KEYWORDS}` are the search keywords, where each keyword is
  separated by `+`, and `{...}` are any other parts of the search query that are not keyword-related (if they exist).
- GOV.UK search page titles are assumed to have the format `{KEYWORDS} - {TYPE} - GOV.UK`, where `{TYPE}` is the GOV.UK search content type, and `{KEYWORDS}` are the search keywords.

In [None]:
from datetime import datetime

import pandas as pd
from google.cloud import bigquery
from google.colab import auth, files
from IPython.core.interactiveshell import InteractiveShell

# Allow multiline outputs
InteractiveShell.ast_node_interactivity = "all"

# Authenticate the user - follow the link and the prompts to get an authentication token
auth.authenticate_user()

In [None]:
# @markdown ## Set query parameters
# @markdown Define the start and end dates
START_DATE = "2021-08-03"  # @param {type:"date"}
END_DATE = "2021-08-16"  # @param {type:"date"}

# @markdown Set the desired page path - must start with '/'
DESIRED_PAGE = "/travel-abroad"  # @param {type:"string"}

# @markdown Set the number of pages, including `DESIRED_PAGE` to include in the subsetted journeys
NUMBER_OF_STAGES = 6  # @param {type:"integer"}

# @markdown Set the page types; at least one must be checked
PAGE_TYPE = True  # @param {type:"boolean"}
EVENT_TYPE = True  # @param {type:"boolean"}

# @markdown Set the device categories; at least one must be checked
DEVICE_DESKTOP = True  # @param {type:"boolean"}
DEVICE_MOBILE = True  # @param {type:"boolean"}
DEVICE_TABLET = False  # @param {type:"boolean"}

# @markdown ### Other options
# @markdown Add a ` [E]` suffix to EVENT page paths - easier to differentiate between PAGE and
# @markdown EVENT types for the same page path
FLAG_EVENTS = False  # @param {type:"boolean"}

# @markdown Add event information suffix to EVENT page paths
EVENT_CATEGORY = True  # @param {type:"boolean"}
EVENT_ACTION = True  # @param {type:"boolean"}
EVENT_LABEL = False  # @param {type:"boolean"}

# @markdown Remove page refreshes when determining the last visit to `DESIRED_PAGE`?
REMOVE_DESIRED_PAGE_REFRESHES = False  # @param {type:"boolean"}

# @markdown Truncate search pages to only show the search content type, and search keywords
TRUNCATE_SEARCHES = False  # @param {type:"boolean"}

In [None]:
# Convert the inputted start and end date into `YYYYMMDD` formats
QUERY_START_DATE = datetime.strptime(START_DATE, "%Y-%m-%d").strftime("%Y%m%d")
QUERY_END_DATE = datetime.strptime(END_DATE, "%Y-%m-%d").strftime("%Y%m%d")

# Check that `DESIRED_PAGE` starts with '/'
assert DESIRED_PAGE.startswith(
    "/"
), f"`DESIRED_PAGE` must start with '/': {DESIRED_PAGE}"

# Compile the query page types
if PAGE_TYPE and EVENT_TYPE:
    QUERY_PAGE_TYPES = ["PAGE", "EVENT"]
elif PAGE_TYPE:
    QUERY_PAGE_TYPES = ["PAGE"]
elif EVENT_TYPE:
    QUERY_PAGE_TYPES = ["EVENT"]
else:
    raise AssertionError("At least one of `PAGE_TYPE` or `EVENT_TYPE` must be checked!")

# Compile the device categories
QUERY_DEVICE_CATEGORIES = [
    "desktop" if DEVICE_DESKTOP else "",
    "mobile" if DEVICE_MOBILE else "",
    "tablet" if DEVICE_TABLET else "",
]
QUERY_DEVICE_CATEGORIES = [d for d in QUERY_DEVICE_CATEGORIES if d]
assert QUERY_DEVICE_CATEGORIES, (
    f"At least one of `DEVICE_DESKTOP`, `DEVICE_MOBILE`, or"
    + f"`DEVICE_TABLET` must be checked!"
)

# Set the notebook execution date
NOTEBOOK_EXECUTION_DATE = datetime.now().strftime("%Y%m%d")

# Define the output file names
OUTPUT_FILE = (
    f"{NOTEBOOK_EXECUTION_DATE}_user_journeys_{QUERY_START_DATE}_{QUERY_END_DATE}_"
    + f"{'_'.join(QUERY_DEVICE_CATEGORIES)}.csv"
)

In [None]:
query = """
WITH
    get_session_data AS (
        -- Get all the session data between `start_date` and `end_date`, subsetting for specific `page_type`s. As
        -- some pages might be dropped by the subsetting, recalculate `hitNumber` as `journeyNumber` so the values
        -- are sequential.
        SELECT
            CONCAT(fullVisitorId, "-", visitId) AS sessionId,
            ROW_NUMBER() OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.hitNumber) AS journeyNumber,
            hits.type,
            CONCAT(
                hits.page.pagePath,  -- modify this line to `hits.page.pageTitle` if required
                IF(hits.type = "EVENT" AND @flagEvents, IF ((@eventCategory OR @eventAction OR @eventLabel), " [E", "[E]"), ""),
                IF(hits.type = "EVENT" AND @eventCategory, CONCAT(IF ((@flagEvents), ", ", " ["), hits.eventInfo.eventCategory, IF ((@eventAction OR @eventLabel), "", "]")), ""),
                IF(hits.type = "EVENT" AND @eventAction, CONCAT(IF ((@flagEvents OR @eventCategory), ", ", " ["), hits.eventInfo.eventAction, IF ((@eventLabel), "", "]")), ""),
                IF(hits.type = "EVENT" AND @eventLabel, CONCAT(IF ((@flagEvents OR @eventCategory OR @eventAction), ", ", " ["), hits.eventInfo.eventLabel, "]"), "") 
            ) AS pageId
        FROM `govuk-bigquery-analytics.87773428.ga_sessions_*`
        CROSS JOIN UNNEST(hits) AS hits
        WHERE _TABLE_SUFFIX BETWEEN @startDate AND @endDate
        AND hits.type IN UNNEST(@pageType)
        AND device.deviceCategory in UNNEST(@deviceCategories)
    ),
    get_search_content_type_and_keywords AS (
        -- Extract the content type and keywords (if any) for GOV.UK search pages.
        SELECT
            *,
            IFNULL(
              REGEXP_EXTRACT(pageId, r"^/search/([^ ?#/]+)"),
              REGEXP_EXTRACT(pageId, r"^.+ - ([^-]+) - GOV.UK$")
            ) AS searchContentType,
            IFNULL(
              REPLACE(REGEXP_EXTRACT(pageId, r"^/search/[^ ?#/]+\?keywords=([^&]+)"), "+", " "),
              REGEXP_EXTRACT(pageId, r"^(.+)- [^-]+ - GOV.UK$")
            ) AS searchKeywords
        FROM get_session_data
    ),
    compile_search_entry AS (
        -- Truncate the search page into an entry of the search content type and keywords (if any).
        SELECT
            * EXCEPT (searchContentType, searchKeywords),
            CONCAT(
                "Sitesearch (",
                searchContentType,
                "):",
                COALESCE(searchKeywords, "none")
            ) AS search_entry
        FROM get_search_content_type_and_keywords
    ),
    revise_search_pageids AS (
        -- Replace `pageId` for search pages with the compiled entries if selected by the user.
        SELECT
            * REPLACE (
                IF(@truncatedSearches, COALESCE(search_entry, pageId), pageId) AS pageId
            )
        FROM compile_search_entry

    ),
    identify_page_refreshes AS (
        -- Lag the page `type` and `pageId` columns. This helps identify page refreshes that can be removed in the
        -- next CTE
        SELECT
            *,
            LAG(type) OVER (PARTITION BY sessionId ORDER BY journeyNumber) AS lagType,
            LAG(pageId) OVER (PARTITION BY sessionId ORDER BY journeyNumber) AS lagPageId
        FROM revise_search_pageids
    ),
    identify_last_hit_to_desired_page AS (
        -- Get the last hit to the desired page. Ignores previous visits to the desirted page. Page refreshes of the
        -- desired page are also ignored if the correct option is declared.
        SELECT
            sessionId,
            MAX(journeyNumber) AS maxDesiredPageJourneyNumber
        FROM identify_page_refreshes
        WHERE pageId = @desiredPage
        AND IF(
            @desiredPageRemoveRefreshes,
            (
                lagPageId IS NULL
                OR pageId != lagPageId
                OR IF(ARRAY_LENGTH(@pageType) > 1, pageId = lagPageId AND type != lagType, FALSE)
            ),
            TRUE
        )
        GROUP BY sessionId
    ),
    subset_journey_to_last_hit_of_desired_page AS (
        -- Subset all user journeys to the last hit of the desired page.
        SELECT revise_search_pageids.*
        FROM revise_search_pageids
        INNER JOIN identify_last_hit_to_desired_page
        ON revise_search_pageids.sessionId = identify_last_hit_to_desired_page.sessionId
        AND revise_search_pageids.journeyNumber <= identify_last_hit_to_desired_page.maxDesiredPageJourneyNumber
    ),
    calculate_stages AS (
        -- Calculate the number of stages from the last hit to the desired page, where the last hit to the desired
        -- page is '1'.
        SELECT
            *,
            ROW_NUMBER() OVER (PARTITION BY sessionId ORDER BY journeyNumber DESC) AS reverseDesiredPageJourneyNumber
        FROM subset_journey_to_last_hit_of_desired_page
    ),
    subset_journey_to_number_of_stages AS (
        -- Compile the subsetted user journeys together for each session in reverse order (last hit to the desired
        -- page first), delimited by " <<< ".
        SELECT DISTINCT
            sessionId,
            MIN(journeyNumber) OVER (PARTITION BY sessionId) = 1 AS flagEntrance,
            STRING_AGG(pageId, " <<< ") OVER (
                PARTITION BY sessionId
                ORDER BY reverseDesiredPageJourneyNumber ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
            ) AS userJourney
        FROM calculate_stages
        WHERE reverseDesiredPageJourneyNumber <= @numberOfStages
    ),
    count_distinct_journeys AS (
        -- Count the number of sessions for each distinct subsetted user journey, split by whether the sessions
        -- entered on the first page of the subsetted journey or not
        SELECT
            flagEntrance,
            userJourney,
            (SELECT COUNT(sessionId) FROM subset_journey_to_number_of_stages) AS totalSessions,
            COUNT(sessionId) AS countSessions
        FROM subset_journey_to_number_of_stages
        GROUP BY
            flagEntrance,
            userJourney
    )
SELECT
    *,
    countSessions / totalSessions AS proportionSessions
FROM count_distinct_journeys
ORDER BY countSessions DESC;
"""

In [None]:
# Initialise a Google BigQuery client, and define a the query parameters
client = bigquery.Client(project="govuk-bigquery-analytics", location="EU")
query_parameters = [
    bigquery.ScalarQueryParameter("startDate", "STRING", QUERY_START_DATE),
    bigquery.ScalarQueryParameter("endDate", "STRING", QUERY_END_DATE),
    bigquery.ArrayQueryParameter("pageType", "STRING", QUERY_PAGE_TYPES),
    bigquery.ArrayQueryParameter("deviceCategories", "STRING", QUERY_DEVICE_CATEGORIES),
    bigquery.ScalarQueryParameter("flagEvents", "BOOL", FLAG_EVENTS),
    bigquery.ScalarQueryParameter("eventCategory", "BOOL", EVENT_CATEGORY),
    bigquery.ScalarQueryParameter("eventAction", "BOOL", EVENT_ACTION),
    bigquery.ScalarQueryParameter("eventLabel", "BOOL", EVENT_LABEL),
    bigquery.ScalarQueryParameter("truncatedSearches", "BOOL", TRUNCATE_SEARCHES),
    bigquery.ScalarQueryParameter("desiredPage", "STRING", DESIRED_PAGE),
    bigquery.ScalarQueryParameter(
        "desiredPageRemoveRefreshes", "BOOL", REMOVE_DESIRED_PAGE_REFRESHES
    ),
    bigquery.ScalarQueryParameter("numberOfStages", "INT64", NUMBER_OF_STAGES),
]

# Dry run the query, asking for user input to confirm the query execution size is okay
bytes_processed = client.query(
    query,
    job_config=bigquery.QueryJobConfig(query_parameters=query_parameters, dry_run=True),
).total_bytes_processed

# Compile a message, and flag to the user for a response; if not "yes", terminate execution
user_message = (
    f"This query will process {bytes_processed / (1024 ** 3):.1f} GB when run, "
    + f"which is approximately ${bytes_processed / (1024 ** 4)*5:.3f}. Continue ([yes])? "
)
if input(user_message).lower() != "yes":
    raise RuntimeError("Stopped execution!")

# Execute the query, and return as a pandas DataFrame
df_raw = client.query(
    query, job_config=bigquery.QueryJobConfig(query_parameters=query_parameters)
).to_dataframe()
df_raw.head()

In [None]:
df_stages = df_raw.set_index(["flagEntrance", "userJourney"], drop=False)[
    "userJourney"
].str.split(" <<< ", expand=True)
df_stages.columns = [
    "goalCompletionLocation",
    *[f"goalPreviousStep{c}" for c in df_stages.columns[1:]],
]

In [None]:
df_stages = df_raw.set_index(["flagEntrance", "userJourney"], drop=False)[
    "userJourney"
].str.split(" <<< ", expand=True)
df_stages.columns = [
    "goalCompletionLocation",
    *[f"goalPreviousStep{c}" for c in df_stages.columns[1:]],
]

df = df_raw.merge(
    df_stages,
    how="left",
    left_on=["flagEntrance", "userJourney"],
    right_index=True,
    validate="1:1",
)
df.head()

In [None]:
# Output the results to a CSV file, and download it
df.to_csv(OUTPUT_FILE)
files.download(OUTPUT_FILE)

# Presenting results in Google sheets
Here's an [example of how you could present the results](https://docs.google.com/spreadsheets/d/1vSFXnPE8XozpRhI1G3x4tl5oro3pUIgZnoFmJ_AjPbY/edit#gid=1115034830) to facilitate sharing with colleagues.

## Original SQL query

```sql
/*
Calculate the count and proportion of sessions that have the same journey behaviour.

This script finds sessions that visit a specific page (`desiredPage`) in their journey. From the last visit to
`desiredPage` in the session, the journey is subsetted to include the last N pages including `desiredPage`
(`numberofStages`). If the subsetted journey contains the entrance page this is flagged.

The count and proportion of sessions visiting distinct, subsetted journeys are compiled together, and returned as a
sorted list in descending order split by subsetted journeys including the entrance page.

Arguments:

    startDate: String in YYYYMMDD format defining the start date of your query.
    endDate: String in YYYYMMDD format defining the end date of your query.
    pageType: String array containing comma-separated strings of page types. Must contain one or more of "PAGE" and
        "EVENT".
    deviceCategories: String array containing comma-separated strings of device categories. Must contain one or more
        of "mobile", "desktop", and "tablet".
    flagEvents: Boolean flag. If TRUE, all "EVENT" page paths will have a " [E]" suffix. This is useful if `pageType`
        contains both "PAGE" and "EVENT" so you can differentiate between the same page path with different types. If
        FALSE, no suffix is appended to "EVENT" page paths.
    eventCategory: Boolean flag. If TRUE, all "EVENT" page paths will be followed by the " [eventCategory]". If FALSE, 
        no " [eventCategory]" suffix is appended to "EVENT" page paths. 
    eventAction: Boolean flag. If TRUE, all "EVENT" page paths will be followed by the " [eventAction]". If FALSE, no 
        " [eventAction]" suffix is appended to "EVENT" page paths. 
    eventLabel: Boolean flag. If TRUE, all "EVENT" page paths will be followed by the " [eventLabel]". If FALSE, no 
       " [eventLabel]" suffix is appended to "EVENT" page paths.     
    truncatedSearches: Boolean flag. If TRUE, all GOV.UK search page paths are truncated to
        "Sitesearch ({TYPE}): {KEYWORDS}", where `{TYPE}` is the GOV.UK search content type, and `{KEYWORDS}` are the
        search keywords. If there are no keywords, this is set to `none`. If FALSE, GOV.UK search page paths are
        not truncated.
    desiredPage: String of the desired GOV.UK page path of interest.
    desiredPageRemoveRefreshes: Boolean flag. If TRUE sequential page paths of the same type are removed when the query
        calculates the last visit to the desired page. In other words, it will only use the first visit in a series
        of sequential visits to desired page if they have the same type. Other earlier visits to the desired page will
        remain, as will any earlier desired page refreshes.
    numberOfStages: Integer defining how many pages in the past (including `desiredPage`) should be considered when
        subsetting the user journeys. Note that journeys with fewer pages than `numberOfStages` will always be
        included.

Returns:

    A Google BigQuery result containing the subsetted user journey containing `pageType` page paths in reverse from
    the last visit to `desiredPage` with a maximum length `numberOfStages`. Counts and the proportion of sessions
    that have this subsetted journey are also shown. Subsetted journeys that incorporate the first page visited by
    a session are flagged as well. The results are presented in descending order, with the most popular subsetted
    user journey first.

Assumptions:

    - Only exact matches to `desiredPage` are currently supported.
    - Previous visits to `desiredPage` are ignored, only the last visit is used.
    - If `desiredPageRemoveRefreshes` is TRUE, only the first visit in a series of sequential visits (page refreshes)
      to `desiredPage` are used to determine which is the last visit.
    - If `desiredPageRemoveRefreshes` is TRUE, and there is more than one page type (`pageType`), only the first visit
      in page refreshes to the same `desiredPage` and page type are used to determine which is the last visit.
    - Journeys shorter than the number of desired stages (`numberOfStages`) are always included.
    - GOV.UK search page paths are assumed to have the format `/search/{TYPE}?keywords={KEYWORDS}{...}`, where
      `{TYPE}` is the GOV.UK search content type, `{KEYWORDS}` are the search keywords, where each keyword is
      separated by `+`, and `{...}` are any other parts of the search query that are not keyword-related (if they
      exist).
    - GOV.UK search page titles are assumed to have the format `{KEYWORDS} - {TYPE} - GOV.UK`, where `{TYPE}` is the
      GOV.UK search content type, and `{KEYWORDS}` are the search keywords.

*/

-- Declare query variables
DECLARE startDate DEFAULT "20210628";
DECLARE endDate DEFAULT "20210628";
DECLARE pageType DEFAULT ["PAGE"];
DECLARE deviceCategories DEFAULT ["mobile", "desktop", "tablet"];
DECLARE flagEvents DEFAULT TRUE;
DECLARE eventCategory DEFAULT TRUE;
DECLARE eventAction DEFAULT TRUE;
DECLARE eventLabel DEFAULT TRUE;
DECLARE truncatedSearches DEFAULT TRUE;
DECLARE desiredPage DEFAULT "/trade-tariff";
DECLARE desiredPageRemoveRefreshes DEFAULT TRUE;
DECLARE numberOfStages DEFAULT 3;

WITH
    get_session_data AS (
        -- Get all the session data between `start_date` and `end_date`, subsetting for specific `page_type`s. As
        -- some pages might be dropped by the subsetting, recalculate `hitNumber` as `journeyNumber` so the values
        -- are sequential.
        SELECT
            CONCAT(fullVisitorId, "-", visitId) AS sessionId,
            ROW_NUMBER() OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.hitNumber) AS journeyNumber,
            hits.type,
            CONCAT(
                hits.page.pagePath,  -- modify this line to `hits.page.pageTitle` if required
                IF(hits.type = "EVENT" AND flagEvents, IF ((eventCategory OR eventAction OR eventLabel), " [E", "[E]"), ""),
                IF(hits.type = "EVENT" AND eventCategory, CONCAT(IF ((flagEvents), ", ", " ["), hits.eventInfo.eventCategory, IF ((eventAction OR eventLabel), "", "]")), ""),
                IF(hits.type = "EVENT" AND eventAction, CONCAT(IF ((flagEvents OR eventCategory), ", ", " ["), hits.eventInfo.eventAction, IF ((eventLabel), "", "]")), ""),
                IF(hits.type = "EVENT" AND eventLabel, CONCAT(IF ((flagEvents OR eventCategory OR eventAction), ", ", " ["), hits.eventInfo.eventLabel, "]"), "") 
            ) AS pageId
        FROM `govuk-bigquery-analytics.87773428.ga_sessions_*`
        CROSS JOIN UNNEST(hits) AS hits
        WHERE _TABLE_SUFFIX BETWEEN startDate AND endDate
        AND hits.type IN UNNEST(pageType)
        AND device.deviceCategory in UNNEST(deviceCategories)
    ),
    get_search_content_type_and_keywords AS (
        -- Extract the content type and keywords (if any) for GOV.UK search pages.
        SELECT
            *,
            IFNULL(
              REGEXP_EXTRACT(pageId, r"^/search/([^ ?#/]+)"),
              REGEXP_EXTRACT(pageId, r"^.+ - ([^-]+) - GOV.UK$")
            ) AS searchContentType,
            IFNULL(
              REPLACE(REGEXP_EXTRACT(pageId, r"^/search/[^ ?#/]+\?keywords=([^&]+)"), "+", " "),
              REGEXP_EXTRACT(pageId, r"^(.+)- [^-]+ - GOV.UK$")
            ) AS searchKeywords
        FROM get_session_data
    ),
    compile_search_entry AS (
        -- Truncate the search page into an entry of the search content type and keywords (if any).
        SELECT
            * EXCEPT (searchContentType, searchKeywords),
            CONCAT(
                "Sitesearch (",
                searchContentType,
                "):",
                COALESCE(searchKeywords, "none")
            ) AS search_entry
        FROM get_search_content_type_and_keywords
    ),
    revise_search_pageids AS (
        -- Replace `pageId` for search pages with the compiled entries if selected by the user.
        SELECT
            * REPLACE (
                IF(truncatedSearches, COALESCE(search_entry, pageId), pageId) AS pageId
            )
        FROM compile_search_entry

    ),
    identify_page_refreshes AS (
        -- Lag the page `type` and `pageId` columns. This helps identify page refreshes that can be removed in the
        -- next CTE
        SELECT
            *,
            LAG(type) OVER (PARTITION BY sessionId ORDER BY journeyNumber) AS lagType,
            LAG(pageId) OVER (PARTITION BY sessionId ORDER BY journeyNumber) AS lagPageId
        FROM revise_search_pageids
    ),
    identify_last_hit_to_desired_page AS (
        -- Get the last hit to the desired page. Ignores previous visits to the desirted page. Page refreshes of the
        -- desired page are also ignored if the correct option is declared.
        SELECT
            sessionId,
            MAX(journeyNumber) AS maxDesiredPageJourneyNumber
        FROM identify_page_refreshes
        WHERE pageId = desiredPage
        AND IF(
            desiredPageRemoveRefreshes,
            (
                lagPageId IS NULL
                OR pageId != lagPageId
                OR IF(ARRAY_LENGTH(pageType) > 1, pageId = lagPageId AND type != lagType, FALSE)
            ),
            TRUE
        )
        GROUP BY sessionId
    ),
    subset_journey_to_last_hit_of_desired_page AS (
        -- Subset all user journeys to the last hit of the desired page.
        SELECT revise_search_pageids.*
        FROM revise_search_pageids
        INNER JOIN identify_last_hit_to_desired_page
        ON revise_search_pageids.sessionId = identify_last_hit_to_desired_page.sessionId
        AND revise_search_pageids.journeyNumber <= identify_last_hit_to_desired_page.maxDesiredPageJourneyNumber
    ),
    calculate_stages AS (
        -- Calculate the number of stages from the last hit to the desired page, where the last hit to the desired
        -- page is '1'.
        SELECT
            *,
            ROW_NUMBER() OVER (PARTITION BY sessionId ORDER BY journeyNumber DESC) AS reverseDesiredPageJourneyNumber
        FROM subset_journey_to_last_hit_of_desired_page
    ),
    subset_journey_to_number_of_stages AS (
        -- Compile the subsetted user journeys together for each session in reverse order (last hit to the desired
        -- page first), delimited by " <<< ".
        SELECT DISTINCT
            sessionId,
            MIN(journeyNumber) OVER (PARTITION BY sessionId) = 1 AS flagEntrance,
            STRING_AGG(pageId, " <<< ") OVER (
                PARTITION BY sessionId
                ORDER BY reverseDesiredPageJourneyNumber ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
            ) AS userJourney
        FROM calculate_stages
        WHERE reverseDesiredPageJourneyNumber <= numberOfStages
    ),
    count_distinct_journeys AS (
        -- Count the number of sessions for each distinct subsetted user journey, split by whether the sessions
        -- entered on the first page of the subsetted journey or not
        SELECT
            flagEntrance,
            userJourney,
            (SELECT COUNT(sessionId) FROM subset_journey_to_number_of_stages) AS totalSessions,
            COUNT(sessionId) AS countSessions
        FROM subset_journey_to_number_of_stages
        GROUP BY
            flagEntrance,
            userJourney
    )
SELECT
    *,
    countSessions / totalSessions AS proportionSessions
FROM count_distinct_journeys
ORDER BY countSessions DESC;
```