# Reporting on user journeys from a GOV.UK page

Calculate the count and proportion of sessions that have the same journey behaviour.

This script finds sessions that visit a specific page (`DESIRED_PAGE`) in their journey. From the first or last visit to
`DESIRED_PAGE` in the session, the journey is subsetted to include the following N pages including `DESIRED_PAGE`
(`NUMBER_OF_STAGES`). 

The count and proportion of sessions visiting distinct, subsetted journeys are compiled together, and returned as a
sorted list in descending order split by subsetted journeys including the entrance and/or exit page.

A visualisation and google sheet of the top 10 unique journeys are also created. The count and percentage of total journeys represented by these unique journeys is included in these outputs.  

**NOTE:** The forward path tool will often output hundreds or thousands of unique journeys. For ease of interpretation, the visualisation and google sheet only presents the top 10 unique journeys.  Therefore, all other unique journeys are excluded. To explore and summarise `all` unique user journeys, please try the [`User Intent Explorer`](https://colab.research.google.com/drive/1xW7uEXpkDfrqAsUBcKMwQcvLit-dUpbx#forceEdit=true&sandboxMode=true).

For help and advice, use the `#data-services` channel in Slack.

## Arguments

- `START_DATE`: String in YYYYMMDD format defining the start date of your query.
- `END_DATE`: String in YYYYMMDD format defining the end date of your query.
- `DESIRED_PAGE`: String of the desired GOV.UK page path of interest.
- `NUMBER_OF_STAGES`: Integer defining how many pages (including `DESIRED_PAGE`) should be considered when subsetting the user journeys. Note that journeys with fewer pages than `NUMBER_OF_STAGES` will always be included.
- `FIRST_HIT`: Boolean flag indicating that the `FIRST` hit to the `DESIRED_PAGE` in the session is used for the subsetted journey. If this option is selected, `LAST_HIT` cannot be selected. 
- `LAST_HIT`: Boolean flag indicating that the `LAST` hit to the `DESIRED_PAGE` in the session is used for the subsetted journey. If this option is selected, `FIRST_HIT` cannot be selected.  
- `PAGE_TYPE`: Boolean flag indicating that `PAGE` page paths are required. One of `PAGE_TYPE` or `EVENT_TYPE` must be selected.
- `EVENT_TYPE`: Boolean flag indicating that `EVENT` page paths are required. One of `PAGE_TYPE` or `EVENT_TYPE` must be selected.
- `DEVICE_DESKTOP`: Boolean flag indicating that desktop devices should be included in this query. One of `DEVICE_DESKTOP`, `DEVICE_MOBILE`, `DEVICE_TABLET`, or `DEVICE_ALL` must be selected. However, `DEVICE_TABLET` cannot be selected if `DEVICE_ALL` is selected.
- `DEVICE_MOBILE`: Boolean flag indicating that mobile devices should be included in this query. One of `DEVICE_DESKTOP`, `DEVICE_MOBILE`, `DEVICE_TABLET`, or `DEVICE_ALL` must be selected. However, `DEVICE_MOBILE` cannot be selected if `DEVICE_ALL` is selected.
- `DEVICE_TABLET`: Boolean flag indicating that tablet devices should be included in this query. One of `DEVICE_DESKTOP`, `DEVICE_MOBILE`, `DEVICE_TABLET`, or `DEVICE_ALL` must be selected. However, `DEVICE_TABLET` cannot be selected if `DEVICE_ALL` is selected.
- `DEVICE_ALL`: Boolean flag indicating that all devices should be segmented but included in this query. One of `DEVICE_DESKTOP`, `DEVICE_MOBILE`, `DEVICE_TABLET`, or `DEVICE_ALL` must be selected. However, `DEVICE_ALL` cannot be selected if `DEVICE_DESKTOP`, `DEVICE_MOBILE`, or `DEVICE_TABLET` is selected. 

### Optional arguments

- `QUERY_STRING`: Boolean flag. If `TRUE`, remove query strings from all page paths. If `FALSE`, keep query strings in all page paths.
- `FLAG_EVENTS`: Boolean flag. If `TRUE`, all `EVENT` page paths will have a ` [E]` suffix. This is useful if both `PAGE_TYPE` and `EVENT_TYPE` are selected, so you can differentiate between the same page path with different types. If `FALSE`, no suffix is appended to `EVENT` page paths.
- `EVENT_CATEGORY`: Boolean flag. If `TRUE`, all event categorys will be displayed. 
- `EVENT_ACTION`: Boolean flag. If `TRUE`, all event actions will be displayed. 
- `EVENT_LABEL`: Boolean flag. If `TRUE`, all event labels will be displayed. 
- `ENTRANCE_PAGE`: Boolean flag. If `TRUE`, if the subsetted journey contains the entrance page this is flagged. 
- `EXIT_PAGE`: Boolean flag. If `TRUE`, if the subsetted journey contains the exit page this is flagged.  
- `REMOVE_DESIRED_PAGE_REFRESHES`: Boolean flag. If `TRUE` sequential page paths of the same type are removed when the query calculates the first/last visit to the desired page. In other words, it will only use the first visit in a series of sequential visits to desired page if they have the same type. Other visits to the desired page will remain, as will any other desired page refreshes.
- `TRUNCATE_SEARCHES`: Boolean flag. If `TRUE`, all GOV.UK search page paths are truncated to `Sitesearch ({TYPE}): {KEYWORDS}`, where `{TYPE}` is the GOV.UK search content type, and `{KEYWORDS}` are the search keywords. If there are no keywords, this is set to `none`. If `FALSE`, GOV.UK search page paths are not truncated.

## Returns

A csv file containing a Google BigQuery result showing the subsetted user journey containing `PAGE_TYPE` and/or `EVENT_TYPE` page paths in order from the first or last visit to `DESIRED_PAGE` with a maximum length `NUMBER_OF_STAGES`. The results are presented in descending order, with the most popular subsetted user journey first.

Results show:
- `flagEntrance`: Subsetted journeys that incorporate the first page visited during a session are flagged if the option is selected
- `flagExit`: Subsetted journeys that incorporate the last page visited during a session are flagged if selected
- `deviceCategories`: The device category/ies of the subsetted journeys
- `totalSessions`: The total number of sessions
- `countSessions`: The total number of sessions per subsetted journey
- `proportionSessions`: The proportion of sessions per subsetted journey
- `goal`: The `DESIRED_PAGE`
- `goalNextStepX`: The X next page path following the `DESIRED_PAGE`; X corresponding to `NUMBER_OF_STAGES`

A second csv file showing the count for each next step page path, regardless of the overall subsetted journey. 
The results are presented in descending order, with the most popular next step first. 

Results show: 
- `goal`: The `DESIRED_PAGE`
- `countsGoal`: The number of unique subsetted journeys 
- `goalNextStepX`: The X next step page path; X corresponding to `NUMBER_OF_STAGES`
- `countsGoalNextStepX`: The number of sessions that visited the page path at step X

## Assumptions

- Only exact matches to `DESIRED_PAGE` are currently supported.
- Other visits to `DESIRED_PAGE` are ignored, only the first or last visit is used.
- If `REMOVE_DESIRED_PAGE_REFRESHES` is `TRUE`, and there is more than one page type (`PAGE_TYPE` and `EVENT_TYPE` are both selected), only the first visit in page refreshes to the same `DESIRED_PAGE` and page type are used to determine which is the first/last visit.
- Journeys shorter than the number of desired stages (`NUMBER_OF_STAGES`) are always included.
- GOV.UK search page paths are assumed to have the format `/search/{TYPE}?keywords={KEYWORDS}{...}`, where `{TYPE}` is the GOV.UK search content type, `{KEYWORDS}` are the search keywords, where each keyword is
  separated by `+`, and `{...}` are any other parts of the search query that are not keyword-related (if they exist).
- GOV.UK search page titles are assumed to have the format `{KEYWORDS} - {TYPE} - GOV.UK`, where `{TYPE}` is the GOV.UK search content type, and `{KEYWORDS}` are the search keywords.
- If `ENTRANCE_PAGE` is `FALSE`, each journey (row) contains both instances where the entrance page is included, and the entrance page is not included. Therefore, if there are more page paths than `NUMBER_OF_STAGES`, this will not be flagged. 
- If `EXIT_PAGE` is `FALSE`, each journey (row) contains both instances where the exit page is included, and the exit page is not included. Therefore, if there are more page paths than `NUMBER_OF_STAGES`, this will not be flagged. 
- If `DEVICE_ALL` is selected in combination with either `DEVICE_DESKTOP`, `DEVICE_MOBILE`, and/or `DEVICE_TABLET`, then the analysis will use `DEVICE_ALL` and ignore the other arguments. 


In [101]:
from datetime import datetime

import pandas as pd
import numpy as np
from google.cloud import bigquery
from google.colab import auth, files
from IPython.core.interactiveshell import InteractiveShell

import plotly.graph_objects as go

!pip install --upgrade gspread -q
import gspread
!pip install gspread_formatting -q
import gspread_formatting as gsf
from google.auth import default

# Allow multiline outputs
InteractiveShell.ast_node_interactivity = "all"

# Authenticate the user - follow the link and the prompts to get an authentication token
auth.authenticate_user()

In [102]:
# @markdown ## Set query parameters
# @markdown Define the start and end dates
START_DATE = "2022-09-01"  # @param {type:"date"}
END_DATE = "2022-09-01"  # @param {type:"date"}

# @markdown Set the desired page path
# @markdown <br><i>NB: This must start with the domain e.g. `www.gov.uk/coronavirus`, `signin.account.gov.uk/enter-password` 
DESIRED_PAGE = "www.gov.uk/"  # @param {type:"string"}

# @markdown Set the hit to the desired page in the session; select one option only
FIRST_HIT = True  # @param {type:"boolean"}
LAST_HIT = False  # @param {type:"boolean"}

# @markdown Set the number of pages, including `DESIRED_PAGE` to include in the subsetted journeys
NUMBER_OF_STAGES =   6# @param {type:"integer"}

# @markdown Set the page types; at least one must be checked
PAGE_TYPE = True  # @param {type:"boolean"}
EVENT_TYPE = True  # @param {type:"boolean"}

# @markdown Set the device categories; select one or more devices `[DEVICE_DESKTOP, DEVICE_MOBILE, DEVICE_TABLET]`, OR select all device categories divided up but included in the same analysis `[DEVICE_ALL]`
DEVICE_DESKTOP = True  # @param {type:"boolean"}
DEVICE_MOBILE = True  # @param {type:"boolean"}
DEVICE_TABLET = True  # @param {type:"boolean"}
DEVICE_ALL = False  # @param {type:"boolean"}

# @markdown ### Other options

# @markdown Remove query strings all page paths
QUERY_STRING = False  # @param {type:"boolean"}

# @markdown Add a ` [E]` suffix to EVENT page paths - easier to differentiate between PAGE and
# @markdown EVENT types for the same page path
FLAG_EVENTS = False  # @param {type:"boolean"}

# @markdown Add event information suffix to EVENT page paths
EVENT_CATEGORY = False  # @param {type:"boolean"}
EVENT_ACTION = False  # @param {type:"boolean"}
EVENT_LABEL = False  # @param {type:"boolean"}

# @markdown Include entrance page flag
ENTRANCE_PAGE = False  # @param {type:"boolean"}

# @markdown Include exit page flag
EXIT_PAGE = False  # @param {type:"boolean"}

# @markdown Remove page refreshes when determining the last visit to `DESIRED_PAGE`
REMOVE_DESIRED_PAGE_REFRESHES = True  # @param {type:"boolean"}

# @markdown Truncate search pages to only show the search content type, and search keywords
TRUNCATE_SEARCHES = True  # @param {type:"boolean"}

In [103]:
# Convert the inputted start and end date into `YYYYMMDD` formats
QUERY_START_DATE = datetime.strptime(START_DATE, "%Y-%m-%d").strftime("%Y%m%d")
QUERY_END_DATE = datetime.strptime(END_DATE, "%Y-%m-%d").strftime("%Y%m%d")

# Check that only one of `FIRST_HIT` or `LAST_HIT` is selected
if FIRST_HIT and LAST_HIT:
    raise AssertionError("Only one of `FIRST_HIT` or `LAST_HIT` can be checked!")

# Compile the query page types
if PAGE_TYPE and EVENT_TYPE:
    QUERY_PAGE_TYPES = ["PAGE", "EVENT"]
elif PAGE_TYPE:
    QUERY_PAGE_TYPES = ["PAGE"]
elif EVENT_TYPE:
    QUERY_PAGE_TYPES = ["EVENT"]
else:
    raise AssertionError("At least one of `PAGE_TYPE` or `EVENT_TYPE` must be checked!")

# Compile the device categories
QUERY_DEVICE_CATEGORIES = [
    "desktop" if DEVICE_DESKTOP else "",
    "mobile" if DEVICE_MOBILE else "",
    "tablet" if DEVICE_TABLET else "",
]
QUERY_DEVICE_CATEGORIES = [d for d in QUERY_DEVICE_CATEGORIES if d]
assert (bool(QUERY_DEVICE_CATEGORIES)) | (DEVICE_ALL), (
    f"At least one of `DEVICE_DESKTOP`, `DEVICE_MOBILE`, `DEVICE_TABLET`"
    + f" or `DEVICE_ALL` must be checked!"
)

# Set the notebook execution date
NOTEBOOK_EXECUTION_DATE = datetime.now().strftime("%Y%m%d")

# Define the output file names
OUTPUT_FILE = (
    f"{NOTEBOOK_EXECUTION_DATE}_user_journeys_{QUERY_START_DATE}_{QUERY_END_DATE}_"
    + f"{'_'.join(QUERY_DEVICE_CATEGORIES)}.csv"
)

In [104]:
query = """
WITH
    get_session_data AS (
        -- Get all the session data between `start_date` and `end_date`, subsetting for specific `page_type`s. As
        -- some pages might be dropped by the subsetting, recalculate `hitNumber` as `journeyNumber` so the values
        -- are sequential.
        SELECT
            CONCAT(fullVisitorId, "-", visitId) AS sessionId,
            ROW_NUMBER() OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.hitNumber) AS journeyNumber,
            ROW_NUMBER() OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.hitNumber DESC) AS revJourneyNumber,
            hits.type,
            device.deviceCategory,
            hits.page.pagePath,
            hits.page.hostname,
            CONCAT(
                IF(@queryString, REGEXP_REPLACE(hits.page.pagePath, r'[?#].*', ''), hits.page.pagePath),  -- modify this line to `hits.page.pageTitle` if required
                IF(hits.type = "EVENT" AND @flagEvents, IF ((@eventCategory OR @eventAction OR @eventLabel), " [E", "[E]"), ""),
                IF(hits.type = "EVENT" AND @eventCategory, CONCAT(IF ((@flagEvents), ", ", " ["), IFNULL(hits.eventInfo.eventCategory, "null"), IF ((@eventAction OR @eventLabel), "", "]")), ""),
                IF(hits.type = "EVENT" AND @eventAction, CONCAT(IF ((@flagEvents OR @eventCategory), ", ", " ["), IFNULL(hits.eventInfo.eventAction, "null"), IF ((@eventLabel), "", "]")), ""),
                IF(hits.type = "EVENT" AND @eventLabel, CONCAT(IF ((@flagEvents OR @eventCategory OR @eventAction), ", ", " ["), IFNULL(hits.eventInfo.eventLabel, "null"), "]"), "")  
            ) AS pageId
        FROM `govuk-bigquery-analytics.87773428.ga_sessions_*`
        CROSS JOIN UNNEST(hits) AS hits
        WHERE _TABLE_SUFFIX BETWEEN @startDate AND @endDate
        AND hits.type IN UNNEST(@pageType)
        AND (CASE WHEN @deviceAll THEN device.deviceCategory in UNNEST(["mobile", "desktop", "tablet"]) END 
            OR CASE WHEN @deviceCategories IS NOT NULL THEN device.deviceCategory in UNNEST(@deviceCategories) END )
    ),
  combine_host_with_pageids AS (
      -- Combine hostname with pageId
      SELECT
        * 
        EXCEPT (hostname)
        REPLACE ( 
          IF(STARTS_WITH(pagePath, '/search/'), CONCAT(hostname, ': ', pageId), CONCAT(hostname, pageId)) AS pageId
            )
      FROM get_session_data
    ), 
    get_search_content_type_and_keywords AS (
        -- Extract the content type and keywords (if any) for GOV.UK search pages.
        SELECT
            *,
            IFNULL(
              REGEXP_EXTRACT(pagePath, r"^/search/([^ ?#/]+)"),
              REGEXP_EXTRACT(pagePath, r"^.+ - ([^-]+) - GOV.UK$")
            ) AS searchContentType,
            IFNULL(
              REPLACE(REGEXP_EXTRACT(pagePath, r"^/search/[^ ?#/]+\?keywords=([^&]+)"), "+", " "),
              REGEXP_EXTRACT(pagePath, r"^(.+)- [^-]+ - GOV.UK$")
            ) AS searchKeywords
        FROM combine_host_with_pageids
    ),
    compile_search_entry AS (
        -- Truncate the search page into an entry of the search content type and keywords (if any).
        SELECT
            * EXCEPT (searchContentType, searchKeywords),
            CONCAT(
                "Sitesearch (",
                searchContentType,
                "):",
                COALESCE(searchKeywords, "none")
            ) AS search_entry
        FROM get_search_content_type_and_keywords
    ),
    replace_escape_characters AS (
    -- Replace \ with / as otherwise following REGEXP_REPLACE will not execute  
       SELECT
          *,
          REGEXP_REPLACE(search_entry, r"\\\\", "/") AS searchEntryEscapeRemoved
       FROM compile_search_entry 
    ),  
    revise_search_pageids AS (
    -- Replace `pageId` for search pages with the compiled entries if selected by the user.
       SELECT
          * REPLACE (
              IFNULL(IF(@truncatedSearches, (REGEXP_REPLACE(pageId, r"^/search/.*", searchEntryEscapeRemoved)), pageId), pageId) AS pageId
            )
        FROM replace_escape_characters
    ),
    identify_page_refreshes AS (
        -- Lead the page `type` and `pageId` columns. This helps identify page refreshes that can be removed in the
        -- next CTE
        SELECT
            *,
            LEAD(type) OVER (PARTITION BY sessionId ORDER BY journeyNumber) AS leadType,
            LEAD(pageId) OVER (PARTITION BY sessionId ORDER BY journeyNumber) AS leadPageId
        FROM revise_search_pageids
    ),
    identify_hit_to_desired_page AS (
        -- Get the first/last hit to the desired page. Ignores previous visits to the desired page. Page refreshes of the
        -- desired page are also ignored if the correct option is declared.
        SELECT
            sessionId,
            deviceCategory,
            CASE 
                WHEN @firstHit THEN MIN(journeyNumber) 
                WHEN @lastHit THEN MAX(journeyNumber) 
            END AS desiredPageJourneyNumber
        FROM identify_page_refreshes
        WHERE pageId = @desiredPage
        AND IF(
            @desiredPageRemoveRefreshes,
            (
                leadPageId IS NULL
                OR pageId != leadPageId
                OR IF(ARRAY_LENGTH(@pageType) > 1, pageId = leadPageId AND type != leadType, FALSE)
            ),
            TRUE
        )
        GROUP BY sessionId, deviceCategory
    ),
    subset_journey_to_hit_of_desired_page AS (
        -- Subset all user journeys to the first/last hit of the desired page.
        SELECT revise_search_pageids.*
        FROM revise_search_pageids
        INNER JOIN identify_hit_to_desired_page
        ON revise_search_pageids.sessionId = identify_hit_to_desired_page.sessionId
        AND revise_search_pageids.deviceCategory = identify_hit_to_desired_page.deviceCategory
        AND revise_search_pageids.journeyNumber >= identify_hit_to_desired_page.desiredPageJourneyNumber
    ),
    calculate_stages AS (
        -- Calculate the number of stages from the first/last hit to the desired page, where the first/last hit to the desired
        -- page is '1'.
        SELECT
            *,
            ROW_NUMBER() OVER (PARTITION BY sessionId ORDER BY journeyNumber ASC) AS forwardDesiredPageJourneyNumber
        FROM subset_journey_to_hit_of_desired_page
    ),
    subset_journey_to_number_of_stages AS (
        -- Compile the subsetted user journeys together for each session in order (first/last hit to the desired
        -- page first), delimited by " >>> ".
        SELECT DISTINCT
            sessionId,
            deviceCategory,
            MIN(journeyNumber) OVER (PARTITION BY sessionId) = 1 AS flagEntrance,
            MIN(revJourneyNumber) OVER (PARTITION BY sessionId) = 1 AS flagExit,
            STRING_AGG(pageId, " >>> ") OVER (
                PARTITION BY sessionId
                ORDER BY forwardDesiredPageJourneyNumber ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
            ) AS userJourney
        FROM calculate_stages
        WHERE forwardDesiredPageJourneyNumber <= @numberOfStages
    ),
    count_distinct_journeys AS (
        -- Count the number of sessions for each distinct subsetted user journey, split by whether the sessions
        -- entered on the first page of the subsetted journey or not
        SELECT
            CASE WHEN @entrancePage THEN CAST(flagEntrance AS STRING) ELSE 'no flag' END AS flagEntrance,
            CASE WHEN @exitPage THEN CAST(flagExit AS STRING) ELSE 'no flag' END AS flagExit,
            CASE WHEN @deviceAll THEN CAST(deviceCategory AS STRING) ELSE ARRAY_TO_STRING(@deviceCategories, ", ") END AS deviceCategory,
            userJourney,
            (SELECT COUNT(sessionId) FROM subset_journey_to_number_of_stages) AS totalSessions,
            COUNT(sessionId) AS countSessions
        FROM subset_journey_to_number_of_stages
        GROUP BY
            flagEntrance, flagExit, userJourney, deviceCategory
    )
SELECT
    *,
    countSessions / totalSessions AS proportionSessions
FROM count_distinct_journeys
ORDER BY countSessions DESC;
"""

In [105]:
# Initialise a Google BigQuery client, and define the query parameters
client = bigquery.Client(project="govuk-bigquery-analytics", location="EU")
query_parameters = [
    bigquery.ScalarQueryParameter("startDate", "STRING", QUERY_START_DATE),
    bigquery.ScalarQueryParameter("endDate", "STRING", QUERY_END_DATE),
    bigquery.ArrayQueryParameter("pageType", "STRING", QUERY_PAGE_TYPES),
    bigquery.ScalarQueryParameter("firstHit", "BOOL", FIRST_HIT),
    bigquery.ScalarQueryParameter("lastHit", "BOOL", LAST_HIT),
    bigquery.ArrayQueryParameter("deviceCategories", "STRING", QUERY_DEVICE_CATEGORIES),
    bigquery.ScalarQueryParameter("deviceAll", "BOOL", DEVICE_ALL),
    bigquery.ScalarQueryParameter("flagEvents", "BOOL", FLAG_EVENTS),
    bigquery.ScalarQueryParameter("eventCategory", "BOOL", EVENT_CATEGORY),
    bigquery.ScalarQueryParameter("eventAction", "BOOL", EVENT_ACTION),
    bigquery.ScalarQueryParameter("eventLabel", "BOOL", EVENT_LABEL),
    bigquery.ScalarQueryParameter("truncatedSearches", "BOOL", TRUNCATE_SEARCHES),
    bigquery.ScalarQueryParameter("desiredPage", "STRING", DESIRED_PAGE),
    bigquery.ScalarQueryParameter("queryString", "BOOL", QUERY_STRING),
    bigquery.ScalarQueryParameter("entrancePage", "BOOL", ENTRANCE_PAGE),
    bigquery.ScalarQueryParameter("exitPage", "BOOL", EXIT_PAGE),
    bigquery.ScalarQueryParameter(
        "desiredPageRemoveRefreshes", "BOOL", REMOVE_DESIRED_PAGE_REFRESHES
    ),
    bigquery.ScalarQueryParameter("numberOfStages", "INT64", NUMBER_OF_STAGES),
]

# Dry run the query, asking for user input to confirm the query execution size is okay
bytes_processed = client.query(
    query,
    job_config=bigquery.QueryJobConfig(query_parameters=query_parameters, dry_run=True),
).total_bytes_processed

# Compile a message, and flag to the user for a response; if not "yes", terminate execution
user_message = (
    f"This query will process {bytes_processed / (1024 ** 3):.1f} GB when run, "
    + f"which is approximately ${bytes_processed / (1024 ** 4)*5:.3f}. Continue ([yes])? "
)
if input(user_message).lower() != "yes":
    raise RuntimeError("Stopped execution!")

# Execute the query, and return as a pandas DataFrame
df_raw = client.query(
    query, job_config=bigquery.QueryJobConfig(query_parameters=query_parameters)
).to_dataframe()
df_raw.head()

This query will process 1.2 GB when run, which is approximately $0.006. Continue ([yes])? yes


Unnamed: 0,flagEntrance,flagExit,deviceCategory,userJourney,totalSessions,countSessions,proportionSessions
0,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/,252302,81644,0.323596
1,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/,252302,9317,0.036928
2,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/ >>> www.gov.uk/sig...,252302,5435,0.021542
3,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/ >>> www.gov.uk/,252302,3171,0.012568
4,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/ >>> www.gov.uk/ >>...,252302,2567,0.010174


In [106]:
df_stages = df_raw.set_index(
    ["flagEntrance", "flagExit", "deviceCategory", "userJourney"], drop=False
)["userJourney"].str.split(" >>> ", expand=True)

df_stages.columns = ["goal", *[f"goalNextStep{c+1}" for c in df_stages.columns[1:]]]

df = df_raw.merge(
    df_stages,
    how="left",
    left_on=["flagEntrance", "flagExit", "deviceCategory", "userJourney"],
    right_index=True,
    validate="1:1",
)

df.head()

Unnamed: 0,flagEntrance,flagExit,deviceCategory,userJourney,totalSessions,countSessions,proportionSessions,goal,goalNextStep2,goalNextStep3,goalNextStep4,goalNextStep5,goalNextStep6
0,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/,252302,81644,0.323596,www.gov.uk/,,,,,
1,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/,252302,9317,0.036928,www.gov.uk/,www.gov.uk/,,,,
2,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/ >>> www.gov.uk/sig...,252302,5435,0.021542,www.gov.uk/,www.gov.uk/,www.gov.uk/sign-in-universal-credit,www.gov.uk/sign-in-universal-credit,,
3,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/ >>> www.gov.uk/,252302,3171,0.012568,www.gov.uk/,www.gov.uk/,www.gov.uk/,,,
4,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/ >>> www.gov.uk/ >>...,252302,2567,0.010174,www.gov.uk/,www.gov.uk/,www.gov.uk/,www.gov.uk/,www.gov.uk/,www.gov.uk/


# Outputs

In [107]:
# Output the results to a CSV file, and download it
df.to_csv(OUTPUT_FILE)
files.download(OUTPUT_FILE)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [108]:
# Amalgamate the previous steps to provide a summary of the most popular pages (regardless of order of steps)
all_data = []
for c in df.columns[7:]:
    df_amal = (df
        .groupby([c])['countSessions']
        .sum()
        .reset_index(name=f"counts{c}")
        .sort_values([f"counts{c}"], ascending=False)
    )
    all_data.append(df_amal)

df2 = pd.concat(all_data, axis=0, ignore_index=True)
df2 = df2.apply(lambda x: pd.Series(x.dropna().values))
df2.head()

Unnamed: 0,goal,countsgoal,goalNextStep2,countsgoalNextStep2,goalNextStep3,countsgoalNextStep3,goalNextStep4,countsgoalNextStep4,goalNextStep5,countsgoalNextStep5,goalNextStep6,countsgoalNextStep6
0,www.gov.uk/,252302.0,www.gov.uk/,129367.0,www.gov.uk/,53254.0,www.gov.uk/,22134.0,www.gov.uk/,11397.0,www.gov.uk/,7393.0
1,,,www.gov.uk/log-in-register-hmrc-online-services,3486.0,www.gov.uk/sign-in-universal-credit,7872.0,www.gov.uk/sign-in-universal-credit,9743.0,www.gov.uk/sign-in-universal-credit,2756.0,www.gov.uk/log-in-register-hmrc-online-services,2865.0
2,,,www.gov.uk/vehicle-tax,2614.0,www.gov.uk/log-in-register-hmrc-online-services,5194.0,www.gov.uk/browse/driving,5199.0,www.gov.uk/coronavirus,2328.0,www.gov.uk/browse/driving/vehicle-tax-mot-insu...,2764.0
3,,,www.gov.uk/contact/govuk/anonymous-feedback/th...,1858.0,www.gov.uk/browse/driving,4187.0,www.gov.uk/browse/visas-immigration,4649.0,www.gov.uk/browse/driving/vehicle-tax-mot-insu...,2277.0,www.gov.uk/browse/childcare-parenting/childcare,2089.0
4,,,www.gov.uk/sign-in-universal-credit,1416.0,www.gov.uk/browse/visas-immigration,3702.0,www.gov.uk/browse/tax,3803.0,www.gov.uk/browse/visas-immigration,2219.0,www.gov.uk/coronavirus,1911.0


In [109]:
# Save amalgamation of next steps to CSV file
filename = "next_steps_amalgamated.csv"
output = df2.to_csv(filename, index=False)
files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Presenting results as a Sankey diagram 
Run this code to create a pseduo Sankey diagram to summarise the top 10 and remainder journeys. 

* If you want to view `EVENT` hit information, consider using the google sheets template instead. The Sankey diagram can only present a limited number of characters, and therefore it is likely that `EVENT` hit information will be lost
* The plot is best when `NUMBER_OF_STAGES` <= 4. More characters are truncated the greater the number of stages, which will impact the coherence and quality of the diagram 
* Because of the above, the Sankey plot cannot be created when `NUMBER_OF_STAGES` is equal to or greater than 8
* If, for example, `NUMBER_OF_STAGES` = 5, but the max journey length is 4, then re-do the analysis with `NUMBER_OF_STAGES` = 4. Less characters will be truncated
* When the plot is created, it is possible to drag the nodes to a different position. This is particularly useful when you have wide nodes, such as nodes with a proportion greater than 70%, as sometimes these nodes will overlap

**NOTE:** The forward path tool will often output hundreds or thousands of unique journeys. For ease of interpretation, the visualisation and google sheet only presents the top 10 unique journeys.  Therefore, all other unique journeys are excluded. To explore and summarise `all` unique user journeys, please try the [`User Intent Explorer`](https://colab.research.google.com/drive/1xW7uEXpkDfrqAsUBcKMwQcvLit-dUpbx#forceEdit=true&sandboxMode=true).

For help and advice, use the `#data-services` channel in Slack.

In [110]:
# Plot without the domain, as otherwise page paths aren't clearly displayed 
df_no_domain = df.copy()  

for column in df_no_domain:
  if column.startswith("goal"):
    df_no_domain[column] = df_no_domain[column].str.replace(r'^[^/]*/', '/')
df_no_domain


The default value of regex will change from True to False in a future version.



Unnamed: 0,flagEntrance,flagExit,deviceCategory,userJourney,totalSessions,countSessions,proportionSessions,goal,goalNextStep2,goalNextStep3,goalNextStep4,goalNextStep5,goalNextStep6
0,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/,252302,81644,0.323596,/,,,,,
1,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/,252302,9317,0.036928,/,/,,,,
2,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/ >>> www.gov.uk/sig...,252302,5435,0.021542,/,/,/sign-in-universal-credit,/sign-in-universal-credit,,
3,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/ >>> www.gov.uk/,252302,3171,0.012568,/,/,/,,,
4,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/ >>> www.gov.uk/ >>...,252302,2567,0.010174,/,/,/,/,/,/
...,...,...,...,...,...,...,...,...,...,...,...,...,...
56916,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/ >>> www.gov.uk/ >>...,252302,1,0.000004,/,/,/,/search/all?keywords=student+visa+overseas+app...,/search/all?keywords=student+visa+overseas+app...,/search/all?keywords=student+visa+overseas+app...
56917,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/ >>> www.gov.uk/ >>...,252302,1,0.000004,/,/,/,/search/all?keywords=white+goods+grant&order=r...,/search/all?keywords=white+goods+grant&order=r...,/search/all?keywords=white+goods+grant&order=r...
56918,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/ >>> www.gov.uk: /s...,252302,1,0.000004,/,/,/search/all?keywords=property+capital+gains&or...,/search/all?keywords=property+capital+gains&or...,/guidance/capital-gains-tax-for-non-residents-...,/search/all?keywords=property+capital+gains&or...
56919,no flag,no flag,"desktop, mobile, tablet",www.gov.uk/ >>> www.gov.uk/ >>> www.gov.uk/ >>...,252302,1,0.000004,/,/,/,/,/search/all?keywords=travel+to+us&order=relevance,/search/all?keywords=travel+to+us&order=relevance


In [111]:
# Raise an error if `NUMBER_OF_STAGES` >= 8 
assert NUMBER_OF_STAGES < 8, f"`NUMBER_OF_STAGES` must be equal to or less than 7"

In [112]:
# Filter the data to show the top 10 journeys only and order columns 
df_top = df_no_domain.iloc[:, np.r_[7:len(df_no_domain.columns), 5, 6]].head(10)

# Transpose df, and replace the first instance of nan value for each journey with '[Exit]'
for column in df_top.transpose():
   df_top.loc[column] = (df_top.loc[column]
                               .fillna('Exit', limit=1))

# Sum count and proportion for top 10 journeys
top_10_count = df_top['countSessions'].sum()
top_10_count = f'{top_10_count:,}'
top_10_prop = df_top['proportionSessions'].sum()*100
top_10_prop = top_10_prop.round(decimals = 1)

# Create 11th journey `Other journeys` which amalgamates the remainding journeys 
journey_remainder = ([[df_no_domain[10:]['countSessions']
                       .sum(axis=0)], [df[10:]['proportionSessions']
                       .sum(axis=0)], [DESIRED_PAGE], ['Other journeys']])
journey_remainder = (pd.DataFrame(data=journey_remainder)
                       .transpose())
journey_remainder.columns =['countSessions', 'proportionSessions', 'goal', 'goalNextStep2']
df_top = df_top.append(journey_remainder, ignore_index=True)
df_top['proportionSessions'] = df_top['proportionSessions'].astype('float')
df_prop = df_top['proportionSessions']*100
df_prop = df_prop.round(decimals = 1)

# Amalgamate countSessions and proportionSessions
df_top['proportionSessions'] = df_top['proportionSessions']*100 
df_top['proportionSessions'] = df_top['proportionSessions'].round(decimals = 1) 
df_top['countSessions'] = [f'{val:,}' for val in df_top['countSessions']] 
df_top['sessions'] = (' [' + df_top['countSessions'].astype(str) 
                           + ': '
                           + df_top['proportionSessions'].astype(str) 
                           + '%]')

# Drop redundant columns
df_top = (df_top.drop(['countSessions', 'proportionSessions'], axis=1)
                .dropna(axis=1, how='all'))

# Create a title for the figure
figure_title = f'<b>Forward Path Tool: `{DESIRED_PAGE}`</b><br>[{START_DATE} to {END_DATE}]'

# Define node colours 
desired_page_node_colour = ['rgb(136,34,85)']
node_colour = ['rgb(222,29,29)', 'rgb(82,188,163)', 'rgb(153,201,69)', 'rgb(204,97,196)', 
               'rgb(36,121,108)', 'rgb(218,165,27)', 'rgb(47,138,196)', 'rgb(118,78,115)', 
               'rgb(237,100,90)', 'rgb(229,134,6)', 'rgb(136,34,85)']
white_colour = ['rgb(255,255,255)']
grey_colour = ['rgb(192,192,192)']

# Get total number of sessions
total_sessions = df_no_domain['totalSessions'][0]
total_sessions = f'{total_sessions:,}'

In [113]:
# Create `x_coord` parameter, and truncate page path characters depending on `NUMBER_OF_STAGES` 
df_top = df_top.astype(str)

if NUMBER_OF_STAGES <= 2: 
   # create `x_coord`
   x_coord = list(np.linspace(0.1,1.05,2))
   
   for column in df_top:
      # truncate characters and add '...' where string lengths are more than 100
      df_top[column] = df_top[column].where(df_top[column].str.len() < 100, 
                                            df_top[column].str[:100] + '...')

elif NUMBER_OF_STAGES == 3:
   x_coord = [0.1, 0.4, 1.05] 

   for column in df_top:
        df_top[column] = df_top[column].where(df_top[column].str.len() < 55, 
                                              df_top[column].str[:55] + '...')   
   # for the last `goalNextStep`, truncate characters and add '...' where strings are more than 39
   df_top.iloc[:,-2] = df_top.iloc[:,-2].where(df_top.iloc[:,-2].str.len() < 39, 
                                               df_top.iloc[:,-2].str[:39] + '...')
   # for the second to last `goalNextStep`, truncate characters to the first 43 and add '...' ONLY if the last column (df_top.iloc[:,-2]) has also been truncated (as otherwise [:,-3] will overlap with [:,-2])
   df_top.iloc[:,-3] = np.where((df_top.iloc[:,-3].str.len() >= 43) & (df_top.iloc[:,-2].str.endswith('...', na=False)),
                                 df_top.iloc[:,-3].str[:43] +'...',
                                 df_top.iloc[:,-3])

elif NUMBER_OF_STAGES == 4: 
   x_coord = [0.1, 0.35, 0.64, 1.05]

   for column in df_top:
       df_top[column] = df_top[column].where(df_top[column].str.len() < 44, 
                                             df_top[column].str[:44] + '...')
   df_top.iloc[:,-2] = df_top.iloc[:,-2].where(df_top.iloc[:,-2].str.len() < 24, 
                                               df_top.iloc[:,-2].str[:24] + '...') 
   df_top.iloc[:,-3] = np.where((df_top.iloc[:,-3].str.len() >= 25) & (df_top.iloc[:,-2].str.endswith('...', na=False)),
                                 df_top.iloc[:,-3].str[:22] +'...',
                                 df_top.iloc[:,-3])

elif NUMBER_OF_STAGES == 5: 
   x_coord = list(np.linspace(0.1,1.05,5))

   for column in df_top:
      df_top[column] = df_top[column].where(df_top[column].str.len() < 33, 
                                            df_top[column].str[:33] + '...')
   df_top.iloc[:,-2] = df_top.iloc[:,-2].where(df_top.iloc[:,-2].str.len() < 8, 
                                               df_top.iloc[:,-2].str[:8] + '...')
   df_top.iloc[:,-3] = np.where((df_top.iloc[:,-2].str.len() >= 5), 
                                df_top.iloc[:,-3].str[:10] + '...', 
                                df_top.iloc[:,-3])

elif NUMBER_OF_STAGES == 6: 
   x_coord = list(np.linspace(0.1,1.05,6))

   for column in df_top:
       df_top[column] = df_top[column].where(df_top[column].str.len() < 26, df_top[column].str[:26] + '...')
   df_top.iloc[:,-2] = df_top.iloc[:,-2].where(df_top.iloc[:,-2].str.len() < 7, df_top.iloc[:,-2].str[:7] + '...')
   df_top.iloc[:,-3] = np.where((df_top.iloc[:,-2].str.len() >= 5), 
                                 df_top.iloc[:,-3].str[:7] + '...', 
                                 df_top.iloc[:,-3])

else: 
   x_coord = list(np.linspace(0.1,1.05,7))

   for column in df_top:
       df_top[column] = df_top[column].where(df_top[column].str.len() < 20, df_top[column].str[:20] + '...')
   df_top.iloc[:,-2] = df_top.iloc[:,-2].where(df_top.iloc[:,-2].str.len() < 2, df_top.iloc[:,-2].str[:2] + '...')
   df_top.iloc[:,-3] = np.where((df_top.iloc[:,-2].str.len() >= 5),
                                df_top.iloc[:,-3].str[:4] + '...', 
                                df_top.iloc[:,-3])

# Remove `None` or 'nan' values
label_list = [[x for x in y 
               if str(x) != 'None' and str(x) != 'nan'] 
               for y in df_top.values.tolist()]

# Concatanate count and proportion in the last `goalNextStep` field 
label_list_concatanated = []

for lists in label_list:
    temp = []
    temp = lists[:-2] + [(' '.join(lists[-2:]))]
    label_list_concatanated.append(temp)

# Get length for each journey
journey_lengths = [len(n) for n in label_list_concatanated]

# Create `x_coord` paramater 
x_coord_list = [x_coord[1:journey_lengths[x]] for x in range(11)]
x_coord_unnested = [item for sublist in x_coord_list
                    for item in sublist]
x_coord_unnested.insert(0,0.1)

In [114]:
# Create `y_coord` parameter
y_coord = [0.1]

for index in range(0, 10):
    
    if index == 0 and df_prop[index] <= 30:  
       prev_elem = y_coord[0]
       y_coord.append(prev_elem+0.1)

    elif index == 0 and df_prop[index] >= 30 and df_prop[index] <= 50:
       prev_elem = y_coord[0]
       y_coord.append(prev_elem+0.2)

    elif index == 0 and df_prop[index] >= 50 and df_prop[index] <= 70: 
       prev_elem = y_coord[0]    
       y_coord.append(prev_elem+0.25)

    elif index == 0 and df_prop[index] >= 70 and df_prop[index] <= 90: 
       prev_elem = y_coord[0]
       y_coord.append(prev_elem+0.3)

    elif index == 0 and df_prop[index] >= 90 and df_prop[index] <= 100: 
       prev_elem = y_coord[0]
       y_coord.append(prev_elem+0.4)

    elif index >= 1 and index <= 8 and df_prop[index] <= 10:
       prev_elem = y_coord[index]
       y_coord.append(prev_elem+0.05)

    elif index >= 1 and index <= 8 and df_prop[index] >= 10 and df_prop[index] <= 30:
      prev_elem = y_coord[index]
      y_coord.append(prev_elem+0.1)

    elif index >= 1 and index <= 8 and df_prop[index] >= 30 and df_prop[index] <= 50:
      prev_elem = y_coord[index]
      y_coord.append(prev_elem+0.2)

    elif index >= 1 and index <= 8 and df_prop[index] >= 50 and df_prop[index] <= 70:
      prev_elem = y_coord[index]
      y_coord.append(prev_elem+0.3)

    elif index >= 1 and index <= 8 and df_prop[index] >= 70 and df_prop[index] <= 100:
      prev_elem = y_coord[index]
      y_coord.append(prev_elem+0.5)

    elif index == 9:
      y_coord.append(0.9)

y_coord_list = [[y_coord[y]]*(journey_lengths[y]-1) for y in range(0,11)]
y_coord_unnested = [item for sublist in y_coord_list 
                    for item in sublist]
y_coord_unnested.insert(0,0.5)

In [115]:
# Get previous item function 
from itertools import tee, islice, chain

def previous(some_iterable):
    prevs, items = tee(some_iterable, 2)
    prevs = chain([None], prevs)
    return zip(prevs, items)

# Create new list of lists with node number
node_no_list = []

for prevlength, length in previous(journey_lengths):
    if prevlength is None: 
       temp1 = list(range(0, length))
       node_no_list.append(temp1)

    elif temp1!=[] and len(node_no_list)==1:
        temp2 = list(range(temp1[-1]+1, temp1[-1]+length+1))
        node_no_list.append(temp2)
    
    else: 
        node_no_list.append(list(range(node_no_list[-1][-1]+1, node_no_list[-1][-1]+length+1)))

# Replace every first value with '0'
for journey in node_no_list:
    journey[0] = 0

In [116]:
# Within `node_no_list`, combine the source and target values
source_target_list = []

for journey in node_no_list:
    number_of_pairs = len(journey)-1

    for prev_elem, elem in previous(journey):
        if prev_elem is None:
           continue   
        
        elif prev_elem is not None:
            temp = []
            temp.append(prev_elem)
            temp.append(elem)

        source_target_list.append(temp)

In [117]:
# Create `source` and `target` parameter 
source = [item[0] for item in source_target_list]
target = [item[1] for item in source_target_list]

# Unnest `label_list_concatanated` to create `label` parameter
label_list_unnested = [item for sublist in label_list_concatanated 
                       for item in sublist]

# Create `color` paramater 
colours = [desired_page_node_colour + [node_colour[colour]]*(journey_lengths[colour]-1) 
           for colour in range(11)]
colours_unnested = [item for sublist in colours 
                    for item in sublist]

# Create `link_color` parameter
link_colour = [grey_colour *(journey_lengths[colour]-1) 
               + white_colour for colour in range(11)]
link_colour_unnested = [item for sublist in link_colour 
                        for item in sublist]

# Create `value` parameter based on proportion
amin, amax = min(df_prop), max(df_prop)
val = [((val-amin) / (amax-amin)) for i, val in enumerate(df_prop)] 
val_list = [[val[y]]*(journey_lengths[y]-1) for y in range(0,11)]         
val_list_unnested = [item for sublist in val_list 
                     for item in sublist]

# Replace `0.0` with the second lowest number, as otherwise journeys with value `0.0` will not display 
val_list_unnested = [sorted(set(val_list_unnested))[1] 
                     if item==0.0 else item 
                     for item in val_list_unnested]

In [118]:
# Create figure
fig = go.Figure(data=[go.Sankey(
    node=dict(
      x=x_coord_unnested,
      y=y_coord_unnested,
      pad=35,
      thickness=35,
      line=dict(color="white", width=0.5),
      label=label_list_unnested,
      color=colours_unnested
    ),
     arrangement='freeform', # 'fixed' 'snap' 'freeform' 'perpendicular'
     link=dict(
      source=source, 
      target=target, 
      value=val_list_unnested
  ))])

# Add annotations
fig = fig.add_annotation(x=1.05,
                         y=1.05,
                         text=f'<br>Total visits and proportion for the top 10 journeys: {top_10_count} [{top_10_prop}%]',
                         showarrow=False,
                         font=dict(family="Arial", size=22),
                         align="right"
)

fig = fig.add_annotation(x=0.01,
                         y=0.485,
                         text=f'<br>Total visits:<br>{total_sessions}',
                         showarrow=False,
                         font=dict(family="Arial", size=19),
                         align="center"
)

# Update layout
fig.update_layout(title_text=figure_title, 
                  font=dict(family='Arial', size=19, color='black'),
                  title_font_size=30,
                  width=1700,
                  height=900,
                  hovermode=False,
                  xaxis={
                        'showgrid': False,
                        'zeroline': False, 
                        'visible': False,  
                        },
                  yaxis={
                        'showgrid': False, 
                        'zeroline': False, 
                        'visible': False,  
                        },
                  plot_bgcolor='rgba(0,0,0,0)'
)

# Presenting results in Google sheets
Here's an [example of how you could present the results](https://docs.google.com/spreadsheets/d/1vSFXnPE8XozpRhI1G3x4tl5oro3pUIgZnoFmJ_AjPbY/edit#gid=1115034830) to facilitate sharing with colleagues. To do this, run the code below. You must type `yes` (all lowercase) into the box when prompted.

This code uses a [template google sheet](https://docs.google.com/spreadsheets/d/1kISyKu2jVzINCxwPe8ydQM8cibgEX2a3WCPxkJM9W80/) to create a new google sheet in `Product and Technology Directorate/Data Services/Data Products/16 User Journey tools/Path tools: google sheet result tables`, with the title: `{START_DATE} to {END_DATE} - Forward path tool - {DESIRED_PAGE}`. This template can present up to 6 `NUMBER_OF_STAGES`. Copy or delete the formatting on the newly created google sheet if more or less stages are required. 

It is advisable to present the results like this when the page paths are long, and if you want to visualise `EVENT` hits, as well as `PAGE` hits.

In [119]:
# Compile a message, and flag to the user for a response; if not "yes", terminate execution
user_message = (
    f"Do you want to create a google sheet of the top 10 journeys?"
)
if input(user_message).lower() != "yes":
    raise RuntimeError("Stopped execution!")

Do you want to create a google sheet of the top 10 journeys?yes


In [120]:
# Authentication
creds, _ = default()
gc = gspread.authorize(creds)

In [121]:
## Set up data
df_top = df.iloc[:, np.r_[6:len(df.columns), 4, 5]].head(10) # Filter the data to show the top 10 journeys only and order columns 
df_top['proportionSessions'] = df_top['proportionSessions']*100  # Convert proportion to %
df_top['proportionSessions'] = df_top['proportionSessions'].round(decimals = 2)  # Round % 2 decimal places 

# Tranpose df, and replace the first instance of na value for each journey with `[Exit]`
for column in df_top.transpose():
   df_top.loc[column] = df_top.loc[column].fillna('Exit', limit=1)

In [122]:
# Create google sheet in `Product and Technology Directorate/Data Services/Data Products/16 User Journey tools/Path tools: google sheet result tables`
gc.copy("1kISyKu2jVzINCxwPe8ydQM8cibgEX2a3WCPxkJM9W80", title=f'{START_DATE} to {END_DATE} - Forward path tool - {DESIRED_PAGE}', copy_permissions=True)
sheet = gc.open(f'{START_DATE} to {END_DATE} - Forward path tool - {DESIRED_PAGE}')
worksheet = sheet.worksheet("forward_path_tool")
print('\n', sheet.url)

<Spreadsheet '2022-09-01 to 2022-09-01 - Forward path tool - www.gov.uk/' id:1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY>


 https://docs.google.com/spreadsheets/d/1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY


In [123]:
## Fill spreadsheet

# Replace df nan values with ''
df = df_top.fillna('')

# Update title header cells
title = f'Forward path tool: `{DESIRED_PAGE}`'
worksheet.update('B1', f'{title}')
worksheet.update('B2', f'{START_DATE} to {END_DATE}')

# Update `% of sessions` cells
cell_range = list(map('C{}'.format, range(4, 14)))
sessions = list(map("{}%".format, list(df['proportionSessions']))) 
[worksheet.update(cell, sessionProp) for cell, sessionProp in zip(cell_range, sessions)]

# Update `No of. sessions` cells
cell_range = list(map('D{}'.format, range(4, 14)))
sessions = list(df['countSessions'])
[worksheet.update(cell, sessionCount) for cell, sessionCount in zip(cell_range, sessions)]

# Update `Goal page` cells
cell_range = list(map('F{}'.format, range(4, 14)))
goal = list(df['goal'])
[worksheet.update(cell, goalPage) for cell, goalPage in zip(cell_range, goal)]

## Update `Next step N` cells

# Get cell ID letter for all `Next step N` cells (start from cell `H`, skip 1, until cell `Z`)
cell_letters = [chr(c) for c in range(ord('h'), ord('z')+1, 2)]
cell_letters = cell_letters[:NUMBER_OF_STAGES] # only keep the numer of elements that match NUMBER_OF_STAGES

# Get cell ID number for all `Next step N` cells 
cell_numbers = list(range(4, 14))
cell_numbers = [str(x) for x in cell_numbers]

# Combine cell ID letter and number to create a list of cell IDs for `Next step N` cells
goal_next_step_cells = []
for letter in cell_letters:
  for number in cell_numbers:
      goal_next_step_cells.append(letter + number)

# Create a list of the `Next step N` paths 
goal_next_step = []
for step in range(2,NUMBER_OF_STAGES+1):
    goal_next_step.extend(df[f'goalNextStep{step}'])

# Update `Next step N` cells
[worksheet.update(cell, goalPage) for cell, goalPage in zip(goal_next_step_cells, goal_next_step)]

{'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
 'updatedRange': 'forward_path_tool!B1',
 'updatedRows': 1,
 'updatedColumns': 1,
 'updatedCells': 1}

{'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
 'updatedRange': 'forward_path_tool!B2',
 'updatedRows': 1,
 'updatedColumns': 1,
 'updatedCells': 1}

[{'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!C4',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!C5',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!C6',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!C7',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!C8',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!C9',
  'updatedRows': 1,
 

[{'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!D4',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!D5',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!D6',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!D7',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!D8',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!D9',
  'updatedRows': 1,
 

[{'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!F4',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!F5',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!F6',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!F7',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!F8',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!F9',
  'updatedRows': 1,
 

[{'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!H4',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!H5',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!H6',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!H7',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!H8',
  'updatedRows': 1,
  'updatedColumns': 1,
  'updatedCells': 1},
 {'spreadsheetId': '1-EhRDBuVJdgt0Bj8J0aOUWw3kQj0ot9UTnqmMR4JrMY',
  'updatedRange': 'forward_path_tool!H9',
  'updatedRows': 1,
 

# Original SQL query

```sql
/*
Calculate the count and proportion of sessions that have the same journey behaviour.

This script finds sessions that visit a specific page (`desiredPage`) in their journey. From the first/last visit to
`desiredPage` in the session, the journey is subsetted to include the following N pages including `desiredPage`
(`numberofStages`).

The count and proportion of sessions visiting distinct, subsetted journeys are compiled together, and returned as a
sorted list in descending order split by subsetted journeys including the entrance page.

Arguments:

    startDate: String in YYYYMMDD format defining the start date of your query.
    endDate: String in YYYYMMDD format defining the end date of your query.
    pageType: String array containing comma-separated strings of page types. Must contain one or more of "PAGE" and
              "EVENT".
    firstHit: Boolean flag. If TRUE the first hit to the desired page is used for the subsetted journey. If set to TRUE, 
              `lastHit` must be set to FALSE.
    lastHit: Boolean flag. If TRUE the last hit to the desired page is used for the subsetted journey. If set to TRUE, 
              `firstHit` must be set to FALSE.
    deviceCategories: String array containing comma-separated strings of device categories. Can contain one or more
        of "mobile", "desktop", and "tablet". Can be left blank if deviceAll is selected.
    deviceAll: Boolean flag. If TRUE all device categories are included in the query but divided into their respective 
        categories. This must to set to TRUE if deviceCategories is left blank. If deviceCategories is not left blank, 
        this must be set to FALSE.
    flagEvents: Boolean flag. If TRUE, all "EVENT" page paths will have a " [E]" suffix. This is useful if `pageType`
        contains both "PAGE" and "EVENT" so you can differentiate between the same page path with different types. If
        FALSE, no " [E]" suffix is appended to "EVENT" page paths.
    eventCategory: Boolean flag. If TRUE, all "EVENT" page paths will be followed by the " [eventCategory]". If FALSE, 
        no " [eventCategory]" suffix is appended to "EVENT" page paths. 
    eventAction: Boolean flag. If TRUE, all "EVENT" page paths will be followed by the " [eventAction]". If FALSE, no 
        " [eventAction]" suffix is appended to "EVENT" page paths. 
    eventLabel: Boolean flag. If TRUE, all "EVENT" page paths will be followed by the " [eventLabel]". If FALSE, no 
       " [eventLabel]" suffix is appended to "EVENT" page paths. 
    truncatedSearches: Boolean flag. If TRUE, all GOV.UK search page paths are truncated to
        "Sitesearch ({TYPE}): {KEYWORDS}", where `{TYPE}` is the GOV.UK search content type, and `{KEYWORDS}` are the
        search keywords. If there are no keywords, this is set to `none`. If FALSE, GOV.UK search page paths are
        not truncated.
    desiredPage: String of the desired GOV.UK page path of interest.
    queryString: If TRUE, remove query string from all page paths. If FALSE, keep all query strings in all page paths.  
    desiredPageRemoveRefreshes: Boolean flag. If TRUE sequential page paths of the same type are removed when the query
        calculates the first/last visit to the desired page. In other words, it will only use the first visit in a series
        of sequential visits to desired page if they have the same type. Other earlier visits to the desired page will
        remain, as will any earlier desired page refreshes.
    numberOfStages: Integer defining how many pages in the past (including `desiredPage`) should be considered when
        subsetting the user journeys. Note that journeys with fewer pages than `numberOfStages` will always be
        included.
    entrancePage: Boolean flag. If TRUE, if the subsetted journey contains the entrance page this is flagged. If FALSE 
        no flag is used (e.g. the journey contains both instances where the entrance page is included, and the
        entrance page is not included).
    exitPage: Boolean flag. If TRUE, if the subsetted journey contains the exit page this is flagged. If FALSE 
        no flag is used (e.g. the journey contains both instances where the exit page is included, and the
        exit page is not included).
Returns:

    A Google BigQuery result containing the subsetted user journey containing `pageType` page paths in order from
    the first/last visit to `desiredPage` with a maximum length `numberOfStages`. Counts and the proportion of sessions
    that have this subsetted journey are also shown. Subsetted journeys that incorporate the first or last page visited by a 
    session are flagged if selected. The device category/ies of the subsetted journeys are also included. The results 
    are presented in descending order, with the most popular subsetted user journey first.

Assumptions:

    - Only exact matches to `desiredPage` are currently supported.
    - Previous visits to `desiredPage` are ignored, only the first/last visit is used.
    - If `desiredPageRemoveRefreshes` is TRUE, and there is more than one page type (`pageType`), only the first visit
      in page refreshes to the same `desiredPage` and page type are used to determine which is the first/last visit.
    - Journeys shorter than the number of desired stages (`numberOfStages`) are always included.
    - GOV.UK search page paths are assumed to have the format `/search/{TYPE}?keywords={KEYWORDS}{...}`, where
      `{TYPE}` is the GOV.UK search content type, `{KEYWORDS}` are the search keywords, where each keyword is
      separated by `+`, and `{...}` are any other parts of the search query that are not keyword-related (if they
      exist).
    - GOV.UK search page titles are assumed to have the format `{KEYWORDS} - {TYPE} - GOV.UK`, where `{TYPE}` is the
      GOV.UK search content type, and `{KEYWORDS}` are the search keywords.
    - If `entrancePage` is FALSE, each journey (row) contains both instances where the entrance page is included, 
      and the entrance page is not included. Therefore, if there are more page paths than `numberOfStages`, this 
      will not be flagged. 
    - If `deviceAll` is set to TRUE, and `deviceCategories` set to 'desktop', 'mobile', and/or 'tablet', the 
      query will use `deviceAll` and ignore all other arguments. 
*/

-- Declare query variables
DECLARE startDate DEFAULT "20210628";
DECLARE endDate DEFAULT "20210628";
DECLARE pageType DEFAULT ["PAGE", "EVENT"];
DECLARE firstHit DEFAULT TRUE;
DECLARE lastHit DEFAULT FALSE; 
DECLARE deviceCategories DEFAULT ['desktop', 'mobile', 'tablet'];
DECLARE deviceAll DEFAULT FALSE;
DECLARE flagEvents DEFAULT TRUE;
DECLARE eventCategory DEFAULT TRUE;
DECLARE eventAction DEFAULT TRUE;
DECLARE eventLabel DEFAULT TRUE;
DECLARE truncatedSearches DEFAULT TRUE;
DECLARE desiredPage DEFAULT "www.gov.uk/trade-tariff";
DECLARE queryString DEFAULT TRUE; 
DECLARE desiredPageRemoveRefreshes DEFAULT TRUE;
DECLARE numberOfStages DEFAULT 3;
DECLARE entrancePage DEFAULT TRUE; 
DECLARE exitPage DEFAULT TRUE; 

WITH
    get_session_data AS (
        -- Get all the session data between `start_date` and `end_date`, subsetting for specific `page_type`s. As
        -- some pages might be dropped by the subsetting, recalculate `hitNumber` as `journeyNumber` so the values
        -- are sequential.
        SELECT
            CONCAT(fullVisitorId, "-", visitId) AS sessionId,
            ROW_NUMBER() OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.hitNumber) AS journeyNumber,
            ROW_NUMBER() OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.hitNumber DESC) AS revJourneyNumber,
            hits.type,
            device.deviceCategory,
            hits.page.pagePath,
            hits.page.hostname,
            CONCAT(
                IF(queryString, REGEXP_REPLACE(hits.page.pagePath, r'[?#].*', ''), hits.page.pagePath),  -- modify this line to `hits.page.pageTitle` if required
                IF(hits.type = "EVENT" AND flagEvents, IF ((eventCategory OR eventAction OR eventLabel), " [E", "[E]"), ""),
                IF(hits.type = "EVENT" AND eventCategory, CONCAT(IF ((flagEvents), ", ", " ["), IFNULL(hits.eventInfo.eventCategory, "null"), IF ((eventAction OR eventLabel), "", "]")), ""),
                IF(hits.type = "EVENT" AND eventAction, CONCAT(IF ((flagEvents OR eventCategory), ", ", " ["), IFNULL(hits.eventInfo.eventAction, "null"), IF ((eventLabel), "", "]")), ""),
                IF(hits.type = "EVENT" AND eventLabel, CONCAT(IF ((flagEvents OR eventCategory OR eventAction), ", ", " ["), IFNULL(hits.eventInfo.eventLabel, "null"), "]"), "") 
            ) AS pageId
        FROM `govuk-bigquery-analytics.87773428.ga_sessions_*`
        CROSS JOIN UNNEST(hits) AS hits
        WHERE _TABLE_SUFFIX BETWEEN startDate AND endDate
        AND hits.type IN UNNEST(pageType)
        AND (CASE WHEN deviceAll THEN device.deviceCategory in UNNEST(["mobile", "desktop", "tablet"]) END 
            OR CASE WHEN deviceCategories IS NOT NULL THEN device.deviceCategory in UNNEST(deviceCategories) END )
    ),

    combine_host_with_pageids AS (
    -- Combine hostname with pageId
      SELECT
        * 
        EXCEPT (hostname)
        REPLACE ( 
          IF(STARTS_WITH(pagePath, '/search/'), CONCAT(hostname, ': ', pageId), CONCAT(hostname, pageId)) AS pageId
            )
      FROM get_session_data
    ), 

    get_search_content_type_and_keywords AS (
        -- Extract the content type and keywords (if any) for GOV.UK search pages.
        SELECT
            *,
            IFNULL(
              REGEXP_EXTRACT(pagePath, r"^/search/([^ ?#/]+)"),
              REGEXP_EXTRACT(pagePath, r"^.+ - ([^-]+) - GOV.UK$")
            ) AS searchContentType,
            IFNULL(
              REPLACE(REGEXP_EXTRACT(pagePath, r"^/search/[^ ?#/]+\?keywords=([^&]+)"), "+", " "),
              REGEXP_EXTRACT(pagePath, r"^(.+)- [^-]+ - GOV.UK$")
            ) AS searchKeywords
        FROM combine_host_with_pageids
    ),

    compile_search_entry AS (
        -- Truncate the search page into an entry of the search content type and keywords (if any).
        SELECT
            * EXCEPT (searchContentType, searchKeywords),
            CONCAT(
                "Sitesearch (",
                searchContentType,
                "):",
                COALESCE(searchKeywords, "none")
            ) AS search_entry
        FROM get_search_content_type_and_keywords
    ),

    replace_escape_characters AS (
    -- Replace \ with / as otherwise following REGEXP_REPLACE will not execute  
       SELECT
          *,
          REGEXP_REPLACE(search_entry, r"\\", "/") AS searchEntryEscapeRemoved
       FROM compile_search_entry 
    ),  

    revise_search_pageids AS (
    -- Replace `pageId` for search pages with the compiled entries if selected by the user.
        SELECT
            * REPLACE (
                 IFNULL(IF(truncatedSearches, (REGEXP_REPLACE(pageId, r"^/search/.*", searchEntryEscapeRemoved)), pageId), pageId) AS pageId
            )
        FROM replace_escape_characters
    ),

    identify_page_refreshes AS (
        -- Lead the page `type` and `pageId` columns. This helps identify page refreshes that can be removed in the
        -- next CTE
        SELECT
            *,
            LEAD(type) OVER (PARTITION BY sessionId ORDER BY journeyNumber) AS leadType,
            LEAD(pageId) OVER (PARTITION BY sessionId ORDER BY journeyNumber) AS leadPageId
        FROM revise_search_pageids
    ),

    identify_last_hit_to_desired_page AS (
        -- Get the last hit to the desired page. Ignores previous visits to the desirted page. Page refreshes of the
        -- desired page are also ignored if the correct option is declared.
        SELECT
            sessionId,
            deviceCategory,
            CASE 
                WHEN firstHit THEN MIN(journeyNumber) 
                WHEN lastHit THEN MAX(journeyNumber) 
            END AS desiredPageJourneyNumber
        FROM identify_page_refreshes
        WHERE pageId = desiredPage
        AND IF(
            desiredPageRemoveRefreshes,
            (
                leadPageId IS NULL
                OR pageId != leadPageId
                OR IF(ARRAY_LENGTH(pageType) > 1, pageId = leadPageId AND type != leadType, FALSE)
            ),
            TRUE
        )
        GROUP BY sessionId, deviceCategory
    ),

    subset_journey_to_last_hit_of_desired_page AS (
        -- Subset all user journeys to the first/last hit of the desired page.
        SELECT revise_search_pageids.*
        FROM revise_search_pageids
        INNER JOIN identify_last_hit_to_desired_page
        ON revise_search_pageids.sessionId = identify_last_hit_to_desired_page.sessionId
        AND revise_search_pageids.deviceCategory = identify_last_hit_to_desired_page.deviceCategory
        AND revise_search_pageids.journeyNumber >= identify_last_hit_to_desired_page.desiredPageJourneyNumber
        -- Only show pageIds following the desired page 
    ),

    calculate_stages AS (
        -- Calculate the number of stages from the first/last hit to the desired page, where the last hit to the desired
        -- page is '1'.
        SELECT
            *,
            ROW_NUMBER() OVER (PARTITION BY sessionId ORDER BY journeyNumber ASC) AS forwardDesiredPageJourneyNumber
        FROM subset_journey_to_last_hit_of_desired_page
    ),

    subset_journey_to_number_of_stages AS (
        -- Compile the subsetted user journeys together for each session in order (last hit to the desired
        -- page first), delimited by " >>> ".
        SELECT DISTINCT
            sessionId,
            deviceCategory,
            MIN(journeyNumber) OVER (PARTITION BY sessionId) = 1 AS flagEntrance,
            MIN(revJourneyNumber) OVER (PARTITION BY sessionId) = 1 AS flagExit,
            STRING_AGG(pageId, " >>> ") OVER (
                PARTITION BY sessionId
                ORDER BY forwardDesiredPageJourneyNumber ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
            ) AS userJourney
        FROM calculate_stages
        WHERE forwardDesiredPageJourneyNumber <= numberOfStages 
    ),

    count_distinct_journeys AS (
        -- Count the number of sessions for each distinct subsetted user journey, split by whether the sessions
        -- entered on the first page of the subsetted journey or not
        SELECT
            CASE WHEN entrancePage 
                   THEN CAST(flagEntrance AS STRING) 
                   ELSE 'no flag' 
            END AS flagEntrance,
            CASE WHEN exitPage 
                   THEN CAST(flagExit AS STRING) 
                   ELSE 'no flag' 
            END AS flagExit,
            CASE WHEN deviceAll
                   THEN CAST(deviceCategory AS STRING) 
                   ELSE ARRAY_TO_STRING(deviceCategories, ", ") 
            END AS deviceCategory,
            userJourney,
            (SELECT COUNT(sessionId) FROM subset_journey_to_number_of_stages) AS totalSessions,
            COUNT(sessionId) AS countSessions
        FROM subset_journey_to_number_of_stages
        GROUP BY
            flagEntrance, flagExit, deviceCategory, userJourney
    )

SELECT
    *,
    countSessions / totalSessions AS proportionSessions
FROM count_distinct_journeys
ORDER BY countSessions DESC;
```