# Page detector 

A functional graph based on page hit session data is created, to find a list of pages related to a Whole User Journey (WUJ). This approach does not rely on the existing knowledge graph. There are 3 main steps: 

- Set variables
- Create functional network 
- Apply biased random walks

Requirements: 
- The functional network is based on user movement data and therefore retrives data from Google BigQuery. As such,
  Google BigQuery credentials are required to run this notebook. 

## Authentication and imports

In [None]:
from src.utils.create_functional_network import (
    create_networkx_graph,
    extract_nodes_and_edges,
    extract_seed_sessions,
    identify_seed_pages,
)

# Authenticate the user - follow the link and the prompts to get an authentication token
auth.authenticate_user()

## Set variables

`seed0_pages`: a list of pages that are known to be vital to the WUJ. seed1_pages are defined as pages that contain a hyperlink on seed0_pages, so it is logical to choose seed0_pages which have hyperlinks to other pages, such as browse or topic pages.  

`start_date`: the functional graph is based on page hit session data. This is the start date for the session hit data.

`end_date`: the functional graph is based on page hit session data. This is the end date for the session hit data.

In [None]:
seed0_pages = [
    "/guidance/travel-to-england-from-another-country-during-coronavirus-covid-19",
    "/email/subscriptions/single-page/new?topic_id=travel-to-england-from-another-country-during-coronavirus-covid-19",
    "/sign-in/callback",
    "/email/manage",
]

start_date = ["20210803"]

end_date = ["20210803"]

## Create functional network

The first step is to create the functional network. To do this, call four functions, in order: 
- `identify_seed_pages(seed0_pages)`: the first step scrapes the seed0_pages, and creates a topology sparse matrix where the row entities are seed0_pages, and the column entities are the pages seed0_pages hyperlink to. Returns a list of seed1_pages (i.e. the column entities).  

- `extract_seed_sessions(start_date, end_date, seed0_pages, seed1_pages)`: Using the list of `seed1_pages` from `identify_seed_pages()`, this function retrieves all page hits from sessions that visit at least one seed0 or seed1 page from Google BigQuery.

- `extract_nodes_and_edges(page_view_network)`: Extracts nodes and edges from the functional network `page_view_network` created via the function `extract_seed_sessions()`. 

- `create_networkx_graph(nodes, edges)`: Combines `nodes` and `edges` from `extract_nodes_and_edges()` to create a NetworkX functional graph related to a set of seed pages.


In [None]:
seed1_pages = identify_seed_pages(seed0_pages)

In [None]:
page_view_network = extract_seed_sessions(
    start_date, end_date, seed0_pages, seed1_pages
)

In [None]:
(nodes, edges) = extract_nodes_and_edges(page_view_network)

In [None]:
G = create_networkx_graph(nodes, edges)