# Page detector 

A functional graph based on page hit session data is created, to find a list of pages related to a Whole User Journey (WUJ). This approach does not rely on the existing knowledge graph. There are 3 main steps: 

- Set variables
- Create functional network 
- Apply biased random walks

### Requirements: 
- The functional network is based on user movement data and therefore retrives data from Google BigQuery. As such,
  Google BigQuery credentials are required to run this notebook. 

### Steps to run this script:


* Make sure you have permissions to use BigQuery
* Run steps 1, 2 and 3 and wait for all dependencies to be installed
* Restart the Runtime by going to `Runtime > Restart runtime`
* Run steps 1 and 2
* Proceed to the Page detection section
* Set variables
* Create functional network. This costs money each time you run it, because it makes use of BigQuery.
* Experiment with random walk community detection:
  * Experiment as much as you want. This bit doesn't cost money to run.
  * Set number of steps and repeats
  * See below for guidance on this 


### Guidance on setting steps and repeats parameters:

Firstly, a random walk will perform a certain number of steps. This is the number of movements it makes between pages. E.g. `A -> B -> C -> B -> D` is a 4 step random walk. 

If we performed this random walk again, we might get a different outcome, like `A -> C -> D -> E -> B`. This is a repeat. We have set `steps = 4` and `repeats = 2`.

Then, the results of each random walk are unioned, to become `{A, B, C, D, E}`.

With this in mind, here is some guidance:

* If you're interested in retrieving a list of pages ranked by relevance to the WUJ, set `repeats` to a large value like 1000.
  * As the number of repeats is increased, the ranks of each page begin to stabilise to a "true" position
  * Large values have the side effect of giving the random walk more opportunities to escape the space of pages related to the WUJ
  * When using a large number of repeats, consider increasing `n_jobs` to speed up processing. This tends to be beneficial when `repeats >= 200`, as the speed up from parallelisation outweighs the overhead of initialising multi-core parallelism.

* The more `steps` you have, the more movements the random walk will make.

  Typically, if `steps` is set to a large value, e.g. over 100, then the random walk has the opportunity to travel far enough through GOV.UK to visit pages that aren't relevant to the WUJ. On the other hand, irrelevant pages will tend to be assigned a low rank and more steps will contribute to the stabilisation of the rankings.

Overall, `repeats = 1000`, `steps = 100` and `n_jobs = -1`, followed by taking pages ranked in the top X% tends to work well. Where X is the top 5% for example.


## Authentication and imports

### Step 1

In [None]:
!git clone -b sort-repo https://github.com/alphagov/govuk-wuj-network-analysis.git

fatal: destination path 'govuk-wuj-network-analysis' already exists and is not an empty directory.


### Step 2

In [None]:
cd govuk-wuj-network-analysis

/content/govuk-wuj-network-analysis


Execute `!pip install -r requirements.txt --ignore-installed`, wait until it's done, then restart the runtime. After that, run the cells above this again, and proceed to run the cells below `!pip install -r requirements.txt --ignore-installed`.

### Step 3

In [None]:
!pip install -r requirements.txt --ignore-installed

# Page detection

In [None]:
!pip install python-dotenv==0.19.2 --quiet

In [None]:
from google.colab import drive, auth
import os
import networkx as nx
from dotenv import load_dotenv
from src.utils.create_functional_network import (
    create_networkx_graph,
    extract_nodes_and_edges,
    extract_seed_sessions,
    identify_seed_pages,
)
import src.utils.randomwalks as rw

drive.mount('/content/gdrive/')

nb_path = '/content/gdrive/Shareddrives/GOV.UK teams/2021-2022/Data labs/14 Network analysis/Page Finding Tool'
load_dotenv(os.path.join(nb_path, '.env'))

# Authenticate the user - follow the link and the prompts to get an authentication token
auth.authenticate_user()

Mounted at /content/gdrive/


## Set variables

`seed0_pages`: a list of pages that are known to be vital to the WUJ. seed1_pages are defined as pages that contain a hyperlink on seed0_pages, so it is logical to choose seed0_pages which have hyperlinks to other pages, such as browse or topic pages.  

`start_date`: the functional graph is based on page hit session data. This is the start date for the session hit data.

`end_date`: the functional graph is based on page hit session data. This is the end date for the session hit data.

In [None]:
seed0_pages = [
    "/guidance/travel-to-england-from-another-country-during-coronavirus-covid-19",
    "/email/subscriptions/single-page/new?topic_id=travel-to-england-from-another-country-during-coronavirus-covid-19",
    "/sign-in/callback",
    "/email/manage"
]

start_date = ["20210803"]

end_date = ["20210803"]

## Create functional network

The first step is to create the functional network. To do this, call four functions, in order: 
- `identify_seed_pages(seed0_pages)`: the first step scrapes the seed0_pages, and creates a topology sparse matrix where the row entities are seed0_pages, and the column entities are the pages seed0_pages hyperlink to. Returns a list of seed1_pages (i.e. the column entities).  

- `extract_seed_sessions(start_date, end_date, seed0_pages, seed1_pages)`: Using the list of `seed1_pages` from `identify_seed_pages()`, this function retrieves all page hits from sessions that visit at least one seed0 or seed1 page from Google BigQuery.

- `extract_nodes_and_edges(page_view_network)`: Extracts nodes and edges from the functional network `page_view_network` created via the function `extract_seed_sessions()`. 

- `create_networkx_graph(nodes, edges)`: Combines `nodes` and `edges` from `extract_nodes_and_edges()` to create a NetworkX functional graph related to a set of seed pages.


In [None]:
seed1_pages = identify_seed_pages(seed0_pages)



  links = BeautifulSoup(html_contents, parse_only=SoupStrainer("a"))


In [None]:
page_view_network = extract_seed_sessions(
    start_date, end_date, seed0_pages, seed1_pages
)

In [None]:
(nodes, edges) = extract_nodes_and_edges(page_view_network)

In [None]:
G = create_networkx_graph(nodes, edges)

## Random walk community detection

In [None]:
# get G in the right format
G = rw.reformat_graph(G)

In [None]:
# get transition matrix of G
# T = rw.get_transition_matrix(G)

In [None]:
T = nx.adjacency_matrix(G, weight=None)

In [None]:
# execute random walks
results = rw.repeat_random_walks(
    steps=10,
    repeats=10,
    T=T,
    G=G,
    seed_pages=seed0_pages,
    proba=False,
    combine='union',
    level=1,
    n_jobs=1
    )

['/email/subscriptions/single-page/new?topic_id=travel-to-england-from-another-country-during-coronavirus-covid-19'] could not be found in the graph


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# rank pages by page-freq path-freq metric
page_scores = rw.page_freq_path_freq_ranking(results)

In [None]:
ranked_pages = rw.add_additional_information(page_scores, G)
ranked_pages.head()

Unnamed: 0,page path,document type,document supertype,number of sessions that visit this page,number of sessions where this page is an entrance hit,number of sessions where this page is an exit hit,number of sessions where this page is both an entrance and exit hit,"all sessions that visit this page, regardless of the session visiting a seed page",how frequent the page occurs in the whole user journey
0,/guidance/travel-to-england-from-another-count...,detailed_guide,guidance and regulation,65398,3395,11627,46485,62421,11.0
1,/sign-in/callback,no value,other,8,1,4,3,6,10.0
2,/email/manage,no value,other,4242,324,1498,1052,3837,10.0
3,/,homepage,other,111628,62311,8320,40997,391279,8.0
4,/search/all,finder,other,50883,1429,2073,47381,171950,5.0
