## Ranking output

Currently, the output of the random walks is a single set of pages. End users will prefer a ranked list of pages. This ranking should have a tendency to rank pages from the target WUJ higher than pages not in that WUJ.

### Ranking by page frequency-random walk frequency

#### An example

The most successful random walk method has been to perform multiple random walks and combine the pages visited by each one into a single set of pages.

Each random walk traverses a path of pages. Since we perform multiple random walks, we have multiple paths. Some pages will appear on more paths than others. Some pages will appear more frequently per path.

For example: suppose you perform two random walks and each one traverses the following path:

- [A, C, D, C, X, Y, Z] 
- [A, C, B, D, Q, P, M]

Pages A, C and D are common to both paths. However, C occurs twice on the first path, which no other page does. Hence, C should be ranked first, followed by A and D in joint second. The remaining pages are equally ranked at the bottom.

#### Page frequency-path frequency

Inspired by the tf-idf ("term frequency-inverse document frequency") metric from NLP, we create the tf-df metric, "term frequency-document frequency". Translated into random walk parlance, this is "page frequency-path frequency". Where "page frequency" is the number of occurences of a given page on a given path taken by a random walk, and, "path frequency" is the number of random walk paths on which a given page occurs at least once.

Mathematically,

$pf(p,r)$ is the frequency of page $p$ on a single random walk $r$,

$$\text{pf}(p,r) = f_{p,r}$$

Where $f_{p,r}$ is the count of a page on a random walk path.

The path frequency is a measure of how common a given page is to all random walks performed, i.e. if it's common or rare across all random walks,

$$\text{rwf}(t,R) = |\{r \in R : p \in r\}|$$

Where $R$ is the set of paths taken by all random walks and $|\{r \in R : p \in r\}|$ is the number of random walks on which the page $p$ occurs. For instance, in the above example, page C occurs on two random walk paths.

Below, a demo is shown of this ranking system.

In [8]:
import randomwalks as rw
import numpy as np
import pandas as pd
import networkx as nx

<module 'randomwalks' from '/Users/jakerutherford/Documents/govuk-wuj-network-analysis/notebooks/random_walk/randomwalks.py'>

In [2]:
# er_pages is a list of pages known to be within the economic recovery WUJ
# this will be used to help evaluate the ranking system

er_pages = pd.read_excel('../../data/processed/2021-11-12 - Economic recovery pages.xlsx', sheet_name='Top pages').pagePathv2.to_list()

# get networkx graph
G = nx.read_gpickle("../../data/processed/functional_session_hit_directed_graph_er.gpickle").to_undirected()

# reformat the graph to make it compliant with existing random walk functions
# i.e. add the path to a name property and set the index to be a number

for index,data in G.nodes(data=True):
    data['properties'] = dict()
    data['properties']['name'] = index


G = nx.convert_node_labels_to_integers(G, first_label=0, ordering='default', label_attribute=None)

# get adjacency matrix of G
A = nx.adj_matrix(G, weight=None)

In [3]:
# set the seeds from where random walks will be initialised
seeds = (
    '/find-a-job',
    '/universal-credit',
    '/government/collections/financial-support-for-businesses-during-coronavirus-covid-19'
)

In [4]:
results = rw.repeat_random_walks(steps=100, repeats=100, T=A, G=G, seed_pages=seeds, proba=False, combine='union', level=1, n_jobs=1)

  0%|          | 0/3 [00:00<?, ?it/s]

In [9]:
page_scores = rw.page_freq_path_freq_ranking(results)

In [10]:
page_scores['ER'] = page_scores.pagePath.isin(er_pages)
colour = (page_scores.ER == True).map({True: 'background-color: black', False: ''})
page_scores.style.apply(lambda s: colour)

Unnamed: 0,pagePath,tfdf_saliency,tfdf_max,tfdf_mean,ER
5060,/search/all,68121.0,261.0,227.07,False
3307,/,65025.0,255.0,216.75,False
4315,/find-a-job,41209.0,203.0,137.363333,True
2046,/browse/working,29584.0,172.0,98.613333,False
967,/prove-right-to-work,22201.0,149.0,74.003333,False
5010,/browse/employing-people,21025.0,145.0,70.083333,False
954,/universal-credit,19321.0,139.0,64.403333,True
1028,/contact-jobcentre-plus,16129.0,127.0,53.763333,False
4815,/jobseekers-allowance,12769.0,113.0,42.563333,True
4460,/request-copy-criminal-record,11664.0,108.0,38.88,False
