**Replication of** `page_view_approach_economic_recovery.ipynb` 

**Produces ranking based on the shortest path to (two) seed pages and degree centrality**


# Exploring the economic recovery whole user journey subgraph

This approach does not rely on the existing knowledge graph. It uses a functional graph based on page hit session data, to find a list of pages related to the economic recovery whole user journey (WUJ).

ASSUMPTIONS: 
- A set of pre-defined pages have been removed from the economic recovery WUJ subgraph
- Any pages with less than 10 session hits are removed from the economic recovery WUJ subgraph
- Any pages with a shortest page path length equal to or greater than 3 from both `seed0` pages are removed from the economic recovery WUJ subgraph
- Any pages where accumulated edge weight is equal to or lower than 20 are removed from the economic recovery WUJ subgraph
- Any browse pages not related to the economic recovery WUJ are manually removed from the economic recovery WUJ subgraph

OUTPUT: 
- A csv containing a list of pages related to the economic recovery WUJ. Sorted in ascending order in relation to a composite metric made up of: shortest page path from both `seed0` pages, and degree centrality.
  - Additional columns to filter on are included: `document type`, `document supertype`, `number of sessions that visit this page`, `number of sessions where this page is an entrance hit`, `number of sessions where this page is an exit hit`, `distance from /browse/working/finding-job`, `distance from /topic/further-education-skills/apprenticeships`, `the number of pages a user moves to/from between this page and another page in the list`

REQUIREMENTS: 
- Create the functional network by running `page_detector.ipynb`. Save `G` as a pickle file in `../../data/processed/functional_session_hit_directed_graph_er.gpickle`

## Import statements  

In [1]:
import os
from collections import Counter, defaultdict
from operator import itemgetter

# import gspread
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
from neo4j import GraphDatabase
from oauth2client.service_account import ServiceAccountCredentials
from tqdm.notebook import tqdm

## Functions for coercing knowledge graph into NetworkX

In [2]:
def getSubgraph(q, parameters=None):

    '''
    Given a Cypher query q, this function queries the knowledge graph,
    returns the nodes and edges from this query, and uses them to construct
    a networkx graph.

    E.g. getSubgraph(r'MATCH (u:Cid)-[r:HYPERLINKS_TO]->(v:Cid) RETURN *')
         returns the structural graph.

    Optionally, can add in parameters (dictionary), allowing Python variables
    to be integrated into the Cypher query q.

    E.g.
        parameters = {}
        parameters['pages'] = ['a','list','of','stuff']
        q7 = f"""
        MATCH (u:Cid)-[r]-(v:Cid)
        WHERE u.name IN $pages AND v.name in $pages
        RETURN *
        """

        g7 = getSubgraph(q7, parameters)
    '''

    # get credentials
    # add to .secrets: export KG_PWD="<PASSWORD>"
    KG_PWD = os.getenv("KG_PWD")

    # create connection to knowledge graph
    driver = GraphDatabase.driver(
        "bolt+s://knowledge-graph.integration.govuk.digital:7687",
        auth=("neo4j", KG_PWD),
    )

    # run query on knowledge graph
    results = driver.session().run(q, parameters)

    # create networkx graph object
    G = nx.MultiDiGraph()

    # add nodes into networkx graph object
    nodes = list(results.graph()._nodes.values())
    print("Adding nodes\n")
    for node in tqdm(nodes):
        G.add_node(node.id, labels=node._labels, properties=node._properties)

    # add edges into networkx graph object
    rels = list(results.graph()._relationships.values())
    print("Adding edges\n")
    for rel in tqdm(rels):
        G.add_edge(
            rel.start_node.id,
            rel.end_node.id,
            key=rel.id,
            type=rel.type,
            properties=rel._properties,
        )

    return G


def showGraph(g):
    """
    Given a networkx graph g, this function visualises the graph.
    Do not use for a large g.
    """
    print(nx.info(g))
    nx.draw(g)

In [3]:
def getNoOfTruePages(g, economic_pages):
    """
    Calculate a proxy recall metric for the list of pages identified in a
    subgraph (when compared to the ground truth for the economic recovery pages).
    The output is the number of pages in the subgraph list that are present in
    the ground truth list.
    """

    # convert nodeIds to page path slug for the subgraph list
    subgraph_list = [node[0] for node in g.nodes(data=True)]

    # set up the ground truth list
    true_list = list(economic_pages)

    # how many pages are in the subgraph list that are also in the ground truth list
    return (len(true_list)) - (
        len([node for node in true_list if node not in subgraph_list])
    )

In [4]:
def getPagesNotInSubGraph(g):
    """
    Return a set of pages which are in the economic recovery pre-defined list
    (41 in total), but are not in the filtered subgraph list.
    """

    # list of pages manually defined in the economic recovery whole user journey
    true_list = list(economic_pages)

    # list of pages in the filtered subgraph list
    subgraph_list = [node[0] for node in g.nodes(data=True)]

    # the list of pages in the manually defined list not in the subgraph list
    return [node for node in true_list if node not in subgraph_list]

## Pre-defined economic recovery subgraph

The pages (nodes) in the manually defined list for economic recovery (41 pages in total)

In [5]:
# Load worksheet of seed pages
er_seed_pages_ws = pd.read_excel("./2021-11-12 - Economic recovery pages.xlsx", sheet_name="Top pages")
er_seed_pages_ws.shape

FileNotFoundError: [Errno 2] No such file or directory: './2021-11-12 - Economic recovery pages.xlsx'

In [None]:
er_seed_pages_table = er_seed_pages_ws.values.reshape(-1, 1)
er_seed_pages_table.shape

In [10]:
economics_pages_df = pd.DataFrame(er_seed_pages_table).dropna() 

## Exploring the economic recovery functional network

- 10631 nodes and 79985 edges
- The page `/guidance/international-trade-products-and-schemes` which is in the manually defined graph, is **not** in this subgraph. This seems to be an old page (updated 2013) related to exporting agricultural products to the EU
- The distribution of session hit data demonstrates that only 25% of the page paths have session hits over 8. The maximum number of session hit for a page path is 71667: 
    - Min: 1
    - Max: 71667
    - 25th percentile: 1
    - 50th percentile: 2
    - 75th percentile: 8


In [11]:
# import the NetworkX graph object
g = nx.read_gpickle(
#     "../../data/processed/functional_session_hit_directed_graph_er.gpickle"
    "./functional_session_hit_directed_graph_er.gpickle"
)

# number of nodes, number of edges
nx.info(g)


  nx.info(g)


'DiGraph with 10631 nodes and 79985 edges'

In [12]:
# how many of the 'true' economic recovery pages are in the functional graph?
economic_pages = economics_pages_df.values.reshape(-1,).tolist()

getNoOfTruePages(g, economic_pages)

88

In [13]:
# which pages are in the economic recovery list but not in the subgraph list?
true_list = list(economic_pages)
subgraph_list = [node[0] for node in g.nodes(data=True)]
[node for node in true_list if node not in subgraph_list]

['/guidance/international-trade-products-and-schemes',
 '/guidance/international-trade-products-and-schemes',
 '/campaigns/internationalisation-fund-for-english-businesses/']

In [14]:
# list of nodes
nodes_list = list(g.nodes(data=True))

### Remove pages where edge weight is <= 20


Edge weight increases by 1 each time a user session visits page A to page B. Therefore, if edge weight is small, we assume the two pages are likely to not be associated with one another. This is because we assume that user's will visit similar pages in the same WUJ during the same session. 

Edge weight is equal to or less than 10:
- 1473 nodes
- 37/41 
- Pages not in the subgraph list: [`/guidance/recovery-loan-scheme`, 
`/find-driving-instructor-training`, 
`/guidance/international-trade-products-and-schemes`,
`/how-to-claim-universal-credit`]

Edge weight is equal to or less than 20:
- 1253 nodes
- 34/41 
- Pages not in the subgraph list: [`/guidance/international-trade-products-and-schemes`,
 `/guidance/recovery-loan-scheme`,
 `/find-driving-instructor-training`,
 `/how-to-claim-universal-credit`,
 `/government/collections/financial-support-for-businesses-during-coronavirus-covid-19`,
 `/agricultural-skills-and-training`,
 `/guidance/claim-back-statutory-sick-pay-paid-to-employees-due-to-coronavirus-covid-19`]

Edge weight is equal to or less than 30:
- 1042 nodes
- 32/41 
- Pages not in the subgraph list: [`/guidance/international-trade-products-and-schemes`,
 `/guidance/recovery-loan-scheme`,
 `/find-driving-instructor-training`,
 `/how-to-claim-universal-credit`,
 `/government/collections/financial-support-for-businesses-during-coronavirus-covid-19`,
 `/agricultural-skills-and-training`,
 `/guidance/claim-back-statutory-sick-pay-paid-to-employees-due-to-coronavirus-covid-19`]

### Save the subgraph list of pages that are related to the economic recovery whole user journey

From the above exploration, run this code to get a final subgraph related to the economic recovery whole user journey

In [38]:
# import the NetworkX graph object
g = nx.read_gpickle(
    "./functional_session_hit_directed_graph_er.gpickle"
)

In [40]:
nx.info(g)


  nx.info(g)


'DiGraph with 10631 nodes and 79985 edges'

In [41]:
# remove irrelevant nodes
g.remove_nodes_from(
    [
        "/prove-right-to-work",
        "/",
        "/request-copy-criminal-record",
        "/browse/working/state-pension",
        "/search/all",
        "/search",
        "/brexit",
        "/coronavirus",
        "/report-covid19-result",
    ]
)

In [42]:
nx.info(g)


  nx.info(g)


'DiGraph with 10622 nodes and 67524 edges'

In [43]:
# remove pages with equal or less than 10 session hits
remove = [
    node for node, session in g.nodes(data=True) if session["sessionHitsAll"] <= 10
]
g.remove_nodes_from(remove)

In [44]:
# remove pages where they have a shortest page path length on >= 3 from both `seed0` pages
shortest_paths_seed0_1 = nx.shortest_path_length(
    g, source="/browse/working/finding-job"
)
shortest_paths_seed0_2 = nx.shortest_path_length(
    g, source="/topic/further-education-skills/apprenticeships"
)

finding_job = {
    k: v
    for k, v in sorted(
        shortest_paths_seed0_1.items(), key=lambda item: item[1], reverse=True
    )
}
apprentice = {
    k: v
    for k, v in sorted(
        shortest_paths_seed0_2.items(), key=lambda item: item[1], reverse=True
    )
}

finding_job_filtered = [node for node, length in finding_job.items() if length >= 3]
apprentice_filtered = [node for node, length in apprentice.items() if length >= 3]

remove = [node for node in finding_job_filtered if node not in apprentice_filtered]
g.remove_nodes_from(remove)

In [45]:
# remove pages where accumulated edge weight is equal to or lower than 20
user_movement_weights = []
for node1, node2, edgeWeight in g.edges(data=True):
    case = {"node1": node1, "node2": node2, "edgeWeight": edgeWeight["edgeWeight"]}
    user_movement_weights.append(case)

user_movements_sum = defaultdict(float)

for info in user_movement_weights:
    user_movements_sum[info["node1"]] += info["edgeWeight"]

user_movements_sum = [
    {"node1": node1, "edgeWeight": user_movements_sum[node1]}
    for node1 in user_movements_sum
]

remove = [node["node1"] for node in user_movements_sum if node["edgeWeight"] <= 20]
g.remove_nodes_from(remove)

In [46]:
# remove certain browse pages. Browse pages seem to score highly (when ranking, below),
# regardless of whether they are relevant to the WUJ or not
browse_pages_remove = (
    "/browse/employing-people",
    "/browse/births-deaths-marriages",
    "/browse/citizenship",
    "/browse/driving",
    "/browse/education",
    "/browse/business",
    "/browse/childcare-parenting",
    "/browse/justice",
    "/browse/abroad",
    "/browse/tax",
    "/browse/visas-immigration",
    "/browse/disabilities",
    "/browse/environment-countryside",
    "/browse/housing-local-services",
    "/browse/employing-people",
    "/browse/births-deaths-marriages",
    "/browse/citizenship",
    "/browse/driving",
    "/browse/business",
    "/browse/abroad",
    "/browse/environment-countryside",
    "/browse/housing-local-services",
    "/browse/benefits",
)

g.remove_nodes_from(browse_pages_remove)

In [47]:
nx.info(g)


  nx.info(g)


'DiGraph with 1338 nodes and 31143 edges'

In [49]:
# how many of economic recovery pages are in the functional graph?
getNoOfTruePages(g, economic_pages)

77

In [50]:
# which pages are in the economic recovery list but not in the subgraph list?
getPagesNotInSubGraph(g)

['/guidance/international-trade-products-and-schemes',
 '/guidance/international-trade-products-and-schemes',
 '/campaigns/internationalisation-fund-for-english-businesses/',
 '/guidance/recovery-loan-scheme',
 '/guidance/recovery-loan-scheme',
 '/how-to-claim-universal-credit',
 '/how-to-claim-universal-credit',
 '/browse/education',
 '/browse/education',
 '/agricultural-skills-and-training',
 '/education/further-and-higher-education-courses-and-qualifications',
 '/agricultural-skills-and-training',
 '/find-driving-instructor-training',
 '/find-driving-instructor-training']

In [51]:
# final subgraph list related to the economic recovery whole user journey
economic_recovery_subgraph = list(g.nodes(data=True))

In [None]:
# save subgraph to file
# nx.write_gpickle(
#     g, "../../data/processed/functional_session_hit_directed_graph_er_final.gpickle"
# )

### Explore communities of the subgraph

Look at adjacent neighbours, to identify communities within the WUJ. While there seems to be relevant pages within a community (e.g. `/find-a-job` and `/contact-jobcentre-plus`), other pages seem to be irrelevant (e.g. `/find-a-job` and `/check-mot-status`)

In [52]:
# create a dict of all nodes and it's edges
dict_of_nodes_and_edges = dict()
for node in g.nodes():
    dict_of_nodes_and_edges[node] = list(nx.neighbors(g, node))

    # sort dict based on the length of its' values (i.e. top result is the largest number of neighbours)
sorted(dict_of_nodes_and_edges.items(), key=lambda x: len(x[1]), reverse=True)

[('/find-a-job',
  ['/sign-in-universal-credit',
   '/browse/working/finding-job',
   '/advertise-job',
   '/contact-jobcentre-plus',
   '/universal-credit',
   '/apply-apprenticeship',
   '/jobseekers-allowance',
   '/browse/working',
   '/browse/benefits/looking-for-work',
   '/coronavirus/worker-support',
   '/government/organisations/department-for-work-pensions',
   '/government/organisations/department-for-work-pensions/about/recruitment',
   '/find-internship',
   '/chwilio-am-swydd',
   '/national-minimum-wage-rates',
   '/guidance/free-courses-for-jobs',
   '/personal-tax-account',
   '/log-in-register-hmrc-online-services',
   '/skilled-worker-visa',
   '/government/organisations/hm-revenue-customs/about/recruitment',
   '/find-traineeship',
   '/government/collections/kickstart-scheme',
   '/benefits-calculators',
   '/find-coronavirus-support',
   '/check-mot-history',
   '/government/organisations/hm-revenue-customs',
   '/government/organisations',
   '/employment/finding

## Ranking the subgraph

We now have a subgraph of 1384 nodes. How do we rank these pages so that the most 'relevant' pages to a WUJ are at the top, while the most 'irrelevant' pages to a WUJ are at the bottom? 

Explored:
- session hits 
- shortest page path lengths 
- sum of the shortest page path lengths
- between centrality
- degree centrality 
- closeness centrality 

An average (non-weighted) rank for each page path was created using:
- shortest page path length from seed0.1: `shortest_seed0_1`
- shortest page path length from seed0.2: `shortest_seed0_2`
- degree centrality: `g_degree`

These four metrics were chosen as the assumption is that nodes close to seed0 nodes, are more likely to be associated with the same topic, and therefore the same WUJ. In addition, nodes which are more highly connected (e.g. more outgoing/incoming relationships) should be included in the WUJ, as it is expected the majority of the pages in the subgraph are mostly relevant.   

The others metrics were not chosen, because: 
- session hits: pages with low session hits may be relevant to a WUJ
- sum of the shortest page path lengths: pages may be closer to one seed0 node, but far away from the other seed0 node
- between centrality: pages relevant to a WUJ may not neccessarily have many paths that pass through. It is likely that more 'popular' pages will (i.e. ones with higher session hits)


Caveat: browse pages are likely to be ranked highly, as these are common pages which may be highly connected to other nodes. Therefore, irrelevant browse pages are removed. 


### Order by session hits

In [53]:
sessions = list(
    sorted(
        economic_recovery_subgraph, key=lambda x: x[1]["sessionHitsAll"], reverse=True
    )
)

### Order by shortest page path lengths

In [54]:
# calculate the paths of the shortest page path

# for two seed pages caluclat eshortest path of all pages to the seed page
shortest_paths_seed0 = nx.shortest_path_length(g, source="/browse/working/finding-job")
shortest_paths_seed1 = nx.shortest_path_length(
    g, source="/topic/further-education-skills/apprenticeships"
)

In [55]:
shortest_paths_seed0

{'/browse/working/finding-job': 0,
 '/dbs-update-service': 1,
 '/government/collections/sponsorship-information-for-employers-and-educators': 1,
 '/view-right-to-work': 1,
 '/access-to-work/after-you-have-applied': 1,
 '/employment-status': 1,
 '/guidance/dbs-check-requests-guidance-for-employers': 1,
 '/what-different-qualification-levels-mean': 1,
 '/government/publications/help-and-support-for-older-workers/help-and-support-for-older-workers': 1,
 '/become-apprentice/apply-for-an-apprenticeship': 1,
 '/browse/working/redundancies-dismissals': 1,
 '/improve-english-maths-it-skills': 1,
 '/find-internship': 1,
 '/contact-ukvi-inside-outside-uk': 1,
 '/coronavirus/worker-support': 1,
 '/access-to-work': 1,
 '/apply-universal-credit': 1,
 '/work-reference': 1,
 '/job-offers-your-rights': 1,
 '/jobseekers-allowance': 1,
 '/find-a-job': 1,
 '/browse/working/time-off': 1,
 '/government/publications/skilled-worker-visa-shortage-occupations/skilled-worker-visa-shortage-occupations': 1,
 '/to

In [56]:
# seeed0.1: shortest_paths_seed0.1: '/browse/working/finding-job'
shortest_seed0_1 = {
    k: v
    for k, v in sorted(
        shortest_paths_seed0.items(), key=lambda item: item[1], reverse=False
    )
}

In [57]:
shortest_seed0_1

{'/browse/working/finding-job': 0,
 '/dbs-update-service': 1,
 '/government/collections/sponsorship-information-for-employers-and-educators': 1,
 '/view-right-to-work': 1,
 '/access-to-work/after-you-have-applied': 1,
 '/employment-status': 1,
 '/guidance/dbs-check-requests-guidance-for-employers': 1,
 '/what-different-qualification-levels-mean': 1,
 '/government/publications/help-and-support-for-older-workers/help-and-support-for-older-workers': 1,
 '/become-apprentice/apply-for-an-apprenticeship': 1,
 '/browse/working/redundancies-dismissals': 1,
 '/improve-english-maths-it-skills': 1,
 '/find-internship': 1,
 '/contact-ukvi-inside-outside-uk': 1,
 '/coronavirus/worker-support': 1,
 '/access-to-work': 1,
 '/apply-universal-credit': 1,
 '/work-reference': 1,
 '/job-offers-your-rights': 1,
 '/jobseekers-allowance': 1,
 '/find-a-job': 1,
 '/browse/working/time-off': 1,
 '/government/publications/skilled-worker-visa-shortage-occupations/skilled-worker-visa-shortage-occupations': 1,
 '/to

In [58]:
# seed0.2: shortest_paths_seed0.2: '/topic/further-education-skills/apprenticeships'
shortest_seed0_2 = {
    k: v
    for k, v in sorted(
        shortest_paths_seed1.items(), key=lambda item: item[1], reverse=False
    )
}

### Order by the sum of the shortest page path lengths

In [59]:
combined_paths = {
    k: finding_job[k] + apprentice[k] for k in set(finding_job) & set(apprentice)
}
combined_paths = {
    k: v
    for k, v in sorted(combined_paths.items(), key=lambda item: item[1], reverse=True)
}
combined_paths

{'/government/publications/lrs-help-and-support/lrs-help-and-support': 8,
 '/employment-tribunals/going-to-a-tribunal-hearing': 8,
 '/government/publications/self-assessment-register-for-self-assessment-and-get-a-tax-return-sa1': 8,
 '/adoption-pay-leave/leave': 7,
 '/adoption-pay-leave/how-to-claim': 7,
 '/government/publications/ordinary-statutory-paternity-pay-and-leave-becoming-a-birth-parent-sc3': 7,
 '/employment-tribunals/after-you-make-a-claim': 7,
 '/trusts-taxes': 7,
 '/government/publications/filtering-rules-for-criminal-record-check-certificates/filtering-rules-for-dbs-certificates-criminal-record-checks': 7,
 '/government/publications/lrs-maintenance-schedule': 7,
 '/raise-grievance-at-work/grievance-procedure': 7,
 '/government/publications/dbs-certificate-disputes-and-fingerprint-consent-forms-and-guidance-af14-af15/dbs-certificate-disputes-and-fingerprint-consent-guidance': 7,
 '/guidance/veterans-uk-armed-forces-pensions-forms': 7,
 '/buy-sell-your-home': 7,
 '/funeral

### Order by centrality metrics

In [60]:
# degree centrality: counts the number of incoming and outgoing relationships from a node - 'most connected'
# the higher the degree, the more central the node is
g_degree = nx.degree_centrality(g)
g_degree = dict(sorted(g_degree.items(), key=itemgetter(1), reverse=True))
g_degree

{'/browse/working': 0.6694091249065071,
 '/find-a-job': 0.5916230366492147,
 '/contact-jobcentre-plus': 0.531787584143605,
 '/jobseekers-allowance': 0.4599850411368736,
 '/apply-apprenticeship': 0.3807030665669409,
 '/sign-in-universal-credit': 0.3096484667165295,
 '/view-prove-immigration-status': 0.3006731488406881,
 '/government/organisations/department-for-work-pensions': 0.3006731488406881,
 '/benefits-calculators': 0.2655198204936425,
 '/universal-credit': 0.262528047868362,
 '/log-in-register-hmrc-online-services': 0.25280478683620045,
 '/state-pension-age': 0.2281226626776365,
 '/access-to-work': 0.2131637995512341,
 '/apply-national-insurance-number': 0.19371727748691098,
 '/search/services': 0.1869857890800299,
 '/national-minimum-wage-rates': 0.18548990276738966,
 '/contact-pension-service': 0.18399401645474944,
 '/browse/working/finding-job': 0.17726252804786835,
 '/contact-ukvi-inside-outside-uk': 0.17651458489154823,
 '/government/organisations': 0.17352281226626776,
 '/c

In [61]:
nx.degree_centrality(g)

{'/view-prove-immigration-status': 0.3006731488406881,
 '/browse/working': 0.6694091249065071,
 '/find-a-job': 0.5916230366492147,
 '/check-state-pension': 0.16454749439042632,
 '/check-state-pension/sign-in/prove-identity': 0.137621540762902,
 '/jobseekers-allowance': 0.4599850411368736,
 '/jobseekers-allowance/eligibility': 0.15033657442034404,
 '/view-right-to-work': 0.14659685863874344,
 '/contact-jobcentre-plus': 0.531787584143605,
 '/government/organisations/disclosure-and-barring-service': 0.137621540762902,
 '/topic/further-education-skills/apprenticeships': 0.1630516080777861,
 '/apply-apprenticeship': 0.3807030665669409,
 '/browse/working/finding-job': 0.17726252804786835,
 '/jobseekers-allowance/apply-new-style-jsa': 0.15856394913986538,
 '/dbs-update-service': 0.12041884816753927,
 '/criminal-record-checks-apply-role': 0.1331338818249813,
 '/new-state-pension': 0.12864622288706057,
 '/government/organisations/department-for-work-pensions': 0.3006731488406881,
 '/apply-natio

### Overall rank from multiple ranked items

Choose a final ranking method. In this example, `shortest_seed0_1`, `shortest_seed0_2`, `g_degree` have been used.

In [63]:
# create lists in order
sessions_list = [(index, element[0]) for index, element in enumerate(sessions)]
shortest_seed0_1_list = [
    (index, element) for index, element in enumerate(shortest_seed0_1)
]
shortest_seed0_2_list = [
    (index, element) for index, element in enumerate(shortest_seed0_2)
]
combined_paths_list = [(index, element) for index, element in enumerate(combined_paths)]
# g_between_list = [(index, element) for index, element in enumerate(g_between)]
g_degree_list = [(index, element) for index, element in enumerate(g_degree)]
# g_closeness_list = [(index, element) for index, element in enumerate(g_closeness)]

In [64]:
shortest_seed0_1_list 

[(0, '/browse/working/finding-job'),
 (1, '/dbs-update-service'),
 (2,
  '/government/collections/sponsorship-information-for-employers-and-educators'),
 (3, '/view-right-to-work'),
 (4, '/access-to-work/after-you-have-applied'),
 (5, '/employment-status'),
 (6, '/guidance/dbs-check-requests-guidance-for-employers'),
 (7, '/what-different-qualification-levels-mean'),
 (8,
  '/government/publications/help-and-support-for-older-workers/help-and-support-for-older-workers'),
 (9, '/become-apprentice/apply-for-an-apprenticeship'),
 (10, '/browse/working/redundancies-dismissals'),
 (11, '/improve-english-maths-it-skills'),
 (12, '/find-internship'),
 (13, '/contact-ukvi-inside-outside-uk'),
 (14, '/coronavirus/worker-support'),
 (15, '/access-to-work'),
 (16, '/apply-universal-credit'),
 (17, '/work-reference'),
 (18, '/job-offers-your-rights'),
 (19, '/jobseekers-allowance'),
 (20, '/find-a-job'),
 (21, '/browse/working/time-off'),
 (22,
  '/government/publications/skilled-worker-visa-short

In [65]:
len(shortest_seed0_1_list + shortest_seed0_2_list + g_degree_list)

4010

In [66]:
pd.DataFrame.from_records(
        shortest_seed0_1_list + shortest_seed0_2_list + g_degree_list
    )

Unnamed: 0,0,1
0,0,/browse/working/finding-job
1,1,/dbs-update-service
2,2,/government/collections/sponsorship-informatio...
3,3,/view-right-to-work
4,4,/access-to-work/after-you-have-applied
...,...,...
4005,1333,/dismiss-staff
4006,1334,/capital-gains-tax
4007,1335,/government/publications/lrs-organisation-portal
4008,1336,/government/publications/lrs-help-and-support


In [67]:
# take average of rank for each page path (non-weighted)
ranked_items = (
    pd.DataFrame.from_records(
        shortest_seed0_1_list + shortest_seed0_2_list + g_degree_list # this combines 3 lists into one long list
    )
    .groupby(1) # group by column named "1": i.e. by page path
    .mean()
    .round()
    .reset_index()
)
ranked_items = ranked_items.rename(columns={1: "page", 0: "rank"})
ranked_items = ranked_items.sort_values(by=["rank"])

In [68]:
ranked_items

Unnamed: 0,page,rank
435,/find-a-job,12.0
190,/browse/working/finding-job,15.0
1096,/sign-in-universal-credit,23.0
1221,/topic/further-education-skills/apprenticeships,30.0
1262,/universal-credit,31.0
...,...,...
382,/employee-tax-codes,1312.0
764,/guidance/veterans-uk-armed-forces-pensions-forms,1312.0
351,/disciplinary-procedures-and-action-at-work/di...,1329.0
624,/government/publications/lrs-organisation-portal,1335.0


In [69]:
# download as csv file
# ranked_items.to_csv("../../data/processed/ranked_items.csv", index=False)

## Create final output

### Add document type and session hit data to final csv file

Document supertypes as per: https://docs.publishing.service.gov.uk/document-types/content_purpose_supergroup.html


In [77]:
# extract: documentType, sessionHitsAll, entranceHit, exitHit, entranceAndExitHit
df_page = pd.DataFrame([i[0] for i in economic_recovery_subgraph], columns=["page"])
df_info = pd.DataFrame([i[1] for i in economic_recovery_subgraph])
df_all = df_page.join(df_info)

# define a set for news and communication doc types
news_and_comms_doctypes = {
    "medical_safety_alert",
    "drug_safety_update",
    "news_article",
    "news_story",
    "press_release",
    "world_location_news_article",
    "world_news_story",
    "fatality_notice",
    "fatality_notice",
    "tax_tribunal_decision",
    "utaac_decision",
    "asylum_support_decision",
    "employment_appeal_tribunal_decision",
    "employment_tribunal_decision",
    "employment_tribunal_decision",
    "service_standard_report",
    "cma_case",
    "decision",
    "oral_statement",
    "written_statement",
    "authored_article",
    "correspondence",
    "speech",
    "government_response",
    "case_study",
}

# define a set for service doc types
service_doctypes = {
    "completed_transaction",
    "local_transaction",
    "form",
    "calculator",
    "smart_answer",
    "simple_smart_answer",
    "place",
    "licence",
    "step_by_step_nav",
    "transaction",
    "answer",
    "guide",
}

# define a set for guidance and regulation doc types
guidance_and_reg_doctypes = {
    "regulation",
    "detailed_guide",
    "manual",
    "manual_section",
    "guidance",
    "map",
    "calendar",
    "statutory_guidance",
    "notice",
    "international_treaty",
    "travel_advice",
    "promotional",
    "international_development_fund",
    "countryside_stewardship_grant",
    "esi_fund",
    "business_finance_support_scheme",
    "statutory_instrument",
    "hmrc_manual",
    "standard",
}

# define a set for policy and engagement doc types
policy_and_engage_doctypes = {
    "impact_assessment",
    "policy_paper",
    "open_consultation",
    "policy_paper",
    "closed_consultation",
    "consultation_outcome",
    "policy_and_engagement",
}

# define a set for research and statistics doc types
research_and_stats_doctypes = {
    "dfid_research_output",
    "independent_report",
    "research",
    "statistics",
    "national_statistics",
    "statistics_announcement",
    "national_statistics_announcement",
    "official_statistics_announcement",
    "statistical_data_set",
    "official_statistics",
}

# define a set for transparency doc types
transparency_doctypes = {
    "transparency",
    "corporate_report",
    "foi_release",
    "aaib_report",
    "raib_report",
    "maib_report",
}

# loop through document types and create document supertype column
document_type_dict = dict.fromkeys(list(set(df_info["documentType"])))

for docType, docSupertype in document_type_dict.items():
    if docType in news_and_comms_doctypes:
        document_type_dict[docType] = "news and communication"

    elif docType in service_doctypes:
        document_type_dict[docType] = "services"

    elif docType in guidance_and_reg_doctypes:
        document_type_dict[docType] = "guidance and regulation"

    elif docType in policy_and_engage_doctypes:
        document_type_dict[docType] = "policy and engagement"

    elif docType in research_and_stats_doctypes:
        document_type_dict[docType] = "research and statistics"

    elif docType in transparency_doctypes:
        document_type_dict[docType] = "transparency"

    else:
        document_type_dict[docType] = "other"

df_docSuper = pd.DataFrame(
    document_type_dict.items(), columns=["documentType", "documentSupertype"]
)

df_all = pd.merge(df_all, df_docSuper, how="left")

# reoder and rename columns
df_all = df_all[
    [
        "page",
        "documentType",
        "documentSupertype",
        "sessionHitsAll",
        "entranceHit",
        "exitHit",
        "entranceAndExitHit",
    ]
]
df_all = df_all.rename(
    columns={
        "documentType": "document type",
        "documentSupertype": "document supertype",
        "sessionHitsAll": "number of sessions that visit this page",
        "entranceHit": "number of sessions where this page is an entrance hit",
        "exitHit": "number of sessions where this page is an exit hit",
        "entranceAndExitHit": "number of sessions where this page is both an entrance and exit hit",
    }
)

### Add ranking to final csv file

In [78]:
# shortest_seed0_1
df_shortest_seed_0_1 = pd.DataFrame(
    shortest_seed0_1.items(),
    columns=[
        "page",
        "distance from `/browse/working/finding-job` (e.g. the higher the value is, the further away the page is from /browse/working/finding-job)",
    ],
)
df_all = pd.merge(df_all, df_shortest_seed_0_1, how="left")

# shortest_seed0_2
df_shortest_seed_0_2 = pd.DataFrame(
    shortest_seed0_2.items(),
    columns=[
        "page",
        "distance from `/topic/further-education-skills/apprenticeships (e.g. the higher the value is, the further away the page is from /topic/further-education-skills/apprenticeships)",
    ],
)
df_all = pd.merge(df_all, df_shortest_seed_0_2, how="left")

# where NaN, add 'no path' (NaN = no path is present between the source and target nodes)
df_all.fillna("no path", inplace=True)

# g_degree
df_g_degree = pd.DataFrame(
    g_degree.items(),
    columns=[
        "page",
        "the number of pages a user moves to/from between this page and another page in the list (e.g. the higher the value, the more users have moved to/from this page and another page in the list)",
    ],
)
df_all = pd.merge(df_all, df_g_degree, how="left")

### Order final csv by rank 

In [80]:
# page path becomes an index:
df_all = df_all.set_index("page")
# order webpages by the ranking (the highe rth better), which is stored in ranked_items dataframe
df_all = df_all.reindex(index=ranked_items["page"]).reset_index()

In [81]:
df_all

Unnamed: 0,page,document type,document supertype,number of sessions that visit this page,number of sessions where this page is an entrance hit,number of sessions where this page is an exit hit,number of sessions where this page is both an entrance and exit hit,"distance from `/browse/working/finding-job` (e.g. the higher the value is, the further away the page is from /browse/working/finding-job)","distance from `/topic/further-education-skills/apprenticeships (e.g. the higher the value is, the further away the page is from /topic/further-education-skills/apprenticeships)","the number of pages a user moves to/from between this page and another page in the list (e.g. the higher the value, the more users have moved to/from this page and another page in the list)"
0,/find-a-job,transaction,services,71667,18532,4157,44191,1.0,1.0,0.591623
1,/browse/working/finding-job,mainstream_browse_page,other,6709,410,753,304,0.0,1.0,0.177263
2,/sign-in-universal-credit,transaction,services,7527,3403,2042,349,1.0,1.0,0.309648
3,/topic/further-education-skills/apprenticeships,topic,other,7488,415,3189,1185,1.0,0.0,0.163052
4,/universal-credit,guide,services,4140,217,1076,24,1.0,1.0,0.262528
...,...,...,...,...,...,...,...,...,...,...
1333,/employee-tax-codes,guide,services,31,2,2,0,3.0,3.0,0.008227
1334,/guidance/veterans-uk-armed-forces-pensions-forms,detailed_guide,guidance and regulation,45,18,0,0,3.0,4.0,0.005984
1335,/disciplinary-procedures-and-action-at-work/di...,guide,services,26,4,3,0,3.0,3.0,0.005984
1336,/government/publications/lrs-organisation-portal,guidance,guidance and regulation,58,17,16,1,no path,no path,0.002244


In [None]:
df_all.to_csv("../../data/processed/pages_ranked_with_data.csv", index=False)

In [83]:
df_all.to_clipboard()

In [84]:
ranked_items

Unnamed: 0,page,rank
435,/find-a-job,12.0
190,/browse/working/finding-job,15.0
1096,/sign-in-universal-credit,23.0
1221,/topic/further-education-skills/apprenticeships,30.0
1262,/universal-credit,31.0
...,...,...
382,/employee-tax-codes,1312.0
764,/guidance/veterans-uk-armed-forces-pensions-forms,1312.0
351,/disciplinary-procedures-and-action-at-work/di...,1329.0
624,/government/publications/lrs-organisation-portal,1335.0
