# Exploring the economic recovery whole user journey subgraph

This approach does not rely on the existing knowledge graph. It uses a functional graph based on page hit session data, to find a list of pages related to the economic recovery whole user journey (WUJ).

ASSUMPTIONS: 
- A set of pre-defined pages have been removed from the economic recovery WUJ subgraph
- Any pages with less than 10 session hits are removed from the economic recovery WUJ subgraph
- Any pages with a shortest page path length equal to or greater than 3 from both `seed0` pages are removed from the economic recovery WUJ subgraph
- Any pages where accumulated edge weight is equal to or lower than 20 are removed from the economic recovery WUJ subgraph
- Any browse pages not related to the economic recovery WUJ are manually removed from the economic recovery WUJ subgraph

OUTPUT: 
- A csv containing a list of pages related to the economic recovery WUJ. Sorted in ascending order in relation to a composite metric made up of: shortest page path from both `seed0` pages, and degree centrality.
  - Additional columns to filter on are included: `document type`, `document supertype`, `number of sessions that visit this page`, `number of sessions where this page is an entrance hit`, `number of sessions where this page is an exit hit`, `distance from /browse/working/finding-job`, `distance from /topic/further-education-skills/apprenticeships`, `the number of pages a user moves to/from between this page and another page in the list`

REQUIREMENTS: 
- Run `step_one_identify_seed_pages.ipynb` to define `seed0` and `seed1` pages
- Run `step_two_extract_page_hits.sql` to extract page hits for sessions that visit at least one `seed0` or `seed1` page
- Run `step_three_extract_nodes_and_edges.sql` to extract nodes and edges 
- Run `step_four_create_networkx_graph.ipynb` to create NetworkX graph of the economic recovery whole user journey

## Import statements  

In [2]:
from neo4j import GraphDatabase
import networkx as nx
import os
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from operator import itemgetter
from collections import defaultdict
import gspread 
from oauth2client.service_account import ServiceAccountCredentials
from collections import Counter
from tqdm.notebook import tqdm

## Functions for coercing knowledge graph into NetworkX

In [3]:
def getSubgraph(q, parameters=None):

    '''
    Given a Cypher query q, this function queries the knowledge graph,
    returns the nodes and edges from this query, and uses them to construct
    a networkx graph.

    E.g. getSubgraph(r'MATCH (u:Cid)-[r:HYPERLINKS_TO]->(v:Cid) RETURN *')
         returns the structural graph.

    Optionally, can add in parameters (dictionary), allowing Python variables
    to be integrated into the Cypher query q.

    E.g.
        parameters = {}
        parameters['pages'] = ['a','list','of','stuff']
        q7 = f"""
        MATCH (u:Cid)-[r]-(v:Cid)
        WHERE u.name IN $pages AND v.name in $pages
        RETURN *
        """

        g7 = getSubgraph(q7, parameters)
    '''

    # get credentials
    # add to .secrets: export KG_PWD="<PASSWORD>"
    KG_PWD = os.getenv("KG_PWD")

    # create connection to knowledge graph
    driver = GraphDatabase.driver(
        "bolt+s://knowledge-graph.integration.govuk.digital:7687",
        auth=("neo4j", KG_PWD),
    )

    # run query on knowledge graph
    results = driver.session().run(q, parameters)

    # create networkx graph object
    G = nx.MultiDiGraph()

    # add nodes into networkx graph object
    nodes = list(results.graph()._nodes.values())
    print("Adding nodes\n")
    for node in tqdm(nodes):
        G.add_node(node.id, labels=node._labels, properties=node._properties)

    # add edges into networkx graph object
    rels = list(results.graph()._relationships.values())
    print("Adding edges\n")
    for rel in tqdm(rels):
        G.add_edge(
            rel.start_node.id,
            rel.end_node.id,
            key=rel.id,
            type=rel.type,
            properties=rel._properties,
        )

    return G


def showGraph(g):
    """
    Given a networkx graph g, this function visualises the graph.
    Do not use for a large g.
    """
    print(nx.info(g))
    nx.draw(g)

In [4]:
def getNoOfTruePages(g):
    """
    Calculate a proxy recall metric for the list of pages identified in a
    subgraph (when compared to the ground truth for the economic recovery pages). 
    The output is the number of pages in the subgraph list that are present in 
    the ground truth list.  
    """
    
    # convert nodeIds to page path slug for the subgraph list
    subgraph_list = [node[0] for node in g.nodes(data=True)]

    # set up the ground truth list
    true_list = list(economic_pages)

    # how many pages are in the subgraph list that are also in the ground truth list
    return (len(true_list)) - (len([node for node in true_list if node not in subgraph_list]))

In [5]:
def getPagesNotInSubGraph(g):
    """
    Return a set of pages which are in the economic recovery pre-defined list
    (41 in total), but are not in the filtered subgraph list. 
    """
    
    # list of pages manually defined in the economic recovery whole user journey 
    true_list = list(economic_pages)
    
    # list of pages in the filtered subgraph list
    subgraph_list = [node[0] for node in g.nodes(data=True)]
    
    # the list of pages in the manually defined list not in the subgraph list
    return [node for node in true_list if node not in subgraph_list]

## Pre-defined economic recovery subgraph

The pages (nodes) in the manually defined list for economic recovery (41 pages in total)

In [18]:
GOOGLE_APPLICATION_CREDENTIALS = os.getenv('GOOGLE_APPLICATION_CREDENTIALS')

# Connect to service account
scope = ['https://spreadsheets.google.com/feeds'] 
credentials = ServiceAccountCredentials.from_json_keyfile_name(GOOGLE_APPLICATION_CREDENTIALS, scope) 
gc = gspread.authorize(credentials)

# Import the data from google sheets
spreadsheet_key = '1lLsgQRsl4bXmbyiwrbMNQOR_Zu-FkbqYRUmOOhTYc70' 
book = gc.open_by_key(spreadsheet_key) 
worksheet = book.worksheet('Top pages') 
table = worksheet.get_all_values()

# Convert table data into a dataframe then set 
df = pd.DataFrame(table[1:], columns=table[0])
economic_pages = set(list(filter(None, df['pagePathv2'])))

## Exploring the economic recovery functional network

- 10631 nodes and 79985 edges
- The page `/guidance/international-trade-products-and-schemes` which is in the manually defined graph, is **not** in this subgraph. This seems to be an old page (updated 2013) related to exporting agricultural products to the EU
- The distribution of session hit data demonstrates that only 25% of the page paths have session hits over 8. The maximum number of session hit for a page path is 71667: 
    - Min: 1
    - Max: 71667
    - 25th percentile: 1
    - 50th percentile: 2
    - 75th percentile: 8


In [6]:
# import the NetworkX graph object
g = nx.read_gpickle("../../data/processed/functional_session_hit_directed_graph_er.gpickle")

In [7]:
# number of nodes, number of edges
nx.info(g)

'DiGraph with 10631 nodes and 79985 edges'

In [None]:
# how many of the 'true' economic recovery pages are in the functional graph?
getNoOfTruePages(g)

In [None]:
# which pages are in the economic recovery list but not in the subgraph list? 
true_list = list(economic_pages)
subgraph_list = [node[0] for node in g.nodes(data=True)]
[node for node in true_list if node not in subgraph_list]

In [None]:
# list of nodes
nodes_list = list(g.nodes(data=True))

In [None]:
# plot distribution of session hit data 
x = [y['sessionHitsAll'] for x, y in nodes_list]

plt.hist(x); # , bins = 10
plt.xlabel('Session hits')
plt.ylabel('Count')
plt.title('No. of page paths')

In [None]:
# boxplot to identify outliers
plt.boxplot(x)

In [None]:
# descriptives, and remove outliers in boxplot
plt.boxplot(x, showfliers=False)
pd.DataFrame(x).describe()

In [None]:
# look at distribution of session hit data up to 100
x = [y['sessionHitsAll'] for x, y in nodes_list]
x = sorted(x, reverse=True)
x = x[2:]
plt.hist(x,bins=10,range=(0,100))
plt.xlabel('Session hits')
plt.ylabel('Count')
plt.title('No. of session hits')
plt.show()

## Improving the precision of the subgraph

The recall of the subgraph (10631 nodes) is acceptable. However, we need to improve the precision (i.e. reduce the number of nodes) so that a human can manually trawl through the list of pages, and not be overwhelmed. We assume the pages removed are irrelevant.

- Remove **irrelevant pages**. There are a number of pages that have high session hit data and are not relevant to the WUJ. These should be predefined and removed. While some would be relevant to remove in all graphs (e.g. `/search/all`), other pages are more unique, but patterns emerge. For example, pages related to `pensions` and `criminal convinctions`. 

- Remove **session hits equal to or lower than 10**.  

- Remove pages where **shortest page path length from is equal to or greater than 3, from both seed0 pages**

- Remove pages where the **accumulated edge weight for one page is equal to or greater than 20**

This leaves us with a subgraph of: 
- 1338 nodes
- 35/41 nodes in the subgraph, which are also in the manually defined list of pages

### Remove irrelevant pages 

In [None]:
# order pages in descending order related to session hits
sorted(g.nodes(data=True), key=lambda x: x[1]['sessionHitsAll'], reverse=True)

In [None]:
# remove nodes 
g.remove_nodes_from(['/prove-right-to-work', '/', '/request-copy-criminal-record', '/browse/working/state-pension', 
                     '/search/all','/search', '/brexit', '/coronavirus', '/report-covid19-result'])

### Remove pages with session hits equal or lower than 10


Decide a cut-off to remove pages with sessions hits lower than X. 10 seems to contain a good number of pre-defined pages (38/41), and reduces the number of nodes in the graph drastically (10631 > 2246).  
- Equal or lower than 10: 2246 nodes, 38/41, [`/find-driving-instructor-training`, `/guidance/international-trade-products-and-schemes`, `/how-to-claim-universal-credit`]
- Equal or lower than 20: 1150 nodes, 38/41, [`/guidance/recovery-loan-scheme`, `/find-driving-instructor-training`, `/guidance/international-trade-products-and-schemes`, `/how-to-claim-universal-credit`]
- Equal or lower than 30: 1355 nodes, 34/41, [`/guidance/recovery-loan-scheme`, `/find-driving-instructor-training`,
 `/education`, `/guidance/international-trade-products-and-schemes`, `/agricultural-skills-and-training`, `/government/collections/financial-support-for-businesses-during-coronavirus-covid-19`, `/how-to-claim-universal-credit`]

In [None]:
# frequency table for session hits
x = sorted([y['sessionHitsAll'] for x, y in nodes_list], reverse=True)
Counter(x)

#### Equal to or lower than 10 sessions

In [None]:
# pages with less than 10 session hits 
remove = [node for node, session in g.nodes(data=True) if session['sessionHitsAll'] <= 10]

# remove nodes with less than 10 session hits 
g.remove_nodes_from(remove)

# how many nodes and edges
nx.info(g)

In [None]:
# how many of the start a business pages in the functional graph?
getNoOfTruePages(g)

In [None]:
# which pages are in the economic recovery list but not in the subgraph list? 
getPagesNotInSubGraph(g)

#### Equal to or lower than 20 sessions

In [None]:
# pages with less than 20 session hits 
remove = [node for node, session in g.nodes(data=True) if session['sessionHitsAll'] <= 20]

# remove nodes with less than 20 session hits 
g.remove_nodes_from(remove)

# how many nodes and edges
nx.info(g)

In [None]:
# how many of the economic recovery pages in the functional graph?
getNoOfTruePages(g)

In [None]:
# which pages are in the economic recovery list but not in the subgraph list? 
getPagesNotInSubGraph(g)

#### Equal to or lower than 30 sessions

In [None]:
# pages with less than 30 session hits 
remove = [node for node, session in g.nodes(data=True) if session['sessionHitsAll'] <= 30]

# remove nodes with less than 30 session hits 
g.remove_nodes_from(remove)

# how many nodes and edges
nx.info(g)

In [None]:
# how many of the start a business pages in the functional graph?
getNoOfTruePages(g)

In [None]:
# which pages are in the economic recovery list but not in the subgraph list? 
getPagesNotInSubGraph(g)

### Remove pages with max shortest page paths from `seed0` pages >= 3

Identify the shortest page paths from seed0 pages to all other pages. 

Frequency distribution of shortest page path lengths: 
- `seed0.1`; /browse/working/finding-job; {4: 2, 3: 137, 2: 1272, 1: 138, 0: 1}
- `seed0.2`: /topic/further-education-skills/apprenticeships; {4: 5, 3: 392, 2: 1047, 1: 105, 0: 1}


If the shortest page path is equal to or greater than 3, remove from the nodes from the subgraph:
- 1521 nodes
- 37/41 
- Pages in the manually defined list not in the subgraph: [`/guidance/international-trade-products-and-schemes`,
 `/guidance/recovery-loan-scheme`, `/find-driving-instructor-training`, `/how-to-claim-universal-credit`, `/government/collections/financial-support-for-businesses-during-coronavirus-covid-19`, `/agricultural-skills-and-training`, `/education`]

In [9]:
# sort nodes by closeness to seed 0 pages
shortest_paths_seed0_1 = nx.shortest_path_length(g, source='/browse/working/finding-job')
shortest_paths_seed0_2 = nx.shortest_path_length(g, source='/topic/further-education-skills/apprenticeships')

In [10]:
# sort nodes in descending order
finding_job = {k: v for k, v in sorted(shortest_paths_seed0_1.items(), key=lambda item: item[1], reverse=True)}

In [11]:
# sort nodes in descending order
apprentice = {k: v for k, v in sorted(shortest_paths_seed0_2.items(), key=lambda item: item[1], reverse=True)}

In [None]:
# frequency table for `seed0.1`
x = sorted([y for x, y in finding_job.items()], reverse=True)
Counter(x)

In [None]:
# frequency table for `seed0.2`
x = sorted([y for x, y in apprentice.items()], reverse=True)
Counter(x)

In [None]:
# extract list of pages that have a shortest page path length of >= 3
finding_job_filtered = [node for node, length in finding_job.items() if length >= 3]
apprentice_filtered = [node for node, length in apprentice.items() if length >= 3]

# extract list of page where they have a shortest page path length on >= 3 from both `seed0` pages
remove=[node for node in finding_job_filtered if node not in apprentice_filtered]

# remove list of pages from g
g.remove_nodes_from(remove)

# number of nodes and esges
nx.info(g)

In [None]:
# how many of the economic recovery pages in the functional graph?
getNoOfTruePages(g)

In [None]:
# which pages are in the economic recovery list but not in the subgraph list? 
getPagesNotInSubGraph(g)

#### Exploration: remove pages with the shortest page paths summed (`seed0.1` + `seed0.2`)

Could also look at the sum of the shortest page paths. However, the caveat here is that if the shortest page path length is `1` from `seed0.1` to `page A`, but `6` from `seed0.2` to `page A`, then the sum would be `7`. However, `page A` may be very relevant to the WUJ, as defined by it's shortest page path length from `seed0.1`.  

Therefore, this method has not been chosen. 

In [16]:
# sum all pages shortest page paths of both `seed0` nodes to get an overall shortest page path proxy metric
combined_paths = {k: finding_job[k] + apprentice[k] for k in set(finding_job) & set(apprentice)}
combined_paths = {k: v for k, v in sorted(combined_paths.items(), key=lambda item: item[1], reverse=True)}

In [None]:
# remove nodes above or equal to 6 paths away  
remove = [node for node, length in combined_paths.items() if length >= 6]
g.remove_nodes_from(remove)

In [None]:
# how many of economic recovery pages are in the functional graph?
getNoOfTruePages(g)

In [None]:
# which pages are in the economic recovery list but not in the subgraph list? 
getPagesNotInSubGraph(g)

### Remove pages where edge weight is <= 20


Edge weight increases by 1 each time a user session visits page A to page B. Therefore, if edge weight is small, we assume the two pages are likely to not be associated with one another. This is because we assume that user's will visit similar pages in the same WUJ during the same session. 

Edge weight is equal to or less than 10:
- 1473 nodes
- 37/41 
- Pages not in the subgraph list: [`/guidance/recovery-loan-scheme`, 
`/find-driving-instructor-training`, 
`/guidance/international-trade-products-and-schemes`,
`/how-to-claim-universal-credit`]

Edge weight is equal to or less than 20:
- 1253 nodes
- 34/41 
- Pages not in the subgraph list: [`/guidance/international-trade-products-and-schemes`,
 `/guidance/recovery-loan-scheme`,
 `/find-driving-instructor-training`,
 `/how-to-claim-universal-credit`,
 `/government/collections/financial-support-for-businesses-during-coronavirus-covid-19`,
 `/agricultural-skills-and-training`,
 `/guidance/claim-back-statutory-sick-pay-paid-to-employees-due-to-coronavirus-covid-19`]

Edge weight is equal to or less than 30:
- 1042 nodes
- 32/41 
- Pages not in the subgraph list: [`/guidance/international-trade-products-and-schemes`,
 `/guidance/recovery-loan-scheme`,
 `/find-driving-instructor-training`,
 `/how-to-claim-universal-credit`,
 `/government/collections/financial-support-for-businesses-during-coronavirus-covid-19`,
 `/agricultural-skills-and-training`,
 `/guidance/claim-back-statutory-sick-pay-paid-to-employees-due-to-coronavirus-covid-19`]

In [None]:
# weight (user movement) for each pair of nodes in the graph  
user_movement_weights = []
for node1, node2, edgeWeight in g.edges(data=True):
    case = {'node1': node1, 'node2': node2, 'edgeWeight':edgeWeight['edgeWeight']}
    user_movement_weights.append(case)

# sorted(user_movement_weights, key=itemgetter('edgeWeight'), reverse=True)

# sum the weight for each node 
user_movements_sum = defaultdict(float)

for info in user_movement_weights:
    user_movements_sum[info['node1']] += info['edgeWeight']

user_movements_sum = [{'node1': node1, 'edgeWeight': user_movements_sum[node1]} 
                     for node1 in user_movements_sum]

#sorted(user_movements_sum, key=lambda x: x['edgeWeight'], reverse=True)

#### Remove pages where edge weight is less than or equal to 10 

In [None]:
# remove anything with edge weight less than 10 
remove = [node['node1'] for node in user_movements_sum if node['edgeWeight'] <= 10]
g.remove_nodes_from(remove)

# number of nodes and esges
nx.info(g)

In [None]:
# how many of economic recovery pages are in the functional graph?
getNoOfTruePages(g)

In [None]:
# which pages are in the economic recovery list but not in the subgraph list? 
getPagesNotInSubGraph(g)

#### Remove pages where edge weight is less than or equal to 20 

In [None]:
# remove anything with edge weight less than 20 
remove = [node['node1'] for node in user_movements_sum if node['edgeWeight'] <= 20]
g.remove_nodes_from(remove)

# number of nodes and esges
nx.info(g)

In [None]:
# how many of economic recovery pages are in the functional graph?
getNoOfTruePages(g)

In [None]:
# which pages are in the economic recovery list but not in the subgraph list? 
getPagesNotInSubGraph(g)

#### Remove pages where edge weight is less than or equal to 30 

In [None]:
# remove anything with edge weight less than 30 
remove = [node['node1'] for node in user_movements_sum if node['edgeWeight'] <= 30]
g.remove_nodes_from(remove)

# number of nodes and esges
nx.info(g)

In [None]:
# how many of economic recovery pages are in the functional graph?
getNoOfTruePages(g)

In [None]:
# which pages are in the economic recovery list but not in the subgraph list? 
getPagesNotInSubGraph(g)

### Save the subgraph list of pages that are related to the economic recovery whole user journey

From the above exploration, run this code to get a final subgraph related to the economic recovery whole user journey

In [19]:
# import the NetworkX graph object
g = nx.read_gpickle("../../data/processed/functional_session_hit_directed_graph_er.gpickle")

In [20]:
# remove irrelevant nodes 
g.remove_nodes_from(['/prove-right-to-work', '/', '/request-copy-criminal-record', '/browse/working/state-pension', 
                     '/search/all','/search', '/brexit', '/coronavirus', '/report-covid19-result'])

In [21]:
# remove pages with equal or less than 10 session hits 
remove = [node for node, session in g.nodes(data=True) if session['sessionHitsAll'] <= 10]
g.remove_nodes_from(remove)

In [22]:
# remove pages where they have a shortest page path length on >= 3 from both `seed0` pages
shortest_paths_seed0_1 = nx.shortest_path_length(g, source='/browse/working/finding-job')
shortest_paths_seed0_2 = nx.shortest_path_length(g, source='/topic/further-education-skills/apprenticeships')

finding_job = {k: v for k, v in sorted(shortest_paths_seed0_1.items(), key=lambda item: item[1], reverse=True)}
apprentice = {k: v for k, v in sorted(shortest_paths_seed0_2.items(), key=lambda item: item[1], reverse=True)}

finding_job_filtered = [node for node, length in finding_job.items() if length >= 3]
apprentice_filtered = [node for node, length in apprentice.items() if length >= 3]

remove = [node for node in finding_job_filtered if node not in apprentice_filtered]
g.remove_nodes_from(remove)

In [23]:
# remove pages where accumulated edge weight is equal to or lower than 20
user_movement_weights = []
for node1, node2, edgeWeight in g.edges(data=True):
    case = {'node1': node1, 'node2': node2, 'edgeWeight':edgeWeight['edgeWeight']}
    user_movement_weights.append(case)

user_movements_sum = defaultdict(float)

for info in user_movement_weights:
    user_movements_sum[info['node1']] += info['edgeWeight']

user_movements_sum = [{'node1': node1, 'edgeWeight': user_movements_sum[node1]} 
                     for node1 in user_movements_sum]

remove=[node['node1'] for node in user_movements_sum if node['edgeWeight'] <= 20]
g.remove_nodes_from(remove)

In [24]:
# remove certain browse pages. Browse pages seem to score highly (when ranking, below), 
# regardless of whether they are relevant to the WUJ or not
browse_pages_remove = ('/browse/employing-people', '/browse/births-deaths-marriages', 
                       '/browse/citizenship', '/browse/driving', '/browse/education', 
                       '/browse/business', '/browse/childcare-parenting', '/browse/justice', 
                       '/browse/abroad', '/browse/tax', '/browse/visas-immigration', 
                       '/browse/disabilities', '/browse/environment-countryside', 
                       '/browse/housing-local-services', '/browse/employing-people', 
                       '/browse/births-deaths-marriages', '/browse/citizenship', 
                       '/browse/driving', '/browse/business', '/browse/abroad', 
                       '/browse/environment-countryside', '/browse/housing-local-services',
                       '/browse/benefits')

g.remove_nodes_from(browse_pages_remove)

In [25]:
nx.info(g)

'DiGraph with 1338 nodes and 31143 edges'

In [26]:
# how many of economic recovery pages are in the functional graph?
getNoOfTruePages(g)

35

In [27]:
# which pages are in the economic recovery list but not in the subgraph list? 
getPagesNotInSubGraph(g)

['/find-driving-instructor-training',
 '/agricultural-skills-and-training',
 '/how-to-claim-universal-credit',
 '/guidance/recovery-loan-scheme',
 '/browse/education',
 '/guidance/international-trade-products-and-schemes']

In [28]:
# final subgraph list related to the economic recovery whole user journey
economic_recovery_subgraph = list(g.nodes(data=True))

In [None]:
# save subgraph to file
nx.write_gpickle(g, '../../data/processed/functional_session_hit_directed_graph_er_final.gpickle')

### Explore communities of the subgraph

Look at adjacent neighbours, to identify communities within the WUJ. While there seems to be relevant pages within a community (e.g. `/find-a-job` and `/contact-jobcentre-plus`), other pages seem to be irrelevant (e.g. `/find-a-job` and `/check-mot-status`)

In [None]:
# create a dict of all nodes and it's edges
dict_of_nodes_and_edges = dict()
for node in g.nodes():
    dict_of_nodes_and_edges[node] = list(nx.neighbors(g, node))

    # sort dict based on the length of its' values (i.e. top result is the largest number of neighbours)
sorted(dict_of_nodes_and_edges.items(), key=lambda x: len(x[1]), reverse=True)

## Ranking the subgraph

We now have a subgraph of 1384 nodes. How do we rank these pages so that the most 'relevant' pages to a WUJ are at the top, while the most 'irrelevant' pages to a WUJ are at the bottom? 

Explored:
- session hits 
- shortest page path lengths 
- sum of the shortest page path lengths
- between centrality
- degree centrality 
- closeness centrality 

An average (non-weighted) rank for each page path was created using:
- shortest page path length from seed0.1: `shortest_seed0_1`
- shortest page path length from seed0.2: `shortest_seed0_2`
- degree centrality: `g_degree`

These four metrics were chosen as the assumption is that nodes close to seed0 nodes, are more likely to be associated with the same topic, and therefore the same WUJ. In addition, nodes which are more highly connected (e.g. more outgoing/incoming relationships) should be included in the WUJ, as it is expected the majority of the pages in the subgraph are mostly relevant.   

The others metrics were not chosen, because: 
- session hits: pages with low session hits may be relevant to a WUJ
- sum of the shortest page path lengths: pages may be closer to one seed0 node, but far away from the other seed0 node
- between centrality: pages relevant to a WUJ may not neccessarily have many paths that pass through. It is likely that more 'popular' pages will (i.e. ones with higher session hits)


Caveat: browse pages are likely to be ranked highly, as these are common pages which may be highly connected to other nodes. Therefore, irrelevant browse pages are removed. 


### Order by session hits

In [None]:
sessions = list(sorted(economic_recovery_subgraph, key=lambda x: x[1]['sessionHitsAll'], reverse=True))

### Order by shortest page path lengths

In [None]:
# calculate the paths of the shortest page path
shortest_paths_seed0 = nx.shortest_path_length(g, source='/browse/working/finding-job')
shortest_paths_seed1 = nx.shortest_path_length(g, source='/topic/further-education-skills/apprenticeships')

In [None]:
# seeed0.1: shortest_paths_seed0.1: '/browse/working/finding-job'
shortest_seed0_1 = {k: v for k, v in sorted(shortest_paths_seed0.items(), key=lambda item: item[1], reverse=False)}

In [None]:
# seed0.2: shortest_paths_seed0.2: '/topic/further-education-skills/apprenticeships'
shortest_seed0_2 = {k: v for k, v in sorted(shortest_paths_seed1.items(), key=lambda item: item[1], reverse=False)}

### Order by the sum of the shortest page path lengths

In [None]:
combined_paths = {k: finding_job[k] + apprentice[k] for k in set(finding_job) & set(apprentice)}
combined_paths = {k: v for k, v in sorted(combined_paths.items(), key=lambda item: item[1], reverse=True)}
combined_paths

### Order by centrality metrics

In [None]:
# between centrality: the number of shortest paths that pass through the node - a 'bridge' between nodes
# higher values indicate higher centrality
g_di = nx.DiGraph(g)
g_di_between = nx.betweenness_centrality(g_di)
g_between = dict(sorted(g_di_between.items(), key=itemgetter(1),reverse=True))
g_between

In [None]:
# degree centrality: counts the number of incoming and outgoing relationships from a node - 'most connected'
# the higher the degree, the more central the node is
g_degree = nx.degree_centrality(g)
g_degree = dict(sorted(g_degree.items(), key=itemgetter(1),reverse=True))
g_degree

In [None]:
# closeness centrality: average length of the shortest path between the node and all other nodes in the graph
# higher values of closeness indicate higher centrality
g_closeness = nx.closeness_centrality(g)
g_closeness = dict(sorted(g_closeness.items(), key=itemgetter(1),reverse=False))
g_closeness

### Overall rank from multiple ranked items

Choose a final ranking method. In this example, `shortest_seed0_1`, `shortest_seed0_2`, `g_degree` have been used.

In [None]:
# create lists in order
sessions_list = [(index, element[0]) for index, element in enumerate(sessions)]
shortest_seed0_1_list = [(index, element) for index, element in enumerate(shortest_seed0_1)]
shortest_seed0_2_list = [(index, element) for index, element in enumerate(shortest_seed0_2)]
combined_paths_list = [(index, element) for index, element in enumerate(combined_paths)]
g_between_list = [(index, element) for index, element in enumerate(g_between)]
g_degree_list = [(index, element) for index, element in enumerate(g_degree)]
g_closeness_list = [(index, element) for index, element in enumerate(g_closeness)]

In [None]:
# take average of rank for each page path (non-weighted)
ranked_items = pd.DataFrame.from_records(shortest_seed0_1_list+shortest_seed0_2_list+g_degree_list).groupby(1).mean().round().reset_index()
ranked_items = ranked_items.rename(columns={1: 'page', 0: 'rank'})
ranked_items = ranked_items.sort_values(by=['rank'])

In [None]:
# download as csv file
ranked_items.to_csv('../../data/processed/ranked_items.csv', index=False)

## Create final output

### Add document type and session hit data to final csv file

Document supertypes as per: https://docs.publishing.service.gov.uk/document-types/content_purpose_supergroup.html


In [29]:
# extract: documentType, sessionHitsAll, entranceHit, exitHit, entranceAndExitHit
df_page = pd.DataFrame([i[0] for i in economic_recovery_subgraph], columns=['page'])
df_info = pd.DataFrame([i[1] for i in economic_recovery_subgraph])
df_all = df_page.join(df_info)

# define a set for news and communication doc types
news_and_comms_doctypes = {'medical_safety_alert', 'drug_safety_update', 'news_article', 
                           'news_story', 'press_release', 'world_location_news_article', 
                           'world_news_story', 'fatality_notice', 'fatality_notice', 
                           'tax_tribunal_decision', 'utaac_decision', 'asylum_support_decision', 
                           'employment_appeal_tribunal_decision', 'employment_tribunal_decision', 
                           'employment_tribunal_decision', 'service_standard_report', 'cma_case', 
                           'decision', 'oral_statement', 'written_statement', 'authored_article', 
                           'correspondence', 'speech', 'government_response', 'case_study' 
}

# define a set for service doc types
service_doctypes = {'completed_transaction', 'local_transaction', 'form', 'calculator',
                    'smart_answer', 'simple_smart_answer', 'place', 'licence', 'step_by_step_nav', 
                    'transaction', 'answer', 'guide'
}

# define a set for guidance and regulation doc types
guidance_and_reg_doctypes = {'regulation', 'detailed_guide', 'manual', 'manual_section',
                             'guidance', 'map', 'calendar', 'statutory_guidance', 'notice',
                             'international_treaty', 'travel_advice', 'promotional', 
                             'international_development_fund', 'countryside_stewardship_grant',
                             'esi_fund', 'business_finance_support_scheme', 'statutory_instrument',
                             'hmrc_manual', 'standard'
}

# define a set for policy and engagement doc types
policy_and_engage_doctypes = {'impact_assessment', 'policy_paper', 'open_consultation',
                              'policy_paper', 'closed_consultation', 'consultation_outcome',
                              'policy_and_engagement'  
}

# define a set for research and statistics doc types
research_and_stats_doctypes = {'dfid_research_output', 'independent_report', 'research', 
                               'statistics', 'national_statistics', 'statistics_announcement',
                               'national_statistics_announcement', 'official_statistics_announcement',
                               'statistical_data_set', 'official_statistics'
}

# define a set for transparency doc types
transparency_doctypes = {'transparency', 'corporate_report', 'foi_release', 'aaib_report',
                         'raib_report', 'maib_report'
}

# loop through document types and create document supertype column 
document_type_dict = dict.fromkeys(list(set(df_info['documentType'])))

for docType, docSupertype in document_type_dict.items():
    if docType in news_and_comms_doctypes: 
        document_type_dict[docType] = 'news and communication'
    
    elif docType in service_doctypes:
        document_type_dict[docType] = 'services'
    
    elif docType in guidance_and_reg_doctypes:
        document_type_dict[docType] = 'guidance and regulation'
 
    elif docType in policy_and_engage_doctypes:
        document_type_dict[docType] = 'policy and engagement'
    
    elif docType in research_and_stats_doctypes:
        document_type_dict[docType] = 'research and statistics'
    
    elif docType in transparency_doctypes:
        document_type_dict[docType] = 'transparency'
    
    else: 
        document_type_dict[docType] = 'other' 

df_docSuper = pd.DataFrame(document_type_dict.items(), columns=['documentType', 'documentSupertype'])

df_all = pd.merge(df_all, df_docSuper, how='left')
        
# reoder and rename columns 
df_all = df_all[['page', 'documentType', 'documentSupertype', 'sessionHitsAll', 'entranceHit', 'exitHit', 'entranceAndExitHit']]
df_all = df_all.rename(columns={'documentType': 'document type', 'documentSupertype': 'document supertype', 'sessionHitsAll': 'number of sessions that visit this page', 'entranceHit': 'number of sessions where this page is an entrance hit', 'exitHit': 'number of sessions where this page is an exit hit', 'entranceAndExitHit': 'number of sessions where this page is both an entrance and exit hit'})


### Add ranking to final csv file

In [None]:
# shortest_seed0_1
df_shortest_seed_0_1 = pd.DataFrame(shortest_seed0_1.items(), columns=['page', 'distance from `/browse/working/finding-job` (e.g. the higher the value is, the further away the page is from /browse/working/finding-job)'])
df_all = pd.merge(df_all, df_shortest_seed_0_1, how='left')

# shortest_seed0_2
df_shortest_seed_0_2 = pd.DataFrame(shortest_seed0_2.items(), columns=['page', 'distance from `/topic/further-education-skills/apprenticeships (e.g. the higher the value is, the further away the page is from /topic/further-education-skills/apprenticeships)'])
df_all = pd.merge(df_all, df_shortest_seed_0_2, how='left')

# where NaN, add 'no path' (NaN = no path is present between the source and target nodes)
df_all.fillna('no path', inplace = True)

# g_degree
df_g_degree = pd.DataFrame(g_degree.items(), columns=['page', 'the number of pages a user moves to/from between this page and another page in the list (e.g. the higher the value, the more users have moved to/from this page and another page in the list)'])
df_all = pd.merge(df_all, df_g_degree, how='left')

### Order final csv by rank 

In [None]:
df_all = df_all.set_index('page')
df_all = df_all.reindex(index=ranked_items['page']).reset_index()

In [None]:
df_all.to_csv('../../data/processed/pages_ranked_with_data.csv', index=False)