## Ranking output

Currently, the output of the random walks is a single set of pages. End users will prefer a ranked list of pages. This ranking should have a tendency to rank pages from the target WUJ higher than pages not in that WUJ.

### Ranking by page frequency-random walk frequency

#### An example

The most successful random walk method has been to perform multiple random walks and combine the pages visited by each one into a single set of pages.

Each random walk traverses a path of pages. Since we perform multiple random walks, we have multiple paths. Some pages will appear on more paths than others. Some pages will appear more frequently per path.

For example: suppose you perform two random walks and each one traverses the following path:

- [A, C, D, C, X, Y, Z] 
- [A, C, B, D, Q, P, M]

Pages A, C and D are common to both paths. However, C occurs twice on the first path, which no other page does. Hence, C should be ranked first, followed by A and D in joint second. The remaining pages are equally ranked at the bottom.

#### Page frequency-path frequency

Inspired by the tf-idf ("term frequency-inverse document frequency") metric from NLP, we create the tf-df metric, "term frequency-document frequency". Translated into random walk parlance, this is "page frequency-path frequency". Where "page frequency" is the number of occurences of a given page on a given path taken by a random walk, and, "path frequency" is the number of random walk paths on which a given page occurs at least once.

Mathematically,

$pf(p,r)$ is the frequency of page $p$ on a single random walk $r$,

$$\text{pf}(p,r) = f_{p,r}$$

Where $f_{p,r}$ is the count of a page on a random walk path.

The path frequency is a measure of how common a given page is to all random walks performed, i.e. if it's common or rare across all random walks,

$$\text{rwf}(t,R) = |\{r \in R : p \in r\}|$$

Where $R$ is the set of paths taken by all random walks and $|\{r \in R : p \in r\}|$ is the number of random walks on which the page $p$ occurs. For instance, in the above example, page C occurs on two random walk paths.

Below, a demo is shown of this ranking system.

In [1]:
import src.utils.randomwalks as rw
import numpy as np
import pandas as pd
import networkx as nx

In [2]:
# er_pages is a list of pages known to be within the economic recovery WUJ
# this will be used to help evaluate the ranking system

er_pages = pd.read_excel('../../data/processed/2021-11-12 - Economic recovery pages.xlsx', sheet_name='Top pages').pagePathv2.to_list()

# get networkx graph
G = nx.read_gpickle("../../data/processed/functional_session_hit_directed_graph_er.gpickle").to_undirected()

# reformat the graph to make it compliant with existing random walk functions
# i.e. add the path to a name property and set the index to be a number

for index,data in G.nodes(data=True):
    data['properties'] = dict()
    data['properties']['name'] = index


G = nx.convert_node_labels_to_integers(G, first_label=0, ordering='default', label_attribute=None)

# get adjacency matrix of G
A = nx.adj_matrix(G, weight=None)

Use `adjacency_matrix` instead

  A = nx.adj_matrix(G, weight=None)


In [3]:
# set the seeds from where random walks will be initialised
seeds = (
    '/find-a-job',
    '/universal-credit',
    '/government/collections/financial-support-for-businesses-during-coronavirus-covid-19'
)

In [4]:
results = rw.repeat_random_walks(steps=100, repeats=100, T=A, G=G, seed_pages=seeds, proba=False, combine='union', level=1, n_jobs=1)

  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
page_scores = rw.page_freq_path_freq_ranking(results)

In [59]:
page_scores['ER'] = page_scores.pagePath.isin(er_pages)
colour = (page_scores.ER == True).map({True: 'background-color: black', False: ''})
page_scores = page_scores.style.apply(lambda s: colour)

## Add columns to output

Add additional information to csv output: 
- document type
- document super type
- number of sessions that visit this page
- number of sessions where this page is an entrance hit
- number of sessions where this page is an exit hit
- number of sessions where this page is both an entrance and exit hit
- how frequent the page occurs in the whole user journey


In [54]:
# create a df with `pagePath`: `documentType`, `sessionHitsAll`, `entranceHit`, `exitHit`, `entranceAndExitHit`
df_dict = {info['properties']['name']: [info['documentType'], info['sessionHitsAll'], info['entranceHit'], info['exitHit'], info['entranceAndExitHit']] for node, info in G.nodes(data=True)}
df_dict = {k:v for (k,v) in df_dict.items() if k in page_scores['pagePath'].tolist()}
df_info = pd.DataFrame.from_dict(df_dict, orient='index', columns=['documentType', 'sessionHitsAll', 'entranceHit', 'exitHit', 'entranceAndExitHit']).rename_axis('pagePath').reset_index()

In [55]:
# create a df with document supertypes
news_and_comms_doctypes = {'medical_safety_alert', 'drug_safety_update', 'news_article', 
                           'news_story', 'press_release', 'world_location_news_article', 
                           'world_news_story', 'fatality_notice', 'fatality_notice', 
                           'tax_tribunal_decision', 'utaac_decision', 'asylum_support_decision', 
                           'employment_appeal_tribunal_decision', 'employment_tribunal_decision', 
                           'employment_tribunal_decision', 'service_standard_report', 'cma_case', 
                           'decision', 'oral_statement', 'written_statement', 'authored_article', 
                           'correspondence', 'speech', 'government_response', 'case_study' 
}

service_doctypes = {'completed_transaction', 'local_transaction', 'form', 'calculator',
                    'smart_answer', 'simple_smart_answer', 'place', 'licence', 'step_by_step_nav', 
                    'transaction', 'answer', 'guide'
}

guidance_and_reg_doctypes = {'regulation', 'detailed_guide', 'manual', 'manual_section',
                             'guidance', 'map', 'calendar', 'statutory_guidance', 'notice',
                             'international_treaty', 'travel_advice', 'promotional', 
                             'international_development_fund', 'countryside_stewardship_grant',
                             'esi_fund', 'business_finance_support_scheme', 'statutory_instrument',
                             'hmrc_manual', 'standard'
}

policy_and_engage_doctypes = {'impact_assessment', 'policy_paper', 'open_consultation',
                              'policy_paper', 'closed_consultation', 'consultation_outcome',
                              'policy_and_engagement'  
}

research_and_stats_doctypes = {'dfid_research_output', 'independent_report', 'research', 
                               'statistics', 'national_statistics', 'statistics_announcement',
                               'national_statistics_announcement', 'official_statistics_announcement',
                               'statistical_data_set', 'official_statistics'
}

transparency_doctypes = {'transparency', 'corporate_report', 'foi_release', 'aaib_report',
                         'raib_report', 'maib_report'
}

document_type_dict = dict.fromkeys(list(set(df_info['documentType'])))

for docType, docSupertype in document_type_dict.items():
    if docType in news_and_comms_doctypes: 
        document_type_dict[docType] = 'news and communication'
    
    elif docType in service_doctypes:
        document_type_dict[docType] = 'services'
    
    elif docType in guidance_and_reg_doctypes:
        document_type_dict[docType] = 'guidance and regulation'
 
    elif docType in policy_and_engage_doctypes:
        document_type_dict[docType] = 'policy and engagement'
    
    elif docType in research_and_stats_doctypes:
        document_type_dict[docType] = 'research and statistics'
    
    elif docType in transparency_doctypes:
        document_type_dict[docType] = 'transparency'
    
    else: 
        document_type_dict[docType] = 'other' 

df_docSuper = pd.DataFrame(document_type_dict.items(), columns=['documentType', 'documentSupertype'])

In [56]:
# merge dfs 
df_merged = pd.merge(page_scores, df_info, on='pagePath')
df_merged = pd.merge(df_merged, df_docSuper, how='left')

In [None]:
# reoder and rename df columns 
df_merged = df_merged[['pagePath', 'documentType', 'documentSupertype', 'sessionHitsAll', 'entranceHit', 'exitHit', 'entranceAndExitHit', 'tfdf_max']]
df_merged = df_merged.rename(columns={'pagePath': 'page path', 'documentType': 'document type', 'documentSupertype': 'document supertype', 'sessionHitsAll': 'number of sessions that visit this page', 'entranceHit': 'number of sessions where this page is an entrance hit', 'exitHit': 'number of sessions where this page is an exit hit', 'entranceAndExitHit': 'number of sessions where this page is both an entrance and exit hit', 'tfdf_max': 'how frequent the page occurs in the whole user journey'})

# save df
df_merged.to_csv('../../data/processed/pages_ranked_with_data.csv', index=False)