# Step 1: Identify seed 0 and seed 1 pages for economic recovery functional network 

This approach does not rely on the existing knowledge graph. A functional graph based on page hit session data is created, before further filtering of the graph, to find a list of pages related to the economic recovery whole user journey (WUJ). 

The first step is to identify seed 0 and seed 1 pages. Seed 0 pages must be pre-defined and manually entered as `seed0_pages`. Using the GOV.UK mirror, `seed1_pages` are defined as pages linked from `seed0_pages`.

ASSUMPTIONS: 
- A copy of the GOV.UK mirror is used from 23-04-2021. The topology sparse matrix therefore only includes page information (e.g.     page paths, hyperlinks on page paths) that is true on this date.
- `seed0_pages` are defined as `/topic/further-education-skills` and `/browse/working/finding-job`. These were chosen as they are topic and browse pages, which therefore link to many similar pages. This analysis assumes these are important pages in the economic recovery whole user journey. `seed1_pages` are reliant on `seed0_pages`, therefore this analysis is dependent on the `seed0_pages` are.

OUTPUT: 
- A topology sparse matrix of `seed0_pages` and `seed1_pages`: `topo_sparse_matrix_all_seeds` 
    - This is saved as a 

In [24]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

from googleapiclient.discovery import build
import io
import pickle
from googleapiclient.http import MediaIoBaseDownload

import pandas as pd

from google.colab import files

## Authentication

In [8]:
# authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

## Create topology sparse matrix 

Using a copy of the GOV.UK mirror from the 23-04-2021, a topology sparse matrix is created, where each row is a source url and each column is a destination url. A `1` indicates a hyperlink from the source to the destination url. 

In [11]:
# import mirror topology matrix
drive_service = build('drive', 'v3')
file_id = '1oskKqx16S_jIo67-fMJk8WaCenQb6UFH'
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
    # _ is a placeholder for a progress object that we ignore.
    # (Our file is small, so we skip reporting progress.)
    _, done = downloader.next_chunk()

downloaded.seek(0)
mirror_topology_matrix = pickle.load(downloaded)

# import vertex
file_id = '1bf9inTVhUygJNm1lgTs2pG87x1RiSfoU'
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
    # _ is a placeholder for a progress object that we ignore.
    # (Our file is small, so we skip reporting progress.)
    _, done = downloader.next_chunk()

downloaded.seek(0)
vertex = pickle.load(downloaded)

# combine mirror_topology matrix and vertex to create topology sparse matrix
topo_sparse_matrix = pd.DataFrame.sparse.from_spmatrix(mirror_topology_matrix, index=vertex, columns=vertex)


## Define `seed0_pages` and `seed1_pages`

In [14]:
# Define seed0 pages 
seed0_pages = ['/topic/further-education-skills', '/browse/working/finding-job']

# Only keep seed0 pages in topo_sparse_matrix (rows)
topo_sparse_matrix_seed_0 = topo_sparse_matrix.loc[seed0_pages, :]

# Define seed1 pages (pages that are hyperlinked from seed0)
topo_sparse_matrix_all_seeds = topo_sparse_matrix_seed_0.loc[:, (topo_sparse_matrix_seed_0 != 0).any(axis=0)]

In [None]:
# Save a list of seed1 pages 
#seed1_pages = topo_sparse_matrix_all_seeds.columns.values.tolist()
#seed1_pages
df = pd.DataFrame(seed1_pages, columns=["colummn"])
df.to_csv('seed1_economic_recovery.csv', index=False)
files.download('seed1_economic_recovery.csv')
