# Intro & Summary

In [1]:
import os
import pymongo
import yaml

import pandas as pd

## What we did before

This is the previous function to see if a page was a "loved" page

```
def is_loved_page(page):
    page = page.split('?')[0]
    return any([
        re.compile(r'/foreign-travel-advice/|/find-local-council/|/premises-licence/').match(page),
        page in hmrc_contact_pages_set,
        page in loved_page_paths_set,
        page == '/help',
        page in ['/help/terms-conditions', 
         '/help/about-govuk',
         '/help/accessibility', 
         '/help/privacy-policy',
         '/help/cookies', 
         '/help/update-email-notifications',
         '/help/browsers', 
         '/help/beta'],
        any([pagepath in page for pagepath in loved_smart_answers]), 
        
        page in ['/visit-europe-brexit',
            '/apply-company-tachograph-card',
            '/cymraeg',
            '/guidance/apprenticeship-funding-rules'],
        any([pagepath in page for pagepath in [
            '/food-premises-approval','/marriage-abroad',
            '/guidance/transport-goods-out-of-the-uk-by-road-if-the-uk-leaves-the-eu-without-a-deal-checklist-for-hauliers',
            '/check-british-citizenship','/renew-driving-licence']])
        ])
```


`hmrc_contact_pages_set` came from the links on https://www.gov.uk/government/organisations/hm-revenue-customs/contact
perhaps we can use a `document_type` or similar this time?

Previously we were comparing just page paths, which meant having to expand out a page into a list of all the possible page paths based on its slugs, whereas if we use content IDs we won't have to do that

### What this means

- re.compile(r'/foreign-travel-advice/|/find-local-council/|/premises-licence/').match(page),
    - /foreign-travel-advice/* pages have `links.ordered_related_items`
    - /find-local-council/* pages have `links.ordered_related_items`
    - /premises-licence/* pages have `links.ordered_related_items`
- hmrc_contact_pages_set contains pages from https://www.gov.uk/government/organisations/hm-revenue-customs/contact,they have `document_type = contact`, they have `details.quicklinks` that are displayed in the right hand sidebar, and links in a field `links.related` that get shown at the bottom of the page
    - another couple of pages have `document_type = contact` and also have `details.quicklinks` that are displayed in the right hand sidebar
- loved_page_paths_set was created looking for     `related_mainstream_content.notnull() or ordered_related_items.notnull()or part_of_step_navs.notnull() or quick_links.notnull()`
- page == '/help' has `links.ordered_related_items`
    - '/help/terms-conditions', has `links.ordered_related_items`
    - '/help/about-govuk', has `links.ordered_related_items`
    - '/help/accessibility', has `links.ordered_related_items`
    - '/help/privacy-policy', has `links.ordered_related_items`
    - '/help/cookies',  does not have related links
    - '/help/update-email-notifications', has `links.ordered_related_items`
    -'/help/browsers', has `links.ordered_related_items`
    -'/help/beta' has `links.ordered_related_items`
- any([pagepath in page for pagepath in loved_smart_answers] is based on the set of pages with `document_type = simple_smart_answer`, these have `links.part_of_step_navs` and/or `links.ordered_related_items`
- '/visit-europe-brexit' redirects to '/visit-eu-switzerland-norway-iceland-liechtenstein', which has `links.ordered_related_items` and `links.document_collections` 
- '/apply-company-tachograph-card' has `links.ordered_related_items` and `links.document_collections`
- '/cymraeg' has `ordered_related_items`
- '/guidance/apprenticeship-funding-rules' has `links.related_mainstream_content`
- /food-premises-approval' has `links.ordered_related_items`
- '/marriage-abroad' has `links.ordered_related_items`
- '/guidance/transport-goods-out-of-the-uk-by-road-if-the-uk-leaves-the-eu-without-a-deal-checklist-for-hauliers' redirects to /guidance/transporting-goods-between-great-britain-and-the-eu-by-roro-freight-guidance-for-hauliers 
- '/check-british-citizenship' has `links.ordered_related_items`
- ,'/renew-driving-licence' has `link.ordered_related_items`

So what we should look for to find pages with manually curated links:
 - `links.ordered_related_items` exists
 - `links.related_mainstream_content` exists
 - `details.quicklinks` exists
 
 The following may mean any suggested links won't be shown:
 - `links.ordered_related_items` exists
 - `links.part_of_step_navs` exists
 - `details.quicklinks` exists

# Workings

## Connect to mongodb instance, look at some pages
Using the guide in [govuk-mongodb-content](https://github.com/alphagov/govuk-mongodb-content)

In [2]:
mongo_client = pymongo.MongoClient("mongodb://localhost:27017/")

In [3]:
content_store_db = mongo_client['content_store']
content_store_collection = content_store_db['content_items']

In [4]:
def read_exclusions_yaml(filepath):
    with open(filepath,'r') as f:
        return yaml.safe_load(f)

In [5]:
BLOCKLIST_DOCUMENT_TYPES = read_exclusions_yaml(
    '../../govuk-related-links-recommender/src/config/document_types_excluded_from_the_topic_taxonomy.yml')[
    'document_types']

In [6]:
CONTENT_ID_PROJECTION = {"content_id": 1}

In [7]:
one_page = content_store_collection.find(
    {"_id": {"$eq": "/government/organisations/hm-revenue-customs/contact/creative-industry-tax-reliefs"}})

In [8]:
page1 = list(one_page)[0]

In [None]:
page1

In [8]:
simple_smart_answers = content_store_collection.find(
    {"document_type": {"$eq": "simple_smart_answer"}},
    {"content_id": 1})

In [None]:
list(simple_smart_answers)

## get pages that have curated related links
previously known as `loved_page_paths_set`
e.g. ordered_related_items as a field

In [10]:
FILTER_HAS_RELATED_LINKS = {
    "$and": [
        { "$or": [
#             standard related links
            {"expanded_links.ordered_related_items": {"$exists": True}},
            
#             step by step deails in the right hand panel, e.g. /limited-company-formation
            {"expanded_links.part_of_step_navs": {"$exists": True}},
        
#             related_mainstream_content link, e.g. see /guidance/work-out-if-youll-pay-the-scottish-rate-of-income-tax
            {"expanded_links.related_mainstream_content": {"$exists": True}},
        
#             quick_links, e.g. see /government/organisations/hm-revenue-customs/contact/creative-industry-tax-reliefs
            {"details.quick_links": {"$exists": True}} 
        ]},
        {"document_type": {"$nin": BLOCKLIST_DOCUMENT_TYPES}},
        {"phase": "live"}]}

pages_with_related_links = content_store_collection.find(FILTER_HAS_RELATED_LINKS, CONTENT_ID_PROJECTION)

In [11]:
list_pages_with_related_links = list(pages_with_related_links)

In [12]:
list_pages_with_related_links[0]

{'_id': '/1619-bursary-fund',
 'content_id': 'f4b96a38-5247-4afd-b554-8a258a0e8c93'}

In [13]:
df_pages_with_related_links = pd.DataFrame(list_pages_with_related_links)

In [14]:
df_pages_with_related_links.head()

Unnamed: 0,_id,content_id
0,/1619-bursary-fund,f4b96a38-5247-4afd-b554-8a258a0e8c93
1,/30-hours-free-childcare,ddda6dc8-e9de-49db-bbd1-97e3d0bc1e6f
2,/access-to-elected-office-fund,e12e3c54-b544-4d94-ba1f-9846144374d2
3,/addisons-disease-driving,b319d900-b15a-4ced-8a75-d3326f987948
4,/adi-part-1-test,f2533b63-0341-4b9a-b37e-a88276b4783e


In [15]:
df_pages_with_related_links.to_csv('../data/df_pages_with_related_links.csv')

In [16]:
df_pages_with_related_links.shape

(3093, 2)

## Explore hmrc_contact_pages_set
looking at 'metadata/hmrc_contact_pages.json' we had 130 page paths such as /government/organisations/hm-revenue-customs/contact/agent-dedicated-line-debt-management

I think this may be where those links sit, and it looks like they have the document type `contact` https://www.gov.uk/government/organisations/hm-revenue-customs/contact



In [23]:
FILTER_HMRC_CONTACT_PAGES = {
    "$and": [
        {"document_type": {"$eq": "contact"}},
        {"phase": "live"}]}

hmrc_contact_pages = content_store_collection.find(FILTER_HMRC_CONTACT_PAGES, 
                                                  {"content_id": 1,
                                                   "title": 1})

In [24]:
hmrc_contact_pages_list = list(hmrc_contact_pages)

In [25]:
len(hmrc_contact_pages_list)

140

In [None]:
sorted([page['title'] for page in hmrc_contact_pages_list])

In [32]:
import urllib.request, json

In [34]:
with urllib.request.urlopen('https://www.gov.uk/api/content/government/organisations/hm-revenue-customs/contact') as url:
    list_on_contact_page = json.loads(url.read().decode())

In [38]:
links_on_contact_page = list_on_contact_page['links']['children']

In [39]:
titles_from_query = set([page['title'] for page in hmrc_contact_pages_list])

In [40]:
titles_from_contact_page = set([page['title'] for page in links_on_contact_page])

In [41]:
links_from_query_not_on_page =  titles_from_query - titles_from_contact_page

In [42]:
links_from_query_not_on_page

{'Complain about HMRC',
 'HM Passport Office webchat',
 'Legal Aid Agency customer services',
 'Mineral Oils Reliefs'}

HM Passport Office webchat  has quicklinks
 https://www.gov.uk/api/content/government/organisations/hm-passport-office/contact/hm-passport-office-webchat
 
 Legal Aid Agency customer services has an empty quicklinks
 https://www.gov.uk/api/content/government/organisations/legal-aid-agency/contact/legal-aid-agency-customer-services

In [44]:
titles_from_contact_page - titles_from_query

{'Make a complaint about HMRC', 'Mineral oils reliefs'}