In this notebook we generate related links for pages using functional data (what people view on GOV.UK), and LLR (the Log-Likelihood Ratio).

LLR (log likelihood ratio) looks at how often pages A and B are viewed together, but also how often in general each page is viewed (if page A co-occurs at similar rates with B and C, but C is less popular sitewide, it is seen as a more relevant recommendation for A).

A blog about LLR: http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
Paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.5962&rep=rep1&type=pdf


Some notes on what we do:
- LLR by content ID
- look at page hits sequentially - link page viewed AFTER page
- filter out "finding' pages - `document_type` in ['document_collection', 'finder', 'homepage', 'license_finder', 'mainstream_browse_page', 'organisation', 'search', 'service_manual_homepage', 'service_manual_topic', 'services_and_information', 'taxon', 'topic', 'topical_event']
- filter out 'fatality_notice', 'contact', 'service_sign_in','html_publication', 'calculator', 'completed_transaction' `document_type`s
- filter out embedded links that already exist on pages

# Setup

In [1]:
import os 
import math
from ast import literal_eval
import pandas as pd

In [2]:
# set up pandas, bigquery, data dir stuff
ProjectID = 'govuk-bigquery-analytics'
KEY_DIR = os.getenv("BQ_KEY_DIR")
key_file_path = os.path.join(KEY_DIR, os.listdir(KEY_DIR)[0])

DATA_DIR = os.getenv("DATA_DIR")

# Run LLR Query
Node2Vec used 3 weeks of data, so we should use 3 weeks here to give it the same chance of capturing the functional network

The query dequencs pageviews using hitNumber, so we look at pages viewed *after* other pages

You have to sub in `START_DATE` and `END_DATE`

In [10]:
file_object  = open('../queries/llr_pages_content_id_split_up_hitNumber.sql', 'r')
QUERY = file_object.read()

In [11]:
print(QUERY)

WITH
  session_pages AS (
  SELECT
    CONCAT(fullVisitorId,"-",CAST(visitId AS STRING)) AS sessionId,
    content_id,
    min(hitNumber) as hitNumber
  FROM (
    SELECT
      fullVisitorId,
      visitId,
      hits.hitNumber as hitNumber,
      hits.page.pagePath AS pagePath,
      (
      SELECT
        value
      FROM
        hits.customDimensions
      WHERE
        index=4) AS content_id,
      (
      SELECT
        value
      FROM
        hits.customDimensions
      WHERE
        index=2) AS document_type
    FROM
      `govuk-bigquery-analytics.87773428.ga_sessions_*` AS sessions
    CROSS JOIN
      UNNEST(sessions.hits) AS hits
    WHERE
      _TABLE_SUFFIX BETWEEN '{START_DATE}'
      AND '{END_DATE}')
  WHERE
    pagePath != '/'
    AND document_type NOT IN ('document_collection',
      'finder',
      'homepage',
      'license_finder',
      'mainstream_browse_page',
      'organisation',
      'search',
      'service_manual_homepage',
      'service_manual_topic',
 

In [12]:
print(QUERY.format(START_DATE='20190215', END_DATE='20190307'))

WITH
  session_pages AS (
  SELECT
    CONCAT(fullVisitorId,"-",CAST(visitId AS STRING)) AS sessionId,
    content_id,
    min(hitNumber) as hitNumber
  FROM (
    SELECT
      fullVisitorId,
      visitId,
      hits.hitNumber as hitNumber,
      hits.page.pagePath AS pagePath,
      (
      SELECT
        value
      FROM
        hits.customDimensions
      WHERE
        index=4) AS content_id,
      (
      SELECT
        value
      FROM
        hits.customDimensions
      WHERE
        index=2) AS document_type
    FROM
      `govuk-bigquery-analytics.87773428.ga_sessions_*` AS sessions
    CROSS JOIN
      UNNEST(sessions.hits) AS hits
    WHERE
      _TABLE_SUFFIX BETWEEN '20190215'
      AND '20190307')
  WHERE
    pagePath != '/'
    AND document_type NOT IN ('document_collection',
      'finder',
      'homepage',
      'license_finder',
      'mainstream_browse_page',
      'organisation',
      'search',
      'service_manual_homepage',
      'service_manual_topic',
      '

In [5]:
df_llr_recs = pd.io.gbq.read_gbq(
    QUERY.format(START_DATE='20190218', END_DATE='20190310'),
                           project_id=ProjectID,
                           reauth=False,
                           verbose=True,
                           private_key=key_file_path,
                           dialect='standard')

  credentials=credentials, verbose=verbose, private_key=private_key)
  credentials=credentials, verbose=verbose, private_key=private_key)


In [6]:
df_llr_recs.to_csv(os.path.join(DATA_DIR, 'llr', 'llr_recs_20190218_20190310.csv.gz'),
                     compression='gzip', index=False)

In [7]:
df_llr_recs.head()

Unnamed: 0,page_1,page_2,page_1_occurrences,co_occurrences,llr_score,rank
0,00015d3f-e7d9-48e8-95ff-ac3f7fa07be3,8ec6ea6e-a259-459a-9e5e-ffab399a2e2e,1,1,778832400.0,1
1,000227a8-f0d2-417d-8ce4-27a18d62d442,f79797b8-93a8-4826-a3fa-91488bfbb73d,30,1,1272172000.0,1
2,000227a8-f0d2-417d-8ce4-27a18d62d442,60556764-7631-11e4-a3cb-005056011aef,30,1,1266232000.0,2
3,000227a8-f0d2-417d-8ce4-27a18d62d442,bd3dad18-b6ed-4f4c-b2d4-1cd4caa9faa8,30,1,1263362000.0,3
4,000227a8-f0d2-417d-8ce4-27a18d62d442,5fe61962-7631-11e4-a3cb-005056011aef,30,1,1263362000.0,3


Quite a big DF at the moment, as I've taken up to 100 links for each page, only because I don't know how many will be filtered out by removing embedded links. I've also looked at pages over 3 weeks, so the number of occurrences for each page will be fairly high

In [8]:
print(df_llr_recs.shape)
print(df_llr_recs[df_llr_recs['page_1_occurrences']>1].shape)
print(df_llr_recs[df_llr_recs['page_1_occurrences']>10].shape)
print(df_llr_recs[df_llr_recs['page_1_occurrences']>100].shape)

(4923663, 6)
(4907923, 6)
(4706298, 6)
(2906420, 6)


# Get top n related links, without embedded links
Filter out embedded links that already exist in pages, and get the top n related links for each page

## Get embedded links
Get the embedded links for each page from `content_json_extended.csv.gz` (created in [govuk-network-embedding](https://github.com/alphagov/govuk-network-embedding/blob/2nd-iteration/notebooks/data_extract/extract_links_from_json.ipynb)

In [9]:
content_json_file = os.path.join(DATA_DIR, 'content_api', "11-02-19", "
                                 .csv.gz")
content = pd.read_csv(content_json_file, compression="gzip")

In [10]:
content.columns

Index(['base_path', 'body', 'children', 'collection_links', 'content_id',
       'description', 'document_collections', 'document_type',
       'embedded_links', 'field_of_operation', 'finder', 'first_published_at',
       'lead_organisations', 'locale', 'mainstream_browse_pages', 'ministers',
       'ordered_related_items_overrides', 'organisations',
       'pages_part_of_step_nav', 'pages_related_to_step_nav', 'parent',
       'part_of_step_navs', 'people', 'policy_areas',
       'primary_publishing_organisation', 'publishing_app', 'related_guides',
       'related_links', 'related_mainstream', 'related_mainstream_content',
       'related_policies', 'related_statistical_data_sets',
       'related_to_step_navs', 'roles', 'sections', 'speaker', 'title',
       'topical_events', 'topics'],
      dtype='object')

In [11]:
content_base_cid = dict(zip(content.base_path, content.content_id))

In [12]:
content['embedded_links'] = content['embedded_links'].map(literal_eval)

content['embedded_cids'] = content['embedded_links'].map(lambda x: [content_base_cid[l] for l in x 
                                                                    if l in content_base_cid.keys()])

In [13]:
cid_link_cids = dict(zip(content.content_id, content.embedded_cids))

In [14]:
cid_link_cids_with_emb = {k: v for k, v in cid_link_cids.items() if len(v) > 0}

In [15]:
len(cid_link_cids_with_emb.keys())

2236

## Get titles and base paths
For each content ID get its title and base_path from `content_json_extended.csv.gz` so that humans can assess our links

In [16]:
cids_titles = dict(zip(content.content_id, content.title))

In [17]:
cids_paths = dict(zip(content.content_id, content.base_path))

In [42]:
df_llr_recs.head()

Unnamed: 0,page_1,page_2,page_1_occurrences,co_occurrences,llr_score,rank
0,00015d3f-e7d9-48e8-95ff-ac3f7fa07be3,8ec6ea6e-a259-459a-9e5e-ffab399a2e2e,1,1,778832400.0,1
1,000227a8-f0d2-417d-8ce4-27a18d62d442,f79797b8-93a8-4826-a3fa-91488bfbb73d,30,1,1272172000.0,1
2,000227a8-f0d2-417d-8ce4-27a18d62d442,60556764-7631-11e4-a3cb-005056011aef,30,1,1266232000.0,2
3,000227a8-f0d2-417d-8ce4-27a18d62d442,bd3dad18-b6ed-4f4c-b2d4-1cd4caa9faa8,30,1,1263362000.0,3
4,000227a8-f0d2-417d-8ce4-27a18d62d442,5fe61962-7631-11e4-a3cb-005056011aef,30,1,1263362000.0,3


Create a dictionary `related_links` from `df_llr_recs`, the content IDs of each page are the keys, the values are arrays of the recommended links, with LLR scores, ranks fro the SQL, and co_occurrence counts

In [43]:
related_links = {}

for row in df_llr_recs.itertuples():
    if row[1] not in related_links:
        related_links[row[1]] = list()
    related_links[row[1]].append (
        {'link': row[2], 'llr_score': row[5], 'rank': row[6], 'co_occurrences': row[4]})

In [44]:
def get_page_title(content_id):
    try:
        return cids_titles[content_id]
    except KeyError:
        return 'unknown'

def get_page_url(content_id):
    try:
        return f"www.gov.uk{cids_paths[content_id]}"
    except KeyError:
        return 'unknown'

## Remove embedded links
With a dictionary of content IDs : related links, and a dictionary of content IDs : embedded links, filter out all embedded links that occur in related links for each content ID

In [45]:
def remove_embedded(related_links_dict, cid_link_cids_with_emb):
    for content_id, embedded_links in cid_link_cids_with_emb.items():
        for emb_link in embedded_links:
            try:
                related_links_dict[content_id] = [rec for rec in related_links_dict[content_id] if rec['link'] != emb_link]
            except KeyError:
                continue

In [46]:
remove_embedded(related_links, cid_link_cids_with_emb)

## Get top n, add titles and base paths
Using a clean dictionary of related links (no replication of embedded links), iterate through the dictionary:
 - sort the recommended related links by their rank, and take the top n
 - add titles and base paths for the pages and their related links
 - return a DataFrame with all this info

In [13]:
def get_top_n(related_links_dict, n=10):
    pages_links = []
    for page, recs in related_links_dict.items():
        title = get_page_title(page)
        base_path = get_page_url(page)
        sorted_recs = sorted(recs, key=lambda k: k['rank'])[:n]
        for rec in sorted_recs:
            page_link = {"title": title,
                         "link_title": get_page_title(rec['link']),
                         "base_path": base_path,
                         "link_base_path": get_page_url(rec['link']),
                         "page": page}
            page_link.update(rec)
            pages_links.append(page_link)
    return pd.DataFrame(pages_links)

In [48]:
top_10_df = get_top_n(related_links, n=10)

In [49]:
top_10_df

Unnamed: 0,base_path,co_occurrences,link,link_base_path,link_title,llr_score,page,rank,title
0,www.gov.uk/government/statistics/uk-consumer-p...,1,8ec6ea6e-a259-459a-9e5e-ffab399a2e2e,www.gov.uk/tax-codes,Tax codes,7.788324e+08,00015d3f-e7d9-48e8-95ff-ac3f7fa07be3,1,UK consumer price inflation: Dec 2017
1,www.gov.uk/government/publications/foi-respons...,1,f79797b8-93a8-4826-a3fa-91488bfbb73d,www.gov.uk/government/publications/foi-respons...,FOI responses published by MOD: week commencin...,1.272172e+09,000227a8-f0d2-417d-8ce4-27a18d62d442,1,FOI responses published by MOD: week commencin...
2,www.gov.uk/government/publications/foi-respons...,1,60556764-7631-11e4-a3cb-005056011aef,www.gov.uk/government/publications/foi-respons...,FOI responses released by MOD: week commencing...,1.266232e+09,000227a8-f0d2-417d-8ce4-27a18d62d442,2,FOI responses published by MOD: week commencin...
3,www.gov.uk/government/publications/foi-respons...,1,bd3dad18-b6ed-4f4c-b2d4-1cd4caa9faa8,www.gov.uk/government/publications/foi-respons...,FOI responses published by MOD: week commencin...,1.263362e+09,000227a8-f0d2-417d-8ce4-27a18d62d442,3,FOI responses published by MOD: week commencin...
4,www.gov.uk/government/publications/foi-respons...,1,5fe61962-7631-11e4-a3cb-005056011aef,www.gov.uk/government/publications/foi-respons...,FOI responses released by MOD: week commencing...,1.263362e+09,000227a8-f0d2-417d-8ce4-27a18d62d442,3,FOI responses published by MOD: week commencin...
5,www.gov.uk/government/publications/foi-respons...,1,5feec55f-7631-11e4-a3cb-005056011aef,www.gov.uk/government/publications/foi-respons...,FOI responses released by MOD: week commencing...,1.260554e+09,000227a8-f0d2-417d-8ce4-27a18d62d442,5,FOI responses published by MOD: week commencin...
6,www.gov.uk/government/publications/foi-respons...,1,fd945ca9-1dc2-49c7-83f0-5c0710454244,www.gov.uk/government/publications/foi-respons...,FOI responses published by MOD: week commencin...,1.252478e+09,000227a8-f0d2-417d-8ce4-27a18d62d442,6,FOI responses published by MOD: week commencin...
7,www.gov.uk/government/publications/foi-respons...,1,276417aa-205a-4bfe-bc39-edae27d528c7,www.gov.uk/government/publications/foi-respons...,FOI responses released by MOD: week commencing...,1.249895e+09,000227a8-f0d2-417d-8ce4-27a18d62d442,7,FOI responses published by MOD: week commencin...
8,www.gov.uk/government/publications/foi-respons...,1,e2aed688-aefc-4ba2-ab11-7181b2b27af0,www.gov.uk/government/publications/foi-respons...,FOI responses published by MOD: week commencin...,1.249895e+09,000227a8-f0d2-417d-8ce4-27a18d62d442,7,FOI responses published by MOD: week commencin...
9,www.gov.uk/government/publications/foi-respons...,1,ce6821d0-c340-4541-813e-14ec3ac3a13f,www.gov.uk/government/publications/foi-respons...,FOI responses published by MOD: week commencin...,1.247362e+09,000227a8-f0d2-417d-8ce4-27a18d62d442,9,FOI responses published by MOD: week commencin...


In [50]:
top_10_df.shape

(1081337, 9)

In [51]:
top_10_df.query('base_path != "unknown"').shape

(907645, 9)

# Get related links for colleagues to review

## Top 10 links for the 200 pages with the most co-occurrences (as page 1 in a page1->page2 pair)

In [52]:
grouped_df = df_llr_recs.groupby('page_1').max()

In [53]:
top_200_pages = grouped_df.sort_values('page_1_occurrences', ascending=False).reset_index()['page_1'][:200].tolist()

In [54]:
def in_top_200(page):
    return page in top_200_pages

In [55]:
top_10_df['is_top_200'] = top_10_df['page'].map(in_top_200)

In [59]:
top_10_df[top_10_df['is_top_200']][
    ['title', 'link_title', 'base_path', 'link_base_path', 'page', 'link',
     'llr_score','rank','co_occurrences'
    ]].to_csv(
    os.path.join(DATA_DIR, 'llr', 
                 'llr_4_top_200_sequenced_filtered20190218_20190310.csv'), index=False)

In [57]:
top_10_df['page'].nunique()

146888

In [58]:
top_10_df.sort_values('llr_score', ascending=True).head(1000)

Unnamed: 0,base_path,co_occurrences,link,link_base_path,link_title,llr_score,page,rank,title,is_top_200
557857,www.gov.uk/government/publications/help-to-buy...,1,632d1ae0-0340-4a23-87b0-595bb596e5f4,www.gov.uk/log-in-register-hmrc-online-services,HMRC services: sign in or register,1.313563e+07,601b8e05-7631-11e4-a3cb-005056011aef,10,Help to Buy (equity loan): participation and r...,False
333688,www.gov.uk/government/publications/the-nationa...,1,fa748fae-3de4-4266-ae85-0797ada3f40c,www.gov.uk/vehicle-tax,Tax your vehicle,2.402697e+07,5d68d742-7631-11e4-a3cb-005056011aef,8,The national crime recording standard (NCRS): ...,False
956012,www.gov.uk/government/news/click-to-accept-jur...,1,f4f2c99e-6d0a-4f2f-b3d0-ff2d00e33887,www.gov.uk/personal-tax-account,Personal tax account: sign in or set up,3.229945e+07,d74f517a-2c49-4354-a54b-8bb8377b4bdf,9,Click to accept jury service in an instant,False
161476,www.gov.uk/government/news/air-safety-informat...,1,b220a437-0d51-4390-9993-63345d0c83ad,www.gov.uk/sign-in-universal-credit,Sign in to your Universal Credit account,3.865114e+07,347df46e-bdc8-4264-9b26-6e808ff812f5,5,Air Safety Information Management System (ASIM...,False
422065,www.gov.uk/government/news/francis-report-on-m...,1,a350d322-669c-4acc-8b54-08c342afd569,www.gov.uk/apply-renew-passport,Apply online for a UK passport,4.326588e+07,5ec50149-7631-11e4-a3cb-005056011aef,9,Francis report on Mid Staffs: government accep...,False
157361,www.gov.uk/government/publications/dwp-debt-ma...,1,b220a437-0d51-4390-9993-63345d0c83ad,www.gov.uk/sign-in-universal-credit,Sign in to your Universal Credit account,4.950627e+07,3332951c-a6c3-4a81-a786-c41a0755c476,9,DWP Debt Management performance data,False
339275,www.gov.uk/government/news/functions-of-clinic...,1,cde068da-883e-42e1-a54d-802ebb7bb57a,www.gov.uk/check-state-pension,Check your State Pension,6.112399e+07,5d889a59-7631-11e4-a3cb-005056011aef,6,Functions of clinical commissioning groups,False
925886,www.gov.uk/government/news/gp-indemnity-develo...,1,f790dc71-386e-4440-9689-31f94e7ac64d,www.gov.uk/universal-credit,Universal Credit,6.401546e+07,cd4fa20a-5f62-46a1-a479-8d1e2f4ebf84,5,GP indemnity: development of state-backed sche...,False
161461,www.gov.uk/government/publications/simple-paym...,1,1a1bc147-c02b-4cad-a415-ccbcecec8cbf,www.gov.uk/benefits-calculators,Benefits calculators,6.552996e+07,347d93b2-cf9e-480f-9eae-5f028ebdcdd1,9,"Simple Payment for benefits, pensions and chil...",False
265108,unknown,1,b220a437-0d51-4390-9993-63345d0c83ad,www.gov.uk/sign-in-universal-credit,Sign in to your Universal Credit account,6.849857e+07,568dea37-80dc-426a-8fa6-bb6de60c6b95,9,unknown,False


In [60]:
top_10_df.shape

(1081337, 10)

In [63]:
top_10_df['base_path'].nunique()

122014

In [62]:
top_10_df[top_10_df['co_occurrences'] > 1].shape

(427877, 10)

In [64]:
top_10_df[top_10_df['co_occurrences'] > 1]['base_path'].nunique()

61607

In [77]:
top_10_df[top_10_df['co_occurrences'] > 2]['base_path'].nunique()

42375

## Calculate `ln(llr_score)` as it's easier to interpret

In [66]:
top_10_df['ln_llr_score'] = top_10_df['llr_score'].map(math.log)

In [67]:
top_10_df.head()

Unnamed: 0,base_path,co_occurrences,link,link_base_path,link_title,llr_score,page,rank,title,is_top_200,ln_llr_score
0,www.gov.uk/government/statistics/uk-consumer-p...,1,8ec6ea6e-a259-459a-9e5e-ffab399a2e2e,www.gov.uk/tax-codes,Tax codes,778832400.0,00015d3f-e7d9-48e8-95ff-ac3f7fa07be3,1,UK consumer price inflation: Dec 2017,False,20.473306
1,www.gov.uk/government/publications/foi-respons...,1,f79797b8-93a8-4826-a3fa-91488bfbb73d,www.gov.uk/government/publications/foi-respons...,FOI responses published by MOD: week commencin...,1272172000.0,000227a8-f0d2-417d-8ce4-27a18d62d442,1,FOI responses published by MOD: week commencin...,False,20.963991
2,www.gov.uk/government/publications/foi-respons...,1,60556764-7631-11e4-a3cb-005056011aef,www.gov.uk/government/publications/foi-respons...,FOI responses released by MOD: week commencing...,1266232000.0,000227a8-f0d2-417d-8ce4-27a18d62d442,2,FOI responses published by MOD: week commencin...,False,20.959311
3,www.gov.uk/government/publications/foi-respons...,1,bd3dad18-b6ed-4f4c-b2d4-1cd4caa9faa8,www.gov.uk/government/publications/foi-respons...,FOI responses published by MOD: week commencin...,1263362000.0,000227a8-f0d2-417d-8ce4-27a18d62d442,3,FOI responses published by MOD: week commencin...,False,20.957042
4,www.gov.uk/government/publications/foi-respons...,1,5fe61962-7631-11e4-a3cb-005056011aef,www.gov.uk/government/publications/foi-respons...,FOI responses released by MOD: week commencing...,1263362000.0,000227a8-f0d2-417d-8ce4-27a18d62d442,3,FOI responses published by MOD: week commencin...,False,20.957042


In [71]:
top_10_df.ln_llr_score.describe()

count    1.081337e+06
mean     2.154350e+01
std      1.038929e+00
min      1.639084e+01
25%      2.096136e+01
50%      2.116171e+01
75%      2.180348e+01
max      3.200472e+01
Name: ln_llr_score, dtype: float64

In [97]:
top_10_df[top_10_df['co_occurrences'] == 1].shape

(653460, 11)

In [75]:
top_10_df[top_10_df['co_occurrences'] > 1].sort_values('llr_score', ascending=True)

Unnamed: 0,base_path,co_occurrences,link,link_base_path,link_title,llr_score,page,rank,title,is_top_200,ln_llr_score
624430,unknown,2,a350d322-669c-4acc-8b54-08c342afd569,www.gov.uk/apply-renew-passport,Apply online for a UK passport,2.282573e+08,69e7d4b3-1712-4d42-9e3b-b29849f59a30,8,unknown,False,19.245984
580875,www.gov.uk/government/publications/vat-applica...,2,632d1ae0-0340-4a23-87b0-595bb596e5f4,www.gov.uk/log-in-register-hmrc-online-services,HMRC services: sign in or register,2.638685e+08,602ddc33-7631-11e4-a3cb-005056011aef,10,VAT: application for economic operator registr...,False,19.390961
977232,unknown,2,b220a437-0d51-4390-9993-63345d0c83ad,www.gov.uk/sign-in-universal-credit,Sign in to your Universal Credit account,2.856630e+08,de21a8c8-f7e0-454a-a195-f8101ada2819,10,unknown,False,19.470323
170847,www.gov.uk/government/news/dwp-announces-extra...,2,cde068da-883e-42e1-a54d-802ebb7bb57a,www.gov.uk/check-state-pension,Check your State Pension,3.139430e+08,378d4836-f1db-41e9-a7ea-9e2c1d57fe6f,10,DWP announces extra support for armed forces s...,False,19.564722
378524,www.gov.uk/government/news/change-to-passport-...,2,a350d322-669c-4acc-8b54-08c342afd569,www.gov.uk/apply-renew-passport,Apply online for a UK passport,3.173728e+08,5e2cdbec-7631-11e4-a3cb-005056011aef,9,Change to passport service for British nationa...,False,19.575588
402532,www.gov.uk/government/news/69-million-more-for...,2,f790dc71-386e-4440-9689-31f94e7ac64d,www.gov.uk/universal-credit,Universal Credit,3.388015e+08,5e900f66-7631-11e4-a3cb-005056011aef,9,£69 million more for Start-Up Loans and New En...,False,19.640925
561367,unknown,2,a350d322-669c-4acc-8b54-08c342afd569,www.gov.uk/apply-renew-passport,Apply online for a UK passport,3.479888e+08,601f62a1-7631-11e4-a3cb-005056011aef,9,unknown,False,19.667681
478773,www.gov.uk/government/publications/updated-vis...,2,e1067450-7d13-45ff-ada4-5e3dd4025fb7,www.gov.uk/standard-visitor-visa,Standard Visitor visa,3.558906e+08,5f4e295b-7631-11e4-a3cb-005056011aef,10,Updated Visa Fees,False,19.690134
879421,www.gov.uk/government/news/dvla-and-mib-announ...,2,0889f128-e479-465f-b3e1-a3db6a3879cf,www.gov.uk/check-vehicle-tax,Check if a vehicle is taxed,3.561401e+08,be3e8ad3-87ac-4fa0-9e2d-c97d4864335e,10,DVLA and MIB announce the launch of MyLicence ...,False,19.690835
374557,www.gov.uk/government/news/online-payment-opti...,2,e1067450-7d13-45ff-ada4-5e3dd4025fb7,www.gov.uk/standard-visitor-visa,Standard Visitor visa,3.833804e+08,5e17c0b3-7631-11e4-a3cb-005056011aef,10,Online payment option introduced for UK visa a...,False,19.764538


In [76]:
top_10_df.co_occurrences.describe()

count    1.081337e+06
mean     2.118323e+01
std      5.042700e+02
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      3.000000e+00
max      1.847650e+05
Name: co_occurrences, dtype: float64

## Get top 7 links for each page, where co_occurrences > 1, or co_occurrences = 1
Save down these two things to show that co_occurrences = 1 lead to some bad links

In [78]:
top_7_df = get_top_n(related_links, n=7)

In [87]:
top_7_df[(top_7_df['co_occurrences'] > 1) ].shape


(334638, 9)

In [86]:
top_7_df[(top_7_df['co_occurrences'] > 1) & (top_7_df['base_path'] != 'unknown')].shape
#          (top_7_df['co_occurrences'] < 1000)]

(280220, 9)

In [94]:
top_7_df[(top_7_df['co_occurrences'] > 1) & (top_7_df['co_occurrences'] < 10)].shape

(251551, 9)

In [96]:
top_7_df[(top_7_df['co_occurrences'] > 1) & (top_7_df['base_path'] != 'unknown')][['title', 'link_title', 'base_path', 'link_base_path', 
     'llr_score','co_occurrences'
    ]].to_csv(
    os.path.join(DATA_DIR, 'llr', 
                 'llr_4_top_7_coocurrence_gt_1_sequenced_filtered20190218_20190310.csv'), index=False)

In [101]:
top_7_df[top_7_df['co_occurrences'] == 1].sort_values('llr_score', ascending=False)[
    ['title', 'link_title', 'base_path', 'link_base_path', 'page', 'link',
     'llr_score','rank','co_occurrences'
    ]].head(100).to_csv(
    os.path.join(DATA_DIR, 'llr', 
                 'llr_4_top_7_single_coocurrence_highest_scores_20190218_20190310.csv'), index=False)

In [102]:
top_7_df[top_7_df['co_occurrences'] == 1].sort_values('llr_score', ascending=True)[
    ['title', 'link_title', 'base_path', 'link_base_path', 'page', 'link',
     'llr_score','rank','co_occurrences'
    ]].head(100).to_csv(
    os.path.join(DATA_DIR, 'llr', 
                 'llr_4_top_7_single_coocurrence_lowest_scores_20190218_20190310.csv'), index=False)

In [104]:
top_7_df[(top_7_df['co_occurrences'] > 1)]['page'].nunique()

72213

In [106]:
top_7_df[(top_7_df['co_occurrences'] > 1)]['base_path'].nunique()

61559

## Get top 10 links per page for the most popular pages based on some GA data
i.e. get the top 10 links per page for a specified list of base paths

In [107]:
top_pages_GA = pd.read_csv(os.path.join(DATA_DIR, 'metadata', 'top_pages.csv'))

In [111]:
top_pages = top_pages_GA['Page'].tolist()

In [112]:
def is_top_page(pagepath):
    return pagepath in top_pages

In [116]:
top_10_df['is_top_page'] = top_10_df['base_path'].map(is_top_page)

In [117]:
top_10_df[top_10_df['is_top_page']].shape

(1470, 12)

In [118]:
top_10_df[top_10_df['is_top_page']][
    ['title', 'link_title', 'base_path', 'link_base_path', 'page', 'link',
     'llr_score','rank','co_occurrences'
    ]].to_csv(
    os.path.join(DATA_DIR, 'llr', 
                 'llr_4_top_10_top_GA_pages_20190218_20190310.csv'), index=False)