In this notebook we generate related links for pages using functional data (what people view on GOV.UK), and LLR (the Log-Likelihood Ratio).

LLR (log likelihood ratio) looks at how often pages A and B are viewed together, but also how often in general each page is viewed (if page A co-occurs at similar rates with B and C, but C is less popular sitewide, it is seen as a more relevant recommendation for A).

A blog about LLR: http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
Paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.5962&rep=rep1&type=pdf


Some notes on what we do:
- LLR by content ID
- look at page hits sequentially - link page viewed AFTER page
- filter out "finding' pages - `document_type` in ['document_collection', 'finder', 'homepage', 'license_finder', 'mainstream_browse_page', 'organisation', 'search', 'service_manual_homepage', 'service_manual_topic', 'services_and_information', 'taxon', 'topic', 'topical_event']
- filter out 'fatality_notice', 'contact', 'service_sign_in','html_publication', 'calculator', 'completed_transaction' `document_type`s
- filter out embedded links that already exist on pages

# Setup

In [1]:
import os 
import math
from ast import literal_eval
import pandas as pd
import json

In [2]:
# set up pandas, bigquery, data dir stuff
ProjectID = 'govuk-bigquery-analytics'
KEY_DIR = os.getenv("BQ_KEY_DIR")
key_file_path = os.path.join(KEY_DIR, os.listdir(KEY_DIR)[0])

DATA_DIR = os.getenv("DATA_DIR")

# Run LLR Query
Node2Vec used 3 weeks of data, so we should use 3 weeks here to give it the same chance of capturing the functional network

The query dequencs pageviews using hitNumber, so we look at pages viewed *after* other pages

You have to sub in `START_DATE` and `END_DATE`

In [11]:
file_object  = open('../queries/llr_pages_content_id_split_up_hitNumber.sql', 'r')
QUERY = file_object.read()

In [12]:
print(QUERY)

WITH
  -- for all sessions between START_DATE and END_DATE, for each content ID viewed, get the
  -- lowest hitNumber in that session when it was viewed (because we want to see what pages
    -- were viewed after what other pages)
  -- we could try getting all hitNumbers each piece of content was viewed at, e.g. a session
  -- includes page X -> page Y -> page X, does that mean we should recommmend page Y on page X, and vice versa?
  session_pages AS (
  SELECT
    CONCAT(fullVisitorId,"-",CAST(visitId AS STRING)) AS sessionId,
    content_id,
    MIN(hitNumber) AS hitNumber,
    -- is_news_item based on news_and_communications supergroup here https://docs.publishing.service.gov.uk/document-types/content_purpose_supergroup.html
    -- we only want to recommend news items on news item pages, not on any other pages
    MAX(IF(document_type IN ('medical_safety_alert',
          'drug_safety_update',
          'news_article',
          'news_story',
          'press_release',
          'wo

In [13]:
print(QUERY.format(START_DATE='20190222', END_DATE='20190314'))

WITH
  -- for all sessions between START_DATE and END_DATE, for each content ID viewed, get the
  -- lowest hitNumber in that session when it was viewed (because we want to see what pages
    -- were viewed after what other pages)
  -- we could try getting all hitNumbers each piece of content was viewed at, e.g. a session
  -- includes page X -> page Y -> page X, does that mean we should recommmend page Y on page X, and vice versa?
  session_pages AS (
  SELECT
    CONCAT(fullVisitorId,"-",CAST(visitId AS STRING)) AS sessionId,
    content_id,
    MIN(hitNumber) AS hitNumber,
    -- is_news_item based on news_and_communications supergroup here https://docs.publishing.service.gov.uk/document-types/content_purpose_supergroup.html
    -- we only want to recommend news items on news item pages, not on any other pages
    MAX(IF(document_type IN ('medical_safety_alert',
          'drug_safety_update',
          'news_article',
          'news_story',
          'press_release',
          'wo

In [14]:
df_llr_recs = pd.io.gbq.read_gbq(
    QUERY.format(START_DATE='20190222', END_DATE='20190314'),
                           project_id=ProjectID,
                           reauth=False,
                           verbose=True,
                           private_key=key_file_path,
                           dialect='standard')

  credentials=credentials, verbose=verbose, private_key=private_key)
  credentials=credentials, verbose=verbose, private_key=private_key)


In [15]:
df_llr_recs.to_csv(os.path.join(DATA_DIR, 'llr', 'llr_recs_20190222_20190314.csv.gz'),
                     compression='gzip', index=False)

In [3]:
# df_llr_recs = pd.read_csv(os.path.join(DATA_DIR, 'llr', 'llr_recs_20190222_20190314.csv.gz'),
#                      compression='gzip')

In [4]:
df_llr_recs.head()

Unnamed: 0,page_1,page_2,page_1_occurrences,co_occurrences,llr_score,rank
0,0000d0a0-037a-4110-a271-24327f422d06,5f563505-7631-11e4-a3cb-005056011aef,1,1,1212560000.0,1
1,00015d3f-e7d9-48e8-95ff-ac3f7fa07be3,8ec6ea6e-a259-459a-9e5e-ffab399a2e2e,1,1,740109400.0,1
2,000227a8-f0d2-417d-8ce4-27a18d62d442,63fdec18-9da3-4621-9621-8f7424d57f3b,47,2,2389898000.0,1
3,000227a8-f0d2-417d-8ce4-27a18d62d442,276417aa-205a-4bfe-bc39-edae27d528c7,47,2,2378549000.0,2
4,000227a8-f0d2-417d-8ce4-27a18d62d442,5fa6f501-7631-11e4-a3cb-005056011aef,47,2,2360718000.0,3


Quite a big DF at the moment, as I've taken up to 100 links for each page, only because I don't know how many will be filtered out by removing embedded links. I've also looked at pages over 3 weeks, so the number of occurrences for each page will be fairly high

In [5]:
print(df_llr_recs.shape)
print(df_llr_recs[df_llr_recs['co_occurrences']>1].shape)
print(df_llr_recs[df_llr_recs['co_occurrences']>10].shape)
print(df_llr_recs[df_llr_recs['co_occurrences']>100].shape)

(4667872, 6)
(1408782, 6)
(237837, 6)
(41982, 6)


In [6]:
df_llr_recs['page_1'].nunique()

147215

# Get top n related links, without embedded links
Filter out embedded links that already exist in pages, and get the top n related links for each page

## Get embedded links
Get the embedded links for each page from `content_json_extended.csv.gz` (created in [govuk-network-embedding](https://github.com/alphagov/govuk-network-embedding/blob/2nd-iteration/notebooks/data_extract/extract_links_from_json.ipynb) )

In [7]:
# get previously extracted content API data
content_json_file = os.path.join(DATA_DIR, 'content_api', "11-02-19",
                                 "content_json_extended.csv.gz")
content = pd.read_csv(content_json_file, compression="gzip")

In [8]:
content.columns

Index(['base_path', 'body', 'children', 'collection_links', 'content_id',
       'description', 'document_collections', 'document_type',
       'embedded_links', 'field_of_operation', 'finder', 'first_published_at',
       'lead_organisations', 'locale', 'mainstream_browse_pages', 'ministers',
       'ordered_related_items_overrides', 'organisations',
       'pages_part_of_step_nav', 'pages_related_to_step_nav', 'parent',
       'part_of_step_navs', 'people', 'policy_areas',
       'primary_publishing_organisation', 'publishing_app', 'related_guides',
       'related_links', 'related_mainstream', 'related_mainstream_content',
       'related_policies', 'related_statistical_data_sets',
       'related_to_step_navs', 'roles', 'sections', 'speaker', 'title',
       'topical_events', 'topics'],
      dtype='object')

In [9]:
# dictionary of base paths to content IDs
content_base_cid = dict(zip(content.base_path, content.content_id))

In [10]:
content['embedded_links'] = content['embedded_links'].map(literal_eval)

# for all the embedded link URLs in a row, get their content IDs if they are in content_base_cid
content['embedded_cids'] = content['embedded_links'].map(lambda x: [content_base_cid[l] for l in x 
                                                                    if l in content_base_cid.keys()])

In [11]:
# dictionary of content IDs and the content IDs of the embedded links in them
cid_link_cids = dict(zip(content.content_id, content.embedded_cids))

In [12]:
# dictionary of content IDs and the content IDs of the embedded links in them,
# only if there are embedded links
cid_link_cids_with_emb = {k: v for k, v in cid_link_cids.items() if len(v) > 0}

In [13]:
len(cid_link_cids_with_emb.keys())

2236

## Get titles and base paths
For each content ID get its title and base_path from `content_json_extended.csv.gz` so that humans can assess our links

In [14]:
cids_titles = dict(zip(content.content_id, content.title))

In [15]:
cids_paths = dict(zip(content.content_id, content.base_path))

Create a dictionary `related_links` from `df_llr_recs`, the content IDs of each page are the keys, the values are arrays of the recommended links, with LLR scores, ranks fro the SQL, and co_occurrence counts

In [19]:
df_llr_recs.columns

Index(['page_1', 'page_2', 'page_1_occurrences', 'co_occurrences', 'llr_score',
       'rank'],
      dtype='object')

In [20]:
related_links = {}

for row in df_llr_recs.itertuples():
    if row.page_1 not in related_links:
        related_links[row.page_1] = list()
    related_links[row.page_1].append (
        {'link': row.page_2, 'llr_score': row.llr_score, 'rank': row.rank, 
         'co_occurrences': row.co_occurrences})

In [21]:
def get_page_title(content_id):
    try:
        return cids_titles[content_id]
    except KeyError:
        return 'unknown'

def get_page_url(content_id):
    try:
        return f"www.gov.uk{cids_paths[content_id]}"
    except KeyError:
        return 'unknown'

## Remove embedded links
With a dictionary of content IDs : related links, and a dictionary of content IDs : embedded links, filter out all embedded links that occur in related links for each content ID

In [48]:
def remove_embedded(related_links_dict, cid_link_cids_with_emb):
    for content_id, embedded_links in cid_link_cids_with_emb.items():
        for emb_link in embedded_links:
            try:
                related_links_dict[content_id] = [
                    recommended_page
                    for recommended_page in related_links_dict[content_id] 
                    if recommended_page['link'] != emb_link]
            except KeyError:
                continue

In [49]:
remove_embedded(related_links, cid_link_cids_with_emb)

## Remove links for some excluded pages
We don't want to generate related links for some pages, so remove them here

In [24]:
PAGES_TO_EXCLUDE = [
    # https://www.gov.uk/contact-student-finance-england
    'd490be5f-1998-4f20-ab52-d3dd5db7fa71',
    # https://www.gov.uk/student-finance-forms
    '37e27ec1-ef3e-4b3c-a2f6-14dc42c4c162',
    # https://www.gov.uk/funding-for-postgraduate-study
    'bcd1365f-8496-40f5-b9cf-06f2493e48c4',
    # https://www.gov.uk/advanced-learner-loan
    '23eee5eb-7e24-4a7f-bf92-112f8c8132bc',
    # https://www.gov.uk/teacher-training-funding
    '708334c4-2855-4d45-b311-72a26b03529a',
    # https://www.gov.uk/extra-money-pay-university
    'eff0d788-3b5e-4090-8b56-aaa0a4bf3f25',
    # https://www.gov.uk/nhs-bursaries
    '6a131bf3-6d52-4512-98cc-348c243fba8f',
    # https://www.gov.uk/travel-grants-students-england
    '05aaf43f-7555-46b3-99e5-a34e48b0eec9',
    # https://www.gov.uk/career-development-loans
    'cd69a882-474a-4dd9-9689-adde2fcd618c',
    # https://www.gov.uk/care-to-learn
    'c1873350-0c82-469d-9f59-4658be95fdd2',
    # https://www.gov.uk/dance-drama-awards
    '04d10810-8620-4edd-a06d-f5ecfc199414',
    # https://www.gov.uk/social-work-bursaries
    'eeeee6fc-5f61-4b8a-bb41-efe12bad9e2c',
    # https://www.gov.uk/student-finance-if-you-suspend-or-leave
    '2585264f-9562-4d4b-bbd0-4dae7b7bc2b1']

In [25]:
def remove_links_from_pages(related_links_dict, pages_to_exclude):
    for content_id in pages_to_exclude:
        try:
            del related_links_dict[content_id]
        except KeyError:
            continue

In [26]:
remove_links_from_pages(related_links, PAGES_TO_EXCLUDE)

In [27]:
len(related_links)

147202

## Get top n, add titles and base paths
Using a clean dictionary of related links (no replication of embedded links), iterate through the dictionary:
 - sort the recommended related links by their rank, and take the top n
 - add titles and base paths for the pages and their related links
 - return a DataFrame with all this info

In [28]:
def get_top_n(related_links_dict, n=10):
    pages_links = []
    for page, recs in related_links_dict.items():
        title = get_page_title(page)
        base_path = get_page_url(page)
        sorted_recs = sorted(recs, key=lambda k: k['rank'])[:n]
        for rec in sorted_recs:
            page_link = {"title": title,
                         "link_title": get_page_title(rec['link']),
                         "base_path": base_path,
                         "link_base_path": get_page_url(rec['link']),
                         "page": page}
            page_link.update(rec)
            pages_links.append(page_link)
    return pd.DataFrame(pages_links)

In [29]:
top_5_df = get_top_n(related_links, n=5)

In [30]:
top_5_df

Unnamed: 0,base_path,co_occurrences,link,link_base_path,link_title,llr_score,page,rank,title
0,www.gov.uk/government/news/new-digital-resourc...,1,5f563505-7631-11e4-a3cb-005056011aef,unknown,unknown,1.212560e+09,0000d0a0-037a-4110-a271-24327f422d06,1,New digital resource for charity trustees laun...
1,www.gov.uk/government/statistics/uk-consumer-p...,1,8ec6ea6e-a259-459a-9e5e-ffab399a2e2e,www.gov.uk/tax-codes,Tax codes,7.401094e+08,00015d3f-e7d9-48e8-95ff-ac3f7fa07be3,1,UK consumer price inflation: Dec 2017
2,www.gov.uk/government/publications/foi-respons...,2,63fdec18-9da3-4621-9621-8f7424d57f3b,www.gov.uk/government/publications/foi-respons...,FOI responses published by MOD: week commencin...,2.389898e+09,000227a8-f0d2-417d-8ce4-27a18d62d442,1,FOI responses published by MOD: week commencin...
3,www.gov.uk/government/publications/foi-respons...,2,276417aa-205a-4bfe-bc39-edae27d528c7,www.gov.uk/government/publications/foi-respons...,FOI responses released by MOD: week commencing...,2.378549e+09,000227a8-f0d2-417d-8ce4-27a18d62d442,2,FOI responses published by MOD: week commencin...
4,www.gov.uk/government/publications/foi-respons...,2,5fa6f501-7631-11e4-a3cb-005056011aef,unknown,unknown,2.360718e+09,000227a8-f0d2-417d-8ce4-27a18d62d442,3,FOI responses published by MOD: week commencin...
5,www.gov.uk/government/publications/foi-respons...,2,e2aed688-aefc-4ba2-ab11-7181b2b27af0,www.gov.uk/government/publications/foi-respons...,FOI responses published by MOD: week commencin...,2.357301e+09,000227a8-f0d2-417d-8ce4-27a18d62d442,4,FOI responses published by MOD: week commencin...
6,www.gov.uk/government/publications/foi-respons...,2,5b405ba5-8ea0-4c74-a182-f57528c6c7de,www.gov.uk/government/publications/foi-respons...,FOI responses published by MOD: week commencin...,2.347324e+09,000227a8-f0d2-417d-8ce4-27a18d62d442,5,FOI responses published by MOD: week commencin...
7,unknown,2,bfd51754-d1a0-43e6-9037-f70ef2e06169,www.gov.uk/employment-tribunal-decisions/mr-k-...,Mr K Batkin v LTE Group: 2423799/2017,2.764195e+09,00026382-59b7-4d50-bd2f-04f4b9defba4,1,unknown
8,unknown,2,8826f10f-f2ad-4ef2-bff4-0b538aad0b6c,www.gov.uk/employment-tribunal-decisions/ms-m-...,Ms M Williams v LTE Group: 2424517/2017,2.587720e+09,00026382-59b7-4d50-bd2f-04f4b9defba4,2,unknown
9,unknown,2,c92cbb9c-ef2a-4576-aaf2-929124d36772,unknown,unknown,2.574079e+09,00026382-59b7-4d50-bd2f-04f4b9defba4,3,unknown


In [31]:
top_5_df.shape

(611596, 9)

In [32]:
top_5_df.query('base_path != "unknown"').shape

(503477, 9)

# Get related links for colleagues to review

## Fix LLR score as we forgot to divide it by N
I've updated the SQL so you won't have to do this

In [33]:
# approx_N = df_llr_recs['co_occurrences'].sum()

In [34]:
# approx_N

In [97]:
# top_5_df['llr_score'] = top_5_df['llr_score'].map(lambda x: x/approx_N)

In [35]:
top_5_df.head()

Unnamed: 0,base_path,co_occurrences,link,link_base_path,link_title,llr_score,page,rank,title
0,www.gov.uk/government/news/new-digital-resourc...,1,5f563505-7631-11e4-a3cb-005056011aef,unknown,unknown,1212560000.0,0000d0a0-037a-4110-a271-24327f422d06,1,New digital resource for charity trustees laun...
1,www.gov.uk/government/statistics/uk-consumer-p...,1,8ec6ea6e-a259-459a-9e5e-ffab399a2e2e,www.gov.uk/tax-codes,Tax codes,740109400.0,00015d3f-e7d9-48e8-95ff-ac3f7fa07be3,1,UK consumer price inflation: Dec 2017
2,www.gov.uk/government/publications/foi-respons...,2,63fdec18-9da3-4621-9621-8f7424d57f3b,www.gov.uk/government/publications/foi-respons...,FOI responses published by MOD: week commencin...,2389898000.0,000227a8-f0d2-417d-8ce4-27a18d62d442,1,FOI responses published by MOD: week commencin...
3,www.gov.uk/government/publications/foi-respons...,2,276417aa-205a-4bfe-bc39-edae27d528c7,www.gov.uk/government/publications/foi-respons...,FOI responses released by MOD: week commencing...,2378549000.0,000227a8-f0d2-417d-8ce4-27a18d62d442,2,FOI responses published by MOD: week commencin...
4,www.gov.uk/government/publications/foi-respons...,2,5fa6f501-7631-11e4-a3cb-005056011aef,unknown,unknown,2360718000.0,000227a8-f0d2-417d-8ce4-27a18d62d442,3,FOI responses published by MOD: week commencin...


In [36]:
top_5_df.llr_score.describe()

count    6.115960e+05
mean     1.482019e+10
std      2.629275e+11
min      4.525482e+07
25%      1.327874e+09
50%      1.654574e+09
75%      3.303661e+09
max      7.649506e+13
Name: llr_score, dtype: float64

In [37]:
top_5_df.co_occurrences.describe()

count    611596.000000
mean         27.732658
std         633.974168
min           1.000000
25%           1.000000
50%           1.000000
75%           3.000000
max      185888.000000
Name: co_occurrences, dtype: float64

## Get top 5 links for each page, where co_occurrences > 1, or co_occurrences = 1
Save down these two things to show that co_occurrences = 1 lead to some bad links

In [38]:
top_5_df[(top_5_df['co_occurrences'] > 1) ].shape


(258267, 9)

In [39]:
top_5_df[(top_5_df['co_occurrences'] > 1) & (top_5_df['base_path'] != 'unknown')].shape
#          (top_7_df['co_occurrences'] < 1000)]

(213934, 9)

In [40]:
top_5_df[(top_5_df['co_occurrences'] > 1) & (top_5_df['co_occurrences'] < 10)].shape

(192173, 9)

In [41]:
top_5_df[(top_5_df['co_occurrences'] > 1)][[
    'title', 'link_title', 'base_path', 'link_base_path', 'page', 'link',
     'llr_score','rank','co_occurrences'
    ]].to_csv(
    os.path.join(DATA_DIR, 'llr', 
                 'llr_top_5_coocurrence_gt_1_20190222_20190314.csv'), index=False)

In [42]:
top_5_df[(top_5_df['co_occurrences'] > 1) & (top_5_df['base_path']!= 'unknown')][[
    'title', 'link_title', 'base_path', 'link_base_path',
     'llr_score', 'co_occurrences'
    ]].to_csv(
    os.path.join(DATA_DIR, 'llr', 
                 'llr_top_5_coocurrence_gt_1_20190222_20190314_thin.csv'), index=False)

In [43]:
top_5_df[top_5_df['co_occurrences'] == 1].sort_values('llr_score', ascending=False)[[
    'title', 'link_title', 'base_path', 'link_base_path', 'page', 'link',
     'llr_score','rank','co_occurrences'
    ]].head(100).to_csv(
    os.path.join(DATA_DIR, 'llr', 
                 'llr_top_5_single_coocurrence_highest_scores_20190222_20190314.csv'), index=False)

In [44]:
top_5_df[(top_5_df['co_occurrences'] > 1)]['page'].nunique()

71062

In [45]:
top_5_df[(top_5_df['co_occurrences'] > 1)]['base_path'].nunique()

59807

## Get top 5 links per page for the most popular pages based on some GA data
i.e. get the top 5 links per page for a specified list of base paths (list [here](https://docs.google.com/spreadsheets/d/1s-tyTXaWTv0yLD6WyQDjWfD3UWlvHmgaE_GqSFkYPI4/edit#gid=1914380518)

In [69]:
top_pages_GA = pd.read_csv(os.path.join(DATA_DIR, 'metadata', 'top_pages.csv'))

In [70]:
top_pages = top_pages_GA['Page'].tolist()

In [71]:
def is_top_page(pagepath):
    return pagepath in top_pages

In [72]:
top_5_df['is_top_page'] = top_5_df['base_path'].map(is_top_page)

In [73]:
top_5_df[top_5_df['is_top_page']].shape

(725, 11)

In [106]:
top_5_df[top_5_df['is_top_page']][
    [      'title', 'link_title', 'base_path', 'link_base_path', 'page', 'link',
     'llr_score','rank','co_occurrences'
    ]].to_csv(
    os.path.join(DATA_DIR, 'llr', 
                 'llr_top_5_top_GA_pages_20190222_20190314.csv'), index=False)

In [107]:
top_5_df[top_5_df['is_top_page']]['co_occurrences'].describe()

count       725.000000
mean       8389.028966
std       14106.852748
min           6.000000
25%        1809.000000
50%        4380.000000
75%        9166.000000
max      185888.000000
Name: co_occurrences, dtype: float64

In [108]:
top_5_df.columns

Index(['base_path', 'co_occurrences', 'link', 'link_base_path', 'link_title',
       'llr_score', 'page', 'rank', 'title', 'ln_llr_score', 'is_top_page'],
      dtype='object')

## Save top 5 links per page for top GA pages as a JSON file for ingestion into Integration

In [109]:
top_5_top_GA_pages_dict = {}

for row in top_5_df[top_5_df['is_top_page']].sort_values('llr_score', ascending=False).itertuples():
    if row.page not in top_5_top_GA_pages_dict:
        top_5_top_GA_pages_dict[row.page] = list()
    top_5_top_GA_pages_dict[row.page].append(row.link)

In [110]:
top_5_top_GA_pages_dict

{'c1f13d41-ed7f-44a3-be11-fd95525ddf40': ['eafed48e-6a19-44cf-993c-c7b9a3a22cfa',
  'e0a879a7-afd2-49fd-ba44-b066cce082cf',
  '9a24dcce-c03a-4523-96e3-893be1c400cd',
  '421504cb-63c4-44fb-94ec-5d0129e1748c',
  '5e169a54-7631-11e4-a3cb-005056011aef'],
 'fa748fae-3de4-4266-ae85-0797ada3f40c': ['0889f128-e479-465f-b3e1-a3db6a3879cf',
  'e41bd8f3-148c-4285-ad16-131c716bc067',
  '5165dcf5-4e9a-4ed0-96a3-e35c6c1060e0',
  'f8490f74-437e-467a-9fa3-0e5f60e0bee0',
  'daca9924-c834-451b-9675-03efb8a3397f'],
 'ad5110e0-fa62-49d3-923f-d50101f12014': ['0889f128-e479-465f-b3e1-a3db6a3879cf',
  'dc57162b-59f4-4d0f-9b83-a67f74ffccf5',
  '4ca3fd7e-efdb-4a24-9a84-60ba3c614588',
  'fb5b3d6b-73ff-4213-8320-f0aa4bbe5e63',
  'fa748fae-3de4-4266-ae85-0797ada3f40c'],
 'dc57162b-59f4-4d0f-9b83-a67f74ffccf5': ['ad5110e0-fa62-49d3-923f-d50101f12014',
  '4a1f8cf7-3779-41d3-9b62-bf0297ff9545',
  '0889f128-e479-465f-b3e1-a3db6a3879cf',
  'fa748fae-3de4-4266-ae85-0797ada3f40c',
  '00cad980-6bce-46e4-b83e-6a4238c25f3b

In [111]:
ingestion_file = os.path.join(DATA_DIR, 'llr', 'llr_top_5_top_GA_pages_20190222_20190314.json')

In [112]:
with open(ingestion_file, 'w') as outfile:
    json.dump(top_5_top_GA_pages_dict, outfile)

# A/B Test stuff
Save top 5 links per page for all pages, where co_occurrences > 1 as a JSON file for ingestion for the A/B test


In [117]:
top_5_all_pages_dict = {}

for row in top_5_df[top_5_df['co_occurrences'] > 1].sort_values(
    'llr_score', ascending=False).itertuples():
    if row[7] not in top_5_all_pages_dict:
        top_5_all_pages_dict[row[7]] = list()
    top_5_all_pages_dict[row[7]].append(row[3])

In [120]:
ingestion_file_all_pages = os.path.join(DATA_DIR, 'llr', 'llr_top_5_all_pages_20190222_20190314.json')

In [121]:
with open(ingestion_file_all_pages, 'w') as outfile:
    json.dump(top_5_all_pages_dict, outfile)