# Finding potentially-misplaced content

## Context

This notebook is for identifying content that seems to be misplaced. It focuses on the money branch, which we've agreed to explore this quarter for demo purposes.

There are sections for finding misplaced content (with taxon-level metrics and content-level metrics), taking a look at the Brexit taxon, and identifying duplicate taxon names.

## Prepare workspace

Assuming for now that your working directory is at `/content-similarity-models/google-universal-encoder`

In [26]:
!pwd

/Users/matthewdray/Documents/content-similarity-models/google-universal-encoder


In [218]:
import numpy as np
import pandas as pd

from sklearn.metrics import pairwise_distances, pairwise_distances_chunked

import altair as alt
from altair import datum

from sklearn.manifold import TSNE
import seaborn as sns

## Read and prepare data

There are two input data sources here:

1. Content items and embeddings
  1a. Embedded clean content (pickle file)
  1b. Labelled data (data frame with a row per content item)
  1c. Embedded sentences (an array of the pairwise cosine similarity scores)
1. Branch homogeneity scores

### 1a. Embedded clean content (pickle file)

This pickle file was created to combine the content data with the embeddings with the need for the numpy array as per section '1. Embedded sentences'.

In [146]:
embedded_clean_content = pd.read_pickle('../data/embedded_clean_content.pkl')

In [147]:
embedded_clean_content.head()

Unnamed: 0,base_path,content_id,description,document_type,first_published_at,locale,primary_publishing_organisation,publishing_app,title,body,combined_text,embedded_sentences
0,/government/publications/list-of-psychologists...,04a0cc0d-0b9f-45ad-bf57-7c54cbab9df9,list of english speaking psychologists and psy...,guidance,2017-07-21T16:42:00.000+00:00,en,Foreign & Commonwealth Office,whitehall,chile - list of psychologists and psychiatrist...,prepared by british embassy/consulate santiago...,chile - list of psychologists and psychiatrist...,"[0.0535724014043808, 0.002477730857208371, -0...."
1,/government/statistics/uk-labour-market-statis...,a61985b0-d6eb-4cf1-8140-642b9557ce00,employment unemployment economic inactivity cl...,national_statistics,2017-05-17T08:30:02.000+00:00,en,Office for National Statistics,whitehall,uk labour market statistics: may 2017,official statistics are produced impartially a...,uk labour market statistics: may 2017 employme...,"[-0.03585460036993027, 0.02925342693924904, 0...."
2,/government/publications/monitor-remuneration-...,d569ef4b-d632-49a0-9795-6a7ea934b799,agenda and committee papers.,transparency,2015-01-20T10:27:00.000+00:00,en,Monitor,whitehall,monitor: remuneration committee papers october...,papers from the october 2014 meeting of monito...,monitor: remuneration committee papers october...,"[0.04697635397315025, 0.0011109516490250826, 0..."
3,/government/statistical-data-sets/env26-expend...,5e0fee54-7631-11e4-a3cb-005056011aef,annual updates on uk and global biodiversity e...,statistical_data_set,2012-06-17T23:00:00.000+00:00,en,"Department for Environment, Food & Rural Affairs",whitehall,env26 - expenditure on biodiversity,uk and global public sector expenditure on bio...,env26 - expenditure on biodiversity annual upd...,"[0.023101944476366043, 0.02299758419394493, 0...."
4,/government/publications/hmg-spending-moratori...,581cabaf-2ed1-4411-8d1b-35b1bcb559b5,hmg spending moratoria: dfid ict january to ma...,transparency,2015-07-17T16:31:32.000+00:00,en,Department for International Development,whitehall,hmg spending moratoria: dfid ict january to ma...,hmg spending moratoria: dfid ict january to ma...,hmg spending moratoria: dfid ict january to ma...,"[0.028265895321965218, 0.0009081338648684323, ..."


### 1b. Labelled data (csv.gz)
We may also need to read the `labelled.csv` data to create some objects that will be used later. The labelled data is one of the inputs to the `get_homogeneity_scores_taxon.py script` that produces `taxon_homogeneity_df.csv`.

In [274]:
labelled = pd.read_csv(
    '../data/2019-02-11/labelled.csv.gz',
    compression='gzip',
    low_memory=False
)

In [276]:
labelled.head()

Unnamed: 0,base_path,content_id,description,document_type,first_published_at,locale,primary_publishing_organisation,publishing_app,title,body,combined_text,taxon_id,taxon_base_path,taxon_name,level1taxon,level2taxon,level3taxon,level4taxon,level5taxon
0,/government/publications/list-of-psychologists...,04a0cc0d-0b9f-45ad-bf57-7c54cbab9df9,list of english speaking psychologists and psy...,guidance,2017-07-21T16:42:00.000+00:00,en,Foreign & Commonwealth Office,whitehall,chile - list of psychologists and psychiatrist...,prepared by british embassy/consulate santiago...,chile - list of psychologists and psychiatrist...,668cd623-c7a8-4159-9575-90caac36d4b4,/society-and-culture/community-and-society,Community and society,Society and culture,Community and society,,,
1,/government/news/charity-commission-names-furt...,5fa49c52-7631-11e4-a3cb-005056011aef,regulator increases transparency of its work.,press_release,2014-06-04T23:00:00.000+00:00,en,The Charity Commission,whitehall,charity commission names further charities und...,the charity commission has today named further...,charity commission names further charities und...,668cd623-c7a8-4159-9575-90caac36d4b4,/society-and-culture/community-and-society,Community and society,Society and culture,Community and society,,,
2,/government/publications/trust-and-confidence-...,d0341424-12a1-4b4c-9045-2e74ba17f2d5,independent research into trust and confidence...,research,2015-06-25T07:00:00.000+00:00,en,The Charity Commission,whitehall,trust and confidence in the charity commission...,the charity commission commissioned populus to...,trust and confidence in the charity commission...,668cd623-c7a8-4159-9575-90caac36d4b4,/society-and-culture/community-and-society,Community and society,Society and culture,Community and society,,,
3,/government/speeches/william-shawcross-speech-...,9245dfca-4210-41d9-9ffd-7fcc35dc1642,william shawcross asks charities to pull toget...,speech,2016-02-29T12:39:07.000+00:00,en,The Charity Commission,whitehall,william shawcross speech at commission’s publi...,good morning and thank you for joining us here...,william shawcross speech at commission’s publi...,668cd623-c7a8-4159-9575-90caac36d4b4,/society-and-culture/community-and-society,Community and society,Society and culture,Community and society,,,
4,/government/statistics/crime-statistics-focus-...,5fec046a-7631-11e4-a3cb-005056011aef,crime statistics from the crime survey for eng...,national_statistics,2015-03-26T09:30:00.000+00:00,en,Office for National Statistics,whitehall,public perceptions of crime and the police and...,official statistics are produced impartially a...,public perceptions of crime and the police and...,668cd623-c7a8-4159-9575-90caac36d4b4,/society-and-culture/community-and-society,Community and society,Society and culture,Community and society,,,


Each content item can exist in more than one row; it might be tagged to more than one part of the taxonomy.

In [280]:
labelled.shape  # 306k rows

(305703, 19)

In [284]:
labelled.drop_duplicates('content_id').shape  # 208k rows when duplicates removed

(208261, 19)

Prepare some objects for later visualisation.

In [32]:
taxon_id_to_base_path = dict(zip(labelled['taxon_id'], labelled['taxon_base_path']))

#taxon_id_to_level = dict(zip(labelled['taxon_id'], labelled['level']))

taxon_id_to_level1 = dict(zip(labelled['taxon_id'], labelled['level1taxon']))

Prep object containing relationship between `taxon_id` and `taxon_name`.

In [183]:
taxon_id_name = labelled[['taxon_id', 'taxon_name', 'level1taxon', 'level2taxon', 'level3taxon', 'level4taxon', 'level5taxon']].drop_duplicates()

### 1c. Embedded sentences (numpy array)

A numpy array of embeddings for content items. Superseded by reading embedded clean content file in section 1a.

In [29]:
embedded_sentences = np.load('../data/embedded_sentences2019-02-11.npy')

In [70]:
embedded_sentences.view()

array([[ 0.05357241,  0.00247775, -0.020976  , ..., -0.05675139,
         0.01268296,  0.01008949],
       [-0.04339019, -0.03241241,  0.00900179, ..., -0.04839829,
         0.03903588, -0.0553612 ],
       [-0.01541414,  0.04076389,  0.04761627, ...,  0.03003302,
         0.02379775, -0.06022731],
       ...,
       [ 0.0387876 ,  0.04077478,  0.04639488, ..., -0.05905833,
         0.00860474, -0.04581202],
       [-0.0426819 ,  0.02391836,  0.02524047, ..., -0.01499535,
         0.0110037 , -0.07002866],
       [ 0.0517485 , -0.01900592,  0.00853599, ..., -0.05199422,
        -0.02320959, -0.05730124]], dtype=float32)

### 2. Branch homogeneity scores
Homogeneity scores at a branch level. A Pandas data frame output from the `get_homogeneity_scores_taxon.py` script.

In [34]:
taxon_homogeneity_df = pd.read_csv("../data/taxon_homogeneity_df.csv")

In [35]:
taxon_homogeneity_df.shape

(1265, 9)

In [36]:
taxon_homogeneity_df.head()

Unnamed: 0.1,Unnamed: 0,taxon_id,taxon_size,mean_cosine_score,taxon_base_path,taxon_level,level1taxon,fewer_than_or_equal_5items,more_than_0_5_diversity
0,0,668cd623-c7a8-4159-9575-90caac36d4b4,5166,0.59549,/society-and-culture/community-and-society,2,Society and culture,0,1
1,246,f9e476ef-654d-41ec-97d9-2b6842d4361d,786,0.589025,/society-and-culture/sports-and-leisure,2,Society and culture,0,1
2,48,495afdb6-47be-4df1-8b38-91c8adb1eefc,8136,0.57151,/business-and-industry,1,Business and industry,0,1
3,833,fc5f468f-a3ba-4fde-9c1d-ed2dd17cfd82,31,0.571205,/housing-local-and-community/housing-local-ser...,3,"Housing, local and community",0,1
4,18,b29cf14b-54c6-402c-a9f0-77218602d1af,2333,0.569644,/society-and-culture/arts-and-culture,2,Society and culture,0,1


# Explore misplaced content

At two levels:

1. High-level structure (can we spot taxons that might contain problem content using taxon-level metrics?)
1. Identify problem content within taxons (can we identify and extract problem content from a given taxon?)

## 1. High-level structure

### Cosine similarity vs taxon size

Plot mean cosine similarity against taxon size; coloured by depth.

In [39]:
numcols = 6  # specify the number of columns you want
level1taxons = taxon_homogeneity_df['level1taxon'].unique() 

money = taxon_homogeneity_df[taxon_homogeneity_df.level1taxon == 'Money'].copy()

total_size = money['taxon_size'].sum().astype(str)

money_plot = alt.Chart(money).mark_circle(size=60).encode(
    alt.X(
        'taxon_size:Q',
        scale=alt.Scale(type='log', domain=(1, 10000)),
        axis=alt.Axis(grid=False, title='log(topic_size)')
    ),
    alt.Y(
        'mean_cosine_score:Q',
        scale=alt.Scale(domain=(0, 0.6)),
        axis=alt.Axis(grid=False, title='content diversity score')
    ), 
    #color='taxon_level:N',
    color=alt.Color('taxon_level:N', scale=alt.Scale(scheme='magma')),
    opacity=alt.value(0.8), 
    tooltip=['taxon_base_path']
).properties(
        title='Money' + ", " + total_size).interactive()

In [40]:
#money_plot.display()

In [41]:
money_plot.save('money.html', scale_factor=2.0)

Are there any level 2 taxons within the money branch that have poor mean cosine similarity? None of them exceed a homogeneity score of 0.5.

In [37]:
th = taxon_homogeneity_df

th[(th.taxon_level == 2) & (th.level1taxon == 'Money')& (th.taxon_size < 300) & (th.mean_cosine_score > 0.5)]

Unnamed: 0.1,Unnamed: 0,taxon_id,taxon_size,mean_cosine_score,taxon_base_path,taxon_level,level1taxon,fewer_than_or_equal_5items,more_than_0_5_diversity


In [74]:
money_level2 = th[(th.level1taxon == 'Money') & (th.taxon_level == 2)]
money_level2

Unnamed: 0.1,Unnamed: 0,taxon_id,taxon_size,mean_cosine_score,taxon_base_path,taxon_level,level1taxon,fewer_than_or_equal_5items,more_than_0_5_diversity
85,41,28262ae3-599c-4259-ae30-3c83a5ec02a1,522,0.47888,/money/business-tax,2,Money,0,0
125,377,b20215a9-25fb-4fa6-80a3-42e23f5352c2,266,0.461853,/money/dealing-with-hmrc,2,Money,0,0
215,194,35136812-221c-4ba2-8ed1-46e409ca5e10,373,0.438402,/money/tax-evasion-and-avoidance,2,Money,0,0
297,42,a5c88a77-03ba-4100-bd33-7ee2ce602dc8,400,0.419783,/money/personal-tax,2,Money,0,0
485,726,7c4cf197-2dba-4a82-83e2-6c8bb332525c,40,0.382529,/money/court-claims-debt-bankruptcy,2,Money,0,0
511,587,a7b67f4f-1234-4a0c-90ba-eeed4c5183cd,33,0.377888,/money/money-laundering-regulations,2,Money,0,0
892,668,5605545e-03ca-4520-9519-163ea341bc86,50,0.302895,/money/expenses-employee-benefits,2,Money,0,0


### Level 2 homogeneity spread

Cosine similarity for all tagged content in a dotplot.

In [None]:
#ax = sns.stripplot(x = "day", y = "total_bill", data = tips)

## 2. Identify problem content within taxons
Content may have been tagged in the wrong place. How can we identify this? One idea is to look at the cosine similarity between each content item and all the others within a taxon and then inspect the ones with scores that are above a certain threshold (i.e. they're semantically different to everything else).

### Example: 'business tax' taxon
This section contains a manual test of extracting misplaced content based on mean cosine similarity with other content in the taxon. This is generalised in a function in the section below.

Store the taxon ID as a variable.

In [42]:
btax_id = '28262ae3-599c-4259-ae30-3c83a5ec02a1'

Filter the embedded sentences (a numpy array) where it matches the business tax taxon ID. Indices for `embedded sentences` and `labelled` are the same, so `labelled` can be used to help filter.

In [43]:
btax_embedded = embedded_sentences[labelled['taxon_id'] == btax_id]

Get the cosine similarity for all content item pairs in the taxon, convert to a Pandas data frame and then get the mean distances for each content item.

In [44]:
btax_dist = pairwise_distances(
    btax_embedded, 
    metric = 'cosine', 
    n_jobs = -1
)

In [45]:
btax_dist_df = pd.DataFrame(btax_dist)

In [46]:
btax_dist_df['mean'] = btax_dist.mean(axis = 1)

In [47]:
btax_dist_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,513,514,515,516,517,518,519,520,521,mean
0,1.192093e-07,5.811087e-01,0.363355,0.395613,0.621481,6.434086e-01,0.289677,0.530680,0.207134,0.350927,...,5.854432e-01,0.505209,0.217435,0.395614,0.237765,0.321701,4.654763e-01,0.299991,1.957843e-01,0.450615
1,5.811087e-01,1.788139e-07,0.654736,0.576303,0.530271,6.266518e-01,0.597192,0.427653,0.501061,0.444977,...,6.465485e-01,0.560747,0.697612,0.502061,0.655768,0.623915,5.118585e-01,0.428566,5.778310e-01,0.580138
2,3.633550e-01,6.547358e-01,0.000000,0.426611,0.544326,5.551578e-01,0.253399,0.467847,0.262523,0.498422,...,6.806681e-01,0.598263,0.327363,0.497435,0.225597,0.265516,5.799253e-01,0.312676,2.386062e-01,0.438918
3,3.956128e-01,5.763029e-01,0.426611,0.000000,0.561196,5.895838e-01,0.290253,0.435382,0.441065,0.483954,...,7.239314e-01,0.557843,0.340588,0.306367,0.411198,0.428849,4.966263e-01,0.362502,4.144944e-01,0.443872
4,6.214815e-01,5.302707e-01,0.544326,0.561196,0.000000,4.022706e-01,0.556397,0.424755,0.601762,0.533905,...,6.275401e-01,0.514643,0.752991,0.367417,0.568614,0.582047,4.397339e-01,0.363483,6.077694e-01,0.555981
5,6.434086e-01,6.266518e-01,0.555158,0.589584,0.402271,1.192093e-07,0.619666,0.345636,0.608109,0.474413,...,6.891668e-01,0.309997,0.744982,0.420848,0.608365,0.475288,3.990849e-01,0.449534,6.333849e-01,0.505678
6,2.896773e-01,5.971916e-01,0.253399,0.290253,0.556397,6.196660e-01,0.000000,0.531329,0.284649,0.440067,...,6.574677e-01,0.515673,0.285938,0.377179,0.309599,0.314286,4.741045e-01,0.263140,2.587251e-01,0.427959
7,5.306804e-01,4.276533e-01,0.467847,0.435382,0.424755,3.456364e-01,0.531329,0.000000,0.487663,0.374371,...,6.128612e-01,0.468699,0.611151,0.319653,0.540137,0.372984,4.767542e-01,0.407424,5.223266e-01,0.441965
8,2.071338e-01,5.010610e-01,0.262523,0.441065,0.601762,6.081095e-01,0.284649,0.487663,0.000000,0.389362,...,6.607015e-01,0.563229,0.262255,0.445377,0.215301,0.299106,5.283373e-01,0.238127,1.458029e-01,0.429075
9,3.509272e-01,4.449770e-01,0.498422,0.483954,0.533905,4.744127e-01,0.440067,0.374371,0.389362,0.000000,...,5.294540e-01,0.323660,0.516627,0.433995,0.463708,0.385228,3.320318e-01,0.388419,3.998640e-01,0.452977


How many content items (rows) have a larger mean distance than the overall mean?

In [48]:
btax_dist_df[btax_dist_df['mean'] > btax_dist.mean()].shape

(212, 523)

Now we can use this information to filter the data frame of labelled content items (`labelled`), leaving us with a data frame of the problem content.

We can start by filtering the `labelled` data so we have only the content items that are in the business tax taxon.

In [49]:
btax_content = labelled[labelled['taxon_id'] == btax_id].reset_index()

In [50]:
btax_content

Unnamed: 0,index,base_path,content_id,description,document_type,first_published_at,locale,primary_publishing_organisation,publishing_app,title,body,combined_text,taxon_id,taxon_base_path,taxon_name,level1taxon,level2taxon,level3taxon,level4taxon,level5taxon
0,115137,/government/publications/hidden-economy-unders...,5fe7f08a-7631-11e4-a3cb-005056011aef,research to help hmrc understand and reduce th...,research,2012-12-07T00:00:00.000+00:00,en,HM Revenue & Customs,whitehall,hidden economy: understanding problems for sma...,research report on ways hm revenue and customs...,hidden economy: understanding problems for sma...,28262ae3-599c-4259-ae30-3c83a5ec02a1,/money/business-tax,Business tax,Money,Business tax,,,
1,115138,/government/publications/duty-on-high-strength...,5ab5791d-1643-4da8-8647-bff155cefe89,details of the government’s reforms to the tax...,policy_paper,2017-11-22T13:37:10.000+00:00,en,HM Treasury,whitehall,duty on high strength ciders: autumn budget 20...,following consultation earlier this year autum...,duty on high strength ciders: autumn budget 20...,28262ae3-599c-4259-ae30-3c83a5ec02a1,/money/business-tax,Business tax,Money,Business tax,,,
2,115139,/government/publications/corporation-tax-refun...,5d644c95-7631-11e4-a3cb-005056011aef,response to a freedom of information request o...,foi_release,2011-11-22T00:00:00.000+00:00,en,HM Revenue & Customs,whitehall,corporation tax refunds between 2006 and 2010,response to a freedom of information request f...,corporation tax refunds between 2006 and 2010 ...,28262ae3-599c-4259-ae30-3c83a5ec02a1,/money/business-tax,Business tax,Money,Business tax,,,
3,115140,/government/news/one-million-schemes-use-new-p...,5e2ac439-7631-11e4-a3cb-005056011aef,over one million employer paye schemes have st...,news_story,2013-05-02T12:32:46.000+00:00,en,HM Revenue & Customs,whitehall,one million schemes use new paye system,the new paye reporting system known as real ti...,one million schemes use new paye system over o...,28262ae3-599c-4259-ae30-3c83a5ec02a1,/money/business-tax,Business tax,Money,Business tax,,,
4,115141,/government/publications/devolution-of-landfil...,b1e9af5d-4613-40a0-9532-e8c55f0a23be,legislation will be made to amend the landfill...,policy_paper,2017-12-07T08:45:10.000+00:00,en,HM Revenue & Customs,whitehall,devolution of landfill tax to wales and the 2 ...,landfill tax will be devolved to wales from 1 ...,devolution of landfill tax to wales and the 2 ...,28262ae3-599c-4259-ae30-3c83a5ec02a1,/money/business-tax,Business tax,Money,Business tax,,,
5,115142,/guidance/stamp-duty-land-tax-cross-border-tra...,33a7c6f3-8c5a-4604-9bce-1b89a4dd40f7,find how to make sure you pay the right tax on...,detailed_guide,2018-03-21T16:27:02.000+00:00,en,HM Revenue & Customs,whitehall,stamp duty land tax: cross-border transactions,there’s no stamp duty land tax ( sdlt ) to pay...,stamp duty land tax: cross-border transactions...,28262ae3-599c-4259-ae30-3c83a5ec02a1,/money/business-tax,Business tax,Money,Business tax,,,
6,115143,/government/consultations/technical-consultati...,de4a00f7-f7a1-4305-b477-8017cd1f2e03,this technical consultation seeks comment on d...,consultation_outcome,2015-11-26T09:30:00.000+00:00,en,HM Revenue & Customs,whitehall,technical consultation on companies excluded f...,the chancellor announced at summer budget 2015...,technical consultation on companies excluded f...,28262ae3-599c-4259-ae30-3c83a5ec02a1,/money/business-tax,Business tax,Money,Business tax,,,
7,115144,/government/publications/diverted-profits-tax-...,627cb593-6304-453a-ad57-765b4212a583,this report sets out how hm revenue and custom...,research,2017-09-13T08:30:00.000+00:00,en,HM Revenue & Customs,whitehall,diverted profits tax yield: methodological note,diverted profits tax ( dpt ) was introduced in...,diverted profits tax yield: methodological not...,28262ae3-599c-4259-ae30-3c83a5ec02a1,/money/business-tax,Business tax,Money,Business tax,,,
8,115145,/government/publications/budget-2016-overview-...,c5d6b5eb-a502-49e6-b4c8-fcb201d96da5,tax policy measures announced at budget 2016.,policy_paper,2016-03-16T18:11:00.000+00:00,en,HM Revenue & Customs,whitehall,budget 2016: overview of tax legislation and r...,this document lists the tax policy measures an...,budget 2016: overview of tax legislation and r...,28262ae3-599c-4259-ae30-3c83a5ec02a1,/money/business-tax,Business tax,Money,Business tax,,,
9,115146,/guidance/changes-to-commodity-codes-in-chapte...,1f461191-e932-47b2-9c9e-134b91760eca,find out about the changes to volume 2 of the ...,detailed_guide,2018-02-16T12:00:00.000+00:00,en,HM Revenue & Customs,whitehall,changes to commodity codes in chapter 40 (tari...,chapter 40 delete commodity code 40121200 00 a...,changes to commodity codes in chapter 40 (tari...,28262ae3-599c-4259-ae30-3c83a5ec02a1,/money/business-tax,Business tax,Money,Business tax,,,


Now return content items from the data frame where the mean cosine similarity score is above a threshold value. These are the problem content items. Simplify the output to three columns of interest.

In [58]:
btax_misplaced = btax_content[['content_id', 'base_path', 'title', 'description']][btax_dist_df['mean'] > 0.65]
btax_misplaced

Unnamed: 0,content_id,base_path,title,description
57,f9e12f0e-bd0d-5361-8d26-bc83bfb34729,/hmrc-internal-manuals/vat-womens-sanitary-pro...,vat women’s sanitary products,guidance on the reduced rate for women's sanit...
66,a211f181-1cc0-45c0-8bb6-0491eb67fc92,/guidance/changes-to-chief-commodity-codes-tar...,changes to chief commodity codes (tariff stop ...,find out the changes to commodity codes in the...
196,6f019571-54be-4344-aede-cebd901c1fe5,/guidance/rates-and-allowances-for-air-passeng...,historic rates for air passenger duty,check which air passenger duty rates apply for...
219,e110d285-20e0-431e-a394-39edabb2b331,/guidance/air-passenger-duty-and-connected-fli...,air passenger duty and connected flights,check which flights to treat as connected for ...
256,eb031ebb-7078-4879-a124-33753c4ca0bd,/guidance/rates-and-allowances-for-air-passeng...,rates for air passenger duty,check which rates of air passenger duty you ne...
278,5f60a446-f47c-403a-aab3-bd83db20cf4f,/guidance/poultry-from-iceland-tariff-quota-no...,poultry from iceland (tariff quota notice 73),check the new tariff quota for poultry from ic...
343,6eb3a99b-9a0b-464a-bb42-c08882c7d857,/government/publications/iso-country-codes,iso country codes,find out the iso country codes.
397,5d5afda3-7631-11e4-a3cb-005056011aef,/government/news/government-to-waive-vat-on-mi...,government to waive vat on military wives’ cha...,chancellor of the exchequer has today announce...
430,0fea02ed-c1c8-4502-a7a2-f0ebebe1ee1c,/guidance/laser-skin-treatment-and-hair-remova...,laser skin treatment and hair removal (tariff ...,check the tariff classification of electrical ...
478,e36ebdbf-b8df-4dc8-beb5-beece2f7b7de,/government/collections/gwe-rwydo-a-sgamiau,gwe-rwydo a sgamiau,cyngor ar ddiogelwch gan gyllid a thollau em i...


Which of these are algo-tagged vs human-tagged?

In [53]:
tag_origin = pd.read_csv('../data/bulk_and_algorithm_tags.tsv', sep = '\t')

In [54]:
tag_origin.head()

Unnamed: 0,content_id,taxon_tag,taxon_id,how_tagged
0,a113dedd-4320-4186-b1af-888437c6aedb,Environment,71d37f3a-7c8c-4128-8763-2fd5b831b9b9,bulk_tag
1,5d8d7f5f-7631-11e4-a3cb-005056011aef,International aid and development,9fb30a53-70fb-4f1c-878b-0064b202d1ba,bulk_tag
2,5e2e074d-7631-11e4-a3cb-005056011aef,Health and social care,8124ead8-8ebc-4faf-88ad-dd5cbcc92ba8,bulk_tag
3,5f50cd4e-7631-11e4-a3cb-005056011aef,Maritime,4a9ab4d7-0d03-4c61-9e16-47787cbf53cd,bulk_tag
4,b60d9e65-dc51-4a5b-8fbc-b6b15652b6ba,World,91b8ef20-74e7-4552-880c-50e6d73c2ff9,bulk_tag


In [59]:
pd.merge(btax_misplaced, tag_origin, on = 'content_id', how = 'inner')

Unnamed: 0,content_id,base_path,title,description,taxon_tag,taxon_id,how_tagged
0,f9e12f0e-bd0d-5361-8d26-bc83bfb34729,/hmrc-internal-manuals/vat-womens-sanitary-pro...,vat women’s sanitary products,guidance on the reduced rate for women's sanit...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1,algorithm_v2.0.0
1,6f019571-54be-4344-aede-cebd901c1fe5,/guidance/rates-and-allowances-for-air-passeng...,historic rates for air passenger duty,check which air passenger duty rates apply for...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1,algorithm_v2.0.0
2,e110d285-20e0-431e-a394-39edabb2b331,/guidance/air-passenger-duty-and-connected-fli...,air passenger duty and connected flights,check which flights to treat as connected for ...,Aviation,51efa3dd-e9bc-42b2-aa26-06bf5f543015,algorithm_v2.0.0
3,e110d285-20e0-431e-a394-39edabb2b331,/guidance/air-passenger-duty-and-connected-fli...,air passenger duty and connected flights,check which flights to treat as connected for ...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1,algorithm_v2.0.0
4,eb031ebb-7078-4879-a124-33753c4ca0bd,/guidance/rates-and-allowances-for-air-passeng...,rates for air passenger duty,check which rates of air passenger duty you ne...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1,algorithm_v2.0.0
5,eb031ebb-7078-4879-a124-33753c4ca0bd,/guidance/rates-and-allowances-for-air-passeng...,rates for air passenger duty,check which rates of air passenger duty you ne...,Dealing with HMRC,b20215a9-25fb-4fa6-80a3-42e23f5352c2,algorithm_v2.0.0
6,6eb3a99b-9a0b-464a-bb42-c08882c7d857,/government/publications/iso-country-codes,iso country codes,find out the iso country codes.,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1,algorithm_v2.0.0
7,6eb3a99b-9a0b-464a-bb42-c08882c7d857,/government/publications/iso-country-codes,iso country codes,find out the iso country codes.,Dealing with HMRC,b20215a9-25fb-4fa6-80a3-42e23f5352c2,algorithm_v2.0.0
8,5d5afda3-7631-11e4-a3cb-005056011aef,/government/news/government-to-waive-vat-on-mi...,government to waive vat on military wives’ cha...,chancellor of the exchequer has today announce...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1,algorithm_v2.0.0
9,e36ebdbf-b8df-4dc8-beb5-beece2f7b7de,/government/collections/gwe-rwydo-a-sgamiau,gwe-rwydo a sgamiau,cyngor ar ddiogelwch gan gyllid a thollau em i...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1,algorithm_v2.0.0


### Function to get misplaced content

In [137]:
def get_misplaced_content (
    taxon_id = '28262ae3-599c-4259-ae30-3c83a5ec02a1',
    similarity_threshold = 0.65,
    embedded_sentences_data = embedded_sentences,
    labelled_data = labelled
):
    
    """Identify content items that seem out of place in a given taxon.
    The cosine-similarity score (CSS) for each content item is calculated.
    Content items are extracted if their mean score is above a particular threshold (default 0.65).
    """
    
    print('Taxon ID: ', taxon_id)
    print('Similarity threshold:', similarity_threshold)
    
    # Get embeddedings for the specified taxon ID
    taxon_embedded = embedded_sentences[labelled['taxon_id'] == taxon_id]
    
    # Get distances between all content item pairs
    taxon_dist = pairwise_distances(
        taxon_embedded,
        metric = 'cosine', 
        n_jobs = -1
    )
    
    # As dataframe
    taxon_dist_df = pd.DataFrame(taxon_dist)
    
    # Calculate a mean
    taxon_dist_df['mean'] = taxon_dist.mean(axis = 1)
    
    # Get the rows of the labelled data (content items) that match the taxon ID
    taxon_content = labelled[labelled['taxon_id'] == taxon_id].reset_index()
    
    # Content items that are above the similarity threshold
    misplaced = taxon_content[['content_id', 'base_path', 'title', 'description', 'taxon_name']][taxon_dist_df['mean'] > similarity_threshold]
    
    # Add column with content_id
    misplaced['taxon_id'] = taxon_id
    
    return misplaced;

In [138]:
get_misplaced_content(similarity_threshold = 0.6)

Taxon ID:  28262ae3-599c-4259-ae30-3c83a5ec02a1
Similarity threshold: 0.6


Unnamed: 0,content_id,base_path,title,description,taxon_name,taxon_id
34,c49483d8-2dd2-41a3-a5cd-5f4bf1e761b4,/government/collections/trading-with-the-eu-if...,trading with the eu if the uk leaves without a...,this collection brings together a set of guide...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
57,f9e12f0e-bd0d-5361-8d26-bc83bfb34729,/hmrc-internal-manuals/vat-womens-sanitary-pro...,vat women’s sanitary products,guidance on the reduced rate for women's sanit...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
66,a211f181-1cc0-45c0-8bb6-0491eb67fc92,/guidance/changes-to-chief-commodity-codes-tar...,changes to chief commodity codes (tariff stop ...,find out the changes to commodity codes in the...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
88,5d5af6f1-7631-11e4-a3cb-005056011aef,/government/news/government-to-waive-vat-on-x-...,government to waive vat on x-factor charity si...,chancellor of the exchequer has today announce...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
91,3e9714b9-888f-5ba5-b593-9ff361a8503d,/hmrc-internal-manuals/vat-betting-and-gaming,vat betting and gaming guidance,guidance on determining the liability of suppl...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
95,36b2574b-4aec-46bb-a3e3-0814f2d173b6,/guidance/agricultural-products-from-chile-tar...,agricultural products from chile (tariff quota...,find out the new tariff quotas for agricultura...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
166,6faa9514-950f-49c1-8f01-05e5e6847428,/government/news/sugar-tax-revenue-helps-tackl...,sugar tax revenue helps tackle childhood obesity,soft drinks manufacturers and traders have pai...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
196,6f019571-54be-4344-aede-cebd901c1fe5,/guidance/rates-and-allowances-for-air-passeng...,historic rates for air passenger duty,check which air passenger duty rates apply for...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
219,e110d285-20e0-431e-a394-39edabb2b331,/guidance/air-passenger-duty-and-connected-fli...,air passenger duty and connected flights,check which flights to treat as connected for ...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
225,d7b847d2-8b83-4e53-86ec-d1747e41ed24,/unincorporated-associations,unincorporated associations,unincorporated associations are organisations ...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1


### Get all potentially-misplaced contents from level 2 money taxons

Get the `taxon_id`s for the level 2 taxons from the money branch to iterate over.

In [139]:
level2_money_taxonid = list(money_level2['taxon_id'])
level2_money_taxonid

['28262ae3-599c-4259-ae30-3c83a5ec02a1',
 'b20215a9-25fb-4fa6-80a3-42e23f5352c2',
 '35136812-221c-4ba2-8ed1-46e409ca5e10',
 'a5c88a77-03ba-4100-bd33-7ee2ce602dc8',
 '7c4cf197-2dba-4a82-83e2-6c8bb332525c',
 'a7b67f4f-1234-4a0c-90ba-eeed4c5183cd',
 '5605545e-03ca-4520-9519-163ea341bc86']

Loop over the `taxon_id`s using the function to find misplaced items.

In [270]:
misplaced_items = []
for x in level2_money_taxonid:
  data = get_misplaced_content(taxon_id = x)
  misplaced_items.append(data)

Taxon ID:  28262ae3-599c-4259-ae30-3c83a5ec02a1
Similarity threshold: 0.65
Taxon ID:  b20215a9-25fb-4fa6-80a3-42e23f5352c2
Similarity threshold: 0.65
Taxon ID:  35136812-221c-4ba2-8ed1-46e409ca5e10
Similarity threshold: 0.65
Taxon ID:  a5c88a77-03ba-4100-bd33-7ee2ce602dc8
Similarity threshold: 0.65
Taxon ID:  7c4cf197-2dba-4a82-83e2-6c8bb332525c
Similarity threshold: 0.65
Taxon ID:  a7b67f4f-1234-4a0c-90ba-eeed4c5183cd
Similarity threshold: 0.65
Taxon ID:  5605545e-03ca-4520-9519-163ea341bc86
Similarity threshold: 0.65


Concatenate list output to datafame format.

In [271]:
misplaced_items = pd.concat(misplaced_items)
misplaced_items

Unnamed: 0,content_id,base_path,title,description,taxon_name,taxon_id
57,f9e12f0e-bd0d-5361-8d26-bc83bfb34729,/hmrc-internal-manuals/vat-womens-sanitary-pro...,vat women’s sanitary products,guidance on the reduced rate for women's sanit...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
66,a211f181-1cc0-45c0-8bb6-0491eb67fc92,/guidance/changes-to-chief-commodity-codes-tar...,changes to chief commodity codes (tariff stop ...,find out the changes to commodity codes in the...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
196,6f019571-54be-4344-aede-cebd901c1fe5,/guidance/rates-and-allowances-for-air-passeng...,historic rates for air passenger duty,check which air passenger duty rates apply for...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
219,e110d285-20e0-431e-a394-39edabb2b331,/guidance/air-passenger-duty-and-connected-fli...,air passenger duty and connected flights,check which flights to treat as connected for ...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
256,eb031ebb-7078-4879-a124-33753c4ca0bd,/guidance/rates-and-allowances-for-air-passeng...,rates for air passenger duty,check which rates of air passenger duty you ne...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
278,5f60a446-f47c-403a-aab3-bd83db20cf4f,/guidance/poultry-from-iceland-tariff-quota-no...,poultry from iceland (tariff quota notice 73),check the new tariff quota for poultry from ic...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
343,6eb3a99b-9a0b-464a-bb42-c08882c7d857,/government/publications/iso-country-codes,iso country codes,find out the iso country codes.,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
397,5d5afda3-7631-11e4-a3cb-005056011aef,/government/news/government-to-waive-vat-on-mi...,government to waive vat on military wives’ cha...,chancellor of the exchequer has today announce...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
430,0fea02ed-c1c8-4502-a7a2-f0ebebe1ee1c,/guidance/laser-skin-treatment-and-hair-remova...,laser skin treatment and hair removal (tariff ...,check the tariff classification of electrical ...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1
478,e36ebdbf-b8df-4dc8-beb5-beece2f7b7de,/government/collections/gwe-rwydo-a-sgamiau,gwe-rwydo a sgamiau,cyngor ar ddiogelwch gan gyllid a thollau em i...,Business tax,28262ae3-599c-4259-ae30-3c83a5ec02a1


In [143]:
misplaced_items.to_csv("~/Desktop/money_level2_misplaced.csv")

#### What other tags do the misplaced items have?

Rename and reorder columns in misplaced items dataframe. Rename to identify the taxons as the problematic ones within which the given content item has been flagged as misplaced; this is helpful when merging with the other tags.

In [302]:
misplaced_items = misplaced_items.rename(columns={
    'taxon_name': 'misplaced_taxon_name',
    'taxon_id': 'misplaced_taxon_id',
    'base_path': 'misplaced_base_path'
}
)

misplaced_items = misplaced_items[[
    'content_id', 'title', 'description',
    'misplaced_taxon_id', 'misplaced_taxon_name', 'misplaced_base_path'
]]

misplaced_items.head()

Unnamed: 0,content_id,title,description,misplaced_taxon_id,misplaced_taxon_name,misplaced_base_path
57,f9e12f0e-bd0d-5361-8d26-bc83bfb34729,vat women’s sanitary products,guidance on the reduced rate for women's sanit...,28262ae3-599c-4259-ae30-3c83a5ec02a1,Business tax,/hmrc-internal-manuals/vat-womens-sanitary-pro...
66,a211f181-1cc0-45c0-8bb6-0491eb67fc92,changes to chief commodity codes (tariff stop ...,find out the changes to commodity codes in the...,28262ae3-599c-4259-ae30-3c83a5ec02a1,Business tax,/guidance/changes-to-chief-commodity-codes-tar...
196,6f019571-54be-4344-aede-cebd901c1fe5,historic rates for air passenger duty,check which air passenger duty rates apply for...,28262ae3-599c-4259-ae30-3c83a5ec02a1,Business tax,/guidance/rates-and-allowances-for-air-passeng...
219,e110d285-20e0-431e-a394-39edabb2b331,air passenger duty and connected flights,check which flights to treat as connected for ...,28262ae3-599c-4259-ae30-3c83a5ec02a1,Business tax,/guidance/air-passenger-duty-and-connected-fli...
256,eb031ebb-7078-4879-a124-33753c4ca0bd,rates for air passenger duty,check which rates of air passenger duty you ne...,28262ae3-599c-4259-ae30-3c83a5ec02a1,Business tax,/guidance/rates-and-allowances-for-air-passeng...


Dataframe where each row is a content item/taxon tag pair. Simplify so it can be merged into our dataframe of misplaced content.

In [305]:
content_id_name = labelled[['content_id', 'taxon_id', 'taxon_name']]

content_id_name = content_id_name.rename(columns = {
    'taxon_id': 'other_taxon_id',
    'taxon_name': 'other_taxon_name'
}
)

content_id_name.sort_values('content_id').head()

Unnamed: 0,content_id,other_taxon_id,other_taxon_name
176658,0000d0a0-037a-4110-a271-24327f422d06,80cb30f4-0361-49ea-ad84-bc10910318bf,Charities and social enterprises
2871,0000d0a0-037a-4110-a271-24327f422d06,668cd623-c7a8-4159-9575-90caac36d4b4,Community and society
232284,00012147-49f6-4e90-be1f-e50bb719f53d,d949275c-88f8-4623-a44b-eb3706651e10,Intellectual property
65198,0001f1a9-3285-4897-baa9-f6663aeb1e8a,f3f4b5d3-49c4-487b-bd5b-be75f11ec8c5,"Government efficiency, transparency and accoun..."
74041,00021930-c266-4312-aeaa-96813c1d8860,3cf97f69-84de-41ae-bc7b-7e2cc238fa58,Environment


Merge and remove redundant rows where misplaced taxon name and other taxon name are the same.

In [313]:
misplaced_content_tags = pd.merge(misplaced_items, content_id_name, on = 'content_id', how = 'left')

misplaced_content_tags = misplaced_content_tags[
    (misplaced_content_tags.misplaced_taxon_name != misplaced_content_tags.other_taxon_name)
]

misplaced_content_tags.head()

Unnamed: 0,content_id,title,description,misplaced_taxon_id,misplaced_taxon_name,misplaced_base_path,other_taxon_id,other_taxon_name
2,a211f181-1cc0-45c0-8bb6-0491eb67fc92,changes to chief commodity codes (tariff stop ...,find out the changes to commodity codes in the...,28262ae3-599c-4259-ae30-3c83a5ec02a1,Business tax,/guidance/changes-to-chief-commodity-codes-tar...,277fd1ce-61c3-46a4-9172-3101fda02111,UK Trade Tariff and classification of goods
3,a211f181-1cc0-45c0-8bb6-0491eb67fc92,changes to chief commodity codes (tariff stop ...,find out the changes to commodity codes in the...,28262ae3-599c-4259-ae30-3c83a5ec02a1,Business tax,/guidance/changes-to-chief-commodity-codes-tar...,1da20ae5-526d-4313-b93d-fed4491f8ed8,Commodity codes and reporting
5,6f019571-54be-4344-aede-cebd901c1fe5,historic rates for air passenger duty,check which air passenger duty rates apply for...,28262ae3-599c-4259-ae30-3c83a5ec02a1,Business tax,/guidance/rates-and-allowances-for-air-passeng...,60677783-056b-484d-89cf-f22a12d1980a,Air passenger duty
7,e110d285-20e0-431e-a394-39edabb2b331,air passenger duty and connected flights,check which flights to treat as connected for ...,28262ae3-599c-4259-ae30-3c83a5ec02a1,Business tax,/guidance/air-passenger-duty-and-connected-fli...,60677783-056b-484d-89cf-f22a12d1980a,Air passenger duty
9,eb031ebb-7078-4879-a124-33753c4ca0bd,rates for air passenger duty,check which rates of air passenger duty you ne...,28262ae3-599c-4259-ae30-3c83a5ec02a1,Business tax,/guidance/rates-and-allowances-for-air-passeng...,b20215a9-25fb-4fa6-80a3-42e23f5352c2,Dealing with HMRC


In [314]:
misplaced_content_tags.to_csv('~/Desktop/misplaced_content_tags.csv')

## Brexit taxon

The Brexit taxon is a bit weird. Look into it. Can we split it?

### Taxons

Content items tagged as Brexit and its child/grandchild.

In [187]:
brexit_taxons = taxon_id_name[(taxon_id_name.level2taxon == 'Brexit')]
brexit_taxons

Unnamed: 0,taxon_id,taxon_name,level1taxon,level2taxon,level3taxon,level4taxon,level5taxon
92002,d6c2de5d-ef90-45d1-82d4-5f2438369eea,Brexit,Government,Brexit,,,
272400,d7bdaee2-8ea5-460e-b00d-6e9382eb6b61,Brexit guidance for UK citizens,Government,Brexit,Brexit guidance for UK citizens,,


Prepare data on taxon size and cosine score.

In [192]:
taxon_size_score = taxon_homogeneity_df[['taxon_id', 'taxon_size', 'mean_cosine_score']]
taxon_size_score.head()

Unnamed: 0,taxon_id,taxon_size,mean_cosine_score
0,668cd623-c7a8-4159-9575-90caac36d4b4,5166,0.59549
1,f9e476ef-654d-41ec-97d9-2b6842d4361d,786,0.589025
2,495afdb6-47be-4df1-8b38-91c8adb1eefc,8136,0.57151
3,fc5f468f-a3ba-4fde-9c1d-ed2dd17cfd82,31,0.571205
4,b29cf14b-54c6-402c-a9f0-77218602d1af,2333,0.569644


Join them.

In [195]:
brexit_taxons_scores = pd.merge(brexit_taxons, taxon_size_score, on = 'taxon_id', how = 'left')
brexit_taxons_scores

Unnamed: 0,taxon_id,taxon_name,level1taxon,level2taxon,level3taxon,level4taxon,level5taxon,taxon_size,mean_cosine_score
0,d6c2de5d-ef90-45d1-82d4-5f2438369eea,Brexit,Government,Brexit,,,,1179,0.433423
1,d7bdaee2-8ea5-460e-b00d-6e9382eb6b61,Brexit guidance for UK citizens,Government,Brexit,Brexit guidance for UK citizens,,,47,0.383072


### Content

Get content that's in the Brexit taxon. The data are labelled with where they are in the taxonomy.

In [209]:
labelled_brexit = labelled[(labelled.level2taxon == 'Brexit')]
labelled_brexit.head()

Unnamed: 0,base_path,content_id,description,document_type,first_published_at,locale,primary_publishing_organisation,publishing_app,title,body,combined_text,taxon_id,taxon_base_path,taxon_name,level1taxon,level2taxon,level3taxon,level4taxon,level5taxon
92002,/government/publications/citizens-rights-admin...,9bea14d9-a73c-4ff7-9956-83602e3e0d8e,details of the uk’s proposed administrative pr...,policy_paper,2017-11-07T13:19:00.000+00:00,en,Home Office,whitehall,citizens' rights: administrative procedures in...,this paper sets out further details on the adm...,citizens' rights: administrative procedures in...,d6c2de5d-ef90-45d1-82d4-5f2438369eea,/government/brexit,Brexit,Government,Brexit,,,
92003,/government/speeches/pm-press-conference-with-...,ac34fc44-11ef-40ad-aee6-69b0be39bed6,the prime minister gave a statement at a joint...,speech,2018-02-16T18:20:00.000+00:00,en,"Prime Minister's Office, 10 Downing Street",whitehall,pm press conference with chancellor merkel: 16...,chancellor merkel ladies and gentlemen we are ...,pm press conference with chancellor merkel: 16...,d6c2de5d-ef90-45d1-82d4-5f2438369eea,/government/brexit,Brexit,Government,Brexit,,,
92004,/government/publications/response-to-the-house...,5fdc0a4d-7631-11e4-a3cb-005056011aef,this command paper sets out the government’s r...,policy_paper,2014-07-22T10:38:27.000+00:00,en,Foreign & Commonwealth Office,whitehall,response to the house of lords european union ...,the government welcomes the european union com...,response to the house of lords european union ...,d6c2de5d-ef90-45d1-82d4-5f2438369eea,/government/brexit,Brexit,Government,Brexit,,,
92005,/eu-withdrawal-act-2018-statutory-instruments/...,2516378a-418b-4004-9c7b-b398fc0d3d61,these instruments amend retained european unio...,statutory_instrument,2018-11-29T16:26:17.000+00:00,en,"Department for Environment, Food & Rural Affairs",specialist-publisher,the common agricultural policy (direct payment...,statutory instrument the common agricultural p...,the common agricultural policy (direct payment...,d6c2de5d-ef90-45d1-82d4-5f2438369eea,/government/brexit,Brexit,Government,Brexit,,,
92006,/eu-withdrawal-act-2018-statutory-instruments/...,d738b3b8-9a3e-4aaa-b1a3-9b121dd90abf,the mutual recognition of protection measures ...,statutory_instrument,2018-10-22T18:44:40.000+00:00,en,Ministry of Justice,specialist-publisher,the mutual recognition of protection measures ...,sifting committees’ recommendation the sifting...,the mutual recognition of protection measures ...,d6c2de5d-ef90-45d1-82d4-5f2438369eea,/government/brexit,Brexit,Government,Brexit,,,


Get the embedded sentences for each content ID.

In [213]:
content_embed = embedded_clean_content[['content_id', 'embedded_sentences']]
content_embed.head()

Unnamed: 0,content_id,embedded_sentences
0,04a0cc0d-0b9f-45ad-bf57-7c54cbab9df9,"[0.0535724014043808, 0.002477730857208371, -0...."
1,a61985b0-d6eb-4cf1-8140-642b9557ce00,"[-0.03585460036993027, 0.02925342693924904, 0...."
2,d569ef4b-d632-49a0-9795-6a7ea934b799,"[0.04697635397315025, 0.0011109516490250826, 0..."
3,5e0fee54-7631-11e4-a3cb-005056011aef,"[0.023101944476366043, 0.02299758419394493, 0...."
4,581cabaf-2ed1-4411-8d1b-35b1bcb559b5,"[0.028265895321965218, 0.0009081338648684323, ..."


Add the embedded sentences to the labelled Brexit content items.

In [217]:
brexit_labelled_embed = pd.merge(labelled_brexit, content_embed, on = 'content_id', how = 'left')
brexit_labelled_embed.head()

Unnamed: 0,base_path,content_id,description,document_type,first_published_at,locale,primary_publishing_organisation,publishing_app,title,body,combined_text,taxon_id,taxon_base_path,taxon_name,level1taxon,level2taxon,level3taxon,level4taxon,level5taxon,embedded_sentences
0,/government/publications/citizens-rights-admin...,9bea14d9-a73c-4ff7-9956-83602e3e0d8e,details of the uk’s proposed administrative pr...,policy_paper,2017-11-07T13:19:00.000+00:00,en,Home Office,whitehall,citizens' rights: administrative procedures in...,this paper sets out further details on the adm...,citizens' rights: administrative procedures in...,d6c2de5d-ef90-45d1-82d4-5f2438369eea,/government/brexit,Brexit,Government,Brexit,,,,"[-0.013846294954419136, 0.017149977385997772, ..."
1,/government/speeches/pm-press-conference-with-...,ac34fc44-11ef-40ad-aee6-69b0be39bed6,the prime minister gave a statement at a joint...,speech,2018-02-16T18:20:00.000+00:00,en,"Prime Minister's Office, 10 Downing Street",whitehall,pm press conference with chancellor merkel: 16...,chancellor merkel ladies and gentlemen we are ...,pm press conference with chancellor merkel: 16...,d6c2de5d-ef90-45d1-82d4-5f2438369eea,/government/brexit,Brexit,Government,Brexit,,,,"[-0.04826006293296814, 0.039077550172805786, -..."
2,/government/publications/response-to-the-house...,5fdc0a4d-7631-11e4-a3cb-005056011aef,this command paper sets out the government’s r...,policy_paper,2014-07-22T10:38:27.000+00:00,en,Foreign & Commonwealth Office,whitehall,response to the house of lords european union ...,the government welcomes the european union com...,response to the house of lords european union ...,d6c2de5d-ef90-45d1-82d4-5f2438369eea,/government/brexit,Brexit,Government,Brexit,,,,"[-0.010840680450201035, -0.006112424191087484,..."
3,/eu-withdrawal-act-2018-statutory-instruments/...,2516378a-418b-4004-9c7b-b398fc0d3d61,these instruments amend retained european unio...,statutory_instrument,2018-11-29T16:26:17.000+00:00,en,"Department for Environment, Food & Rural Affairs",specialist-publisher,the common agricultural policy (direct payment...,statutory instrument the common agricultural p...,the common agricultural policy (direct payment...,d6c2de5d-ef90-45d1-82d4-5f2438369eea,/government/brexit,Brexit,Government,Brexit,,,,"[-0.014112734235823154, -0.01948806457221508, ..."
4,/eu-withdrawal-act-2018-statutory-instruments/...,d738b3b8-9a3e-4aaa-b1a3-9b121dd90abf,the mutual recognition of protection measures ...,statutory_instrument,2018-10-22T18:44:40.000+00:00,en,Ministry of Justice,specialist-publisher,the mutual recognition of protection measures ...,sifting committees’ recommendation the sifting...,the mutual recognition of protection measures ...,d6c2de5d-ef90-45d1-82d4-5f2438369eea,/government/brexit,Brexit,Government,Brexit,,,,"[0.018107034265995026, -0.05084199458360672, 0..."


In [222]:
### t-SNE

In [224]:
brexit = embedded_sentences[labelled['taxon_id']=='d6c2de5d-ef90-45d1-82d4-5f2438369eea']

In [225]:
brexit

array([[-0.01384636,  0.01715001,  0.03606272, ..., -0.01534817,
         0.01647961, -0.05993748],
       [-0.04563793,  0.04127745, -0.00644559, ..., -0.03332986,
         0.04976409, -0.06276609],
       [-0.01084065, -0.00611237, -0.01730281, ...,  0.00154343,
         0.05014637, -0.06256321],
       ...,
       [-0.04382474,  0.00842297,  0.0089147 , ..., -0.02628383,
         0.05491083, -0.05879661],
       [-0.00030151, -0.05320191,  0.03599082, ..., -0.04575277,
         0.01496796, -0.04728961],
       [-0.0412386 , -0.02751214,  0.03315298, ..., -0.0589726 ,
        -0.00737035, -0.06045217]], dtype=float32)

In [226]:
tsne = TSNE(n_components=2, verbose=1, perplexity=20, n_iter=7000)
tsne_results = tsne.fit_transform(brexit)

[t-SNE] Computing 61 nearest neighbors...
[t-SNE] Indexed 1179 samples in 0.021s...
[t-SNE] Computed neighbors for 1179 samples in 1.106s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1179
[t-SNE] Computed conditional probabilities for sample 1179 / 1179
[t-SNE] Mean sigma: 0.213753
[t-SNE] KL divergence after 250 iterations with early exaggeration: 72.629425
[t-SNE] KL divergence after 7000 iterations: 1.003498


In [228]:
df = labelled[labelled['taxon_id']=='d6c2de5d-ef90-45d1-82d4-5f2438369eea'].copy()
df['x-tsne'] = tsne_results[:,0]
df['y-tsne'] = tsne_results[:,1]

In [230]:
brexit_tsne = alt.Chart(df[['x-tsne', 'y-tsne','title']]).mark_circle(size=60).encode(
    alt.X('x-tsne:Q',
         axis=alt.Axis(grid=False)), 
    alt.Y('y-tsne:Q',
         axis=alt.Axis(grid=False)), 
    tooltip=['title:N']
#     size = alt.Size('branch_size:Q'),
#     color=alt.Color('taxon_size:Q', legend=alt.Legend(title="mean taxon size in branch"))
).configure_axis(
    titleFontSize=15).interactive()

brexit_tsne.serve()


Note: if you're in the Jupyter notebook, Chart.serve() is not the best
      way to view plots. Consider using Chart.display().
You must interrupt the kernel to cancel this command.

Serving to http://127.0.0.1:8889/    [Ctrl-C to exit]


127.0.0.1 - - [06/Aug/2019 17:14:47] "GET / HTTP/1.1" 200 -



stopping Server...


## Isolate duplicate taxon names

Turns out that some of the taxon names are used more than once. Maybe not a problem if applying a taxon to a content item as a publisher who is traversing the tree. Probably more an issue when people search for a taxon in the content publisher and select the 'wrong' one.

In [235]:
taxon_homogeneity_df

Unnamed: 0.1,Unnamed: 0,taxon_id,taxon_size,mean_cosine_score,taxon_base_path,taxon_level,level1taxon,fewer_than_or_equal_5items,more_than_0_5_diversity
0,0,668cd623-c7a8-4159-9575-90caac36d4b4,5166,5.954902e-01,/society-and-culture/community-and-society,2,Society and culture,0,1
1,246,f9e476ef-654d-41ec-97d9-2b6842d4361d,786,5.890253e-01,/society-and-culture/sports-and-leisure,2,Society and culture,0,1
2,48,495afdb6-47be-4df1-8b38-91c8adb1eefc,8136,5.715099e-01,/business-and-industry,1,Business and industry,0,1
3,833,fc5f468f-a3ba-4fde-9c1d-ed2dd17cfd82,31,5.712054e-01,/housing-local-and-community/housing-local-ser...,3,"Housing, local and community",0,1
4,18,b29cf14b-54c6-402c-a9f0-77218602d1af,2333,5.696437e-01,/society-and-culture/arts-and-culture,2,Society and culture,0,1
5,38,e491505c-77ae-45b2-84be-8c94b94f6a2b,4917,5.625974e-01,/defence-and-armed-forces,1,Defence and armed forces,0,1
6,133,b297e49c-7da4-4bc1-8714-da80fa0758d3,2332,5.621166e-01,/society-and-culture/equality-rights-and-citiz...,2,Society and culture,0,1
7,58,8a98b827-82ad-49b4-819e-82c208c551c4,9471,5.603833e-01,/government/national-security,2,Government,0,1
8,123,78c2148c-a7cd-448b-8105-9f78ded119d1,1785,5.598941e-01,/government/emergency-preparation-reponse-and-...,2,Government,0,1
9,49,68ad6c84-49fd-4871-95bc-5bd48c0f81e1,2062,5.561067e-01,/regional-and-local-government/wales,2,Regional and local government,0,1


In [316]:
taxon_id_name = labelled[['taxon_id', 'taxon_name']].drop_duplicates()
taxon_id_name.head()

Unnamed: 0,taxon_id,taxon_name
0,668cd623-c7a8-4159-9575-90caac36d4b4,Community and society
5166,d0f1e5a3-c8f4-4780-8678-994f19104b21,Work
7035,f3dcc290-752f-4bbe-b379-9155d919a58d,National Health Service
13699,8f75f298-8126-47d4-8da1-ed67c4ecb39c,Biodiversity and ecosystems
14207,9fb30a53-70fb-4f1c-878b-0064b202d1ba,International aid and development


In [248]:
taxon_info = pd.merge(taxon_homogeneity_df, taxon_id_name, on = 'taxon_id', how = 'left')[['taxon_id', 'taxon_name', 'taxon_base_path', 'taxon_level']]

In [255]:
taxon_duplicates = taxon_info[taxon_info.duplicated(subset=['taxon_name'], keep = False)].sort_values('taxon_name')

In [256]:
taxon_duplicates.to_csv('~/Desktop/taxon_duplicates.csv')