# Get duplicate taxon names

## Context

This notebook is for exploring how to identify content in the 'wrong' taxon. For example, the content could be entirely unrelaed to the taxon it's been tagged to, or ther might be a more granular taxon that it's better suited to.

## Prepare workspace

Assuming for now that your working directory is at `/content-similarity-models/google-universal-encoder`

In [3]:
import numpy as np
import pandas as pd

## Read and prepare data

There are a few data sets used here:

1. Content items and embeddings

    1a. Embedded clean content (pickle file)
    
    1b. Labelled data (data frame with a row per content item)
    
2. Taxon homogeneity scores

### 1a. Embedded clean content (pickle file)

This pickle file was created to combine the content data with the embeddings with the need for the numpy array as per section '1. Embedded sentences'.

In [3]:
embedded_clean_content = pd.read_pickle('../data/embedded_clean_content.pkl')

In [7]:
embedded_clean_content.iloc[1]

base_path                          /government/statistics/uk-labour-market-statis...
content_id                                      a61985b0-d6eb-4cf1-8140-642b9557ce00
description                        employment unemployment economic inactivity cl...
document_type                                                    national_statistics
first_published_at                                     2017-05-17T08:30:02.000+00:00
locale                                                                            en
primary_publishing_organisation                       Office for National Statistics
publishing_app                                                             whitehall
title                                          uk labour market statistics: may 2017
body                               official statistics are produced impartially a...
combined_text                      uk labour market statistics: may 2017 employme...
embedded_sentences                 [-0.03585460036993027, 0.02925

### 1b. Labelled data (csv.gz)
We may also need to read the `labelled.csv` data to create some objects that will be used later. The labelled data is one of the inputs to the `get_homogeneity_scores_taxon.py script` that produces `taxon_homogeneity_df.csv`.

In [8]:
labelled = pd.read_csv(
    '../data/2019-02-11/labelled.csv.gz',
    compression = 'gzip',
    low_memory = False
)

In [9]:
labelled.iloc[1]

base_path                          /government/news/charity-commission-names-furt...
content_id                                      5fa49c52-7631-11e4-a3cb-005056011aef
description                            regulator increases transparency of its work.
document_type                                                          press_release
first_published_at                                     2014-06-04T23:00:00.000+00:00
locale                                                                            en
primary_publishing_organisation                               The Charity Commission
publishing_app                                                             whitehall
title                              charity commission names further charities und...
body                               the charity commission has today named further...
combined_text                      charity commission names further charities und...
taxon_id                                        668cd623-c7a8-415

Each content item can exist in more than one row; it might be tagged to more than one part of the taxonomy.

In [10]:
labelled.shape  # 306k rows

(305703, 19)

In [11]:
labelled.drop_duplicates('content_id').shape  # 208k rows when duplicates removed

(208261, 19)

Prepare some objects for later visualisation.

In [9]:
taxon_id_to_base_path = dict(zip(labelled['taxon_id'], labelled['taxon_base_path']))

#taxon_id_to_level = dict(zip(labelled['taxon_id'], labelled['level']))

taxon_id_to_level1 = dict(zip(labelled['taxon_id'], labelled['level1taxon']))

Prep object containing relationship between `taxon_id` and `taxon_name`.

In [10]:
taxon_id_name = labelled[['taxon_id', 'taxon_name', 'level1taxon', 'level2taxon', 'level3taxon', 'level4taxon', 'level5taxon']].drop_duplicates()

### 1c. Embedded sentences (numpy array)

A numpy array of embeddings for content items. Superseded by reading embedded clean content file in section 1a.

In [6]:
embedded_sentences = np.load('../data/embedded_sentences2019-02-11.npy')

In [7]:
embedded_sentences.view()

array([[ 0.05357241,  0.00247775, -0.020976  , ..., -0.05675139,
         0.01268296,  0.01008949],
       [-0.04339019, -0.03241241,  0.00900179, ..., -0.04839829,
         0.03903588, -0.0553612 ],
       [-0.01541414,  0.04076389,  0.04761627, ...,  0.03003302,
         0.02379775, -0.06022731],
       ...,
       [ 0.0387876 ,  0.04077478,  0.04639488, ..., -0.05905833,
         0.00860474, -0.04581202],
       [-0.0426819 ,  0.02391836,  0.02524047, ..., -0.01499535,
         0.0110037 , -0.07002866],
       [ 0.0517485 , -0.01900592,  0.00853599, ..., -0.05199422,
        -0.02320959, -0.05730124]], dtype=float32)

### 2. Taxon homogeneity scores
Homogeneity scores at a taxon level. A Pandas data frame output from the `get_homogeneity_scores_taxon.py` script.

In [2]:
taxon_homogeneity_df = pd.read_csv("../data/taxon_homogeneity_df.csv")

NameError: name 'pd' is not defined

In [14]:
taxon_homogeneity_df.shape

(1265, 9)

In [1]:
taxon_homogeneity_df.iloc[1]

NameError: name 'taxon_homogeneity_df' is not defined

## Isolate duplicate taxon names

Turns out that some of the taxon names are used more than once. Maybe not a problem if applying a taxon to a content item as a publisher who is traversing the tree. Probably more an issue when people search for a taxon in the content publisher and select the 'wrong' one.

In [57]:
taxon_homogeneity_df

Unnamed: 0.1,Unnamed: 0,taxon_id,taxon_size,mean_cosine_score,taxon_base_path,taxon_level,level1taxon,fewer_than_or_equal_5items,more_than_0_5_diversity
0,0,668cd623-c7a8-4159-9575-90caac36d4b4,5166,5.954902e-01,/society-and-culture/community-and-society,2,Society and culture,0,1
1,246,f9e476ef-654d-41ec-97d9-2b6842d4361d,786,5.890253e-01,/society-and-culture/sports-and-leisure,2,Society and culture,0,1
2,48,495afdb6-47be-4df1-8b38-91c8adb1eefc,8136,5.715099e-01,/business-and-industry,1,Business and industry,0,1
3,833,fc5f468f-a3ba-4fde-9c1d-ed2dd17cfd82,31,5.712054e-01,/housing-local-and-community/housing-local-ser...,3,"Housing, local and community",0,1
4,18,b29cf14b-54c6-402c-a9f0-77218602d1af,2333,5.696437e-01,/society-and-culture/arts-and-culture,2,Society and culture,0,1
5,38,e491505c-77ae-45b2-84be-8c94b94f6a2b,4917,5.625974e-01,/defence-and-armed-forces,1,Defence and armed forces,0,1
6,133,b297e49c-7da4-4bc1-8714-da80fa0758d3,2332,5.621166e-01,/society-and-culture/equality-rights-and-citiz...,2,Society and culture,0,1
7,58,8a98b827-82ad-49b4-819e-82c208c551c4,9471,5.603833e-01,/government/national-security,2,Government,0,1
8,123,78c2148c-a7cd-448b-8105-9f78ded119d1,1785,5.598941e-01,/government/emergency-preparation-reponse-and-...,2,Government,0,1
9,49,68ad6c84-49fd-4871-95bc-5bd48c0f81e1,2062,5.561067e-01,/regional-and-local-government/wales,2,Regional and local government,0,1


In [58]:
taxon_id_name = labelled[['taxon_id', 'taxon_name']].drop_duplicates()
taxon_id_name.head()

Unnamed: 0,taxon_id,taxon_name
0,668cd623-c7a8-4159-9575-90caac36d4b4,Community and society
5166,d0f1e5a3-c8f4-4780-8678-994f19104b21,Work
7035,f3dcc290-752f-4bbe-b379-9155d919a58d,National Health Service
13699,8f75f298-8126-47d4-8da1-ed67c4ecb39c,Biodiversity and ecosystems
14207,9fb30a53-70fb-4f1c-878b-0064b202d1ba,International aid and development


In [59]:
taxon_info = pd.merge(taxon_homogeneity_df, taxon_id_name, on = 'taxon_id', how = 'left')[['taxon_id', 'taxon_name', 'taxon_base_path', 'taxon_level']]

In [60]:
taxon_duplicates = taxon_info[taxon_info.duplicated(subset=['taxon_name'], keep = False)].sort_values('taxon_name')

In [61]:
#taxon_duplicates.to_csv('~/Desktop/taxon_duplicates.csv')