# Explore the use of cosine similarity measures on the money branch to improve its structure

# Context

This notebook is for identifying areas of the taxonomy where content is not well sorted and may require further curation. We're focusing on the money branch.

We're looking to explore and flag:

1. Content in the wrong place (its semantically different to other items in that taxon)
1. Odd taxon structure (content diversity, taxon size and depth)
1. Taxons that need splitting (clusters of closely-related content exist within a taxon)
1. Taxons that need merging (there's a large overlap in content tagging between branches)

# Prepare workspace

Assuming for now that your working directory is at `/content-similarity-models/google-universal-encoder`

In [None]:
!pwd

In [None]:
import numpy as np
import pandas as pd

from sklearn.metrics import pairwise_distances, pairwise_distances_chunked

import altair as alt
from altair import datum

# Read and prepare data

In [None]:
## Embedded sentences

In [None]:
embedded_sentences = np.load('../data/embedded_sentences2019-02-11.npy')

## Labelled data
We may also need to read the `labelled.csv` data to create some objects that will be used later. The labelled data is one of the inputs to the `get_homogeneity_scores_taxon.py script` that produces `taxon_homogeneity_df.csv`.

In [None]:
labelled = pd.read_csv(
    '../data/2019-02-11/labelled.csv.gz',
    compression='gzip',
    low_memory=False
)

In [None]:
labelled

Prepare objects for later visualisation

In [None]:
taxon_id_to_base_path = dict(zip(labelled['taxon_id'], labelled['taxon_base_path']))

#taxon_id_to_level = dict(zip(labelled['taxon_id'], labelled['level']))

taxon_id_to_level1 = dict(zip(labelled['taxon_id'], labelled['level1taxon']))

In [None]:
taxons = labelled['taxon_id'].unique()

## Branch homogeneity
Read the homogeneity data, which is a Pandas data frame output from the `get_homogeneity_scores_taxon.py` script.

In [None]:
taxon_homogeneity_df = pd.read_csv("../data/taxon_homogeneity_df.csv")

In [None]:
taxon_homogeneity_df.shape

In [None]:
taxon_homogeneity_df.head()

# Explore flagging

## 1. Odd taxon structure

In [None]:
numcols = 6  # specify the number of columns you want
level1taxons = taxon_homogeneity_df['level1taxon'].unique() 


money = taxon_homogeneity_df[taxon_homogeneity_df.level1taxon == 'Money'].copy()

total_size = money['taxon_size'].sum().astype(str)

money_plot = alt.Chart(money).mark_circle(size=60).encode(
    alt.X(
        'taxon_size:Q',
        scale=alt.Scale(type='log', domain=(1, 10000)),
        axis=alt.Axis(grid=False, title='log(topic_size)')
    ),
    alt.Y(
        'mean_cosine_score:Q',
        scale=alt.Scale(domain=(0, 0.6)),
        axis=alt.Axis(grid=False, title='content diversity score')
    ), 
    #color='taxon_level:N',
    color=alt.Color('taxon_level:N', scale=alt.Scale(scheme='magma')),
    opacity=alt.value(0.8), 
    tooltip=['taxon_base_path']
).properties(
        title='Money' + ", " + total_size).interactive()

In [None]:
money_plot.save('money.html', scale_factor=2.0)

## 2. Content in the wrong place
Content may have been tagged in the wrong place. How can we identify this? One idea is to look at the cosine similarity between each content item and all the others within a taxon and then inspect the ones with scores that are above a certain threshold (i.e. they're semantically different to everything else).

### Example: 'business tax' taxon

Store the taxon ID as a variable.

In [None]:
btax_id = '28262ae3-599c-4259-ae30-3c83a5ec02a1'

Filter the embedded sentences (a numpy array) where it matches the business tax taxon ID. Indices for `embedded sentences` and `labelled` are the same, so `labelled` can be used to help filter.

In [None]:
btax_embedded = embedded_sentences[labelled['taxon_id'] == btax_id]

Get the cosine similarity for all content item pairs in the taxon, convert to a Pandas data frame and then get the mean distances for each content item.

In [None]:
btax_dist = pairwise_distances(
    btax_embedded, 
    metric = 'cosine', 
    n_jobs = -1
)

In [None]:
btax_dist_df = pd.DataFrame(btax_dist)

In [None]:
btax_dist_df['mean'] = btax_dist.mean(axis = 1)

In [None]:
btax_dist_df

How many content items (rows) have a larger mean distance than the overall mean?

In [None]:
btax_dist_df[btax_dist_df['mean'] > btax_dist.mean()].shape

Now we can use this information to filter the data frame of labelled content items (`labelled`), leaving us with a data frame of the problem content.

We can start by filtering the `labelled` data so we have only the content items that are in the business tax taxon.

In [None]:
btax_content = labelled[labelled['taxon_id'] == btax_id].reset_index()

In [None]:
btax_content

Now return content items from the data frame where the mean cosine similarity score is above a threshold value. These are the problem content items. Simplify the output to three columns of interest.

In [None]:
btax_content[['base_path', 'title', 'description']][btax_dist_df['mean'] > 0.65]

## Function to get odd content

In [None]:
def get_misplaced_content (
    taxon_id = '28262ae3-599c-4259-ae30-3c83a5ec02a1',
    similarity_threshold = 0.65,
    embedded_sentences_data = embedded_sentences,
    labelled_data = labelled
):
    
    """Identify content items that seem out of place in a given taxon.
    The cosine-similarity score (CSS) for each content item is calculated.
    Content items are extracted if their mean score is above a particular threshold (default 0.65).
    """
    
    print('Taxon ID: ', taxon_id)
    print('Similarity threshold:', similarity_threshold)
    
    # Get embeddedings for the specified taxon ID
    taxon_embedded = embedded_sentences[labelled['taxon_id'] == taxon_id]
    
    # Get distances between all content item pairs
    taxon_dist = pairwise_distances(
        taxon_embedded, 
        metric = 'cosine', 
        n_jobs = -1
    )
    
    # As dataframe
    taxon_dist_df = pd.DataFrame(taxon_dist)
    
    # Calculate a mean
    taxon_dist_df['mean'] = taxon_dist.mean(axis = 1)
    
    # Get the rows of the labelled data (content items) that match the taxon ID
    taxon_content = labelled[labelled['taxon_id'] == taxon_id].reset_index()
    
    # Content items that are above the similarity threshold
    misplaced = taxon_content[['content_id', 'base_path', 'title', 'description']][taxon_dist_df['mean'] > similarity_threshold]
    
    return misplaced;
    

In [None]:
get_misplaced_content()

In [None]:
test

## 3. Taxon could be split 

## 4. Taxon could be merged