Skip to content
data science experiments to develop models of GOV.UK content similarity
Jupyter Notebook Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
BERT
doc2vec Update data Feb 14, 2019
google-universal-encoder
.env_aws Add env file for aws Feb 14, 2019
README.md
agglomerative_clumping.py Add script for agglomerative clumping of content Aug 12, 2019
recursive_retag_content_business_tax.py
requirements.txt
retag_content_business_tax.py
retag_content_business_tax_separate_title_cosine_score.py
retag_content_business_tax_tf_idf.py Single script to fetch all mistagged content and attempt to move it Aug 22, 2019

README.md

content-similarity-models

Experiments and use of content similarity models

A set of experiments and models to generate embeddings for content items on GOV.UK, so far:

  1. doc2vec
  2. Google's universal sentence encoder

Installing / Getting started

When following this guidance, code is executed in the terminal unless specified otherwise.

Clone this repo using:

git clone git@github.com:alphagov/content-similarity-models.git

in your terminal.

Where to get the data

These models either use clean.content.csv.gz or labelled.csv.gz which are produced in the dataprep scripts in alphagov/govuk-taxonomy-supervised-learning

Python version

These experiments were conducted in python version Python 3.6.4.

Virtual environment

Create a new python 3.6.4 virtual environment using your favourite virtual environment manager (you may need pyenv to specify python version; which you can get using pip install pyenv).

If new to python, an easy way to do this is using the PyCharm community edition and opening this repo as a project. You can then specify what python interpreter to use (as explained here).

Using pip to install necessary packages

Then install required python packages:

pip install -r requirements.txt

How to visualise the embeddings in tensorboard

After saving checkpoints to specified log directory e.g., universal_embeddings:

tensorboard --logdir=universal_embeddings

then go to localhost:6006 in your browser. They can take a long time to render.

You can then play around with the t-SNE and PCA algorithms for visualisation and colour the pages by any metadata items.

Contributing

License

You can’t perform that action at this time.