Experiments and use of content similarity models
A set of experiments and models to generate embeddings for content items on GOV.UK, so far:
When following this guidance, code is executed in the terminal unless specified otherwise.
Clone this repo using:
git clone git@github.com:alphagov/content-similarity-models.git
in your terminal.
These models either use clean.content.csv.gz or labelled.csv.gz which are produced in the dataprep scripts in alphagov/govuk-taxonomy-supervised-learning
These experiments were conducted in python version Python 3.6.4.
Create a new python 3.6.4 virtual environment using your favourite virtual environment
manager (you may need pyenv
to specify python version; which you can get using pip install pyenv
).
If new to python, an easy way to do this is using the PyCharm community edition and opening this repo as a project. You can then specify what python interpreter to use (as explained here).
Then install required python packages:
pip install -r requirements.txt
After saving checkpoints to specified log directory e.g., universal_embeddings:
tensorboard --logdir=universal_embeddings
then go to localhost:6006
in your browser. They can take a long time to render.
You can then play around with the t-SNE and PCA algorithms for visualisation and colour the pages by any metadata items.