content-similarity-models

Experiments and use of content similarity models

A set of experiments and models to generate embeddings for content items on GOV.UK, so far:

Installing / Getting started

When following this guidance, code is executed in the terminal unless specified otherwise.

Clone this repo using:

git clone git@github.com:alphagov/content-similarity-models.git

in your terminal.

Where to get the data

These models either use clean.content.csv.gz or labelled.csv.gz which are produced in the dataprep scripts in alphagov/govuk-taxonomy-supervised-learning

Python version

These experiments were conducted in python version Python 3.6.4.

Virtual environment

Create a new python 3.6.4 virtual environment using your favourite virtual environment manager (you may need pyenv to specify python version; which you can get using pip install pyenv).

If new to python, an easy way to do this is using the PyCharm community edition and opening this repo as a project. You can then specify what python interpreter to use (as explained here).

Using pip to install necessary packages

Then install required python packages:

pip install -r requirements.txt

How to visualise the embeddings in tensorboard

After saving checkpoints to specified log directory e.g., universal_embeddings:

tensorboard --logdir=universal_embeddings

then go to localhost:6006 in your browser. They can take a long time to render.

You can then play around with the t-SNE and PCA algorithms for visualisation and colour the pages by any metadata items.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
BERT		BERT
doc2vec		doc2vec
google-universal-encoder		google-universal-encoder
.env_aws		.env_aws
README.md		README.md
agglomerative_clumping.py		agglomerative_clumping.py
recursive_retag_content_business_tax.py		recursive_retag_content_business_tax.py
requirements.txt		requirements.txt
retag_content_business_tax.py		retag_content_business_tax.py
retag_content_business_tax_separate_title_cosine_score.py		retag_content_business_tax_separate_title_cosine_score.py
retag_content_business_tax_tf_idf.py		retag_content_business_tax_tf_idf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT

BERT

doc2vec

doc2vec

google-universal-encoder

google-universal-encoder

.env_aws

.env_aws

README.md

README.md

agglomerative_clumping.py

agglomerative_clumping.py

recursive_retag_content_business_tax.py

recursive_retag_content_business_tax.py

requirements.txt

requirements.txt

retag_content_business_tax.py

retag_content_business_tax.py

retag_content_business_tax_separate_title_cosine_score.py

retag_content_business_tax_separate_title_cosine_score.py

retag_content_business_tax_tf_idf.py

retag_content_business_tax_tf_idf.py

Repository files navigation

content-similarity-models

Installing / Getting started

Where to get the data

Python version

Virtual environment

Using pip to install necessary packages

How to visualise the embeddings in tensorboard

Contributing

License

About

Releases

Packages

Contributors 2

Languages

alphagov/content-similarity-models

Folders and files

Latest commit

History

Repository files navigation

content-similarity-models

Installing / Getting started

Where to get the data

Python version

Virtual environment

Using pip to install necessary packages

How to visualise the embeddings in tensorboard

Contributing

License

About

Topics

Resources

Code of conduct

Security policy

Stars

Watchers

Forks

Languages