Skip to content

A database for storing and comparing entity embeddings

License

Notifications You must be signed in to change notification settings

cthoyt/embeddingdb

Repository files navigation

Embedding Database zenodo

This package provides a database schema and Python wrapper for storing the embeddings generated through various representation learning packages.

Currently, this package focuses on using a SQL database with SQLAlchemy, but might be extended to use a NoSQL database as an alternative.

Installation

Install embeddingdb from PyPI with:

$ pip install embeddingdb

Alternatively, install the latest development version of embeddingdb directly from GitHub with:

$ pip install git+https://github.com/cthoyt/embeddingdb

For developers, install embeddingdb in development mode from GitHub with:

$ git clone https://github.com/cthoyt/embeddingdb.git
$ cd embeddingdb
$ pip install -e .

Set the environment variable EMBEDDINGDB_CONNECTION to a valid SQLAlchemy connection string for a PostgreSQL instance, as this package uses the PostgreSQL-specific ARRAY type.

Command Line Interface

This package installs an entrypoint embeddingdb that can be used directly from the shell.

Uploading Entity Embeddings

Entities can be embedded and stored from various types of representation learning, including network representation learning, knowledge graph embedding, and textual learning.

Upload embeddings generated by word2vec by specifying the file path with:

$ embeddingdb upload --fmt word2vec --path ~/path/to/file.txt

Upload embeddings generated by pykeen by specifying the output directory with:

$ embeddingdb upload --fmt keen --path ~/path/to/directory/

Listing Entity Embeddings

After uploading, the collections can be listed with:

$ embeddingdb ls

Analyzing Entity Embeddings' Correlations

One of the motivations for building this repository was to make a convenient way to compare the embeddings for entities generated through orthogonal embedding tecnhiques. For example, we wanted to know to what extent the embeddings for proteins generated from their sequences with ratvec contained the same information as the embeddings generated from protein-protein interaction networks with pykeen or nrl.

The two positional arguments correspond to the collection identifiers in the database.

$ embeddingdb analyze 1 2

Running with Docker

After installing Docker, the entire web application can be instantiated with:

$ docker-compose up

Get the endpoint /test to instantiate the database and add a test collection.