Skip to content

Embeddings storage #1

Open
Open
@amercader

Description

@amercader

The first approach used was to create a separate command that created the embeddings and stored them in the database using pgvector. But to implement Semantic Search we have to index the embedding in Solr anyway so to simplify we dropped pgvector and the embeddings table entirely and just computed the embedding every time we are indexing a dataset (in the before_dataset_index plugin hook).

This has the benefit of not having to worry about the embeddings being up to date if you for instance update a dataset title, but it's very likely not performant enough, certainly when calling an external API because the hooks are called on each individual dataset, so we can't submit data in bulk.

A probably better option would be to have the embeddings cached in the database, being created beforehand in the after_dataset_create and after_dataset_update hooks. We might not even need pgvector, just store them as arrays of floats or even strings, which is what we actually send to Solr.

An additional CLI command to refresh all (or some) embeddings would still be useful to change models, etc

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions