Support registered vectors #12492

adrianeboyd · 2023-03-31T16:28:00Z

Description

Support registered vectors.

Types of change

Enhancement.

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

adrianeboyd · 2023-03-31T16:31:42Z

For example using bpemb vectors (@koaning):

from typing import cast, Callable, Optional
from pathlib import Path
import warnings
from bpemb import BPEmb
from spacy.util import registry
from spacy.vectors import BaseVectors
from spacy.vocab import Vocab
from thinc.api import Ops, get_current_ops
from thinc.backends import get_array_ops
from thinc.types import Floats2d

class BPEmbVectors(BaseVectors):
    def __init__(
        self,
        *,
        strings: Optional[str] = None,
        lang: Optional[str] = None,
        vs: Optional[int] = None,
        dim: Optional[int] = None,
        cache_dir: Optional[Path] = None,
        encode_extra_options: Optional[str] = None,
        model_file: Optional[Path] = None,
        emb_file: Optional[Path] = None,
    ):
        kwargs = {}
        if lang is not None:
            kwargs["lang"] = lang
        if vs is not None:
            kwargs["vs"] = vs
        if dim is not None:
            kwargs["dim"] = dim
        if cache_dir is not None:
            kwargs["cache_dir"] = cache_dir
        if encode_extra_options is not None:
            kwargs["encode_extra_options"] = encode_extra_options
        if model_file is not None:
            kwargs["model_file"] = model_file
        if emb_file is not None:
            kwargs["emb_file"] = emb_file
        self.bpemb = BPEmb(**kwargs)
        self.strings = strings
        self.name = repr(self.bpemb)
        self.n_keys = -1
        self.mode = "BPEmb"
        self.to_ops(get_current_ops())

    def __contains__(self, key):
        return True

    def is_full(self):
        return True

    def add(self, key, *, vector=None, row=None):
        warnings.warn(
            (
                "Skipping BPEmbVectors.add: the bpemb vector table cannot be "
                "modified. Vectors are calculated from bytepieces."
            )
        )
        return -1

    def __getitem__(self, key):
        return self.get_batch([key])[0]

    def get_batch(self, keys):
        keys = [self.strings.as_string(key) for key in keys]
        bp_ids = self.bpemb.encode_ids(keys)
        ops = get_array_ops(self.bpemb.emb.vectors)
        indices = ops.asarray(ops.xp.hstack(bp_ids), dtype="int32")
        lengths = ops.asarray([len(x) for x in bp_ids], dtype="int32")
        vecs = ops.reduce_mean(cast(Floats2d, self.bpemb.emb.vectors[indices]), lengths)
        return vecs

    @property
    def shape(self):
        return self.bpemb.vectors.shape

    def __len__(self):
        return self.shape[0]

    @property
    def vectors_length(self):
        return self.shape[1]

    @property
    def size(self):
        return self.bpemb.vectors.size

    def to_ops(self, ops: Ops):
        self.bpemb.emb.vectors = ops.asarray(self.bpemb.emb.vectors)


@registry.misc("BPEmbVectors.v1")
def create_bpemb_vectors(
    lang: Optional[str],
    vs: Optional[int],
    dim: Optional[int],
    cache_dir: Optional[Path],
    encode_extra_options: Optional[str],
    model_file: Optional[Path],
    emb_file: Optional[Path],
) -> Callable[[Vocab], BPEmbVectors]:
    def bpemb_vectors_factory(vocab: Vocab) -> BPEmbVectors:
        return BPEmbVectors(
            strings=vocab.strings,
            lang=lang,
            vs=vs,
            dim=dim,
            cache_dir=cache_dir,
            encode_extra_options=encode_extra_options,
            model_file=model_file,
            emb_file=emb_file,
        )

    return bpemb_vectors_factory

adrianeboyd · 2023-03-31T16:33:38Z

(At least the immediate CI failures are github glitches.)

koaning · 2023-04-03T11:32:02Z

Super cool! Thanks for the ping.

Just sayin', that bpemb example might serve as a great plugin, if only for the language support, but also because it seems to be a nice tangible example of an implementation.

Something like spacy-bpemb?

adrianeboyd · 2023-04-03T14:31:19Z

Yeah, it would definitely make sense to publish this kind of thing in an extra package and we could also include support for finalfusion and other similar packages. (Do you have any other recommendations? Implementing this for bpemb was not at all difficult.)

We'd have to decide whether we want to support internal serialization or not. And I wouldn't know what to call the package because something like spacy-vectors sounds pretty ambiguous.

koaning · 2023-04-04T08:18:22Z

Maybe something like spacy-embeddings? Or spacy-representations?

The main idea that I have is that we may want to consider embeddings that go beyond words. Maybe sentences? I've made an "embeddings for sklearn" project and the main embeddings that I end up using are the ones from sentence-bert. I also recall using the universal sentence encoder a lot, but supporting tfhub was such a pain that I decided not supporting it anymore.

Another idea might be to have projects that are specific. The downside of having spacy-embeddings is that folks might want us to implement/maintain this months unicorn vectors. It could be nicer to perhaps have spacy-bpemb, spacy-sentence-encoders, etc. That way, if people want to implement their own, they are totally welcome to clone one of our example projects. Seems easier to maintain in the long run.

Another reason to consider the "project-per-project" approach is that bpemb (and others) really have been out for a while. And since it's a research project, it could be abandoned in the future. Would we want to host these embeddings on their behalf? Feels tricky.

These methods are needed in various places for training and vector similarity.

adrianeboyd · 2023-04-04T08:59:34Z

I'm not sure I would want to maintain a bunch of small packages and we can be selective about which ones we would add to our own package, making it a bit like spacy-loggers.

We wouldn't plan to host any embeddings, just as embeddings at least. (As part of a distributed model if we implement internal serialization: maybe.)

adrianeboyd · 2023-05-12T07:47:58Z

The main open issue here is that we need some sort of [vectors] section in the config that is auto-filled for earlier models to:

[vectors]
@misc = "spacy.Vectors.v1"

To me it feel like this belongs under [nlp] because it's a clear part of how the vocab is created, but we are worried that auto-filling [nlp] will potentially mask incompatibilities/bugs related to older configs. On the other hand, we do auto-fill [components] when new settings are added to pipeline components, and this doesn't feel that different from that scenario, where it would only be auto-filled with the current default.

Would it make sense to use [vectors] instead and have that be auto-filled? Restrict the auto-filling to [nlp.vectors] (which might be hackier/trickier with due to the schema design)?

(A registry other than @misc is a separate and probably much easier question.)

adrianeboyd · 2023-06-27T09:23:40Z

Would come back to this after #12625 is merged.

…vectors

svlandeg

This is a nice feature, and I like the idea of maintaining a separate small library with example implementations like spacy-loggers does!

Only had a few smaller comments.

spacy/vocab.pyx

spacy/default_config.cfg

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

svlandeg

Only found a few more typos, otherwise should be good to merge!

spacy/vocab.pyx

website/docs/api/basevectors.mdx

website/docs/usage/embeddings-transformers.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

website/docs/api/basevectors.mdx

Support registered vectors

d3b8b91

adrianeboyd added enhancement Feature requests and improvements feat / vectors Feature: Word vectors and similarity v3.6 Related to v3.6 labels Mar 31, 2023

Format

7b36e7c

adrianeboyd marked this pull request as draft March 31, 2023 16:33

adrianeboyd added 3 commits April 2, 2023 11:56

Auto-fill [nlp] on load from config and from bytes/disk

431c2ec

Only auto-fill [nlp]

0321a06

Undo all changes to Language.from_disk

3f24334

Expand BaseVectors

030a700

These methods are needed in various places for training and vector similarity.

adrianeboyd added v3.7 Related to v3.7 and removed v3.6 Related to v3.6 labels Jul 4, 2023

adrianeboyd changed the base branch from master to develop July 4, 2023 13:26

adrianeboyd added 5 commits July 24, 2023 16:44

Merge remote-tracking branch 'upstream/develop' into test/registered-…

a1a1e1c

…vectors

isort

361840f

More linting

c195b50

Merge remote-tracking branch 'upstream/develop' into test/registered-…

6e615ad

…vectors

Only fill [nlp.vectors]

8332ead

adrianeboyd marked this pull request as ready for review July 31, 2023 12:21

svlandeg reviewed Jul 31, 2023

View reviewed changes

spacy/vocab.pyx Outdated Show resolved Hide resolved

spacy/vocab.pyx Show resolved Hide resolved

spacy/default_config.cfg Outdated Show resolved Hide resolved

adrianeboyd and others added 2 commits July 31, 2023 17:45

Update spacy/vocab.pyx

0d05f10

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Revert changes to test related to auto-filling [nlp]

808cc94

adrianeboyd added 5 commits July 31, 2023 18:04

Add vectors registry

ffbc4af

Rephrase error about vocab methods for vectors

8d6df67

Switch to dummy implementation for BaseVectors.to_ops

ae9dfb4

Add initial draft of docs

294f89e

Remove example from BaseVectors docs

d79fb4f

svlandeg approved these changes Aug 1, 2023

View reviewed changes

adrianeboyd and others added 3 commits August 1, 2023 14:45

Apply suggestions from code review

9542d1f

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Update website/docs/api/basevectors.mdx

b64e78a

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Fix type and lint bpemb example

998b7d9

adrianeboyd commented Aug 1, 2023

View reviewed changes

website/docs/api/basevectors.mdx Outdated Show resolved Hide resolved

Update website/docs/api/basevectors.mdx

77df98b

svlandeg merged commit 0fe43f4 into explosion:develop Aug 1, 2023
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support registered vectors #12492

Support registered vectors #12492

adrianeboyd commented Mar 31, 2023

adrianeboyd commented Mar 31, 2023 •

edited

Loading

adrianeboyd commented Mar 31, 2023

koaning commented Apr 3, 2023

adrianeboyd commented Apr 3, 2023

koaning commented Apr 4, 2023 •

edited

Loading

adrianeboyd commented Apr 4, 2023

adrianeboyd commented May 12, 2023

adrianeboyd commented Jun 27, 2023

svlandeg left a comment

svlandeg left a comment

Support registered vectors #12492

Support registered vectors #12492

Conversation

adrianeboyd commented Mar 31, 2023

Description

Types of change

Checklist

adrianeboyd commented Mar 31, 2023 • edited Loading

adrianeboyd commented Mar 31, 2023

koaning commented Apr 3, 2023

adrianeboyd commented Apr 3, 2023

koaning commented Apr 4, 2023 • edited Loading

adrianeboyd commented Apr 4, 2023

adrianeboyd commented May 12, 2023

adrianeboyd commented Jun 27, 2023

svlandeg left a comment

Choose a reason for hiding this comment

svlandeg left a comment

Choose a reason for hiding this comment

adrianeboyd commented Mar 31, 2023 •

edited

Loading

koaning commented Apr 4, 2023 •

edited

Loading