Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support registered vectors #12492

Merged
merged 22 commits into from
Aug 1, 2023

Conversation

adrianeboyd
Copy link
Contributor

Description

Support registered vectors.

Types of change

Enhancement.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@adrianeboyd adrianeboyd added enhancement Feature requests and improvements feat / vectors Feature: Word vectors and similarity v3.6 Related to v3.6 labels Mar 31, 2023
@adrianeboyd
Copy link
Contributor Author

adrianeboyd commented Mar 31, 2023

For example using bpemb vectors (@koaning):

from typing import cast, Callable, Optional
from pathlib import Path
import warnings
from bpemb import BPEmb
from spacy.util import registry
from spacy.vectors import BaseVectors
from spacy.vocab import Vocab
from thinc.api import Ops, get_current_ops
from thinc.backends import get_array_ops
from thinc.types import Floats2d

class BPEmbVectors(BaseVectors):
    def __init__(
        self,
        *,
        strings: Optional[str] = None,
        lang: Optional[str] = None,
        vs: Optional[int] = None,
        dim: Optional[int] = None,
        cache_dir: Optional[Path] = None,
        encode_extra_options: Optional[str] = None,
        model_file: Optional[Path] = None,
        emb_file: Optional[Path] = None,
    ):
        kwargs = {}
        if lang is not None:
            kwargs["lang"] = lang
        if vs is not None:
            kwargs["vs"] = vs
        if dim is not None:
            kwargs["dim"] = dim
        if cache_dir is not None:
            kwargs["cache_dir"] = cache_dir
        if encode_extra_options is not None:
            kwargs["encode_extra_options"] = encode_extra_options
        if model_file is not None:
            kwargs["model_file"] = model_file
        if emb_file is not None:
            kwargs["emb_file"] = emb_file
        self.bpemb = BPEmb(**kwargs)
        self.strings = strings
        self.name = repr(self.bpemb)
        self.n_keys = -1
        self.mode = "BPEmb"
        self.to_ops(get_current_ops())

    def __contains__(self, key):
        return True

    def is_full(self):
        return True

    def add(self, key, *, vector=None, row=None):
        warnings.warn(
            (
                "Skipping BPEmbVectors.add: the bpemb vector table cannot be "
                "modified. Vectors are calculated from bytepieces."
            )
        )
        return -1

    def __getitem__(self, key):
        return self.get_batch([key])[0]

    def get_batch(self, keys):
        keys = [self.strings.as_string(key) for key in keys]
        bp_ids = self.bpemb.encode_ids(keys)
        ops = get_array_ops(self.bpemb.emb.vectors)
        indices = ops.asarray(ops.xp.hstack(bp_ids), dtype="int32")
        lengths = ops.asarray([len(x) for x in bp_ids], dtype="int32")
        vecs = ops.reduce_mean(cast(Floats2d, self.bpemb.emb.vectors[indices]), lengths)
        return vecs

    @property
    def shape(self):
        return self.bpemb.vectors.shape

    def __len__(self):
        return self.shape[0]

    @property
    def vectors_length(self):
        return self.shape[1]

    @property
    def size(self):
        return self.bpemb.vectors.size

    def to_ops(self, ops: Ops):
        self.bpemb.emb.vectors = ops.asarray(self.bpemb.emb.vectors)


@registry.misc("BPEmbVectors.v1")
def create_bpemb_vectors(
    lang: Optional[str],
    vs: Optional[int],
    dim: Optional[int],
    cache_dir: Optional[Path],
    encode_extra_options: Optional[str],
    model_file: Optional[Path],
    emb_file: Optional[Path],
) -> Callable[[Vocab], BPEmbVectors]:
    def bpemb_vectors_factory(vocab: Vocab) -> BPEmbVectors:
        return BPEmbVectors(
            strings=vocab.strings,
            lang=lang,
            vs=vs,
            dim=dim,
            cache_dir=cache_dir,
            encode_extra_options=encode_extra_options,
            model_file=model_file,
            emb_file=emb_file,
        )

    return bpemb_vectors_factory

@adrianeboyd
Copy link
Contributor Author

(At least the immediate CI failures are github glitches.)

@adrianeboyd adrianeboyd marked this pull request as draft March 31, 2023 16:33
@koaning
Copy link
Contributor

koaning commented Apr 3, 2023

Super cool! Thanks for the ping.

Just sayin', that bpemb example might serve as a great plugin, if only for the language support, but also because it seems to be a nice tangible example of an implementation.

Something like spacy-bpemb?

@adrianeboyd
Copy link
Contributor Author

Yeah, it would definitely make sense to publish this kind of thing in an extra package and we could also include support for finalfusion and other similar packages. (Do you have any other recommendations? Implementing this for bpemb was not at all difficult.)

We'd have to decide whether we want to support internal serialization or not. And I wouldn't know what to call the package because something like spacy-vectors sounds pretty ambiguous.

@koaning
Copy link
Contributor

koaning commented Apr 4, 2023

Maybe something like spacy-embeddings? Or spacy-representations?

The main idea that I have is that we may want to consider embeddings that go beyond words. Maybe sentences? I've made an "embeddings for sklearn" project and the main embeddings that I end up using are the ones from sentence-bert. I also recall using the universal sentence encoder a lot, but supporting tfhub was such a pain that I decided not supporting it anymore.

Another idea might be to have projects that are specific. The downside of having spacy-embeddings is that folks might want us to implement/maintain this months unicorn vectors. It could be nicer to perhaps have spacy-bpemb, spacy-sentence-encoders, etc. That way, if people want to implement their own, they are totally welcome to clone one of our example projects. Seems easier to maintain in the long run.

Another reason to consider the "project-per-project" approach is that bpemb (and others) really have been out for a while. And since it's a research project, it could be abandoned in the future. Would we want to host these embeddings on their behalf? Feels tricky.

These methods are needed in various places for training and vector
similarity.
@adrianeboyd
Copy link
Contributor Author

I'm not sure I would want to maintain a bunch of small packages and we can be selective about which ones we would add to our own package, making it a bit like spacy-loggers.

We wouldn't plan to host any embeddings, just as embeddings at least. (As part of a distributed model if we implement internal serialization: maybe.)

@adrianeboyd
Copy link
Contributor Author

The main open issue here is that we need some sort of [vectors] section in the config that is auto-filled for earlier models to:

[vectors]
@misc = "spacy.Vectors.v1"

To me it feel like this belongs under [nlp] because it's a clear part of how the vocab is created, but we are worried that auto-filling [nlp] will potentially mask incompatibilities/bugs related to older configs. On the other hand, we do auto-fill [components] when new settings are added to pipeline components, and this doesn't feel that different from that scenario, where it would only be auto-filled with the current default.

Would it make sense to use [vectors] instead and have that be auto-filled? Restrict the auto-filling to [nlp.vectors] (which might be hackier/trickier with due to the schema design)?

(A registry other than @misc is a separate and probably much easier question.)

@adrianeboyd
Copy link
Contributor Author

Would come back to this after #12625 is merged.

@adrianeboyd adrianeboyd added v3.7 Related to v3.7 and removed v3.6 Related to v3.6 labels Jul 4, 2023
@adrianeboyd adrianeboyd changed the base branch from master to develop July 4, 2023 13:26
@adrianeboyd adrianeboyd marked this pull request as ready for review July 31, 2023 12:21
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice feature, and I like the idea of maintaining a separate small library with example implementations like spacy-loggers does!

Only had a few smaller comments.

spacy/vocab.pyx Outdated Show resolved Hide resolved
spacy/vocab.pyx Show resolved Hide resolved
spacy/default_config.cfg Outdated Show resolved Hide resolved
adrianeboyd and others added 2 commits July 31, 2023 17:45
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only found a few more typos, otherwise should be good to merge!

spacy/vocab.pyx Outdated Show resolved Hide resolved
spacy/vocab.pyx Outdated Show resolved Hide resolved
website/docs/api/basevectors.mdx Outdated Show resolved Hide resolved
website/docs/usage/embeddings-transformers.mdx Outdated Show resolved Hide resolved
website/docs/usage/embeddings-transformers.mdx Outdated Show resolved Hide resolved
adrianeboyd and others added 3 commits August 1, 2023 14:45
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
@svlandeg svlandeg merged commit 0fe43f4 into explosion:develop Aug 1, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / vectors Feature: Word vectors and similarity v3.7 Related to v3.7
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants