-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support registered vectors #12492
Support registered vectors #12492
Conversation
For example using bpemb vectors (@koaning): from typing import cast, Callable, Optional
from pathlib import Path
import warnings
from bpemb import BPEmb
from spacy.util import registry
from spacy.vectors import BaseVectors
from spacy.vocab import Vocab
from thinc.api import Ops, get_current_ops
from thinc.backends import get_array_ops
from thinc.types import Floats2d
class BPEmbVectors(BaseVectors):
def __init__(
self,
*,
strings: Optional[str] = None,
lang: Optional[str] = None,
vs: Optional[int] = None,
dim: Optional[int] = None,
cache_dir: Optional[Path] = None,
encode_extra_options: Optional[str] = None,
model_file: Optional[Path] = None,
emb_file: Optional[Path] = None,
):
kwargs = {}
if lang is not None:
kwargs["lang"] = lang
if vs is not None:
kwargs["vs"] = vs
if dim is not None:
kwargs["dim"] = dim
if cache_dir is not None:
kwargs["cache_dir"] = cache_dir
if encode_extra_options is not None:
kwargs["encode_extra_options"] = encode_extra_options
if model_file is not None:
kwargs["model_file"] = model_file
if emb_file is not None:
kwargs["emb_file"] = emb_file
self.bpemb = BPEmb(**kwargs)
self.strings = strings
self.name = repr(self.bpemb)
self.n_keys = -1
self.mode = "BPEmb"
self.to_ops(get_current_ops())
def __contains__(self, key):
return True
def is_full(self):
return True
def add(self, key, *, vector=None, row=None):
warnings.warn(
(
"Skipping BPEmbVectors.add: the bpemb vector table cannot be "
"modified. Vectors are calculated from bytepieces."
)
)
return -1
def __getitem__(self, key):
return self.get_batch([key])[0]
def get_batch(self, keys):
keys = [self.strings.as_string(key) for key in keys]
bp_ids = self.bpemb.encode_ids(keys)
ops = get_array_ops(self.bpemb.emb.vectors)
indices = ops.asarray(ops.xp.hstack(bp_ids), dtype="int32")
lengths = ops.asarray([len(x) for x in bp_ids], dtype="int32")
vecs = ops.reduce_mean(cast(Floats2d, self.bpemb.emb.vectors[indices]), lengths)
return vecs
@property
def shape(self):
return self.bpemb.vectors.shape
def __len__(self):
return self.shape[0]
@property
def vectors_length(self):
return self.shape[1]
@property
def size(self):
return self.bpemb.vectors.size
def to_ops(self, ops: Ops):
self.bpemb.emb.vectors = ops.asarray(self.bpemb.emb.vectors)
@registry.misc("BPEmbVectors.v1")
def create_bpemb_vectors(
lang: Optional[str],
vs: Optional[int],
dim: Optional[int],
cache_dir: Optional[Path],
encode_extra_options: Optional[str],
model_file: Optional[Path],
emb_file: Optional[Path],
) -> Callable[[Vocab], BPEmbVectors]:
def bpemb_vectors_factory(vocab: Vocab) -> BPEmbVectors:
return BPEmbVectors(
strings=vocab.strings,
lang=lang,
vs=vs,
dim=dim,
cache_dir=cache_dir,
encode_extra_options=encode_extra_options,
model_file=model_file,
emb_file=emb_file,
)
return bpemb_vectors_factory |
(At least the immediate CI failures are github glitches.) |
Super cool! Thanks for the ping. Just sayin', that bpemb example might serve as a great plugin, if only for the language support, but also because it seems to be a nice tangible example of an implementation. Something like |
Yeah, it would definitely make sense to publish this kind of thing in an extra package and we could also include support for We'd have to decide whether we want to support internal serialization or not. And I wouldn't know what to call the package because something like |
Maybe something like The main idea that I have is that we may want to consider embeddings that go beyond words. Maybe sentences? I've made an "embeddings for sklearn" project and the main embeddings that I end up using are the ones from sentence-bert. I also recall using the universal sentence encoder a lot, but supporting Another idea might be to have projects that are specific. The downside of having Another reason to consider the "project-per-project" approach is that |
These methods are needed in various places for training and vector similarity.
I'm not sure I would want to maintain a bunch of small packages and we can be selective about which ones we would add to our own package, making it a bit like We wouldn't plan to host any embeddings, just as embeddings at least. (As part of a distributed model if we implement internal serialization: maybe.) |
The main open issue here is that we need some sort of [vectors]
@misc = "spacy.Vectors.v1" To me it feel like this belongs under Would it make sense to use (A registry other than |
Would come back to this after #12625 is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice feature, and I like the idea of maintaining a separate small library with example implementations like spacy-loggers
does!
Only had a few smaller comments.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only found a few more typos, otherwise should be good to merge!
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Description
Support registered vectors.
Types of change
Enhancement.
Checklist