Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set Force=True on all set_extensions #54

Closed
wants to merge 7 commits into from

Conversation

tolomaus
Copy link
Contributor

@tolomaus tolomaus commented Jun 1, 2018

to avoid errors like "Extension 'in_s2v' already exists on Token" when multiple spacy+sense2vec pipelines are loaded in the same process (e.g. when creating the spacy+sense2vec pipeline in a mapPartitions call in spark)

@ines
Copy link
Member

ines commented Jul 21, 2018

Thanks a lot and sorry for only getting to this now!

This example (and similar cases) make me wonder whether spaCy should maybe handle this better... forcing an overwrite feels a little unsatisfying, but I can't really think of a better option.

I also just realised that there's potentially another problem here: Let's say you have two s2v models and you want to use them both in the same process, in different pipelines. The default behaviour is pretty unintuitive here: instance 2 of the sense2vec component will overwrite the extension attributes with its getters, so both your pipelines (and all others in the process) will now use the getters of instance 2. One solution would be to rename the attributes for the other pipeline, but that's also not very nice. I might open a separate issue about this on spaCy to discuss how we should handle this.

In the meantime, how would you feel about adding force_overwrite as an argument to the __init__, which is then passed down to the extensions? This would let you solve your problem in Spark, without making overwriting the default behaviour.

@tolomaus
Copy link
Contributor Author

Ah I see what you mean. I guess the optimal architecture should be then to pass in the reference of the sense2vec model into the lambda expression instead of hardcoding it to self which will indeed be the overridden to the latest sense2vec model that was initialized.

As a quick fix the force_overwrite would work for me though. I will always set it to True since I only have one model at a time.

Thanks!

@ines
Copy link
Member

ines commented Jul 21, 2018

Yeah, that would be an option! Alternatively, we could do something like this and keep the loaded vectors in the global scope, keyed by path (or maybe an optional ID assigned by the user). This ID would then be stored on the Doc:

import spacy
from spacy.tokens import Doc, Span

Doc.set_attribute('s2v_model', default=None)
Span.set_extension('s2v', getter=get_s2v)


MODELS = {}


def load_model(model_path):
    if model_path in MODELS:
        return MODELS[model_path]
    return sense2vec.load(model_path)  # load vectors here


def get_s2v(span):
    model = load_model(span.doc._.s2v_model)
    return  # compute the result here


class Sense2VecComponent(object):
    def __init__(self, model_path):
        self.model = model_path
        load_model(model_path)  # preload model

    def __call__(self, doc):
        doc._.s2v_model = self.model
        # etc.
        return doc

The above would then also allow usage like:

nlp1 = spacy.load('en')
nlp2 = spacy.load('de')

s2v1 = Sense2VecComponent('./en_vectors')
s2v2 = Sense2VecComponent('./de_vectors')

@tolomaus
Copy link
Contributor Author

Nice!

@tolomaus
Copy link
Contributor Author

Oops I just noticed that some other commits & PR's I have done a moment ago were added to this PR. I guess I should have used a branch instead of master, sorry for that

@prathameshd9
Copy link

prathameshd9 commented Jun 14, 2019

I have a similar situation and have used "force=True" for all set_extension() but not sure if it has any side effects ?!

@Narquadah
Copy link

Is there an ETA for merging? As this is not yet merged I am using the following as a work around:

if Span.has_extension('in_s2v') and Token.has_extension('in_s2v'):
  Span.remove_extension('in_s2v')
  Token.remove_extension('in_s2v')
  Span.remove_extension('s2v_freq')
  Token.remove_extension('s2v_freq')
  Span.remove_extension('s2v_vec')
  Token.remove_extension('s2v_vec')
  Span.remove_extension('s2v_most_similar')
  Token.remove_extension('s2v_most_similar')

@ines
Copy link
Member

ines commented Oct 29, 2019

I'll close this, since #77 will deal with part of the underlying problem, which is that extension attributes would always refer to the same sense2vec object, even if you're using multiple instances and different vectors.

Aside from that, setting force=True by default isn't usually a good idea. Extensions should only be set once and if they're silently overwritten, this can lead to problems later on.

@ines ines closed this Oct 29, 2019
@petulla
Copy link

petulla commented Jul 9, 2020

Still having this issue. Are any updates planned on the roadmap?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants