Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to not apply norm form to lexeme during retokenization #6443

Closed
kinghuang opened this issue Nov 24, 2020 · 2 comments · Fixed by #6464
Closed

Option to not apply norm form to lexeme during retokenization #6443

kinghuang opened this issue Nov 24, 2020 · 2 comments · Fixed by #6464
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects

Comments

@kinghuang
Copy link

kinghuang commented Nov 24, 2020

Feature description

Retokenizer.split takes in attrs to apply to split tokens. These attributes are applied to the resulting token or to the underlying lexeme depending on whether they're considered context-dependent or context-independent. This is inconvenient for an attribute like NORM, where I might want to set a context-dependent value.

For example, I have some code that splits the token ["bfv"] into ["bf", "v"]. As part of the split, I declared custom norms for bf and v.

from spacy.lang.en import English

nlp = English()
nlp.vocab.strings.add("butterfly")
nlp.vocab.strings.add("valve")

doc = nlp("bfv")

with doc.retokenize() as retokenizer:
    token = doc[0]
    retokenizer.split(
        token,
        ["bf", "v"],
        heads=[(token, 0), (token, 1)],
        attrs={"NORM": ["butterfly", "valve"]},
    )

However, this sets the norm on the underlying lexeme. So, all other bf and v tokens now have butterfly and valve as their norms.

doc2 = nlp("bf")
print(doc2[0].norm_)
'butterfly'

Norms set on a token's norm_ attribute don't modify the underlying lexeme.

doc3 = nlp("a")
doc3[0].norm_ = "alphabet"
print(doc3[0].norm_)
'alphabet`
doc4 = nlp("a")
print(doc4[0].norm_)
'a'

I would like an option on Retokenizer.split to not apply attributes to the underlying lexeme, so that I can easily set NORMs without affecting it globally.

Could the feature be a custom component or spaCy plugin?

No.

@adrianeboyd adrianeboyd added enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects labels Nov 25, 2020
@adrianeboyd
Copy link
Contributor

adrianeboyd commented Nov 25, 2020

The retokenizer tries to take the simple route and set all attributes on both tokens and lexemes, but you're right that this doesn't make sense for NORM, plus NORM is the only attribute that's stored in both. I think it would make sense to just skip NORM for the lexeme here.

@adrianeboyd adrianeboyd linked a pull request Nov 27, 2020 that will close this issue
3 tasks
@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 27, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants