Option to not apply norm form to lexeme during retokenization #6443

kinghuang · 2020-11-24T19:15:25Z

Feature description

Retokenizer.split takes in attrs to apply to split tokens. These attributes are applied to the resulting token or to the underlying lexeme depending on whether they're considered context-dependent or context-independent. This is inconvenient for an attribute like NORM, where I might want to set a context-dependent value.

For example, I have some code that splits the token ["bfv"] into ["bf", "v"]. As part of the split, I declared custom norms for bf and v.

from spacy.lang.en import English

nlp = English()
nlp.vocab.strings.add("butterfly")
nlp.vocab.strings.add("valve")

doc = nlp("bfv")

with doc.retokenize() as retokenizer:
    token = doc[0]
    retokenizer.split(
        token,
        ["bf", "v"],
        heads=[(token, 0), (token, 1)],
        attrs={"NORM": ["butterfly", "valve"]},
    )

However, this sets the norm on the underlying lexeme. So, all other bf and v tokens now have butterfly and valve as their norms.

doc2 = nlp("bf")
print(doc2[0].norm_)

'butterfly'

Norms set on a token's norm_ attribute don't modify the underlying lexeme.

doc3 = nlp("a")
doc3[0].norm_ = "alphabet"
print(doc3[0].norm_)

'alphabet`

doc4 = nlp("a")
print(doc4[0].norm_)

'a'

I would like an option on Retokenizer.split to not apply attributes to the underlying lexeme, so that I can easily set NORMs without affecting it globally.

Could the feature be a custom component or spaCy plugin?

No.

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2020-11-25T08:11:06Z

The retokenizer tries to take the simple route and set all attributes on both tokens and lexemes, but you're right that this doesn't make sense for NORM, plus NORM is the only attribute that's stored in both. I think it would make sense to just skip NORM for the lexeme here.

github-actions · 2021-10-27T00:02:15Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

adrianeboyd added enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects labels Nov 25, 2020

adrianeboyd linked a pull request Nov 27, 2020 that will close this issue

Only set NORM on Token in retokenizer #6464

Merged

3 tasks

honnibal closed this as completed in #6464 Nov 30, 2020

github-actions bot locked as resolved and limited conversation to collaborators Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to not apply norm form to lexeme during retokenization #6443

Option to not apply norm form to lexeme during retokenization #6443

kinghuang commented Nov 24, 2020 •

edited

adrianeboyd commented Nov 25, 2020 •

edited

github-actions bot commented Oct 27, 2021

Option to not apply norm form to lexeme during retokenization #6443

Option to not apply norm form to lexeme during retokenization #6443

Comments

kinghuang commented Nov 24, 2020 • edited

Feature description

Could the feature be a custom component or spaCy plugin?

adrianeboyd commented Nov 25, 2020 • edited

github-actions bot commented Oct 27, 2021

kinghuang commented Nov 24, 2020 •

edited

adrianeboyd commented Nov 25, 2020 •

edited