You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Retokenizer.split takes in attrs to apply to split tokens. These attributes are applied to the resulting token or to the underlying lexeme depending on whether they're considered context-dependent or context-independent. This is inconvenient for an attribute like NORM, where I might want to set a context-dependent value.
For example, I have some code that splits the token ["bfv"] into ["bf", "v"]. As part of the split, I declared custom norms for bf and v.
I would like an option on Retokenizer.split to not apply attributes to the underlying lexeme, so that I can easily set NORMs without affecting it globally.
The retokenizer tries to take the simple route and set all attributes on both tokens and lexemes, but you're right that this doesn't make sense for NORM, plus NORM is the only attribute that's stored in both. I think it would make sense to just skip NORM for the lexeme here.
Feature description
Retokenizer.split takes in
attrs
to apply to split tokens. These attributes are applied to the resulting token or to the underlying lexeme depending on whether they're considered context-dependent or context-independent. This is inconvenient for an attribute likeNORM
, where I might want to set a context-dependent value.For example, I have some code that splits the token
["bfv"]
into["bf", "v"]
. As part of the split, I declared custom norms forbf
andv
.However, this sets the norm on the underlying lexeme. So, all other
bf
andv
tokens now havebutterfly
andvalve
as their norms.Norms set on a token's
norm_
attribute don't modify the underlying lexeme.I would like an option on Retokenizer.split to not apply attributes to the underlying lexeme, so that I can easily set
NORM
s without affecting it globally.Could the feature be a custom component or spaCy plugin?
No.
The text was updated successfully, but these errors were encountered: