Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize German variants #87

Open
michelole opened this issue Aug 28, 2019 · 3 comments
Open

Normalize German variants #87

michelole opened this issue Aug 28, 2019 · 3 comments
Labels
P2 High priority issues, a COULD

Comments

@michelole
Copy link
Member

michelole commented Aug 28, 2019

Normalize, e.g., c <-> k, z <-> c before applying filtering rules.

@michelole michelole added the P1 Higher priority issues, a SHOULD label Aug 28, 2019
@michelole
Copy link
Member Author

michelole commented Aug 29, 2019

Maybe normalize all to k also during training to make models denser?

@michelole michelole changed the title Revisit variants Normalize variants Nov 22, 2019
@michelole michelole changed the title Normalize variants Normalize German variants Nov 22, 2019
@michelole
Copy link
Member Author

If normalizing before training, cleaning routines should be applied to the GS (at runtime) to avoid false negatives.

@michelole
Copy link
Member Author

This actually may require annotating the new expansions, since some of them could be considered typos, e.g. "becannt"/"druccausgleich", "karotis"/"kava".

@michelole michelole removed the P1 Higher priority issues, a SHOULD label Dec 3, 2019
@michelole michelole mentioned this issue Dec 3, 2019
@michelole michelole added the P2 High priority issues, a COULD label Dec 3, 2019
michelole added a commit to michelole/acres that referenced this issue Aug 21, 2020
Spelling variants are better handled with a normalization step instead of an exponential increase of expansion candidates, which led to very slow processing and several bugs. This refs bst-mug#87 and closes bst-mug#98.

Also, `get_acro_def_pair_score` was originally intended for web-based (i.e. text with acronym-definition pairs) inputs, now removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 High priority issues, a COULD
Projects
None yet
Development

No branches or pull requests

1 participant