-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#4525] fix gold.align #4526
[#4525] fix gold.align #4526
Conversation
Below is a notebook on profiling the old and new |
Really appreciate your help @tamuhey ! Thanks so much. I've had a fair bit of trouble with the alignment code over time. |
I merged a bit hastily here, and missed something. Will make a commit to master to address it. |
PR #4526 missed extra lower-casing and spacing normalization.
This has fixed some of the subtok problems I was having (this is great!) but now I have cases where texts aren't being aligned like they were before. At least one problem is when the raw text starts with a whitespace character. A minimal example:
|
Ah, wait, I think I managed to test this at just the wrong point and the failing overall alignment was what Matt fixed above. Now it doesn't throw an error but I get incorrect results with the GoldParse for the same example:
The difference is
|
@adrianeboyd Thanks for catching! I'm trying to fix this bug in (#4537) |
* Switch from original `_align` to new simpler alignment algorithm from explosion#4526 * Remove alignment normalizations beyond whitespace and lowercasing
* Switch from original `_align` to new simpler alignment algorithm from #4526 * Remove alignment normalizations beyond whitespace and lowercasing
from #4525
gold.align
code simpler and more efficientTypes of change
bug fix and refactor.
Checklist