-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding all bible book names as abbreviations #12
Conversation
…y in bible citations.
Thank you, Jake! Can I kindly ask you to restructure the added abbreviations by weaving them into the existing list of abbreviations, instead of changing their order completely? Also, there seem to be several tokens in this change that I can spot already that are not strictly abbreviations; Such as Matt, Song, or Psalm. If a sentence were to end with one of those words, that sentence would not be split. Abbreviations really should only be tokens that hardly ever would be used as proper words. Otherwise, setting those abbreviations would cause false-positive splits for many out-of-domain texts (i.e., non-bible texts in this case for this suggested change). |
@fnl Thank you for the suggestion, just applied it. Yeah, it's an interesting case on which things to include in this list, and which to not include.
But it gets split into 3 sentences:
|
Hey @fnl Just wanted to check in and see if you'd merge this in? |
Hi Jake; Sorry, this had dropped off my radar. |
Interesting suggestion, what would the default value of I guess it depends on your definition of "domain-specific" :) In our case, we see a pretty wide corpus of English text, and the Bible names appeared to be a very common thing in the language. |
While old, I still think the right way to fix this issue is to implement #16 instead of adding more special case tokens. Sadly, life and work haven’t allowed me to work on that enhancement, however. |
Hi Jake, sorry for the long silence here. It took some time for me to address this, but with the latest version bump to 1.4.1, syntok now correctly handles citations in quotes at the beginning of sentences, too ("Bible style"). See b74e65e for details how I enabled this, and even stole your test case. Note that there is a difference to your solution, hopefully for the better. The bible citations are segmented as wholly separate sentences now, as they semantically do not belong to the following sentence. They would belong to the preceding sentence if anything, but that is a bit too annoying to solve, so simply treating them as stand-alone sentences made the most sense. |
Thank you, that makes a lot of sense! |
We've seen a lot of bible citations in our data, and wanted to add a comprehensive list of bible book names as abbreviations. This makes syntok do a better job splitting bible citations of the form