Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding all bible book names as abbreviations #12

Closed
wants to merge 3 commits into from

Conversation

jakepoz
Copy link

@jakepoz jakepoz commented Jul 27, 2020

We've seen a lot of bible citations in our data, and wanted to add a comprehensive list of bible book names as abbreviations. This makes syntok do a better job splitting bible citations of the form

This is not a real quote? (Phil. 4:8) No, it's not.

@fnl
Copy link
Owner

fnl commented Aug 3, 2020

Thank you, Jake!

Can I kindly ask you to restructure the added abbreviations by weaving them into the existing list of abbreviations, instead of changing their order completely?
The current integration is making it impossible for me to spot which new abbreviations this change is in fact trying to add.

Also, there seem to be several tokens in this change that I can spot already that are not strictly abbreviations; Such as Matt, Song, or Psalm. If a sentence were to end with one of those words, that sentence would not be split. Abbreviations really should only be tokens that hardly ever would be used as proper words. Otherwise, setting those abbreviations would cause false-positive splits for many out-of-domain texts (i.e., non-bible texts in this case for this suggested change).

@jakepoz
Copy link
Author

jakepoz commented Aug 4, 2020

@fnl Thank you for the suggestion, just applied it.

Yeah, it's an interesting case on which things to include in this list, and which to not include.
We see a pretty broad range of content come through our system, and I did notice that bible citations of this sort were often split incorrectly. They seem to always have the following form:

... is a great joy, a prized possession. (Isa. 33:6) Text continues...

But it gets split into 3 sentences:

is a great joy, a prized possession.
(Phil.
33:6) Text continues

@jakepoz
Copy link
Author

jakepoz commented Feb 19, 2021

Hey @fnl Just wanted to check in and see if you'd merge this in?

@fnl
Copy link
Owner

fnl commented Feb 22, 2021

Hi Jake; Sorry, this had dropped off my radar.
From the unit test and your above comments, I see the cases that matter to you all follow a very regular structure, namely /\([A-Z][a-z]*\. [0-9]+:[0-9]+\)/
In the former version of syntok (that is, in segtok) I supported avoiding sentence segmentation inside the parenthesis.
That is important to avoid over-splitting in sci. quotes, for example (F. Leitner, 2021), too.
As preventing segmentation inside parenthesis with short token sequences would resemble a more generic solution than adding domain-specific abbreviations, my question for you is:
Assuming syntok would not split within parenthesis with less than n (user-configurable) tokens inside, would that resolve your segmentation issues, too?

@jakepoz
Copy link
Author

jakepoz commented Feb 23, 2021

Interesting suggestion, what would the default value of n be?

I guess it depends on your definition of "domain-specific" :) In our case, we see a pretty wide corpus of English text, and the Bible names appeared to be a very common thing in the language.

@fnl
Copy link
Owner

fnl commented Dec 17, 2021

While old, I still think the right way to fix this issue is to implement #16 instead of adding more special case tokens. Sadly, life and work haven’t allowed me to work on that enhancement, however.

@fnl
Copy link
Owner

fnl commented Jan 26, 2022

Hi Jake, sorry for the long silence here. It took some time for me to address this, but with the latest version bump to 1.4.1, syntok now correctly handles citations in quotes at the beginning of sentences, too ("Bible style"). See b74e65e for details how I enabled this, and even stole your test case.

Note that there is a difference to your solution, hopefully for the better. The bible citations are segmented as wholly separate sentences now, as they semantically do not belong to the following sentence. They would belong to the preceding sentence if anything, but that is a bit too annoying to solve, so simply treating them as stand-alone sentences made the most sense.

@fnl fnl closed this Jan 26, 2022
@jakepoz
Copy link
Author

jakepoz commented Jan 27, 2022

Thank you, that makes a lot of sense!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants