Adding all bible book names as abbreviations #12

jakepoz · 2020-07-27T21:55:59Z

We've seen a lot of bible citations in our data, and wanted to add a comprehensive list of bible book names as abbreviations. This makes syntok do a better job splitting bible citations of the form

This is not a real quote? (Phil. 4:8) No, it's not.

…y in bible citations.

fnl · 2020-08-03T21:33:02Z

Thank you, Jake!

Can I kindly ask you to restructure the added abbreviations by weaving them into the existing list of abbreviations, instead of changing their order completely?
The current integration is making it impossible for me to spot which new abbreviations this change is in fact trying to add.

Also, there seem to be several tokens in this change that I can spot already that are not strictly abbreviations; Such as Matt, Song, or Psalm. If a sentence were to end with one of those words, that sentence would not be split. Abbreviations really should only be tokens that hardly ever would be used as proper words. Otherwise, setting those abbreviations would cause false-positive splits for many out-of-domain texts (i.e., non-bible texts in this case for this suggested change).

jakepoz · 2020-08-04T00:05:34Z

@fnl Thank you for the suggestion, just applied it.

Yeah, it's an interesting case on which things to include in this list, and which to not include.
We see a pretty broad range of content come through our system, and I did notice that bible citations of this sort were often split incorrectly. They seem to always have the following form:

... is a great joy, a prized possession. (Isa. 33:6) Text continues...

But it gets split into 3 sentences:

is a great joy, a prized possession.
(Phil.
33:6) Text continues

jakepoz · 2021-02-19T22:56:10Z

Hey @fnl Just wanted to check in and see if you'd merge this in?

fnl · 2021-02-22T22:28:57Z

Hi Jake; Sorry, this had dropped off my radar.
From the unit test and your above comments, I see the cases that matter to you all follow a very regular structure, namely /\([A-Z][a-z]*\. [0-9]+:[0-9]+\)/
In the former version of syntok (that is, in segtok) I supported avoiding sentence segmentation inside the parenthesis.
That is important to avoid over-splitting in sci. quotes, for example (F. Leitner, 2021), too.
As preventing segmentation inside parenthesis with short token sequences would resemble a more generic solution than adding domain-specific abbreviations, my question for you is:
Assuming syntok would not split within parenthesis with less than n (user-configurable) tokens inside, would that resolve your segmentation issues, too?

jakepoz · 2021-02-23T00:20:12Z

Interesting suggestion, what would the default value of n be?

I guess it depends on your definition of "domain-specific" :) In our case, we see a pretty wide corpus of English text, and the Bible names appeared to be a very common thing in the language.

fnl · 2021-12-17T12:19:03Z

While old, I still think the right way to fix this issue is to implement #16 instead of adding more special case tokens. Sadly, life and work haven’t allowed me to work on that enhancement, however.

fnl · 2022-01-26T23:25:58Z

Hi Jake, sorry for the long silence here. It took some time for me to address this, but with the latest version bump to 1.4.1, syntok now correctly handles citations in quotes at the beginning of sentences, too ("Bible style"). See b74e65e for details how I enabled this, and even stole your test case.

Note that there is a difference to your solution, hopefully for the better. The bible citations are segmented as wholly separate sentences now, as they semantically do not belong to the following sentence. They would belong to the preceding sentence if anything, but that is a bit too annoying to solve, so simply treating them as stand-alone sentences made the most sense.

jakepoz · 2022-01-27T00:09:22Z

Thank you, that makes a lot of sense!

Adding all bible book names as abbreviations, as they show up commonl…

725f62d

…y in bible citations.

jakepoz added 2 commits August 3, 2020 16:52

Interleaving the new abbreviation for cleaner PR

87fbc26

Cleaning up the list a little bit more

3cd50d8

fnl mentioned this pull request Dec 16, 2021

Do not segment inside parenthesis #16

Closed

fnl closed this Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding all bible book names as abbreviations #12

Adding all bible book names as abbreviations #12

jakepoz commented Jul 27, 2020

fnl commented Aug 3, 2020

jakepoz commented Aug 4, 2020 •

edited

Loading

jakepoz commented Feb 19, 2021

fnl commented Feb 22, 2021 •

edited

Loading

jakepoz commented Feb 23, 2021

fnl commented Dec 17, 2021

fnl commented Jan 26, 2022

jakepoz commented Jan 27, 2022

Adding all bible book names as abbreviations #12

Adding all bible book names as abbreviations #12

Conversation

jakepoz commented Jul 27, 2020

fnl commented Aug 3, 2020

jakepoz commented Aug 4, 2020 • edited Loading

jakepoz commented Feb 19, 2021

fnl commented Feb 22, 2021 • edited Loading

jakepoz commented Feb 23, 2021

fnl commented Dec 17, 2021

fnl commented Jan 26, 2022

jakepoz commented Jan 27, 2022

jakepoz commented Aug 4, 2020 •

edited

Loading

fnl commented Feb 22, 2021 •

edited

Loading