-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve sentence count #21
Comments
I've realised, only counting ".", "!", or "?" if they are followed by a blank space might skip sentences at the end of paragraphs (depending on how that is coded in the xml?). |
It's generally ok. But if an e.g. or particularly messy sentence is involved, it could be off by about 1-2 sentences. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently, the code counts BIS (broader impact statement) sentences by simply counting the number of final punctuation markers (. ! ?). This is not perfect because strings like "e.g." or "1.5 gallons" incorrectly add to the sentence count.
Ideally, the script would take these exceptional cases into account and reflect this in the final count.
There is an easy fix for the two most common occurrences: e.g. and i.e., which would be to subtract "2" from the sentence count for every separate occurence of either substrings ("e.g." or "i.e.") in the BIS text. More complex solutions might be (1) to automatically dismiss any sentence that is shorter than X characters (around 3-10 seems appropriate) and/or (2) only count ".", "!", or "?" if they are followed by a blank space (that is, count ". ", "! " and "? "). This would help exclude rarer false positives, for example tables, lists or numerical values that include full stops (such as "934.2" or "1. Computation Cost, 2. Training Data" etc.)
The text was updated successfully, but these errors were encountered: