Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sentence count #21

Open
paulsedille opened this issue Apr 2, 2021 · 3 comments
Open

Improve sentence count #21

paulsedille opened this issue Apr 2, 2021 · 3 comments

Comments

@paulsedille
Copy link
Collaborator

Currently, the code counts BIS (broader impact statement) sentences by simply counting the number of final punctuation markers (. ! ?). This is not perfect because strings like "e.g." or "1.5 gallons" incorrectly add to the sentence count.

Ideally, the script would take these exceptional cases into account and reflect this in the final count.

There is an easy fix for the two most common occurrences: e.g. and i.e., which would be to subtract "2" from the sentence count for every separate occurence of either substrings ("e.g." or "i.e.") in the BIS text. More complex solutions might be (1) to automatically dismiss any sentence that is shorter than X characters (around 3-10 seems appropriate) and/or (2) only count ".", "!", or "?" if they are followed by a blank space (that is, count ". ", "! " and "? "). This would help exclude rarer false positives, for example tables, lists or numerical values that include full stops (such as "934.2" or "1. Computation Cost, 2. Training Data" etc.)

@paulsedille
Copy link
Collaborator Author

I've realised, only counting ".", "!", or "?" if they are followed by a blank space might skip sentences at the end of paragraphs (depending on how that is coded in the xml?).

@earlng
Copy link
Owner

earlng commented Apr 3, 2021

Following a suggestion here I think we can use the nltk package instead of re for the sentence count. (Documentation here.

The issue is that for cases that include e.g. or i.e. it still double counts them. But it does take into consideration decimal points, so I consider it an improvement.

@earlng
Copy link
Owner

earlng commented Apr 3, 2021

It's generally ok. But if an e.g. or particularly messy sentence is involved, it could be off by about 1-2 sentences.

earlng added a commit that referenced this issue Apr 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants