Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to filter out single letter and dot using abbreviation_patterns #174

Closed
comodoro opened this issue Mar 19, 2022 · 3 comments
Closed
Assignees
Labels
bug Something isn't working discussion invalid This doesn't seem right rules
Projects

Comments

@comodoro
Copy link
Contributor

comodoro commented Mar 19, 2022

Regular expressions in abbreviation_patterns intended to filter out sentences containing a single letter and a dot do not do that. The ones I have tried are "\\b\\p{Latin}\\.", "\\b[^\\W]\\.", and (this one does not even work for the standalone enumerated letters, see second sentence below) "\\b[bcčdďeéěfghjlmnňpqrřštťwxyýžBCČDĎEĚFGHJLMNŇPQRŘŠTŤWXYÝŽ]\\b".

Sentences that make it into the result but shouldn't are for example:
G. H. Bondy mu jde v ústrety.
C D G D. špatně.
Francis J. Mulberry.
Svými začátky sem náleží i O. Fischer.

The whole rule file, renamed to txt for attaching here, is
cs.toml.txt

Sample file with these and other examples:
sample.txt

@MichaelKohler MichaelKohler added this to In Progress in Overview Mar 19, 2022
@MichaelKohler MichaelKohler self-assigned this Mar 19, 2022
@MichaelKohler
Copy link
Member

How are you running the extraction? Both automated tests as well as taking your sample.txt and using extract-file none of those sentences get accepted. Are you passing the -l cs argument correctly? Are you on the latest version of Sentence Extractor?

@comodoro
Copy link
Contributor Author

Sorry, it was very likely a case of appending to the same file. I was using

cargo run -- extract-file -l cs -d /mnt/d/shared/speech/language/all/ >> /mnt/d/shared/speech/language/sentences.txt

as per the README and did not notice the >> and file growth. I would swear that some of the regular expressions were there from the start, but apparently not. I have just rerun the whole thing (180MB) and the problem seems to not be present.

@MichaelKohler
Copy link
Member

Oh, I see. Might take a look at the README to see if appending really makes sense in that case. Happy you figured it out :)

@MichaelKohler MichaelKohler added invalid This doesn't seem right and removed needs debugging labels Mar 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussion invalid This doesn't seem right rules
Projects
Overview
In Progress
Development

No branches or pull requests

2 participants