You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Regular expressions in abbreviation_patterns intended to filter out sentences containing a single letter and a dot do not do that. The ones I have tried are "\\b\\p{Latin}\\.", "\\b[^\\W]\\.", and (this one does not even work for the standalone enumerated letters, see second sentence below) "\\b[bcčdďeéěfghjlmnňpqrřštťwxyýžBCČDĎEĚFGHJLMNŇPQRŘŠTŤWXYÝŽ]\\b".
Sentences that make it into the result but shouldn't are for example: G. H. Bondy mu jde v ústrety. C D G D. špatně. Francis J. Mulberry. Svými začátky sem náleží i O. Fischer.
The whole rule file, renamed to txt for attaching here, is cs.toml.txt
Sample file with these and other examples: sample.txt
The text was updated successfully, but these errors were encountered:
How are you running the extraction? Both automated tests as well as taking your sample.txt and using extract-file none of those sentences get accepted. Are you passing the -l cs argument correctly? Are you on the latest version of Sentence Extractor?
Sorry, it was very likely a case of appending to the same file. I was using
cargo run -- extract-file -l cs -d /mnt/d/shared/speech/language/all/ >> /mnt/d/shared/speech/language/sentences.txt
as per the README and did not notice the >> and file growth. I would swear that some of the regular expressions were there from the start, but apparently not. I have just rerun the whole thing (180MB) and the problem seems to not be present.
Regular expressions in
abbreviation_patterns
intended to filter out sentences containing a single letter and a dot do not do that. The ones I have tried are"\\b\\p{Latin}\\."
,"\\b[^\\W]\\.",
and (this one does not even work for the standalone enumerated letters, see second sentence below)"\\b[bcčdďeéěfghjlmnňpqrřštťwxyýžBCČDĎEĚFGHJLMNŇPQRŘŠTŤWXYÝŽ]\\b"
.Sentences that make it into the result but shouldn't are for example:
G. H. Bondy mu jde v ústrety.
C D G D. špatně.
Francis J. Mulberry.
Svými začátky sem náleží i O. Fischer.
The whole rule file, renamed to txt for attaching here, is
cs.toml.txt
Sample file with these and other examples:
sample.txt
The text was updated successfully, but these errors were encountered: