feat: enable grouping broken paragraphs in partition_text#456
feat: enable grouping broken paragraphs in partition_text#456MthwRobinson merged 9 commits intomainfrom
partition_text#456Conversation
|
Couple of comments: I'll try this out on some real docs that tend to have blank lines between paragraphs and see there are reasonable elements that come out. It's not obvious to me they would. E.g., https://www.ietf.org/rfc/rfc1918.txt and https://www.ietf.org/rfc/rfc1918.txt . Also, it'd be nice if EDIT: now that I read the description this PR is mainly addressing a separate issue around page breaks only, so can disregard the above a separate TODO. |
qued
left a comment
There was a problem hiding this comment.
LGTM once the formatting issues are resolved. See also @cragwolfe 's comments though.
|
Now updated to use regex splitters, per Crag's comment. |
Summary
Closes #406. Adds a
group_broken_paragraphscleaning brick for grouping together sentences that have a page break for formatting purposes. This is common in.txtfiles. Also add aparagraph_grouperkwarg topartition_textto allow for passing in a paragraph grouping callable.Testing
The following code should produce two
NarrativeTextelements. Without the paragraph grouper,partition_textproduce four elements due to the line break.