Skip to content

feat: enable grouping broken paragraphs in partition_text#456

Merged
MthwRobinson merged 9 commits intomainfrom
feat/group-paragraphs
Apr 6, 2023
Merged

feat: enable grouping broken paragraphs in partition_text#456
MthwRobinson merged 9 commits intomainfrom
feat/group-paragraphs

Conversation

@MthwRobinson
Copy link
Copy Markdown
Contributor

Summary

Closes #406. Adds a group_broken_paragraphs cleaning brick for grouping together sentences that have a page break for formatting purposes. This is common in .txt files. Also add a paragraph_grouper kwarg to partition_text to allow for passing in a paragraph grouping callable.

Testing

The following code should produce two NarrativeText elements. Without the paragraph grouper, partition_text produce four elements due to the line break.

from unstructured.partition.text import partition_text
from unstructured.cleaners.core import group_broken_paragraphs


text = """The big brown fox
was walking down the lane.

At the end of the lane, the
fox met a bear."""

partition_text(text=text, paragraph_grouper=group_broken_paragraphs)

@MthwRobinson MthwRobinson requested a review from qued April 6, 2023 16:04
@cragwolfe
Copy link
Copy Markdown
Contributor

cragwolfe commented Apr 6, 2023

Couple of comments:

I'll try this out on some real docs that tend to have blank lines between paragraphs and see there are reasonable elements that come out. It's not obvious to me they would. E.g., https://www.ietf.org/rfc/rfc1918.txt and https://www.ietf.org/rfc/rfc1918.txt .

Also, it'd be nice if parition_txt() could automatically do the right thing in most cases, allowing the user to choose a certain splitting pattern if desired. I.e., paragraphs in examples above are NarrativeText items, and in the case of a phone book (a bunch of single lines), each line is its own element (not worrying about the type of element too much). From an implementation note, it seems like it would be more straightforward to choose between to regex splitters, adding a "blank line" one after https://github.com/Unstructured-IO/unstructured/blob/3467a27/unstructured/nlp/patterns.py#L71 when partitioning, rather than relying on a cleaning brick to do so after the fact (though the cleaning brick is a nice to have regardless).

EDIT: now that I read the description this PR is mainly addressing a separate issue around page breaks only, so can disregard the above a separate TODO.

Copy link
Copy Markdown
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the formatting issues are resolved. See also @cragwolfe 's comments though.

@MthwRobinson
Copy link
Copy Markdown
Contributor Author

Now updated to use regex splitters, per Crag's comment.

@MthwRobinson MthwRobinson enabled auto-merge (squash) April 6, 2023 18:17
@MthwRobinson MthwRobinson merged commit c99c099 into main Apr 6, 2023
@MthwRobinson MthwRobinson deleted the feat/group-paragraphs branch April 6, 2023 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Group together blocks of text separated by line breaks in partition_text

3 participants