Skip to content

feat: chunk elements based on titles#1222

Merged
MthwRobinson merged 17 commits intomainfrom
feat/title-chunking
Aug 29, 2023
Merged

feat: chunk elements based on titles#1222
MthwRobinson merged 17 commits intomainfrom
feat/title-chunking

Conversation

@MthwRobinson
Copy link
Copy Markdown
Contributor

Summary

An initial pass on smart chunking for RAG applications. Breaks a document into sections based on the presence of Title elements. Also starts a new section under the following conditions:

  • If metadata changes, indicating a change in section or page or a switch to processing attachments. If multipage_sections=True, sections can span pages. multipage_sections defaults to True.
  • If the length of the section exceeds new_after_n_chars characters. The default is 1500. The chunking function does not split individual elements, so it's possible for a section to exceed that threshold if an individual element if over new_after_n_chars characters, which could occur with a long NarrativeText element.
  • Section under combine_under_n_chars characters are combined. The default is 500.

Testing

from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_by_title(elements)

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()

@MthwRobinson MthwRobinson requested a review from qued August 28, 2023 17:02
class Section(Text):
"""A section of text consisting of a combination of elements."""

category = "Section"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels odd that this is just another Element type since all the ones we have so far (and planned) are "granular." that being said, i guess it has to derive from an Element to be immediately transformable by cleaning or staging bricks.

what do you think of the name CompositeElement instead of Section?

PS, i think eventually (soon, even) there could a metadata field that is the base64 gzipped data of the original elements in this element, so the transformation to whatever we call this type could be reversible, plus retaining the full granular element data could still be useful for end user applications. not a blocker for this PR, of course.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, thank you for getting the ball rolling here!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updated to CompositeElement. I like that better than Section since section is already a metadata field. And yeah reasoning was to make the outputs immediately consumable by the functions we already have that expect Element objects. Zipping the original elements and maintaining that also sounds great!

Copy link
Copy Markdown
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM although Crag's comment is good to think about. I also have the thought that this logic seems similar to an initial strategy we want to try for capturing hierarchy, and maybe the logic can be shared with it once that gets added in the next few weeks.

@MthwRobinson MthwRobinson enabled auto-merge (squash) August 29, 2023 15:33
@MthwRobinson MthwRobinson merged commit f6a745a into main Aug 29, 2023
@MthwRobinson MthwRobinson deleted the feat/title-chunking branch August 29, 2023 16:04
@cragwolfe
Copy link
Copy Markdown
Contributor

similar to an initial strategy we want to try for capturing hierarchy,

imo, hierarchy-per-granular element is a bit different. chunking could use hierarchy better in the future (when it is included in element metadata), but chunking itself is strictly a downstream consumer of elements (which may contain additional hierarchy metadata info) imo.

cragwolfe pushed a commit that referenced this pull request Sep 11, 2023
### Summary

Partial solution to #1185.
Related to #1222.
Creates decorator from `chunk_by_title` cleaning brick.
Breaks a document into sections based on the presence of Title elements.
Also starts a new section under the following conditions:

- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is 1500. The **chunking function does not split individual
elements**, so it's possible for a section to exceed that threshold if
an individual element if over `new_after_n_chars characters`, which
could occur with a long NarrativeText element.

Combines sections under these conditions
- Sections under `combine_under_n_chars` characters are combined. The
default is 500.

### Testing

from unstructured.partition.html import partition_html

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
chunks = partition_html(url=url, chunking_strategy="by_title")

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants