feat: chunk elements based on titles by MthwRobinson · Pull Request #1222 · Unstructured-IO/unstructured

MthwRobinson · 2023-08-28T17:02:53Z

Summary

An initial pass on smart chunking for RAG applications. Breaks a document into sections based on the presence of Title elements. Also starts a new section under the following conditions:

If metadata changes, indicating a change in section or page or a switch to processing attachments. If multipage_sections=True, sections can span pages. multipage_sections defaults to True.
If the length of the section exceeds new_after_n_chars characters. The default is 1500. The chunking function does not split individual elements, so it's possible for a section to exceed that threshold if an individual element if over new_after_n_chars characters, which could occur with a long NarrativeText element.
Section under combine_under_n_chars characters are combined. The default is 500.

Testing

from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_by_title(elements)

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()

cragwolfe · 2023-08-29T06:03:02Z

unstructured/documents/elements.py

+class Section(Text):
+    """A section of text consisting of a combination of elements."""
+
+    category = "Section"


it feels odd that this is just another Element type since all the ones we have so far (and planned) are "granular." that being said, i guess it has to derive from an Element to be immediately transformable by cleaning or staging bricks.

what do you think of the name CompositeElement instead of Section?

PS, i think eventually (soon, even) there could a metadata field that is the base64 gzipped data of the original elements in this element, so the transformation to whatever we call this type could be reversible, plus retaining the full granular element data could still be useful for end user applications. not a blocker for this PR, of course.

also, thank you for getting the ball rolling here!

Just updated to CompositeElement. I like that better than Section since section is already a metadata field. And yeah reasoning was to make the outputs immediately consumable by the functions we already have that expect Element objects. Zipping the original elements and maintaining that also sounds great!

qued

LGTM although Crag's comment is good to think about. I also have the thought that this logic seems similar to an initial strategy we want to try for capturing hierarchy, and maybe the logic can be shared with it once that gets added in the next few weeks.

cragwolfe · 2023-08-29T17:06:23Z

similar to an initial strategy we want to try for capturing hierarchy,

imo, hierarchy-per-granular element is a bit different. chunking could use hierarchy better in the future (when it is included in element metadata), but chunking itself is strictly a downstream consumer of elements (which may contain additional hierarchy metadata info) imo.

### Summary Partial solution to #1185. Related to #1222. Creates decorator from `chunk_by_title` cleaning brick. Breaks a document into sections based on the presence of Title elements. Also starts a new section under the following conditions: - If metadata changes, indicating a change in section or page or a switch to processing attachments. If `multipage_sections=True`, sections can span pages. `multipage_sections` defaults to True. - If the length of the section exceeds `new_after_n_chars` characters. The default is 1500. The **chunking function does not split individual elements**, so it's possible for a section to exceed that threshold if an individual element if over `new_after_n_chars characters`, which could occur with a long NarrativeText element. Combines sections under these conditions - Sections under `combine_under_n_chars` characters are combined. The default is 500. ### Testing from unstructured.partition.html import partition_html url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" chunks = partition_html(url=url, chunking_strategy="by_title") for chunk in chunks: print(chunk) print("\n\n" + "-"*80) input()

MthwRobinson added 13 commits August 27, 2023 14:18

add chunking directory

69c5cd5

separate based on titles

55a884c

function and test for chunk by title

e3846f4

add handling for regex metadata

f3989be

work on list metadata

6e0075d

finish with metadata

598bc8e

check for change in metadata

cb1692d

split when metadata changes

cf464dc

add multipage sections

41f46dd

add new after and combine under chars

a07e98c

add doc strings

f1f89fd

update docs

7924788

changelog and verison

4e015ca

MthwRobinson requested a review from qued August 28, 2023 17:02

MthwRobinson added 2 commits August 28, 2023 13:03

Merge branch 'main' into feat/title-chunking

550941b

Merge branch 'main' into feat/title-chunking

b5c12a2

cragwolfe reviewed Aug 29, 2023

View reviewed changes

qued approved these changes Aug 29, 2023

View reviewed changes

MthwRobinson and others added 2 commits August 29, 2023 11:30

Merge branch 'main' into feat/title-chunking

ee17a24

Section -> CompositeElement

0046269

MthwRobinson enabled auto-merge (squash) August 29, 2023 15:33

MthwRobinson merged commit f6a745a into main Aug 29, 2023

MthwRobinson deleted the feat/title-chunking branch August 29, 2023 16:04

This was referenced Aug 29, 2023

feat/min and max partition made available to all partitions #1185

Closed

feat: support for CompositeElement inverse operations #1286

Closed

Coniferish mentioned this pull request Sep 5, 2023

chunk_by_title decorator #1304

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: chunk elements based on titles#1222

feat: chunk elements based on titles#1222
MthwRobinson merged 17 commits intomainfrom
feat/title-chunking

MthwRobinson commented Aug 28, 2023

Uh oh!

cragwolfe Aug 29, 2023

Uh oh!

cragwolfe Aug 29, 2023

Uh oh!

MthwRobinson Aug 29, 2023

Uh oh!

qued left a comment

Uh oh!

cragwolfe commented Aug 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MthwRobinson commented Aug 28, 2023

Summary

Testing

Uh oh!

cragwolfe Aug 29, 2023

Choose a reason for hiding this comment

Uh oh!

cragwolfe Aug 29, 2023

Choose a reason for hiding this comment

Uh oh!

MthwRobinson Aug 29, 2023

Choose a reason for hiding this comment

Uh oh!

qued left a comment

Choose a reason for hiding this comment

Uh oh!

cragwolfe commented Aug 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants