feat: chunk elements based on titles#1222
Conversation
unstructured/documents/elements.py
Outdated
| class Section(Text): | ||
| """A section of text consisting of a combination of elements.""" | ||
|
|
||
| category = "Section" |
There was a problem hiding this comment.
it feels odd that this is just another Element type since all the ones we have so far (and planned) are "granular." that being said, i guess it has to derive from an Element to be immediately transformable by cleaning or staging bricks.
what do you think of the name CompositeElement instead of Section?
PS, i think eventually (soon, even) there could a metadata field that is the base64 gzipped data of the original elements in this element, so the transformation to whatever we call this type could be reversible, plus retaining the full granular element data could still be useful for end user applications. not a blocker for this PR, of course.
There was a problem hiding this comment.
also, thank you for getting the ball rolling here!
There was a problem hiding this comment.
Just updated to CompositeElement. I like that better than Section since section is already a metadata field. And yeah reasoning was to make the outputs immediately consumable by the functions we already have that expect Element objects. Zipping the original elements and maintaining that also sounds great!
qued
left a comment
There was a problem hiding this comment.
LGTM although Crag's comment is good to think about. I also have the thought that this logic seems similar to an initial strategy we want to try for capturing hierarchy, and maybe the logic can be shared with it once that gets added in the next few weeks.
imo, hierarchy-per-granular element is a bit different. chunking could use hierarchy better in the future (when it is included in element metadata), but chunking itself is strictly a downstream consumer of elements (which may contain additional hierarchy metadata info) imo. |
### Summary Partial solution to #1185. Related to #1222. Creates decorator from `chunk_by_title` cleaning brick. Breaks a document into sections based on the presence of Title elements. Also starts a new section under the following conditions: - If metadata changes, indicating a change in section or page or a switch to processing attachments. If `multipage_sections=True`, sections can span pages. `multipage_sections` defaults to True. - If the length of the section exceeds `new_after_n_chars` characters. The default is 1500. The **chunking function does not split individual elements**, so it's possible for a section to exceed that threshold if an individual element if over `new_after_n_chars characters`, which could occur with a long NarrativeText element. Combines sections under these conditions - Sections under `combine_under_n_chars` characters are combined. The default is 500. ### Testing from unstructured.partition.html import partition_html url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" chunks = partition_html(url=url, chunking_strategy="by_title") for chunk in chunks: print(chunk) print("\n\n" + "-"*80) input()
Summary
An initial pass on smart chunking for RAG applications. Breaks a document into sections based on the presence of
Titleelements. Also starts a new section under the following conditions:multipage_sections=True, sections can span pages.multipage_sectionsdefaults to True.new_after_n_charscharacters. The default is1500. The chunking function does not split individual elements, so it's possible for a section to exceed that threshold if an individual element if overnew_after_n_charscharacters, which could occur with a longNarrativeTextelement.combine_under_n_charscharacters are combined. The default is500.Testing