-
Notifications
You must be signed in to change notification settings - Fork 697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Table Title and Table content separate chunks: Merge contents of parent_id and element.id #3012
Comments
Hi, did you find any solution to this? I am having the same problem and would like the table title and content to be in the same chunk to provide appropriate context to the content. |
+1, good question! |
If a So one approach is to increase A chunker that did exactly what you're asking for would be a different chunker, that is it would not just be a configuration of an existing chunker. I think the spec you're asking for is:
A more "pragmatic" approach might be to do partitioning and chunking in separate steps, and combine elements = partition(file)
def combine_title_elements(elements: Iterable[Element]) -> Iterator[Element]:
title = None
for e in elements:
# -- case where Title immediately follows a Title --
if isinstance(e, Title):
if title:
yield title
title = e
# -- case when prior element was a title --
elif title:
yield combine_title_with_element_fn_you_wrote_yourself(title, e)
title = None
# -- "normal" case when prior element was not a title --
else:
yield e
# -- handle case when last element is a Title --
if title:
yield title
chunks = chunk_elements(combine_title_elements(elements)) |
Hi, @scanny , I'm interesting on you code, so, what is the combine_title_with_element_fn_you_wrote_yourself function, can you provide the full code about it? Thanks |
That's the function you write yourself, to combine those elements in whatever way suits your purposes. It could be as simple as: def combine_title_with_element(title_element: Title, next_element: Element) -> Element:
next_element.text = f"{title_element.text} {next_element.text}".strip()
return next_element but you may also want to make some adjustments to the metadata depending. |
Thanks, @scanny . I guess chunk_elements function is |
@huangpan2507 Sounds like a different question related to PDFs. Best to ask that as a separate issue or on the Unstructured Community Slack channel. |
Thanks for your response, oK , I will post a issue on that channel |
Hi,
I am using partition and chunk_by_title to chunk my pdfs. It generally works but when I investigated the chunks I saw that if there is a Table in one of my documents, the title of the table is always one chunk and the actual content of a table is a separate chunk which I think it not optimal.
E.g. see this example with a pptx-file:
Prints:
+++++++++++++++++++++++++
RAG Evaluation: RAGAS
{'file_directory': '...', 'filename': '301123_genai_präsentation.pptx', 'filetype': '...', 'last_modified': '2023-11-30T10:26:30', 'page_number': 15, 'source': '301123_genai_präsentation.pptx', 'source_documents': '301123_genai_präsentation.pptx', 'page': 15}
+++++++++++++++++++++++++
Retrieval Generation
Model Context Recall Context Precision Faithfulness
Llama 2-Chat 0.86 0.58 0.91
LeoLM-Chat 0.86 0.58 0.81
LeoLM-Mistral-Chat 0.86 0.58 0.87
EM German Leo Mistral 0.86 0.58 0.82
Llama-German-Assistant 0.86 0.58 0.91
{'file_directory': '...', 'filename': '301123_genai_präsentation.pptx', 'last_modified': '2023-11-30T10:26:30', 'page_number': 15, 'parent_id': 'a9e22a24894f5c1dbe9b0b66251bbbc2', 'filetype': '...', 'source': '301123_genai_präsentation.pptx', 'source_documents': '301123_genai_präsentation.pptx', 'page': 15}
Question
So I see a parent_id key in the second output. How can I merge the content of the first output (the table heading) with the second output, so I would have all in one chunk:
RAG Evaluation: RAGAS
Retrieval Generation
Model Context Recall Context Precision Faithfulness
Llama 2-Chat 0.86 0.58 0.91
LeoLM-Chat 0.86 0.58 0.81
LeoLM-Mistral-Chat 0.86 0.58 0.87
EM German Leo Mistral 0.86 0.58 0.82
Llama-German-Assistant 0.86 0.58 0.91
Here is the full code:
The text was updated successfully, but these errors were encountered: