-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug/chunk_by_title disregarding combine_text_under_n_chars #2699
Comments
@georearl A This is what we're seeing in the chunk stream; chunk, table, chunk, table, etc. Note that Closing for now as no bug is evident here but feel free to ask any other questions about this if you need more clarification :) |
Thank you. Really appreciate it. Possible follow up question. If you have a table in a file, say a word file and it crosses multiple pages, say 20 pages, but headings aren't repeated on each page, do you chunk the table, but pull forward the headings from page 1? |
This will likely be file-format dependent somewhat, but in the DOCX case you mentioned, the |
Hi @scanny is there a way to avoid this behaviour? I tried modifying the element type before chunking and removing text_as_html and table_as_cells from the metadata, but this didn't work, I guess there is some underlying table identifier in the metadata that I can not see. |
@LucasOliveira44 it would be better to open a new issue for this as it's not strictly related to the original question. If you add a new issue I'll see it and be happy to respond. The short answer though is you'll need to change the element type. An element that returns You could potentially filter the element stream before passing it along to chunking and convert But I'd like to discuss further because this isn't the first time it's come up. If you'll open a fresh issue I can say more. |
Done @scanny you can find it here, #2990 (comment) |
Describe the bug
I am partitioning and then chunking an html file. The HTML has 12357 chars including spaces., but even with very large values for
max_characters, combine_text_under_n_chars and new_after_n_chars it still gives me 9 chunks.
To Reproduce
Provide a code snippet that reproduces the issue. Use an HTML based document, roughly 1 page in length with numerous titles, then partition and chunk as follows.
elements = partition_html(file=bytes_io)
chunks = chunk_by_title(elements, multipage_sections=True, new_after_n_chars=100000, combine_text_under_n_chars=75000, max_characters=100000)
Expected behavior
Given the parameter values I would expect a single chunk. This is just exploratory to understand an issue. In the real scenario I wouldn't use these values or desire a single chunk.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment Info
Please run
python scripts/collect_env.py
and paste the output here.This will help us understand more about the environment in which the bug occurred.
Additional context
Add any other context about the problem here.
We are using version 0.12.6
The text was updated successfully, but these errors were encountered: