-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance Camelot's Table Extraction to Exclude Specific Rows Based on Alignment Issues #504
Comments
Hey all! We try to build a maintained fork at pypdf_table_extraction. You are welcome to check it out and contribute there. |
I have the same problem. |
@rodfloripa I haven't got any solution, then I handled this all in processing of the data |
Can you open an issue on https://github.com/py-pdf/pypdf_table_extraction ?? |
Have you tried setting table regions ? |
I am using Camelot for table extraction in PDF documents, which generally works well for my needs. However, I've encountered a recurring issue where the first and last rows of tables cause problems during the extraction process, primarily due to their alignment. These rows often differ in format from the rest of the table, affecting the consistency and accuracy of the extracted data. Currently, Camelot does not seem to offer a direct way to exclude specific rows based on their characteristics or alignment.
This feature would be incredibly beneficial for scenarios where table headers or footers consistently deviate in style or alignment from the main table body, leading to extraction inaccuracies. A parameter or method to specify rows to ignore (by index or pattern recognition) during extraction could significantly improve the utility and flexibility of Camelot for users facing similar challenges.
Is there an existing solution or workaround to address this issue, or could this functionality be considered for future updates?
For the details.
page 1
page 2 ( long table and is on 2, 3, and 4 pages in some pdf)
You can see because of this last row and first, it is making 11 columns for this data frame instead actually they are 10 columns. In my PDFs sometimes there are such footers (last row of the table on pdf) and (first row of header) which I am not interested in getting extracted and my header is after this.
I have already tried to play with line_tol, joint_tol, split_text, line_scale, shift_text, etc (and it works for smaller differences like in the 1st screenshot of page 1 it works but in the case of the second screenshot it fails.
Here is my appending tables function which makes a a single result_df for long tables
`
def append_tables_to_dataframe(tables):
try:
df_list = []
`
Is there any way we can search the appearing text or anything else to exclude that first row and last while extraction so that Camelot focuses on the main table (containing data we are interested in)?
I hope it is making sense, if not do let me know, I would love to explain more and if somehow you will be able to add this to Camelot it will make more powerful to this library.
Thanks
The text was updated successfully, but these errors were encountered: