Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabula py Ignores an entire column if it's blank and if it does not contain headerd? #348

Closed
3 tasks done
phantom2152 opened this issue Apr 11, 2023 · 1 comment
Closed
3 tasks done

Comments

@phantom2152
Copy link

Summary of your issue

Tabula Py is ignoring the whole column in the output if entire row is blank incase the headers are missing .

Check list before submit

  • Did you read FAQ?

  • (Optional, but really helpful) Your PDF URL: https://drive.google.com/file/d/1-whMOIcCahbjZYxOtsDYQXr9s7Tlh_mG/view?usp=sharing

  • Paste the output of import tabula; tabula.environment_info() on Python REPL:
    Python version: 3.10.9 (tags/v3.10.9:1dd9be6, Dec 6 2022, 20:01:21) [MSC v.1934 64 bit (AMD64)] Java version: java version "17.0.6" 2023-01-17 LTS Java(TM) SE Runtime Environment (build 17.0.6+9-LTS-190) Java HotSpot(TM) 64-Bit Server VM (build 17.0.6+9-LTS-190, mixed mode, sharing) tabula-py version: 2.7.0 platform: Windows-10-10.0.19044-SP0

What did you do when you faced the problem?

Tried using different setting like lattice =True and stream=True but none worked

Code:

def pdf_parser_to_table(pdf_file_path, csv_file_path): 

"""This function extracts table from digital pdf and 
    converts to csv""" 
tabula.convert_into(pdf_file_path, 
    csv_file_path, output_format="csv", pages='all',stream=True)

Expected behavior:

Expected the blank rows to be filled with Na value or empty spaces.

Actual behavior:

When a table is spanned over multiple pages in case of header is missing from other pages then the beginning page the blank cols in the page are ignored and the columns next to it are shifted to the position of the blank col
tmp_1681215223377
If you can see for the first page with the header is present the last name colum n is preserved in the output csv file even though it's completely empty

tmp_1681215366456
But if you see here the second pages output which does not have header in this case the columns next to the last name that is the email is shifted into the position of the last name column how can I avoid this and preserve and empty column even in cases like this

Related Issues:

@chezou
Copy link
Owner

chezou commented Apr 23, 2023

Looking at the PDF, it should work lattice =True as the following:

import tabula
fname = "MOCK_DATA.pdf"
tabula.convert_into(fname, "test.csv",pages='all', lattice=True)

image

If you need to avoid some spilled characters from email to cc, you have to adjust columns option.

@chezou chezou closed this as completed Apr 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants