Tabula py Ignores an entire column if it's blank and if it does not contain headerd? #348

phantom2152 · 2023-04-11T12:19:57Z

Summary of your issue

Tabula Py is ignoring the whole column in the output if entire row is blank incase the headers are missing .

Check list before submit

Did you read FAQ?
(Optional, but really helpful) Your PDF URL: https://drive.google.com/file/d/1-whMOIcCahbjZYxOtsDYQXr9s7Tlh_mG/view?usp=sharing
Paste the output of import tabula; tabula.environment_info() on Python REPL:
Python version: 3.10.9 (tags/v3.10.9:1dd9be6, Dec 6 2022, 20:01:21) [MSC v.1934 64 bit (AMD64)] Java version: java version "17.0.6" 2023-01-17 LTS Java(TM) SE Runtime Environment (build 17.0.6+9-LTS-190) Java HotSpot(TM) 64-Bit Server VM (build 17.0.6+9-LTS-190, mixed mode, sharing) tabula-py version: 2.7.0 platform: Windows-10-10.0.19044-SP0

What did you do when you faced the problem?

Tried using different setting like lattice =True and stream=True but none worked

Code:

def pdf_parser_to_table(pdf_file_path, csv_file_path): 

"""This function extracts table from digital pdf and 
    converts to csv""" 
tabula.convert_into(pdf_file_path, 
    csv_file_path, output_format="csv", pages='all',stream=True)

Expected behavior:

Expected the blank rows to be filled with Na value or empty spaces.

Actual behavior:

When a table is spanned over multiple pages in case of header is missing from other pages then the beginning page the blank cols in the page are ignored and the columns next to it are shifted to the position of the blank col

If you can see for the first page with the header is present the last name colum n is preserved in the output csv file even though it's completely empty

But if you see here the second pages output which does not have header in this case the columns next to the last name that is the email is shifted into the position of the last name column how can I avoid this and preserve and empty column even in cases like this

Related Issues:

The text was updated successfully, but these errors were encountered:

chezou · 2023-04-23T00:54:40Z

Looking at the PDF, it should work lattice =True as the following:

import tabula
fname = "MOCK_DATA.pdf"
tabula.convert_into(fname, "test.csv",pages='all', lattice=True)

If you need to avoid some spilled characters from email to cc, you have to adjust columns option.

chezou closed this as completed Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabula py Ignores an entire column if it's blank and if it does not contain headerd? #348

Tabula py Ignores an entire column if it's blank and if it does not contain headerd? #348

phantom2152 commented Apr 11, 2023

chezou commented Apr 23, 2023

Tabula py Ignores an entire column if it's blank and if it does not contain headerd? #348

Tabula py Ignores an entire column if it's blank and if it does not contain headerd? #348

Comments

phantom2152 commented Apr 11, 2023

Summary of your issue

Check list before submit

What did you do when you faced the problem?

Code:

Expected behavior:

Actual behavior:

Related Issues:

chezou commented Apr 23, 2023