You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Paste the output of import tabula; tabula.environment_info() on Python REPL: Python version: 3.10.9 (tags/v3.10.9:1dd9be6, Dec 6 2022, 20:01:21) [MSC v.1934 64 bit (AMD64)] Java version: java version "17.0.6" 2023-01-17 LTS Java(TM) SE Runtime Environment (build 17.0.6+9-LTS-190) Java HotSpot(TM) 64-Bit Server VM (build 17.0.6+9-LTS-190, mixed mode, sharing) tabula-py version: 2.7.0 platform: Windows-10-10.0.19044-SP0
What did you do when you faced the problem?
Tried using different setting like lattice =True and stream=True but none worked
Code:
def pdf_parser_to_table(pdf_file_path, csv_file_path):
"""This function extracts table from digital pdf and
converts to csv"""
tabula.convert_into(pdf_file_path,
csv_file_path, output_format="csv", pages='all',stream=True)
Expected behavior:
Expected the blank rows to be filled with Na value or empty spaces.
Actual behavior:
When a table is spanned over multiple pages in case of header is missing from other pages then the beginning page the blank cols in the page are ignored and the columns next to it are shifted to the position of the blank col
If you can see for the first page with the header is present the last name colum n is preserved in the output csv file even though it's completely empty
But if you see here the second pages output which does not have header in this case the columns next to the last name that is the email is shifted into the position of the last name column how can I avoid this and preserve and empty column even in cases like this
Related Issues:
The text was updated successfully, but these errors were encountered:
Summary of your issue
Tabula Py is ignoring the whole column in the output if entire row is blank incase the headers are missing .
Check list before submit
Did you read FAQ?
(Optional, but really helpful) Your PDF URL: https://drive.google.com/file/d/1-whMOIcCahbjZYxOtsDYQXr9s7Tlh_mG/view?usp=sharing
Paste the output of
import tabula; tabula.environment_info()
on Python REPL:Python version: 3.10.9 (tags/v3.10.9:1dd9be6, Dec 6 2022, 20:01:21) [MSC v.1934 64 bit (AMD64)] Java version: java version "17.0.6" 2023-01-17 LTS Java(TM) SE Runtime Environment (build 17.0.6+9-LTS-190) Java HotSpot(TM) 64-Bit Server VM (build 17.0.6+9-LTS-190, mixed mode, sharing) tabula-py version: 2.7.0 platform: Windows-10-10.0.19044-SP0
What did you do when you faced the problem?
Tried using different setting like
lattice =True
andstream=True
but none workedCode:
Expected behavior:
Expected the blank rows to be filled with Na value or empty spaces.
Actual behavior:
When a table is spanned over multiple pages in case of header is missing from other pages then the beginning page the blank cols in the page are ignored and the columns next to it are shifted to the position of the blank col
If you can see for the first page with the header is present the last name colum n is preserved in the output csv file even though it's completely empty
But if you see here the second pages output which does not have header in this case the columns next to the last name that is the email is shifted into the position of the last name column how can I avoid this and preserve and empty column even in cases like this
Related Issues:
The text was updated successfully, but these errors were encountered: