tabula.io.read_pdf argument "pandas_options" is being changed inside the function #338

vaghinak-vardanyan · 2023-02-14T21:44:20Z

Summary of your issue

In tabula.io.read_pdf when specifying the "pandas_options" argument, it is being changed inside the read_pdf function causing unexpected behavior. The thing is when you call the function second-time pandas_options is empty and you are not getting what you want.

Check list before submit

[ x] Did you read FAQ?
(Optional, but really helpful) Your PDF URL: ?
[ x] Paste the output of import tabula; tabula.environment_info() on Python REPL: ?

Python version:
3.9.16 (main, Dec 7 2022, 01:12:08)
[GCC 11.3.0]
Java version:
openjdk version "1.8.0_352"
OpenJDK Runtime Environment (build 1.8.0_352-8u352-ga-1~22.04-b08)
OpenJDK 64-Bit Server VM (build 25.352-b08, mixed mode)
tabula-py version: 2.6.0
platform: Linux-5.15.0-48-generic-x86_64-with-glibc2.35
uname:
uname_result(system='Linux', node='vaghinak-pc', release='5.15.0-48-generic', version='#54-Ubuntu SMP Fri Aug 26 13:26:29 UTC 2022', machine='x86_64')
linux_distribution: ('Ubuntu', '22.04', 'jammy')
mac_ver: ('', ('', '', ''), '')

If not possible to execute tabula.environment_info(), please answer following questions manually.

Paste the output of python --version command on your terminal: ?
Paste the output of java -version command on your terminal: ?
Does java -h command work well?; Ensure your java command is included in PATH
Write your OS and it's version: ?

What did you do when you faced the problem?

Code:

paste your core code which minimum reproducible for the issue

import tabula

file_path = "path/to/pdf"
pandas_options = {"header": None}

first_part = tabula.read_pdf(
    file_path,
    pages=1,
    pandas_options=pandas_options,
    area=(160.0, 10.0, 500.0, 250.0),
    columns=[42.0, 170.0, 210.0],
    lattice=True
)

# this time the first row is interpreted as a column
second_part = tabula.read_pdf(
    file_path,
    pages=1,
    pandas_options=pandas_options,
    area=(160.0, 280.0, 490.0, 480.0),
    columns=[282.0, 410.0, 450.0],
    lattice=True
)

print(first_part[0].columns)
# different result
print(second_part[0].columns)

Expected behavior:

write your expected output

RangeIndex(start=0, stop=3, step=1)
RangeIndex(start=0, stop=3, step=1)

Actual behavior:

paste your output

RangeIndex(start=0, stop=3, step=1)
Index(['Myanmar - Long Grain Parboiled Rice', 'Unnamed: 0', 'Unnamed: 1'], dtype='object')

Related Issues:

The text was updated successfully, but these errors were encountered:

vaghinak-vardanyan · 2023-02-14T21:53:49Z

I can make the appropriate changes if we agree on this.

chezou · 2023-02-14T22:24:10Z

Thank you for recreating the issue!

I agree it should be a bug, and it'd be appreciate you can send a PR for it 😄

chezou · 2023-02-21T01:11:45Z

Released on v2.7.0.

chezou added the bug label Feb 14, 2023

chezou mentioned this issue Feb 20, 2023

fix: do not break pandas_options #339

Merged

7 tasks

chezou closed this as completed in #339 Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tabula.io.read_pdf argument "pandas_options" is being changed inside the function #338

tabula.io.read_pdf argument "pandas_options" is being changed inside the function #338

vaghinak-vardanyan commented Feb 14, 2023

vaghinak-vardanyan commented Feb 14, 2023

chezou commented Feb 14, 2023

chezou commented Feb 21, 2023

tabula.io.read_pdf argument "pandas_options" is being changed inside the function #338

tabula.io.read_pdf argument "pandas_options" is being changed inside the function #338

Comments

vaghinak-vardanyan commented Feb 14, 2023

Summary of your issue

Check list before submit

What did you do when you faced the problem?

Code:

Expected behavior:

Actual behavior:

Related Issues:

vaghinak-vardanyan commented Feb 14, 2023

chezou commented Feb 14, 2023

chezou commented Feb 21, 2023