Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tabula.io.read_pdf argument "pandas_options" is being changed inside the function #338

Closed
5 tasks
vaghinak-vardanyan opened this issue Feb 14, 2023 · 3 comments · Fixed by #339
Closed
5 tasks
Labels

Comments

@vaghinak-vardanyan
Copy link

Summary of your issue

In tabula.io.read_pdf when specifying the "pandas_options" argument, it is being changed inside the read_pdf function causing unexpected behavior. The thing is when you call the function second-time pandas_options is empty and you are not getting what you want.

Check list before submit

  • [ x] Did you read FAQ?

  • (Optional, but really helpful) Your PDF URL: ?

  • [ x] Paste the output of import tabula; tabula.environment_info() on Python REPL: ?

Python version:
3.9.16 (main, Dec 7 2022, 01:12:08)
[GCC 11.3.0]
Java version:
openjdk version "1.8.0_352"
OpenJDK Runtime Environment (build 1.8.0_352-8u352-ga-1~22.04-b08)
OpenJDK 64-Bit Server VM (build 25.352-b08, mixed mode)
tabula-py version: 2.6.0
platform: Linux-5.15.0-48-generic-x86_64-with-glibc2.35
uname:
uname_result(system='Linux', node='vaghinak-pc', release='5.15.0-48-generic', version='#54-Ubuntu SMP Fri Aug 26 13:26:29 UTC 2022', machine='x86_64')
linux_distribution: ('Ubuntu', '22.04', 'jammy')
mac_ver: ('', ('', '', ''), '')

If not possible to execute tabula.environment_info(), please answer following questions manually.

  • Paste the output of python --version command on your terminal: ?
  • Paste the output of java -version command on your terminal: ?
  • Does java -h command work well?; Ensure your java command is included in PATH
  • Write your OS and it's version: ?

What did you do when you faced the problem?

Code:

paste your core code which minimum reproducible for the issue
import tabula

file_path = "path/to/pdf"
pandas_options = {"header": None}

first_part = tabula.read_pdf(
    file_path,
    pages=1,
    pandas_options=pandas_options,
    area=(160.0, 10.0, 500.0, 250.0),
    columns=[42.0, 170.0, 210.0],
    lattice=True
)

# this time the first row is interpreted as a column
second_part = tabula.read_pdf(
    file_path,
    pages=1,
    pandas_options=pandas_options,
    area=(160.0, 280.0, 490.0, 480.0),
    columns=[282.0, 410.0, 450.0],
    lattice=True
)

print(first_part[0].columns)
# different result
print(second_part[0].columns)

Expected behavior:

write your expected output
RangeIndex(start=0, stop=3, step=1)
RangeIndex(start=0, stop=3, step=1)

Actual behavior:

paste your output
RangeIndex(start=0, stop=3, step=1)
Index(['Myanmar - Long Grain Parboiled Rice', 'Unnamed: 0', 'Unnamed: 1'], dtype='object')

Related Issues:

@vaghinak-vardanyan
Copy link
Author

I can make the appropriate changes if we agree on this.

@chezou
Copy link
Owner

chezou commented Feb 14, 2023

Thank you for recreating the issue!

I agree it should be a bug, and it'd be appreciate you can send a PR for it 😄

@chezou
Copy link
Owner

chezou commented Feb 21, 2023

Released on v2.7.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants