Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.io.common.CParserError: Error tokenizing data. #17

Closed
kierank opened this issue Feb 22, 2017 · 5 comments
Closed

pandas.io.common.CParserError: Error tokenizing data. #17

kierank opened this issue Feb 22, 2017 · 5 comments

Comments

@kierank
Copy link

kierank commented Feb 22, 2017

Hi,

We get the following error parsing a certain pdf file from a URL.
This is using latest tabula-py from git.

url is https://resource.holdan.co.uk/Holdan/gbp/BMD.pdf

Traceback (most recent call last):
  File "test.py", line 8, in <module>
    df = tabula.read_pdf(url, pages="all")
  File "/usr/local/lib/python2.7/dist-packages/tabula/wrapper.py", line 69, in read_pdf_table
    return pd.read_csv(io.BytesIO(output), encoding = encoding)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 646, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 401, in _read
    data = parser.read()
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 939, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1508, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 848, in pandas.parser.TextReader.read (pandas/parser.c:9977)
  File "pandas/parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10235)
  File "pandas/parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas/parser.c:10963)
  File "pandas/parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10834)
  File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:25978)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 7 fields in line 270, saw 10
@chezou
Copy link
Owner

chezou commented Feb 23, 2017

It seems to related to #2 . I guess you should set area option https://github.com/chezou/tabula-py#how-can-i-ignore-useless-area . Could you show me your command with options?

I'd like to know the result of extracting with tabula-java. Could you try it?

@kierank
Copy link
Author

kierank commented Mar 2, 2017

We do this command:

`#!/usr/bin/env python
import tabula
import re

url = "https://resource.holdan.co.uk/Holdan/gbp/BMD.pdf"
df = tabula.read_pdf(url, pages="all")`

It seems to work in the vanilla tabula web interface.

@chezou
Copy link
Owner

chezou commented Mar 3, 2017

As mentioned in #2,though tabula-java exports multiple tables with defferent size of column, there is no delimiter within tables. So with current version of tabula-py, you should specify each tables areas.

@kierank
Copy link
Author

kierank commented Mar 9, 2017

Ok, thanks for the help.

Kieran

@kierank kierank closed this as completed Mar 9, 2017
@chezou
Copy link
Owner

chezou commented May 24, 2017

@kierank Just FYI. I released tabula-py v0.7.0 which includes multiple_tables option. #34

Any feedbacks are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants