pandas.io.common.CParserError: Error tokenizing data. #17

kierank · 2017-02-22T20:25:39Z

Hi,

We get the following error parsing a certain pdf file from a URL.
This is using latest tabula-py from git.

url is https://resource.holdan.co.uk/Holdan/gbp/BMD.pdf

Traceback (most recent call last):
  File "test.py", line 8, in <module>
    df = tabula.read_pdf(url, pages="all")
  File "/usr/local/lib/python2.7/dist-packages/tabula/wrapper.py", line 69, in read_pdf_table
    return pd.read_csv(io.BytesIO(output), encoding = encoding)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 646, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 401, in _read
    data = parser.read()
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 939, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1508, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 848, in pandas.parser.TextReader.read (pandas/parser.c:9977)
  File "pandas/parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10235)
  File "pandas/parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas/parser.c:10963)
  File "pandas/parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10834)
  File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:25978)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 7 fields in line 270, saw 10

The text was updated successfully, but these errors were encountered:

chezou · 2017-02-23T08:59:38Z

It seems to related to #2 . I guess you should set area option https://github.com/chezou/tabula-py#how-can-i-ignore-useless-area . Could you show me your command with options?

I'd like to know the result of extracting with tabula-java. Could you try it?

kierank · 2017-03-02T19:53:01Z

We do this command:

`#!/usr/bin/env python
import tabula
import re

url = "https://resource.holdan.co.uk/Holdan/gbp/BMD.pdf"
df = tabula.read_pdf(url, pages="all")`

It seems to work in the vanilla tabula web interface.

chezou · 2017-03-03T22:38:12Z

As mentioned in #2,though tabula-java exports multiple tables with defferent size of column, there is no delimiter within tables. So with current version of tabula-py, you should specify each tables areas.

kierank · 2017-03-09T00:08:47Z

Ok, thanks for the help.

Kieran

chezou · 2017-05-24T00:49:36Z

@kierank Just FYI. I released tabula-py v0.7.0 which includes multiple_tables option. #34

Any feedbacks are welcome!

kierank closed this as completed Mar 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.io.common.CParserError: Error tokenizing data. #17

pandas.io.common.CParserError: Error tokenizing data. #17

kierank commented Feb 22, 2017

chezou commented Feb 23, 2017

kierank commented Mar 2, 2017 •

edited

Loading

chezou commented Mar 3, 2017

kierank commented Mar 9, 2017

chezou commented May 24, 2017

pandas.io.common.CParserError: Error tokenizing data. #17

pandas.io.common.CParserError: Error tokenizing data. #17

Comments

kierank commented Feb 22, 2017

chezou commented Feb 23, 2017

kierank commented Mar 2, 2017 • edited Loading

chezou commented Mar 3, 2017

kierank commented Mar 9, 2017

chezou commented May 24, 2017

kierank commented Mar 2, 2017 •

edited

Loading