CParserError: Error tokenizing data. C error: Expected 2 fields in line 733, saw 3 #12

alonsopg · 2016-12-23T19:14:27Z

I am trying to extract the tables from a number of pdf documents:

In:

from tabula import read_pdf_table
pdf_table = read_pdf_table("../file.pdf", pages="all")

Out:


---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-31-c86da9ee0350> in <module>()
      1 from tabula import read_pdf_table
----> 2 pdf_table = read_pdf_table("../file.pdf", pages="all")
      3 type(pdf_table)

/usr/local/lib/python3.5/site-packages/tabula/wrapper.py in read_pdf_table(input_path, options, pages, guess, area, spreadsheet, password, nospreadsheet, silent)
    100         return
    101 
--> 102     return pd.read_csv(io.BytesIO(output))

/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    643                     skip_blank_lines=skip_blank_lines)
    644 
--> 645         return _read(filepath_or_buffer, kwds)
    646 
    647     parser_f.__name__ = name

/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    398         return parser
    399 
--> 400     data = parser.read()
    401     parser.close()
    402     return data

/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
    936                 raise ValueError('skipfooter not supported for iteration')
    937 
--> 938         ret = self._engine.read(nrows)
    939 
    940         if self.options.get('as_recarray'):

/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
   1503     def read(self, nrows=None):
   1504         try:
-> 1505             data = self._reader.read(nrows)
   1506         except StopIteration:
   1507             if self._first_chunk:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:9884)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10142)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:10870)()

pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10741)()

pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:25878)()

CParserError: Error tokenizing data. C error: Expected 2 fields in line 733, saw 3

I tried to use sep parameter as \t. Nevertheless, it did not worked. What can I do?

The text was updated successfully, but these errors were encountered:

chezou · 2016-12-24T00:06:24Z

If there were multiple tables in a file, you should specify page number with pages option. This might be related to #2

alonsopg · 2016-12-27T18:39:03Z

Thanks for the help @chezou, I tried this:
In:

from tabula import read_pdf_table
pdf_table = read_pdf_table("../file.pdf", pages="45")
pdf_table

out:


 	dic 27 	2016 12:32:04 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
0 	ADVERTENCIA: java.lang.ArrayIndexOutOfBoundsEx... 	NaN
1 	java.lang.ArrayIndexOutOfBoundsException: 5 	NaN
2 	\tat java.awt.geom.Path2DFloatFloatTxIterator.cur... 	NaN
3 	\tat technology.tabula.ObjectExtractor.strokeO... 	NaN
4 	\tat technology.tabula.ObjectExtractor.strokeP... 	NaN
5 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
6 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
7 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
8 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
9 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
10 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
11 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
12 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
13 	\tat technology.tabula.ObjectExtractor.drawPag... 	NaN
14 	\tat technology.tabula.ObjectExtractor.extract... 	NaN
15 	\tat technology.tabula.PageIterator.next(PageI... 	NaN
16 	\tat technology.tabula.CommandLineApp.extractT... 	NaN
17 	\tat technology.tabula.CommandLineApp.main(Com... 	NaN
18 	dic 27 	2016 12:32:04 PM org.apache.pdfbox.util.PDFSt...
19 	ADVERTENCIA: java.lang.ArrayIndexOutOfBoundsEx... 	NaN
20 	java.lang.ArrayIndexOutOfBoundsException: 10 	NaN
21 	\tat java.awt.geom.Path2DFloatFloatTxIterator.cur... 	NaN
22 	\tat technology.tabula.ObjectExtractor.strokeO... 	NaN
23 	\tat technology.tabula.ObjectExtractor.strokeP... 	NaN
24 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
25 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
26 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
27 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
28 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
29 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
30 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
31 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
32 	\tat technology.tabula.ObjectExtractor.drawPag... 	NaN
33 	\tat technology.tabula.ObjectExtractor.extract... 	NaN
34 	\tat technology.tabula.PageIterator.next(PageI... 	NaN
35 	\tat technology.tabula.CommandLineApp.extractT... 	NaN
36 	\tat technology.tabula.CommandLineApp.main(Com... 	NaN
37 	ATOS DO PODER EXECUTIVO 	NaN
38 	ADMINISTRAÇÃO DIRETA 	NaN
39 	DECRETOS 	NaN

And:

In:

from tabula import read_pdf_table
pdf_table = read_pdf_table("../file.pdf", pages="5")
pdf_table

Out:

CParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 6

As you can see I specified the pages parameter. Any idea of how to proceed?. Thanks!

chezou · 2016-12-29T06:45:29Z

Could you try with tabula-java?
https://github.com/tabulapdf/tabula-java

If page 45 of your pdf includes multiple table or has combined cell, tabula-py should be fail. If you use area option, you might extract the table.

Anywhere I can't uess anymore without your pdf.

chezou · 2017-01-09T04:32:46Z

I find out the cause of this error and fixed it. f1db4ef

@alonsopg Could you upgrade your tabula-py?

chezou · 2017-01-15T10:48:48Z

@alonsopg Did your problem solve with updated version? If so, I would like to close this issue.

RAHAAMA · 2018-02-26T14:53:19Z

I have the same problem with pages =all. could anuone help me ?

chezou · 2018-02-26T14:55:58Z

@RAHAAMA Set mutiple_tables=True.
https://github.com/chezou/tabula-py#i-faced-cparsererror-how-can-i-extract-multiple-tables

RAHAAMA · 2018-02-26T16:12:36Z

@chezou Thank you . There is another problem with multiple tables , I have a pdf that prepared in two language , It means that pdf has two column (English and French ) , when I want to extract the tables , it consider all text like table. Is there any suggestion for this problem ?

RAHAAMA · 2018-02-26T16:16:50Z

b'Skipping line 28: expected 2 fields, saw 4\nSkipping line 29: expected 2 fields, saw 4\nSkipping line 30: expected 2 fields, saw 4\nSkipping line 31: expected 2 fields, saw 4\nSkipping line 32: expected 2 fields, saw 4\nSkipping line 33: expected 2 fields, saw 4\nSkipping line 34: expected 2 fields, saw 4\nSkipping line 35: expected 2 fields, saw 4\nSkipping line 36: expected 2 fields, saw 4\nSkipping line 37: expected 2 fields, saw 4\nSkipping line 38: expected 2 fields, saw 4\nSkipping line 39: expected 2 fields, saw 4\nSkipping line 40: expected 2 fields, saw 4\nSkipping line

I got above warnings also , I have set pandas_options={'error_bad_lines': False}

alonsopg closed this as completed Jan 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CParserError: Error tokenizing data. C error: Expected 2 fields in line 733, saw 3 #12

CParserError: Error tokenizing data. C error: Expected 2 fields in line 733, saw 3 #12

alonsopg commented Dec 23, 2016 •

edited

Loading

chezou commented Dec 24, 2016

alonsopg commented Dec 27, 2016 •

edited

Loading

chezou commented Dec 29, 2016

chezou commented Jan 9, 2017 •

edited

Loading

chezou commented Jan 15, 2017

RAHAAMA commented Feb 26, 2018

chezou commented Feb 26, 2018

RAHAAMA commented Feb 26, 2018 •

edited

Loading

RAHAAMA commented Feb 26, 2018

CParserError: Error tokenizing data. C error: Expected 2 fields in line 733, saw 3 #12

CParserError: Error tokenizing data. C error: Expected 2 fields in line 733, saw 3 #12

Comments

alonsopg commented Dec 23, 2016 • edited Loading

chezou commented Dec 24, 2016

alonsopg commented Dec 27, 2016 • edited Loading

chezou commented Dec 29, 2016

chezou commented Jan 9, 2017 • edited Loading

chezou commented Jan 15, 2017

RAHAAMA commented Feb 26, 2018

chezou commented Feb 26, 2018

RAHAAMA commented Feb 26, 2018 • edited Loading

RAHAAMA commented Feb 26, 2018

alonsopg commented Dec 23, 2016 •

edited

Loading

alonsopg commented Dec 27, 2016 •

edited

Loading

chezou commented Jan 9, 2017 •

edited

Loading

RAHAAMA commented Feb 26, 2018 •

edited

Loading