Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CParserError: Error tokenizing data. C error: Expected 2 fields in line 733, saw 3 #12

Closed
alonsopg opened this issue Dec 23, 2016 · 9 comments

Comments

@alonsopg
Copy link

alonsopg commented Dec 23, 2016

I am trying to extract the tables from a number of pdf documents:

In:

from tabula import read_pdf_table
pdf_table = read_pdf_table("../file.pdf", pages="all")

Out:


---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-31-c86da9ee0350> in <module>()
      1 from tabula import read_pdf_table
----> 2 pdf_table = read_pdf_table("../file.pdf", pages="all")
      3 type(pdf_table)

/usr/local/lib/python3.5/site-packages/tabula/wrapper.py in read_pdf_table(input_path, options, pages, guess, area, spreadsheet, password, nospreadsheet, silent)
    100         return
    101 
--> 102     return pd.read_csv(io.BytesIO(output))

/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    643                     skip_blank_lines=skip_blank_lines)
    644 
--> 645         return _read(filepath_or_buffer, kwds)
    646 
    647     parser_f.__name__ = name

/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    398         return parser
    399 
--> 400     data = parser.read()
    401     parser.close()
    402     return data

/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
    936                 raise ValueError('skipfooter not supported for iteration')
    937 
--> 938         ret = self._engine.read(nrows)
    939 
    940         if self.options.get('as_recarray'):

/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
   1503     def read(self, nrows=None):
   1504         try:
-> 1505             data = self._reader.read(nrows)
   1506         except StopIteration:
   1507             if self._first_chunk:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:9884)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10142)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:10870)()

pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:10741)()

pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:25878)()

CParserError: Error tokenizing data. C error: Expected 2 fields in line 733, saw 3

I tried to use sep parameter as \t. Nevertheless, it did not worked. What can I do?

@chezou
Copy link
Owner

chezou commented Dec 24, 2016

If there were multiple tables in a file, you should specify page number with pages option. This might be related to #2

@alonsopg
Copy link
Author

alonsopg commented Dec 27, 2016

Thanks for the help @chezou, I tried this:
In:

from tabula import read_pdf_table
pdf_table = read_pdf_table("../file.pdf", pages="45")
pdf_table

out:


 	dic 27 	2016 12:32:04 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
0 	ADVERTENCIA: java.lang.ArrayIndexOutOfBoundsEx... 	NaN
1 	java.lang.ArrayIndexOutOfBoundsException: 5 	NaN
2 	\tat java.awt.geom.Path2DFloatFloatTxIterator.cur... 	NaN
3 	\tat technology.tabula.ObjectExtractor.strokeO... 	NaN
4 	\tat technology.tabula.ObjectExtractor.strokeP... 	NaN
5 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
6 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
7 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
8 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
9 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
10 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
11 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
12 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
13 	\tat technology.tabula.ObjectExtractor.drawPag... 	NaN
14 	\tat technology.tabula.ObjectExtractor.extract... 	NaN
15 	\tat technology.tabula.PageIterator.next(PageI... 	NaN
16 	\tat technology.tabula.CommandLineApp.extractT... 	NaN
17 	\tat technology.tabula.CommandLineApp.main(Com... 	NaN
18 	dic 27 	2016 12:32:04 PM org.apache.pdfbox.util.PDFSt...
19 	ADVERTENCIA: java.lang.ArrayIndexOutOfBoundsEx... 	NaN
20 	java.lang.ArrayIndexOutOfBoundsException: 10 	NaN
21 	\tat java.awt.geom.Path2DFloatFloatTxIterator.cur... 	NaN
22 	\tat technology.tabula.ObjectExtractor.strokeO... 	NaN
23 	\tat technology.tabula.ObjectExtractor.strokeP... 	NaN
24 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
25 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
26 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
27 	\tat org.apache.pdfbox.util.operator.pagedrawe... 	NaN
28 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
29 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
30 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
31 	\tat org.apache.pdfbox.util.PDFStreamEngine.pr... 	NaN
32 	\tat technology.tabula.ObjectExtractor.drawPag... 	NaN
33 	\tat technology.tabula.ObjectExtractor.extract... 	NaN
34 	\tat technology.tabula.PageIterator.next(PageI... 	NaN
35 	\tat technology.tabula.CommandLineApp.extractT... 	NaN
36 	\tat technology.tabula.CommandLineApp.main(Com... 	NaN
37 	ATOS DO PODER EXECUTIVO 	NaN
38 	ADMINISTRAÇÃO DIRETA 	NaN
39 	DECRETOS 	NaN

And:

In:

from tabula import read_pdf_table
pdf_table = read_pdf_table("../file.pdf", pages="5")
pdf_table

Out:

CParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 6

As you can see I specified the pages parameter. Any idea of how to proceed?. Thanks!

@chezou
Copy link
Owner

chezou commented Dec 29, 2016

Could you try with tabula-java?
https://github.com/tabulapdf/tabula-java

If page 45 of your pdf includes multiple table or has combined cell, tabula-py should be fail. If you use area option, you might extract the table.

Anywhere I can't uess anymore without your pdf.

@chezou
Copy link
Owner

chezou commented Jan 9, 2017

I find out the cause of this error and fixed it. f1db4ef

@alonsopg Could you upgrade your tabula-py?

@chezou
Copy link
Owner

chezou commented Jan 15, 2017

@alonsopg Did your problem solve with updated version? If so, I would like to close this issue.

@RAHAAMA
Copy link

RAHAAMA commented Feb 26, 2018

I have the same problem with pages =all. could anuone help me ?

@chezou
Copy link
Owner

chezou commented Feb 26, 2018

@RAHAAMA
Copy link

RAHAAMA commented Feb 26, 2018

@chezou Thank you . There is another problem with multiple tables , I have a pdf that prepared in two language , It means that pdf has two column (English and French ) , when I want to extract the tables , it consider all text like table. Is there any suggestion for this problem ?

@RAHAAMA
Copy link

RAHAAMA commented Feb 26, 2018

b'Skipping line 28: expected 2 fields, saw 4\nSkipping line 29: expected 2 fields, saw 4\nSkipping line 30: expected 2 fields, saw 4\nSkipping line 31: expected 2 fields, saw 4\nSkipping line 32: expected 2 fields, saw 4\nSkipping line 33: expected 2 fields, saw 4\nSkipping line 34: expected 2 fields, saw 4\nSkipping line 35: expected 2 fields, saw 4\nSkipping line 36: expected 2 fields, saw 4\nSkipping line 37: expected 2 fields, saw 4\nSkipping line 38: expected 2 fields, saw 4\nSkipping line 39: expected 2 fields, saw 4\nSkipping line 40: expected 2 fields, saw 4\nSkipping line

I got above warnings also , I have set pandas_options={'error_bad_lines': False}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants