can not parse Chinese pdf document #161

yuyangyoung · 2018-10-20T08:17:45Z

This is the file I want to parse.
W020180518365531252048.pdf
PdfReadWarning: Illegal character in Name Object [generic.py:489]
Traceback (most recent call last):
File "D:\Python\Python36\lib\site-packages\IPython\core\interactiveshell.py", line 2961, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
tables = camelot.read_pdf("W020180518365531252048.pdf")
File "D:\Python\Python36\lib\site-packages\camelot\io.py", line 91, in read_pdf
tables = p.parse(flavor=flavor, **kwargs)
File "D:\Python\Python36\lib\site-packages\camelot\handlers.py", line 141, in parse
self._save_page(self.filename, p, tempdir)
File "D:\Python\Python36\lib\site-packages\camelot\handlers.py", line 95, in save_page
layout, dim = get_page_layout(fpath)
File "D:\Python\Python36\lib\site-packages\camelot\utils.py", line 586, in get_page_layout
document = PDFDocument(parser)
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\pdfdocument.py", line 566, in init
xref.load(parser)
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\pdfdocument.py", line 195, in load
(, obj) = parser.nextobject()
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\psparser.py", line 606, in nextobject
raise PSSyntaxError('Invalid dictionary construct: %r' % objs)
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'Type', /'Font', /'Subtype', /'Type0', /'BaseFont', /b"b'", /"ABCDEE+\xb7\xc2\xcb\xce'", /'Encoding', /'Identity-H', /'DescendantFonts', PDFObjRef:6, /'ToUnicode', PDFObjRef:12]

Langdi · 2018-10-21T20:15:45Z

You should probably include the document you were trying to parse. Worked fine for me on a Chinese pdf.

yuyangyoung · 2018-10-21T23:31:04Z

W020180518365531252048.pdf
This is the file I want to parse.
PdfReadWarning: Illegal character in Name Object [generic.py:489]
Traceback (most recent call last):
File "D:\Python\Python36\lib\site-packages\IPython\core\interactiveshell.py", line 2961, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
tables = camelot.read_pdf("W020180518365531252048.pdf")
File "D:\Python\Python36\lib\site-packages\camelot\io.py", line 91, in read_pdf
tables = p.parse(flavor=flavor, **kwargs)
File "D:\Python\Python36\lib\site-packages\camelot\handlers.py", line 141, in parse
self._save_page(self.filename, p, tempdir)
File "D:\Python\Python36\lib\site-packages\camelot\handlers.py", line 95, in save_page
layout, dim = get_page_layout(fpath)
File "D:\Python\Python36\lib\site-packages\camelot\utils.py", line 586, in get_page_layout
document = PDFDocument(parser)
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\pdfdocument.py", line 566, in init
xref.load(parser)
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\pdfdocument.py", line 195, in load
(, obj) = parser.nextobject()
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\psparser.py", line 606, in nextobject
raise PSSyntaxError('Invalid dictionary construct: %r' % objs)
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'Type', /'Font', /'Subtype', /'Type0', /'BaseFont', /b"b'", /"ABCDEE+\xb7\xc2\xcb\xce'", /'Encoding', /'Identity-H', /'DescendantFonts', PDFObjRef:6, /'ToUnicode', PDFObjRef:12]

vinayak-mehta · 2018-10-22T22:25:50Z

Thanks for the report @yuyangyoung! I was not able to find a bug report for a syntax error on the pdfminer github repo. From the error name, I can guess that the PostScript inside the PDF has incorrect syntax. I was able to fix the file using ghostscript and parse tables from it, you can check out this SO answer for how to do that. Basically, do this:

$ gs -o W020180518365531252048_repaired.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress W020180518365531252048.pdf

Eichee · 2018-11-17T14:32:10Z

I have met the same problem. And I want to know if you have already solved it. Thanks!

vinayak-mehta · 2018-11-17T15:55:55Z

@Eichee If you face the same error as the @yuyangyoung faced, you can apply the fix I've mentioned in the comment above to see if it fixes the error.

yuyangyoung · 2018-11-17T16:00:49Z

@vinayak-mehta thank you for fixing the error zhat way. But another pdf is the same error, So everytime I want to get the table, I must fix the pdf. So all Chinese pdf is wrong?

vinayak-mehta · 2018-11-17T16:03:47Z

It does not mean that all Chinese PDFs are wrong. Are all your PDFs from the same source? It probably is an error in the way the PDFs are being generated at the source. If you already have all your PDFs in a directory with you, you can fix all of them at once using the fix mentioned above and some unix wildcard magic.

yuyangyoung · 2018-11-17T16:07:33Z

@vinayak-mehta thanks, all are from the same source, Chinese SEC, but I can easily get the table by pdfplumber. So if you can change something?

vinayak-mehta · 2018-11-17T22:19:31Z

Oh, does pdfplumber not raise the same error? (Since both camelot and pdfplumber use pdfminer internally)

I'll take a look.

Eichee · 2018-11-18T06:45:49Z

@vinayak-mehta I have fixed all of my PDFs and then there is no error in extracting table. Wonderful!!

yuyangyoung · 2018-11-18T18:10:56Z

@vinayak-mehta ，yes, pdfplumber won't raise the same error, and the result is better than the pdf repaired then parsed by camelot

outsiderJiumao · 2019-02-04T03:38:18Z

it seems that PyPDF2 cannot deal with chinese characters by now, and camelot extract one page from pdf using PyPDF2. I guess maybe that is the reason..

huntzhan · 2019-03-24T11:34:30Z

@outsiderJiumao Thanks for the insight. Wondering if you have come up with a solution.

mzhadigerov · 2021-08-12T09:55:14Z

After fix mentioned above, I get an empty object. It could not retrieve the tables

vinayak-mehta closed this as completed Oct 22, 2018

vinayak-mehta mentioned this issue Nov 17, 2018

Error in encoding type #205

Closed

cqluohong mentioned this issue Jul 1, 2019

Can't extract all the tables on each page #349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can not parse Chinese pdf document #161

can not parse Chinese pdf document #161

yuyangyoung commented Oct 20, 2018 •

edited

Loading

Langdi commented Oct 21, 2018

yuyangyoung commented Oct 21, 2018

vinayak-mehta commented Oct 22, 2018

Eichee commented Nov 17, 2018

vinayak-mehta commented Nov 17, 2018

yuyangyoung commented Nov 17, 2018

vinayak-mehta commented Nov 17, 2018

yuyangyoung commented Nov 17, 2018

vinayak-mehta commented Nov 17, 2018

Eichee commented Nov 18, 2018

yuyangyoung commented Nov 18, 2018

outsiderJiumao commented Feb 4, 2019

huntzhan commented Mar 24, 2019

mzhadigerov commented Aug 12, 2021

can not parse Chinese pdf document #161

can not parse Chinese pdf document #161

Comments

yuyangyoung commented Oct 20, 2018 • edited Loading

Langdi commented Oct 21, 2018

yuyangyoung commented Oct 21, 2018

vinayak-mehta commented Oct 22, 2018

Eichee commented Nov 17, 2018

vinayak-mehta commented Nov 17, 2018

yuyangyoung commented Nov 17, 2018

vinayak-mehta commented Nov 17, 2018

yuyangyoung commented Nov 17, 2018

vinayak-mehta commented Nov 17, 2018

Eichee commented Nov 18, 2018

yuyangyoung commented Nov 18, 2018

outsiderJiumao commented Feb 4, 2019

huntzhan commented Mar 24, 2019

mzhadigerov commented Aug 12, 2021

yuyangyoung commented Oct 20, 2018 •

edited

Loading