Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can not parse Chinese pdf document #161

Closed
yuyangyoung opened this issue Oct 20, 2018 · 14 comments
Closed

can not parse Chinese pdf document #161

yuyangyoung opened this issue Oct 20, 2018 · 14 comments

Comments

@yuyangyoung
Copy link

yuyangyoung commented Oct 20, 2018

This is the file I want to parse.
W020180518365531252048.pdf
PdfReadWarning: Illegal character in Name Object [generic.py:489]
Traceback (most recent call last):
File "D:\Python\Python36\lib\site-packages\IPython\core\interactiveshell.py", line 2961, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
tables = camelot.read_pdf("W020180518365531252048.pdf")
File "D:\Python\Python36\lib\site-packages\camelot\io.py", line 91, in read_pdf
tables = p.parse(flavor=flavor, **kwargs)
File "D:\Python\Python36\lib\site-packages\camelot\handlers.py", line 141, in parse
self._save_page(self.filename, p, tempdir)
File "D:\Python\Python36\lib\site-packages\camelot\handlers.py", line 95, in save_page
layout, dim = get_page_layout(fpath)
File "D:\Python\Python36\lib\site-packages\camelot\utils.py", line 586, in get_page_layout
document = PDFDocument(parser)
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\pdfdocument.py", line 566, in init
xref.load(parser)
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\pdfdocument.py", line 195, in load
(
, obj) = parser.nextobject()
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\psparser.py", line 606, in nextobject
raise PSSyntaxError('Invalid dictionary construct: %r' % objs)
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'Type', /'Font', /'Subtype', /'Type0', /'BaseFont', /b"b'", /"ABCDEE+\xb7\xc2\xcb\xce'", /'Encoding', /'Identity-H', /'DescendantFonts', PDFObjRef:6, /'ToUnicode', PDFObjRef:12]

@Langdi
Copy link

Langdi commented Oct 21, 2018

You should probably include the document you were trying to parse. Worked fine for me on a Chinese pdf.

@yuyangyoung
Copy link
Author

W020180518365531252048.pdf
This is the file I want to parse.
PdfReadWarning: Illegal character in Name Object [generic.py:489]
Traceback (most recent call last):
File "D:\Python\Python36\lib\site-packages\IPython\core\interactiveshell.py", line 2961, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
tables = camelot.read_pdf("W020180518365531252048.pdf")
File "D:\Python\Python36\lib\site-packages\camelot\io.py", line 91, in read_pdf
tables = p.parse(flavor=flavor, **kwargs)
File "D:\Python\Python36\lib\site-packages\camelot\handlers.py", line 141, in parse
self._save_page(self.filename, p, tempdir)
File "D:\Python\Python36\lib\site-packages\camelot\handlers.py", line 95, in save_page
layout, dim = get_page_layout(fpath)
File "D:\Python\Python36\lib\site-packages\camelot\utils.py", line 586, in get_page_layout
document = PDFDocument(parser)
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\pdfdocument.py", line 566, in init
xref.load(parser)
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\pdfdocument.py", line 195, in load
(
, obj) = parser.nextobject()
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\psparser.py", line 606, in nextobject
raise PSSyntaxError('Invalid dictionary construct: %r' % objs)
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'Type', /'Font', /'Subtype', /'Type0', /'BaseFont', /b"b'", /"ABCDEE+\xb7\xc2\xcb\xce'", /'Encoding', /'Identity-H', /'DescendantFonts', PDFObjRef:6, /'ToUnicode', PDFObjRef:12]

@vinayak-mehta
Copy link
Contributor

Thanks for the report @yuyangyoung! I was not able to find a bug report for a syntax error on the pdfminer github repo. From the error name, I can guess that the PostScript inside the PDF has incorrect syntax. I was able to fix the file using ghostscript and parse tables from it, you can check out this SO answer for how to do that. Basically, do this:

$ gs -o W020180518365531252048_repaired.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress W020180518365531252048.pdf

@Eichee
Copy link

Eichee commented Nov 17, 2018

I have met the same problem. And I want to know if you have already solved it. Thanks!

@vinayak-mehta
Copy link
Contributor

@Eichee If you face the same error as the @yuyangyoung faced, you can apply the fix I've mentioned in the comment above to see if it fixes the error.

@yuyangyoung
Copy link
Author

@vinayak-mehta thank you for fixing the error zhat way. But another pdf is the same error, So everytime I want to get the table, I must fix the pdf. So all Chinese pdf is wrong?

@vinayak-mehta
Copy link
Contributor

It does not mean that all Chinese PDFs are wrong. Are all your PDFs from the same source? It probably is an error in the way the PDFs are being generated at the source. If you already have all your PDFs in a directory with you, you can fix all of them at once using the fix mentioned above and some unix wildcard magic.

@yuyangyoung
Copy link
Author

@vinayak-mehta thanks, all are from the same source, Chinese SEC, but I can easily get the table by pdfplumber. So if you can change something?

@vinayak-mehta
Copy link
Contributor

Oh, does pdfplumber not raise the same error? (Since both camelot and pdfplumber use pdfminer internally)

I'll take a look.

@Eichee
Copy link

Eichee commented Nov 18, 2018

@vinayak-mehta I have fixed all of my PDFs and then there is no error in extracting table. Wonderful!!

@yuyangyoung
Copy link
Author

@vinayak-mehta ,yes, pdfplumber won't raise the same error, and the result is better than the pdf repaired then parsed by camelot

@outsiderJiumao
Copy link

it seems that PyPDF2 cannot deal with chinese characters by now, and camelot extract one page from pdf using PyPDF2. I guess maybe that is the reason..

@huntzhan
Copy link

@outsiderJiumao Thanks for the insight. Wondering if you have come up with a solution.

@mzhadigerov
Copy link

After fix mentioned above, I get an empty object. It could not retrieve the tables

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants