-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can not parse Chinese pdf document #161
Comments
You should probably include the document you were trying to parse. Worked fine for me on a Chinese pdf. |
W020180518365531252048.pdf |
Thanks for the report @yuyangyoung! I was not able to find a bug report for a syntax error on the pdfminer github repo. From the error name, I can guess that the PostScript inside the PDF has incorrect syntax. I was able to fix the file using ghostscript and parse tables from it, you can check out this SO answer for how to do that. Basically, do this: $ gs -o W020180518365531252048_repaired.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress W020180518365531252048.pdf |
I have met the same problem. And I want to know if you have already solved it. Thanks! |
@Eichee If you face the same error as the @yuyangyoung faced, you can apply the fix I've mentioned in the comment above to see if it fixes the error. |
@vinayak-mehta thank you for fixing the error zhat way. But another pdf is the same error, So everytime I want to get the table, I must fix the pdf. So all Chinese pdf is wrong? |
It does not mean that all Chinese PDFs are wrong. Are all your PDFs from the same source? It probably is an error in the way the PDFs are being generated at the source. If you already have all your PDFs in a directory with you, you can fix all of them at once using the fix mentioned above and some unix wildcard magic. |
@vinayak-mehta thanks, all are from the same source, Chinese SEC, but I can easily get the table by pdfplumber. So if you can change something? |
Oh, does pdfplumber not raise the same error? (Since both camelot and pdfplumber use pdfminer internally) I'll take a look. |
@vinayak-mehta I have fixed all of my PDFs and then there is no error in extracting table. Wonderful!! |
@vinayak-mehta ,yes, pdfplumber won't raise the same error, and the result is better than the pdf repaired then parsed by camelot |
it seems that PyPDF2 cannot deal with chinese characters by now, and camelot extract one page from pdf using PyPDF2. I guess maybe that is the reason.. |
@outsiderJiumao Thanks for the insight. Wondering if you have come up with a solution. |
After fix mentioned above, I get an empty object. It could not retrieve the tables |
This is the file I want to parse.
W020180518365531252048.pdf
PdfReadWarning: Illegal character in Name Object [generic.py:489]
Traceback (most recent call last):
File "D:\Python\Python36\lib\site-packages\IPython\core\interactiveshell.py", line 2961, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
tables = camelot.read_pdf("W020180518365531252048.pdf")
File "D:\Python\Python36\lib\site-packages\camelot\io.py", line 91, in read_pdf
tables = p.parse(flavor=flavor, **kwargs)
File "D:\Python\Python36\lib\site-packages\camelot\handlers.py", line 141, in parse
self._save_page(self.filename, p, tempdir)
File "D:\Python\Python36\lib\site-packages\camelot\handlers.py", line 95, in save_page
layout, dim = get_page_layout(fpath)
File "D:\Python\Python36\lib\site-packages\camelot\utils.py", line 586, in get_page_layout
document = PDFDocument(parser)
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\pdfdocument.py", line 566, in init
xref.load(parser)
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\pdfdocument.py", line 195, in load
(, obj) = parser.nextobject()
File "D:\Python\Python36\lib\site-packages\pdfminer.six-20170720-py3.6.egg\pdfminer\psparser.py", line 606, in nextobject
raise PSSyntaxError('Invalid dictionary construct: %r' % objs)
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'Type', /'Font', /'Subtype', /'Type0', /'BaseFont', /b"b'", /"ABCDEE+\xb7\xc2\xcb\xce'", /'Encoding', /'Identity-H', /'DescendantFonts', PDFObjRef:6, /'ToUnicode', PDFObjRef:12]
The text was updated successfully, but these errors were encountered: