Failure to parse a PDF file from https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf #118

pombredanne · 2015-08-14T11:26:40Z

The file at https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf fails to be parsed
I verified this is the latest Pypi version and with the HEAD version.
This small snippet reproduces the error:

wget https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf
python -c "from pdfminer.pdfparser import PDFParser;p=PDFParser(open('5756M-PG101-R.pdf','rb'));from pdfminer.pdfdocument import PDFDocument;PDFDocument(p)" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "[...]/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 575, in __init__
    self._initialize_password(password)
  File "[...]/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 598, in _initialize_password
    raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={u'EncryptMetadata': False, u'CF': {u'StdCF': {u'Length': 16, u'CFM': /V2, u'AuthEvent': /DocOpen}}, u'O': '\xc6\xa4\xb4%\xed\xda\xe8\x7f&\xd2\x97\x840y\xc7\xbe!N\xdb\xfbw\x0f\x04\xb3iZTn\n\xc3\x93c', u'Filter': /Standard, u'P': -1324, u'Length': 128, u'R': 4, u'U': '\xf3\xa1\xeb\xa5\x19\x8a\x15%\x001\x13CenHO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', u'V': 4, u'StmF': /StdCF, u'StrF': /StdCF}

The text was updated successfully, but these errors were encountered:

pombredanne · 2015-08-14T11:32:14Z

Note that on Linux using:

 wget https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf
 pdfseparate -f 1 -l 1 5756M-PG101-R.pdf  5756M-PG101-R-p1.pdf

creates a single page small PDF doc that has the same issue as the full doc

euske/pdfminer#118

pombredanne · 2015-09-03T06:14:52Z

@euske any hint of where I could start to help?

euske/pdfminer#118

This reverts commit 7c31351.

chid · 2016-02-02T23:56:48Z

@pombredanne This works now in the current version of pdfminer

pombredanne · 2016-02-03T09:23:17Z

@chid It does not work for me on Ubuntu LTS 14.04 with Python 2.7.6.
Note that I had made the tests with head and Pypi and both still fail for me.
Which environment do you use?

(tmp)pombreda@COMPUTER:~/tmp/pdfminer$ python -c "from pdfminer.pdfparser import PDFParser;p=PDFParser(open('5756M-PG101-R.pdf','rb'));from pdfminer.pdfdocument import PDFDocument;PDFDocument(p)" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/pombreda/tmp/pdfminer/pdfminer/pdfminer/pdfdocument.py", line 575, in __init__
    self._initialize_password(password)
  File "/home/pombreda/tmp/pdfminer/pdfminer/pdfminer/pdfdocument.py", line 598, in _initialize_password
    raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={u'EncryptMetadata': False, u'CF': {u'StdCF': {u'Length': 16, u'CFM': /V2, u'AuthEvent': /DocOpen}}, u'O': '\xc6\xa4\xb4%\xed\xda\xe8\x7f&\xd2\x97\x840y\xc7\xbe!N\xdb\xfbw\x0f\x04\xb3iZTn\n\xc3\x93c', u'Filter': /Standard, u'P': -1324, u'Length': 128, u'R': 4, u'U': '\xf3\xa1\xeb\xa5\x19\x8a\x15%\x001\x13CenHO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', u'V': 4, u'StmF': /StdCF, u'StrF': /StdCF}

chid · 2016-02-04T03:25:37Z

I'm on Windows on Python 2.7.10. You could try removing the PDF protection with qpdf first.

edit: I just tried it on raspbian and it works fine,

 7986  sudo pip install --upgrade https://github.com/euske/pdfminer/zipball/master
 7987  wget https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf
 7988  python -c "from pdfminer.pdfparser import PDFParser;p=PDFParser(open('5756M-PG101-R.pdf','rb'));from pdfminer.pdfdocument import PDFDocument;PDFDocument(p)"

pombredanne · 2016-02-04T10:11:31Z

@chid Thanks but that's very weird. For me for https://github.com/nexB/scancode-toolkit I cannot afford to mandate to have a special version of Python 2.7 on Ubuntu (it comes built in) and I support windows/linux/mac. qpdf could be an option, but it is a native not Python which I like to avoid when possible (even though it is cross platform).
That said, this means that the problem lies somewhere in the Python stdlib.... Any idea where? Because this means that this could be patched alright easily in pdfminer.

chid · 2016-02-05T06:08:54Z

I might have a go at it in ubuntu with default Python and see if it works

pombredanne mentioned this issue Aug 14, 2015

Scan fails on PDF file nexB/scancode-toolkit#56

Closed

pombredanne added a commit to nexB/scancode-toolkit that referenced this issue Aug 14, 2015

Workaround and test for #56 and

8d78940

euske/pdfminer#118

pombredanne added a commit to nexB/scancode-toolkit that referenced this issue Nov 24, 2015

Workaround and test for #56 and

7c31351

euske/pdfminer#118

pombredanne added a commit to nexB/scancode-toolkit that referenced this issue Dec 7, 2015

Revert "Workaround and test for #56 and euske/pdfminer#118"

f87d2a8

This reverts commit 7c31351.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to parse a PDF file from https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf #118

Failure to parse a PDF file from https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf #118

pombredanne commented Aug 14, 2015

pombredanne commented Aug 14, 2015

pombredanne commented Sep 3, 2015

chid commented Feb 2, 2016

pombredanne commented Feb 3, 2016

chid commented Feb 4, 2016

pombredanne commented Feb 4, 2016

chid commented Feb 5, 2016

Failure to parse a PDF file from https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf #118

Failure to parse a PDF file from https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf #118

Comments

pombredanne commented Aug 14, 2015

pombredanne commented Aug 14, 2015

pombredanne commented Sep 3, 2015

chid commented Feb 2, 2016

pombredanne commented Feb 3, 2016

chid commented Feb 4, 2016

pombredanne commented Feb 4, 2016

chid commented Feb 5, 2016