Skip to content
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.

Failure to parse a PDF file from https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf #118

Open
pombredanne opened this issue Aug 14, 2015 · 7 comments

Comments

@pombredanne
Copy link

The file at https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf fails to be parsed
I verified this is the latest Pypi version and with the HEAD version.
This small snippet reproduces the error:

wget https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf
python -c "from pdfminer.pdfparser import PDFParser;p=PDFParser(open('5756M-PG101-R.pdf','rb'));from pdfminer.pdfdocument import PDFDocument;PDFDocument(p)" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "[...]/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 575, in __init__
    self._initialize_password(password)
  File "[...]/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 598, in _initialize_password
    raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={u'EncryptMetadata': False, u'CF': {u'StdCF': {u'Length': 16, u'CFM': /V2, u'AuthEvent': /DocOpen}}, u'O': '\xc6\xa4\xb4%\xed\xda\xe8\x7f&\xd2\x97\x840y\xc7\xbe!N\xdb\xfbw\x0f\x04\xb3iZTn\n\xc3\x93c', u'Filter': /Standard, u'P': -1324, u'Length': 128, u'R': 4, u'U': '\xf3\xa1\xeb\xa5\x19\x8a\x15%\x001\x13CenHO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', u'V': 4, u'StmF': /StdCF, u'StrF': /StdCF}
@pombredanne
Copy link
Author

Note that on Linux using:

 wget https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf
 pdfseparate -f 1 -l 1 5756M-PG101-R.pdf  5756M-PG101-R-p1.pdf

creates a single page small PDF doc that has the same issue as the full doc

pombredanne added a commit to nexB/scancode-toolkit that referenced this issue Aug 14, 2015
@pombredanne
Copy link
Author

@euske any hint of where I could start to help?

pombredanne added a commit to nexB/scancode-toolkit that referenced this issue Nov 24, 2015
pombredanne added a commit to nexB/scancode-toolkit that referenced this issue Dec 7, 2015
@chid
Copy link

chid commented Feb 2, 2016

@pombredanne This works now in the current version of pdfminer

@pombredanne
Copy link
Author

@chid It does not work for me on Ubuntu LTS 14.04 with Python 2.7.6.
Note that I had made the tests with head and Pypi and both still fail for me.
Which environment do you use?

(tmp)pombreda@COMPUTER:~/tmp/pdfminer$ python -c "from pdfminer.pdfparser import PDFParser;p=PDFParser(open('5756M-PG101-R.pdf','rb'));from pdfminer.pdfdocument import PDFDocument;PDFDocument(p)" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/pombreda/tmp/pdfminer/pdfminer/pdfminer/pdfdocument.py", line 575, in __init__
    self._initialize_password(password)
  File "/home/pombreda/tmp/pdfminer/pdfminer/pdfminer/pdfdocument.py", line 598, in _initialize_password
    raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={u'EncryptMetadata': False, u'CF': {u'StdCF': {u'Length': 16, u'CFM': /V2, u'AuthEvent': /DocOpen}}, u'O': '\xc6\xa4\xb4%\xed\xda\xe8\x7f&\xd2\x97\x840y\xc7\xbe!N\xdb\xfbw\x0f\x04\xb3iZTn\n\xc3\x93c', u'Filter': /Standard, u'P': -1324, u'Length': 128, u'R': 4, u'U': '\xf3\xa1\xeb\xa5\x19\x8a\x15%\x001\x13CenHO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', u'V': 4, u'StmF': /StdCF, u'StrF': /StdCF}

@chid
Copy link

chid commented Feb 4, 2016

I'm on Windows on Python 2.7.10. You could try removing the PDF protection with qpdf first.

edit: I just tried it on raspbian and it works fine,

 7986  sudo pip install --upgrade https://github.com/euske/pdfminer/zipball/master
 7987  wget https://www.broadcom.com/collateral/pg/5756M-PG101-R.pdf
 7988  python -c "from pdfminer.pdfparser import PDFParser;p=PDFParser(open('5756M-PG101-R.pdf','rb'));from pdfminer.pdfdocument import PDFDocument;PDFDocument(p)"

@pombredanne
Copy link
Author

@chid Thanks but that's very weird. For me for https://github.com/nexB/scancode-toolkit I cannot afford to mandate to have a special version of Python 2.7 on Ubuntu (it comes built in) and I support windows/linux/mac. qpdf could be an option, but it is a native not Python which I like to avoid when possible (even though it is cross platform).
That said, this means that the problem lies somewhere in the Python stdlib.... Any idea where? Because this means that this could be patched alright easily in pdfminer.

@chid
Copy link

chid commented Feb 5, 2016

I might have a go at it in ubuntu with default Python and see if it works

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants