Skip to content
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.

Dumping objects when the xref table is corrupted #221

Open
davidtr1037 opened this issue Jun 5, 2018 · 1 comment
Open

Dumping objects when the xref table is corrupted #221

davidtr1037 opened this issue Jun 5, 2018 · 1 comment

Comments

@davidtr1037
Copy link

Currently, if the xref table in the PDF is corrupted (wrong offsets), then dumppdf.py fails to extract files.

For example, when the following command runs:

dumppdf.py -E /tmp/ sample.pdf

We get:

Traceback (most recent call last):
  File "/usr/local/bin/dumppdf.py", line 268, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/usr/local/bin/dumppdf.py", line 265, in main
    dumpall=dumpall, codec=codec, extractdir=extractdir)
  File "/usr/local/bin/dumppdf.py", line 191, in extractembedded
    obj = doc.getobj(objid)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfdocument.py", line 681, in getobj
    stream = stream_value(self.getobj(strmid))
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfdocument.py", line 694, in getobj
    raise PDFObjectNotFound(objid)
pdfminer.pdftypes.PDFObjectNotFound: 1

Possible solution:
When computing an object offset (in PDFXref.load), you can check if the offset given by the xref table makes sense, probably by checking if starts with "<objid> 0 obj". If that's not the case, then you can find that pattern in the whole file, and use that offset instead.

@RolandColored
Copy link

Maybe it's possible to fix the corrupt PDF using a different tool beforehand. I'm asking if it makes sense to start dealing with all possible edge cases or just assume valid PDF files.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants