Dumping objects when the xref table is corrupted #221

davidtr1037 · 2018-06-05T11:16:00Z

Currently, if the xref table in the PDF is corrupted (wrong offsets), then dumppdf.py fails to extract files.

For example, when the following command runs:

dumppdf.py -E /tmp/ sample.pdf

We get:

Traceback (most recent call last):
  File "/usr/local/bin/dumppdf.py", line 268, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/usr/local/bin/dumppdf.py", line 265, in main
    dumpall=dumpall, codec=codec, extractdir=extractdir)
  File "/usr/local/bin/dumppdf.py", line 191, in extractembedded
    obj = doc.getobj(objid)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfdocument.py", line 681, in getobj
    stream = stream_value(self.getobj(strmid))
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfdocument.py", line 694, in getobj
    raise PDFObjectNotFound(objid)
pdfminer.pdftypes.PDFObjectNotFound: 1

Possible solution:
When computing an object offset (in PDFXref.load), you can check if the offset given by the xref table makes sense, probably by checking if starts with "<objid> 0 obj". If that's not the case, then you can find that pattern in the whole file, and use that offset instead.

The text was updated successfully, but these errors were encountered:

RolandColored · 2018-06-20T14:38:18Z

Maybe it's possible to fix the corrupt PDF using a different tool beforehand. I'm asking if it makes sense to start dealing with all possible edge cases or just assume valid PDF files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dumping objects when the xref table is corrupted #221

Dumping objects when the xref table is corrupted #221

davidtr1037 commented Jun 5, 2018

RolandColored commented Jun 20, 2018

Dumping objects when the xref table is corrupted #221

Dumping objects when the xref table is corrupted #221

Comments

davidtr1037 commented Jun 5, 2018

RolandColored commented Jun 20, 2018