PDFDocument is slow, any way to speed it up? #283

typhoon71 · 2020-01-06T22:35:45Z

I'm currently doing this:

fp = open(pdf_fpath, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)

With the pdfs I'm working with, PDFDocument is really too slow. Is there any way to speed it up?

I see that there's often the suggestion to "giving -n option which turns off automatic layout analysis.", but I have no idea how to do that from python.

I "only" need to read xrefs and objects/streams, I don't care how the pdf would be rendered/pages.
Can anyone help please? Thanks.

typhoon71 · 2020-01-07T18:26:07Z

From that example:

parser = PDFParser(in_file)
doc = PDFDocument(parser)

was exactly where I timed the slowdown.

 doc = PDFDocument(parser)

is time consumnig, like 99%.

If I understand your suggestion, I should make my own version of PDFDocument?
One that does not analyze the layout?

No time for it right now, but maybe I could in the future.

Actually I hoped in some kind of switch for PDFDocument... but I found nothing.

pietermarsman · 2020-01-07T20:56:10Z

I'm sorry. My comment was short, not clear and partially wrong. I've deleted it.

Layout analysis is not used during the parsing of the PDF. It is only used when you interpret the document, with e.g. TextConverter of HTMLConverter.

Could you share an example PDF that PDFDocument is really slow on?

typhoon71 · 2020-01-09T20:54:57Z

Sadly I can't share those PFDs.
The thing is that once I got PDFDocument initialized, everything else is really fast.
Maybe it's decompressing the PDF, that could be slowing it.

hiDaDeng · 2021-12-11T03:14:37Z

import fitz

filepath = "C:\\user\\docs\\aPDFfile.pdf"

text = ''
with fitz.open(filepath ) as doc:
    for page in doc:
        text+= page.get_text()
print(text)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFDocument is slow, any way to speed it up? #283

PDFDocument is slow, any way to speed it up? #283

typhoon71 commented Jan 6, 2020 •

edited

typhoon71 commented Jan 7, 2020

pietermarsman commented Jan 7, 2020

typhoon71 commented Jan 9, 2020

hiDaDeng commented Dec 11, 2021

PDFDocument is slow, any way to speed it up? #283

PDFDocument is slow, any way to speed it up? #283

Comments

typhoon71 commented Jan 6, 2020 • edited

typhoon71 commented Jan 7, 2020

pietermarsman commented Jan 7, 2020

typhoon71 commented Jan 9, 2020

hiDaDeng commented Dec 11, 2021

typhoon71 commented Jan 6, 2020 •

edited