Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue/issue 299 #315

Merged
merged 14 commits into from
Feb 7, 2020
Merged

Issue/issue 299 #315

merged 14 commits into from
Feb 7, 2020

Conversation

jvalls-axa
Copy link
Contributor

Issue 299 Part I (PdfMiner)

This PR is part I of #299 -> Heap out of memory in Pdminer extraction & dumppdf extraction.

Changes:

  • Use of xml-stream to read xml pdfminer output file
  • Pdfs will be extracted using slots of 500 pages
  • Allow node objects to be up to 4Gb in memory using --max-old-space-size=4096 (Temporally fix)

Improvements:

  • Better feedback running huge documents (Logs will be displayed every 500 pages)
  • Faster extraction time using PdfMnier + Xml-Stream

TODO:

  • Fix 'dumppdf.py' Heap out of memory

xml2js vs xml-stream

17 min VS 9 min

XmlStream

);
const startTime: number = Date.now();
const extractFont = extractImagesAndFonts(repairedPdf);
const pdfminerExtract = this.extractFile(repairedPdf, 1, 500, totalPages);
Copy link

@peter-vandenabeele-axa peter-vandenabeele-axa Feb 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe place the value of 500 as a clearly visible constant at top of file or even a config parameter with 500 or some other value as default value ?

@peter-vandenabeele-axa
Copy link

Nice :-)

  • The description of this PR has a typo PdfMnier.
  • How was the value of 500 determined/optimized (the batch size for PDF parsing)?
    Once you have done this effort, to split it in smaller batches, I would think that taking e.g. 100 pages per batch (instead of 500) would further reduce the risk for out-of-memory ?
    Maybe making it a config parameter can resolve my question (then a user can further reduce it, if 500 or 100 is still too large for certain Use Cases?)

} else if (jsonObj.figure != null) {
this.figure = [new PdfminerFigure(jsonObj.figure)];
}
/*this.line = jsonObj.line;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful to keep this commented code in the commit ?

@@ -25,18 +25,26 @@ export class PdfminerPage {
};
public textbox: PdfminerTextbox[];
public figure: PdfminerFigure[];
public line: object[];
/*public line: object[];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful to keep this commented code in the commit ?

@jvalls-axa
Copy link
Contributor Author

Nice :-)

  • The description of this PR has a typo PdfMnier.
  • How was the value of 500 determined/optimized (the batch size for PDF parsing)?
    Once you have done this effort, to split it in smaller batches, I would think that taking e.g. 100 pages per batch (instead of 500) would further reduce the risk for out-of-memory ?
    Maybe making it a config parameter can resolve my question (then a user can further reduce it, if 500 or 100 is still too large for certain Use Cases?)

Hi @peter-vandenabeele-axa and thanks for your feedback !! 👍

I used 500 pages because after a lot of tests with some pdfs is the bigger number of pages supported with no 'out of memory' and no '--max-old-space-size' committed here

Our goal is allow Parsr to run properly with huge Pdfs (2K pages or more) and avoid '--max-old-space-size' usage.

But anyway adding 'batch max pages' in configuration file has a lot of sense!

@peter-vandenabeele-axa
Copy link

I used 500 pages because after a lot of tests with some pdfs is the bigger number of pages supported with no 'out of memory' and no '--max-old-space-size' committed here

Our goal is allow Parsr to run properly with huge Pdfs (2K pages or more) and avoid '--max-old-space-size' usage.

But anyway adding 'batch max pages' in configuration file has a lot of sense!

Yes, I would suggest so. I am a bit afraid that other users or use cases will pop up where the "complexity per page" will be larger, so it will get OOM at smaller amount of pages.

I vaguely remember this in real life when a physical printer would drop a PDF print job at a very small number of pages (10 or so), e.g. when the PDF was full of large, detailed images that consumed a lot of MB's per image.

Copy link
Contributor

@marianorodriguez marianorodriguez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • pdfminer extractor is now working with 2000+ pages
  • Execution now crashes on TableDetectionScript. We will fix this on another PR

@marianorodriguez marianorodriguez merged commit daa03ea into develop Feb 7, 2020
@marianorodriguez marianorodriguez deleted the issue/issue-299 branch February 7, 2020 12:10
@peter-vandenabeele-axa
Copy link

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants