Issue/issue 299 #315

jvalls-axa · 2020-02-06T12:00:02Z

Issue 299 Part I (PdfMiner)

This PR is part I of #299 -> Heap out of memory in Pdminer extraction & dumppdf extraction.

Changes:

Use of xml-stream to read xml pdfminer output file
Pdfs will be extracted using slots of 500 pages
Allow node objects to be up to 4Gb in memory using --max-old-space-size=4096 (Temporally fix)

Improvements:

Better feedback running huge documents (Logs will be displayed every 500 pages)
Faster extraction time using PdfMnier + Xml-Stream

TODO:

Fix 'dumppdf.py' Heap out of memory

xml2js vs xml-stream

17 min VS 9 min

…d filtering elements

…sing slots of 500 pages and parse xml output using xml-stream

…when it's really needed

…n some documents lines will not be splitted into words properly

# Conflicts: # package-lock.json

peter-vandenabeele-axa · 2020-02-06T13:04:40Z

server/src/input/pdfminer/PdfminerExtractor.ts

+        );
+        const startTime: number = Date.now();
+        const extractFont = extractImagesAndFonts(repairedPdf);
+        const pdfminerExtract = this.extractFile(repairedPdf, 1, 500, totalPages);


Maybe place the value of 500 as a clearly visible constant at top of file or even a config parameter with 500 or some other value as default value ?

peter-vandenabeele-axa · 2020-02-06T13:06:27Z

Nice :-)

The description of this PR has a typo PdfMnier.
How was the value of 500 determined/optimized (the batch size for PDF parsing)?
Once you have done this effort, to split it in smaller batches, I would think that taking e.g. 100 pages per batch (instead of 500) would further reduce the risk for out-of-memory ?
Maybe making it a config parameter can resolve my question (then a user can further reduce it, if 500 or 100 is still too large for certain Use Cases?)

peter-vandenabeele-axa · 2020-02-06T14:44:03Z

server/src/types/PdfminerPage.ts

+    } else if (jsonObj.figure != null) {
+      this.figure = [new PdfminerFigure(jsonObj.figure)];
+    }
+    /*this.line = jsonObj.line;


Useful to keep this commented code in the commit ?

peter-vandenabeele-axa · 2020-02-06T14:44:11Z

server/src/types/PdfminerPage.ts

@@ -25,18 +25,26 @@ export class PdfminerPage {
  };
  public textbox: PdfminerTextbox[];
  public figure: PdfminerFigure[];
-  public line: object[];
+  /*public line: object[];


Useful to keep this commented code in the commit ?

jvalls-axa · 2020-02-06T15:22:32Z

Nice :-)

The description of this PR has a typo PdfMnier.

How was the value of 500 determined/optimized (the batch size for PDF parsing)?
Once you have done this effort, to split it in smaller batches, I would think that taking e.g. 100 pages per batch (instead of 500) would further reduce the risk for out-of-memory ?
Maybe making it a config parameter can resolve my question (then a user can further reduce it, if 500 or 100 is still too large for certain Use Cases?)

Hi @peter-vandenabeele-axa and thanks for your feedback !! 👍

I used 500 pages because after a lot of tests with some pdfs is the bigger number of pages supported with no 'out of memory' and no '--max-old-space-size' committed here

Our goal is allow Parsr to run properly with huge Pdfs (2K pages or more) and avoid '--max-old-space-size' usage.

But anyway adding 'batch max pages' in configuration file has a lot of sense!

peter-vandenabeele-axa · 2020-02-06T15:31:24Z

I used 500 pages because after a lot of tests with some pdfs is the bigger number of pages supported with no 'out of memory' and no '--max-old-space-size' committed here

Our goal is allow Parsr to run properly with huge Pdfs (2K pages or more) and avoid '--max-old-space-size' usage.

But anyway adding 'batch max pages' in configuration file has a lot of sense!

Yes, I would suggest so. I am a bit afraid that other users or use cases will pop up where the "complexity per page" will be larger, so it will get OOM at smaller amount of pages.

I vaguely remember this in real life when a physical printer would drop a PDF print job at a very small number of pages (10 or so), e.g. when the PDF was full of large, detailed images that consumed a lot of MB's per image.

marianorodriguez

pdfminer extractor is now working with 2000+ pages
Execution now crashes on TableDetectionScript. We will fix this on another PR

peter-vandenabeele-axa · 2020-02-07T13:27:28Z

jvalls-axa added 11 commits January 27, 2020 10:50

[Issue 299] Removed unused asset

021a817

[Issue 299] Added script to get Pdf total pages number

ce4126b

[Issue-299] Allow 4Gb as max object size when running Parsr through API

df3e419

[Isse-299] Added xml-stream because it can parse xml using streams an…

ef73559

…d filtering elements

[Issue-299] Global refactor of PdfMiner to extract large pdfs files u…

5f67f7a

…sing slots of 500 pages and parse xml output using xml-stream

[Issue-299] Allow constructor with json for all PdfMiner objects

f90ba33

[Issue-299] Linter fixes

29ab8ae

[Issue-299] Skip DOMParser.parseFromString in utils and call it only …

a5b0b9d

…when it's really needed

[Issue-299] Ensure spaces (with no attr and no text) are treated or i…

91376ca

…n some documents lines will not be splitted into words properly

[Issue-299] Reverted commit that broke line breaks in MD exporter

cda1cab

Merge branch 'develop' into issue/issue-299

ac7cad1

# Conflicts: # package-lock.json

peter-vandenabeele-axa reviewed Feb 6, 2020

View reviewed changes

[Issue-299] Removed unused PdfMinerPage properties

c5c0af8

jvalls-axa added 2 commits February 7, 2020 08:45

[Issue-299] Ensure array if not undefined before pushing elements

9a4e896

[Issue-299] Log folder where pdf to image are stored & lint fixes

3e7d060

marianorodriguez approved these changes Feb 7, 2020

View reviewed changes

marianorodriguez merged commit daa03ea into develop Feb 7, 2020

marianorodriguez deleted the issue/issue-299 branch February 7, 2020 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue/issue 299 #315

Issue/issue 299 #315

jvalls-axa commented Feb 6, 2020

peter-vandenabeele-axa Feb 6, 2020 •

edited

Loading

peter-vandenabeele-axa commented Feb 6, 2020

peter-vandenabeele-axa Feb 6, 2020

peter-vandenabeele-axa Feb 6, 2020

jvalls-axa commented Feb 6, 2020

peter-vandenabeele-axa commented Feb 6, 2020

marianorodriguez left a comment

peter-vandenabeele-axa commented Feb 7, 2020

Issue/issue 299 #315

Issue/issue 299 #315

Conversation

jvalls-axa commented Feb 6, 2020

Issue 299 Part I (PdfMiner)

Changes:

Improvements:

TODO:

xml2js vs xml-stream

peter-vandenabeele-axa Feb 6, 2020 • edited Loading

Choose a reason for hiding this comment

peter-vandenabeele-axa commented Feb 6, 2020

peter-vandenabeele-axa Feb 6, 2020

Choose a reason for hiding this comment

peter-vandenabeele-axa Feb 6, 2020

Choose a reason for hiding this comment

jvalls-axa commented Feb 6, 2020

peter-vandenabeele-axa commented Feb 6, 2020

marianorodriguez left a comment

Choose a reason for hiding this comment

peter-vandenabeele-axa commented Feb 7, 2020

peter-vandenabeele-axa Feb 6, 2020 •

edited

Loading