-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue/issue 299 #315
Issue/issue 299 #315
Conversation
…d filtering elements
…sing slots of 500 pages and parse xml output using xml-stream
…when it's really needed
…n some documents lines will not be splitted into words properly
# Conflicts: # package-lock.json
); | ||
const startTime: number = Date.now(); | ||
const extractFont = extractImagesAndFonts(repairedPdf); | ||
const pdfminerExtract = this.extractFile(repairedPdf, 1, 500, totalPages); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe place the value of 500
as a clearly visible constant
at top of file or even a config parameter with 500 or some other value as default value ?
Nice :-)
|
server/src/types/PdfminerPage.ts
Outdated
} else if (jsonObj.figure != null) { | ||
this.figure = [new PdfminerFigure(jsonObj.figure)]; | ||
} | ||
/*this.line = jsonObj.line; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useful to keep this commented code in the commit ?
server/src/types/PdfminerPage.ts
Outdated
@@ -25,18 +25,26 @@ export class PdfminerPage { | |||
}; | |||
public textbox: PdfminerTextbox[]; | |||
public figure: PdfminerFigure[]; | |||
public line: object[]; | |||
/*public line: object[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useful to keep this commented code in the commit ?
Hi @peter-vandenabeele-axa and thanks for your feedback !! 👍 I used 500 pages because after a lot of tests with some pdfs is the bigger number of pages supported with no 'out of memory' and no '--max-old-space-size' committed here Our goal is allow Parsr to run properly with huge Pdfs (2K pages or more) and avoid '--max-old-space-size' usage. But anyway adding 'batch max pages' in configuration file has a lot of sense! |
Yes, I would suggest so. I am a bit afraid that other users or use cases will pop up where the "complexity per page" will be larger, so it will get OOM at smaller amount of pages. I vaguely remember this in real life when a physical printer would drop a PDF print job at a very small number of pages (10 or so), e.g. when the PDF was full of large, detailed images that consumed a lot of MB's per image. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- pdfminer extractor is now working with 2000+ pages
- Execution now crashes on TableDetectionScript. We will fix this on another PR
Issue 299 Part I (PdfMiner)
This PR is part I of #299 -> Heap out of memory in Pdminer extraction & dumppdf extraction.
Changes:
--max-old-space-size=4096
(Temporally fix)Improvements:
TODO:
xml2js vs xml-stream
17 min VS 9 min