Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thumbnailing Large/Complex pdfs can blow heap #23384

Closed
wezell opened this issue Nov 17, 2022 · 4 comments · Fixed by #23385
Closed

Thumbnailing Large/Complex pdfs can blow heap #23384

wezell opened this issue Nov 17, 2022 · 4 comments · Fixed by #23385
Assignees

Comments

@wezell
Copy link
Contributor

wezell commented Nov 17, 2022

When trying to generate thumbnails for PDFs, we make use of an Apache library called PDFBox. The issue is that when we try to thumbnail these pdfs, it can overload the servers and blow server heap space. This is a known issue with the library and something they have addressed in later versions.

It is easy to reproduce - when I try to thumbnail this PDF locally I get an OOM exception. At best, we should be able to thumbnail PDFs without blowing heap, regardless of how long it takes.
Here is a fat pdf that blows up:
neuroscience2009.pdf

wezell added a commit that referenced this issue Nov 17, 2022
wezell added a commit that referenced this issue Nov 17, 2022
wezell added a commit that referenced this issue Nov 17, 2022
@wezell
Copy link
Contributor Author

wezell commented Nov 17, 2022

What this does:

  • Adds a 10 slot semaphore to only allow 10 Image Filtering threads at a time. This can be configured as needed by setting IMAGE_GENERATION_SIMULTANEOUS_REQUESTS to a higher number.
  • Updates PDFBox version to 2.0.27
  • Limits available memory for resizing PDFs
  • setSubsamplingAllowed(true) allows for PDFBox to use less memory

@wezell wezell added LTS : Next Ticket that will be added to LTS Release : 23.01 labels Nov 17, 2022
@wezell wezell linked a pull request Nov 17, 2022 that will close this issue
@jcastro-dotcms jcastro-dotcms self-assigned this Nov 29, 2022
jcastro-dotcms pushed a commit that referenced this issue Nov 29, 2022
jcastro-dotcms pushed a commit that referenced this issue Nov 29, 2022
jcastro-dotcms pushed a commit that referenced this issue Nov 29, 2022
fmontes pushed a commit that referenced this issue Nov 30, 2022
* #23384 stops oom with pdfs

* #23384 sonarqube cleanup

* #23384 remove unneeded class
@fmontes fmontes reopened this Nov 30, 2022
@yolabingo
Copy link
Contributor

yolabingo commented Dec 6, 2022

related cloud support ticket https://dotcms.zendesk.com/agent/tickets/109178

@rjvelazco
Copy link
Contributor

Passed Internal QA

Docker Image: 23.01_a953eb1e_SNAPSHOT

Note For QA

It may take a while, it took me 30 minutes until could see the thumbnail.

@bryanboza
Copy link
Member

Fixed, tested locally and it is taken like 10/15 minutes to show the thumbnail, which is not great, but is not killing the server. Then we are ok for now.

Passed QA. Tested on release-23.01 // Docker // FF

@fmontes fmontes closed this as completed Dec 26, 2022
erickgonzalez added a commit that referenced this issue Jan 30, 2023
@erickgonzalez erickgonzalez added the Release : 21.06.13 Included in LTS patch release 21.06.13 label Feb 1, 2023
erickgonzalez added a commit that referenced this issue Mar 17, 2023
@erickgonzalez erickgonzalez added Release : 22.03.5 Included in LTS patch release 22.03.5 and removed Next LTS Release labels Mar 17, 2023
@erickgonzalez erickgonzalez removed the LTS : Next Ticket that will be added to LTS label Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants