Skip to content

OPENNLP-1821: Prevent OutOfMemory Due To Huge Array Allocation #1022

Merged
mawiesne merged 8 commits intoapache:mainfrom
subbudvk:subbudvk-patch-3
Apr 27, 2026
Merged

OPENNLP-1821: Prevent OutOfMemory Due To Huge Array Allocation #1022
mawiesne merged 8 commits intoapache:mainfrom
subbudvk:subbudvk-patch-3

Conversation

@subbudvk
Copy link
Copy Markdown
Contributor

Description

getOutcomes(), getOutcomePatterns(), and getPredicates() in AbstractModelReader
read a 32-bit integer from the binary stream and use it directly as an array size
with no bounds check. A malformed model file with any count field set to
Integer.MAX_VALUE causes an OutOfMemoryError at allocation time, before any
model data is validated.

Fix

Added a MAX_ENTRIES = 10_000_000 limit. All three methods now throw
InvalidFormatException if the count field is negative or exceeds the limit.

Copy link
Copy Markdown
Contributor

@rzo1 rzo1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me, but can we make MAX_ENTRIES configurable via a system property? In addition, it might be worth to apply the same to ModelParameterChunker.readUTF() and TwoPassDataIndexer (EventStream.read()), no?

@subbudvk
Copy link
Copy Markdown
Contributor Author

Overall, looks good to me, but can we make MAX_ENTRIES configurable via a system property? In addition, it might be worth to apply the same to ModelParameterChunker.readUTF() and TwoPassDataIndexer (EventStream.read()), no?

I thought this limit is sufficient and we don't need more customisation here. Do we need a system property?

@subbudvk
Copy link
Copy Markdown
Contributor Author

subbudvk commented Apr 26, 2026

I will add the restriction in ModelParameterChunker but TwoPassDataIndexer usage seems not user controlled as it's coming from a temp file. @rzo1

@rzo1
Copy link
Copy Markdown
Contributor

rzo1 commented Apr 26, 2026

We never know what people are doing in the wild, so a config option would be the safe way, imho.

@subbudvk
Copy link
Copy Markdown
Contributor Author

Overall, looks good to me, but can we make MAX_ENTRIES configurable via a system property? In addition, it might be worth to apply the same to ModelParameterChunker.readUTF() and TwoPassDataIndexer (EventStream.read()), no?

@rzo1 : Fixed. Please review

@subbudvk subbudvk requested a review from rzo1 April 26, 2026 15:07
@rzo1 rzo1 requested a review from mawiesne April 26, 2026 15:54
@rzo1 rzo1 changed the title Prevent OOM/DoS in AbstractModelReader OPENNLP-1821 - Prevent OutOfMemory Due To Huge Array Allocation Apr 26, 2026
@rzo1 rzo1 requested review from atarora and jzonthemtn April 26, 2026 19:11
@mawiesne mawiesne changed the title OPENNLP-1821 - Prevent OutOfMemory Due To Huge Array Allocation OPENNLP-1821: Prevent OutOfMemory Due To Huge Array Allocation Apr 26, 2026
Copy link
Copy Markdown
Contributor

@mawiesne mawiesne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @subbudvk - I've left several comments and one (open) question.

@subbudvk subbudvk requested a review from mawiesne April 27, 2026 03:42
@rzo1
Copy link
Copy Markdown
Contributor

rzo1 commented Apr 27, 2026

I did a search for the same pattern in the morning and found HeadRules (and AncoraSpanishHeadRules) : What do you guys think about HeadRules#199 (` AncoraSpanishHeadRules#215)?

Judging from HEAD_RULES_MODEL_ENTRY_NAME in ParserModel it could be reachable from loading a .bin?

Regardless of that: Checkstyle is currently failing.

@mawiesne mawiesne added the java Pull requests that update Java code label Apr 27, 2026
@mawiesne
Copy link
Copy Markdown
Contributor

mawiesne commented Apr 27, 2026

Thx @subbudvk for the recent contributions. Really looking forward for the next round of releases with your name in the contributors list.

@mawiesne mawiesne merged commit 96a073f into apache:main Apr 27, 2026
9 checks passed
@subbudvk
Copy link
Copy Markdown
Contributor Author

Thanks @mawiesne @rzo1

Thx @subbudvk for the recent contributions. Really looking forward for the next round of releases with your name in the contributors list.

Looking forward to make more meaningful contributions!

rzo1 pushed a commit that referenced this pull request Apr 27, 2026
* Fix : Prevent OOM/DoS from Crafted Inputs

* Customizable entry code in OpenNLP

* Use Max_Entries Declared to prevent OOM

* Use correct exception in fix for OOM

(cherry picked from commit 96a073f)
@subbudvk
Copy link
Copy Markdown
Contributor Author

subbudvk commented Apr 27, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

java Pull requests that update Java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants