Skip to content

feat: Add Google RE2/J linear time regular expression as alternative to Java regex#19514

Open
vivek807 wants to merge 2 commits into
apache:masterfrom
deep-bi:feature/CWE-1333-Inefficient-Regular-Expression-Complexity
Open

feat: Add Google RE2/J linear time regular expression as alternative to Java regex#19514
vivek807 wants to merge 2 commits into
apache:masterfrom
deep-bi:feature/CWE-1333-Inefficient-Regular-Expression-Complexity

Conversation

@vivek807
Copy link
Copy Markdown
Contributor

Fixes #19513.

Description

Add Google RE2/J linear time regular expression as alternative to Java regex

druid.regex.engine=JAVA

Supported values:

Value Description
JAVA Uses Java's built-in java.util.regex.Pattern engine.
RE2J Uses Google's RE2/J regex engine with linear-time matching guarantees.

Default value:

druid.regex.engine=JAVA

RE2/J engine

Setting:

druid.regex.engine=RE2J

enables the RE2/J regex engine for ingestion task regex input formats.

RE2/J helps protect against catastrophic backtracking and Regular Expression Denial of Service (ReDoS) attacks by guaranteeing linear-time regex evaluation.

Compatibility differences

RE2/J does not support all Java regex features.

Unsupported or partially supported features include:

  • backreferences
  • lookbehind assertions
  • some advanced backtracking behavior

Patterns using unsupported constructs will fail during regex compilation.

Example of catastrophic backtracking

The following Java regex may cause catastrophic backtracking:

^(.*a){20}$

against input such as:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaX

Using RE2J avoids this issue.

Performance considerations

  • JAVA may support more advanced regex syntax and behavior.
  • RE2J provides safer and more predictable runtime characteristics.
  • For trusted internal ingestion specs, JAVA may be preferred for compatibility.
  • For externally supplied regex patterns, RE2J is recommended.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@vivek807 vivek807 force-pushed the feature/CWE-1333-Inefficient-Regular-Expression-Complexity branch 2 times, most recently from 76edf35 to 2010eeb Compare May 25, 2026 08:33
Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 0
P2 1
P3 1
Total 2
Severity Findings
P0 0
P1 0
P2 1
P3 1
Total 2

Reviewed 21 of 21 changed files.


This is an automated review by Codex GPT-5.5

Comment thread docs/ingestion/data-formats.md Outdated
@vivek807 vivek807 force-pushed the feature/CWE-1333-Inefficient-Regular-Expression-Complexity branch 2 times, most recently from 6fad2f1 to cdf47e9 Compare May 26, 2026 05:23
@vivek807 vivek807 requested a review from FrankChen021 May 26, 2026 06:20
@vivek807 vivek807 force-pushed the feature/CWE-1333-Inefficient-Regular-Expression-Complexity branch from cdf47e9 to 8633e33 Compare May 26, 2026 06:27
@vivek807 vivek807 force-pushed the feature/CWE-1333-Inefficient-Regular-Expression-Complexity branch from 8633e33 to f1af702 Compare May 26, 2026 06:55
@vivek807 vivek807 force-pushed the feature/CWE-1333-Inefficient-Regular-Expression-Complexity branch from f1af702 to d41d952 Compare May 26, 2026 07:14
Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the follow-up. The FlattenSpec heading has been restored, so no inline reply is needed.

Reviewed 24 of 24 changed files.


This is an automated review by Codex GPT-5.5

@vivek807 vivek807 requested a review from FrankChen021 May 27, 2026 03:53
Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the follow-up. The regex engine wiring concern is handled: the module is now present in startup injection and the ingestion-facing service paths touched by the follow-up, including indexer, middle manager, overlord/sampler, peon, and coordinator standalone wiring.

Reviewed 24 of 24 changed files.


This is an automated review by Codex GPT-5.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CWE-1333: Inefficient Regular Expression Complexity

2 participants