Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stack overflow in RegExp for long string #12462

Merged
merged 5 commits into from
Aug 17, 2023
Merged

Conversation

slow-J
Copy link
Contributor

@slow-J slow-J commented Jul 25, 2023

Description

Removed recursion from parseUnionExp, parseInterExp and parseConcatExp methods.
This prevents hitting stack overflow errors on long string inputs. Added a unit test to demonstrate.

Closes #11537

This prevents hitting stack overflow errors on long string inputs.
Closes apache#11537
@@ -30,6 +30,7 @@
package org.apache.lucene.util.automaton;

import java.io.IOException;
import java.util.ArrayDeque;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using ArrayDeque because we expect it to be faster than a Stack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using ArrayDeque because Stack is in forbidden apis, see message:

java.util.Stack @ Use more modern java.util.ArrayDeque as it is not 
synchronized

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stack is old and not preferred. Use Deque :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, now using Deque type.

@@ -1067,22 +1068,44 @@ private boolean check(int flag) {
}

final RegExp parseUnionExp() throws IllegalArgumentException {
Copy link
Contributor

@stefanvodita stefanvodita Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three methods look similar. Is it worth trying to extract out their bodies to a common method?
This new method would take 3 functions as arguments, one to call in the do-while loop, one for the do-while condition, and one to call in the while loop after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I will try this and submit a commit with this approach.

Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making the changes!

final RegExp iterativeParseExp(
Supplier<RegExp> gather, BooleanSupplier stop, MakeRegexGroup reduce)
throws IllegalArgumentException {
Deque<RegExp> regExpStack = new ArrayDeque<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need stack/deque even?

If we need to further reduce call stack, then I think we need a stack that is shared across function calls and some more rewrite.
But here I don't think we need stack and do any FIFO operations? Should be just:

  1. parse all the sub component
  2. reduce them

So why not:

RegExp res = null;
do {
  RegExp e = gather.get();
  if (res == null) {
    res = e;
  } else {
    res = reduce.get(flags, res, e);
  }
while (stop.getAsBoolean());

I think this may alter the result a bit by changing it from a | (b | (c | d)) to ((a | b) | c) | d, but for union intersect and concat the affiliation shouldn't affect the correctness?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! In these cases, reduce is an associative function. If we do this, maybe we can rename it to associativeReduce or something similar to make this extra assumption obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion Patrick and Stefan.

As long as we do the suggested reduce.get(flags, res, e) instead of reduce.get(flags, e, res) the result should be correct.

Will rename reduce to highlight it's associativeness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new commit with the changes.

@slow-J slow-J requested a review from zhaih August 14, 2023 10:07
Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slow-J Please add an entry to CHANGES.txt, I'll merge and backport this one then :)
Thank you!

@slow-J slow-J requested a review from zhaih August 16, 2023 08:32
@zhaih zhaih merged commit fb81833 into apache:main Aug 17, 2023
4 checks passed
@zhaih
Copy link
Contributor

zhaih commented Aug 17, 2023

Merged and backported, thank you @slow-J !

jpountz added a commit to jpountz/elasticsearch that referenced this pull request Sep 6, 2023
Parsing regexps no longer raises stack overflows thanks to apache/lucene#12462.
jpountz added a commit to elastic/elasticsearch that referenced this pull request Sep 6, 2023
Parsing regexps no longer raises stack overflows thanks to apache/lucene#12462.
@slow-J slow-J deleted the idea_rebase branch December 12, 2023 19:47
reschke added a commit to reschke/jackrabbit-oak-lucene that referenced this pull request Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StackOverflow when RegExp encounters a very large string [LUCENE-10501]
4 participants