Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push and pop OutputAccumulator as IntersectTermsEnumFrames are pushed and popped #12900

Merged
merged 9 commits into from
Dec 12, 2023

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Dec 10, 2023

Relate to : #12895

@mikemccand
Copy link
Member

OK I opened #12901 to create a better BWC test revealing these read-time exceptions -- can we do that, first, and then confirm #12900 indeed fixes it, then push the test, then push the fix? I can try to work on creating that test case, but won't have much time until early tomorrow AM. If anyone else wants to crack it, feel free!

@mikemccand
Copy link
Member

I've confirmed the new (failing) BWC test from #12901 now passes with this PR. I'll review ...

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gf2121! I left a few head scratching comments ...

@@ -2245,7 +2244,6 @@ public void testFailOpenOldIndex() throws IOException {
// #12895: test on a carefully crafted 9.8.0 index (from a small contiguous subset
// of wikibigall unique terms) that shows the read-time exception of
// IntersectTermsEnum (used by WildcardQuery)
@Ignore("re-enable once we merge #12900")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay :)

@@ -89,6 +89,9 @@ final class IntersectTermsEnumFrame {

final ByteArrayDataInput bytesReader = new ByteArrayDataInput();

// Cumulative outputs so far
BytesRef[] outputPrefix;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sort of strange to have a separate array for tracking the accumulated BytesRef outputs when this is the purpose of the outputAccumulator? Wouldn't we be able to push/pop as we push/pop the frames, into the single accumulator, so we don't need make extra arrays as we push deeper into the FST?

We could assert when we pop that the output we are popping exactly matches (.equals()) the FST.Arc's output from that frame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we be able to push/pop as we push/pop the frames, into the single accumulator, so we don't need make extra arrays as we push deeper into the FST?

I'm not sure. Looking at pushFrame, it seems like one single frame could be related to more than one FST#Arc. Maybe we need to record the number of outputs pushed into the accumulator when pushing a frame ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea to avoid the array copy. Implemented in e387ec9.

@@ -198,6 +204,7 @@ private IntersectTermsEnumFrame pushFrame(int state) throws IOException {
}

f.arc = arc;
f.outputPrefix = outputAccumulator.bytesRefs();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm shouldn't we sometimes set this outputPrefix back to null? The frames are re-used, so we could push deep into the FST, where there is an outputPrefix, then pop back out and push into a new part of the FST that does not have an outputPrefix and illegally share / use the old leftover stale outputPrefix?

Copy link
Contributor Author

@gf2121 gf2121 Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

push into a new part of the FST that does not have an outputPrefix and illegally share / use the old leftover stale outputPrefix?

I thought the outputPrefix will be always overwritten by pushFrame when we push into a new part of the FST. Did i miss something ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the outputPrefix will be always overwritten by pushFrame when we push into a new part of the FST. Did i miss something ?

Ahh OK you are right -- and we could simply rewrite to empty BytesRef array which means the same thing as null. Good.

@@ -495,7 +495,7 @@ public boolean seekExact(BytesRef target) throws IOException {
targetUpto = 0;
outputAccumulator.push(arc.nextFinalOutput());
currentFrame = pushFrame(arc, 0);
outputAccumulator.pop();
outputAccumulator.pop(arc.nextFinalOutput());
Copy link
Contributor Author

@gf2121 gf2121 Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like another potential bug : If arc.nextFinalOutput() == NO_OUTPUT, we will not add the output into the accumulator but we pop one out.

Not sure why we have not seen any bug report for this. Maybe our output format protect it implicitly? This is in SegmentTermsEnum so it could possibly affect merge. I'll dig to see if it is a real bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! And the new check you added in pop will prevent this bug, right?

But I also think this was (luckily!?) not a bug, because this case is when the STE is not yet seek'd to any term, so we are starting at the FST entry arc / root block, and the empty string FST input must always have a final output (the "root block"), which will never be NO_OUTPUT. Are there any other places in STE where we were blindly popping like this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the other places where we "blindly" pop are fine -- we conditionalize by arc.isFinal() check, and we know FST for BlockTree terms index will always have a nextFinalOutput on final nodes since it holds the floor data for that terms block. And there there is one other place in seekCeil where we also rely on empty string having a final output, which is fine. So net/net I don't think this was a bug, but it smelled close to one!

@@ -198,9 +199,11 @@ private IntersectTermsEnumFrame pushFrame(int state) throws IOException {
}

f.arc = arc;
f.outputNum = outputAccumulator.outputCount() - initOutputCount;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK we need this because a single frame might traverse multiple FST arcs, so we need to accumulate possibly / often more than one output.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I like the added safety of confirming on pop that we are in fact poping what we think we are (except when we pop multiple items). And it's cleaner than the first cut by simple using the output accumulator consistently.

I think we should merge this to main, 9x and 99x and let's let CI chew on it for a bit (day or so) before cutting 9.9.1?

@gf2121
Copy link
Contributor Author

gf2121 commented Dec 12, 2023

Thanks for review and great advice @mikemccand !

I think we should merge this to main, 9x and 99x and let's let CI chew on it for a bit (day or so) before cutting 9.9.1?

+1. I'll merge and backport soon.

@jpountz
Copy link
Contributor

jpountz commented Dec 12, 2023

FWIW I confirmed that this change makes the new test in #12925 pass.

@gf2121 gf2121 changed the title IntersectTermsEnum should accumulate from output prefix instead of current output Push and pop OutputAccumulator as IntersectTermsEnumFrames are pushed and popped Dec 12, 2023
@gf2121 gf2121 merged commit 05b14e2 into apache:main Dec 12, 2023
4 checks passed
asfgit pushed a commit that referenced this pull request Dec 12, 2023
asfgit pushed a commit that referenced this pull request Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants