Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9634: Fix highlighting of extended intervals matched using offset #16

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zacharymorn
Copy link
Contributor

This PR is currently in draft state

Description

Fix highlighting of extended intervals matched using offset

Proposed Solution

In OffsetsFromMatchIterator, retrieves ExtendedIntervalsSource and its before / after position values, and use them to adjust highlight offset range matched by offset

Tests

  • gradle precommit currently failing for nocommit comment
  • passed previously failed test
  • passed new tests

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

@@ -250,6 +250,10 @@ public MatchesIterator getSubMatches() throws IOException {

@Override
public Query getQuery() {
return queue.top().getQuery();
if (queue.size() > 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is needed to fix two other tests that would throw NPE from here when matchesIterator.getQuery() is called in OffsetsFromMatchIterator

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was away for the weekend -- sorry, Zach. I'm not sure if I'm a fan of reanalyzing the token stream... This would be always my last-resort option, to be honest.

Maybe we could somehow augment the matches API so that those "correct" before/after offsets are returned or somehow provided instead? @romseygeek would be an authoritative source to ask here, I'm just a humble user of this awesome function - Alan knows the internals.

@zacharymorn
Copy link
Contributor Author

No problem Dawid and thanks for the feedback! I agree that re-analyzing the token stream does seems to be wasteful if the index already has position and offset information. I think the central issue here is that the current Intervals#extend API here only takes before / after values based on position instead of offset. But on the other hand, additional API that takes in / provides offset based before / after values may not be very meaningful to most client applications (I think), as offset varies based on each token length. So highlighting offset range computed based on non-positional APIs may always require adjustment here if the before / after values are provided based on position (this may also include OffsetsFromTokens and OffsetsFromTokens strategies as well).

* @param before how many positions to extend before the delegated interval
* @param after how many positions to extend after the delegated interval

However, I'm also wondering if the position to / from offset mappings were already computed / stored for other purposes and readily available in OffsetRetrievalStrategy classes? If so, we could skip the re-analysis as well.

Alan, could you please let me know if you have any pointer here? I'm happy to try out different approaches to find the best solution here.

@zacharymorn
Copy link
Contributor Author

Hi @romseygeek, just want to circle back to this PR to see if you could provide any guidance here?

Copy link

github-actions bot commented Jan 9, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants