Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returning the match indexes along with results? #3

Open
mfkp opened this issue Mar 11, 2022 · 3 comments
Open

Returning the match indexes along with results? #3

mfkp opened this issue Mar 11, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@mfkp
Copy link
Contributor

mfkp commented Mar 11, 2022

Hello, first of all, thanks for publishing this, looks to be a very interesting gem.

I'm wondering if it would be possible to return the matched text (or index of matched characters) along with the results? This would be useful in cases where a fuzzy match returns results, and I would want to highlight the matching text (or just show a snippet of text around the matching text for context) in the results.

In the readme it says:

"You may have noticed that search method returns only documents ids. This is by design. The documents themselves are not stored in the index."

So maybe this is not possible, but I figured I would ask anyway:

From your example:

brother = {
  imdb_id: "tt0118767",
  type: "/crime/Russia",
  title: "Brother",
  description: "An ex-soldier with a personal honor code enters the family crime business in St. Petersburg, Russia.",
  duration: 99,
  rating: 7.9,
  release_date: Date.parse("December 12, 1997")
}

Right now, this is how the search returns:

index.search('bersonal coder', fuzzy_distance: 1)
=> ["tt0118767"]

It would be great to return something like:

index.search('bersonal coder', fuzzy_distance: 1)
=> 
[{
  "tt0118767": {
    match_ranges: [[21, 28], [36, 39]]
  }
}]

That way I could display the results in my search listing like:

An ex-soldier with a personal honor code enters the family crime business in St. Petersburg, Russia.

I haven't dug into the source yet to see if this is possible, but I figure you'd know the limitations better and might be able to provide some input on if this is feasible.

@mfkp
Copy link
Contributor Author

mfkp commented Mar 11, 2022

I'm guessing somewhere around here, we would need to have the ability to add STORED as an option on text fields:

tantiny/src/index.rs

Lines 70 to 71 in 0320b41

let options = TextOptions::default()
.set_indexing_options(indexing);

https://github.com/quickwit-oss/tantivy/blob/main/src/schema/text_options.rs#L27-L32

@mfkp
Copy link
Contributor Author

mfkp commented Mar 11, 2022

Here's an example I found of using highlighted snippets:

https://github.com/quickwit-oss/tantivy/blob/eca6628b3cb6dbfdc75a441889367aa1fd58c2e1/examples/snippet.rs

@baygeldin baygeldin added the enhancement New feature or request label Mar 11, 2022
@baygeldin
Copy link
Owner

I haven't thought about this, but it's a valid usecase. It's definitely feasable, but it would significally change the API, so the correct approach requires more thought.

First of all, as you correctly said for this to work the text fields should be stored. I don't want it to be the default behavior because storing fields in the index takes space and reading stored fields is not free (as Tantivy documentation puts it Reading the stored fields of a document is relatively slow. (100 microsecs)). So, this should be opt-in, maybe a stored option for text fields.

As performance goes, I would also make calculating ranges optional (e.g. index.search(query, with_match_ranges: true)).

Also, note that in your example there is only one text field, but there might be more, so we need to return match_ranges for every stored text field:

index.search('bersonal coder', fuzzy_distance: 1)
=> 
[
  {
    id: "tt0118767",
    match_ranges: {
      text_field_1: [[21, 28]],
      text_field_2: [[36, 39]],
    }
  }
]

And speaking of the search method: since we need to return additional metadata with every document, it would be best to create a new entity (e.g. Tantiny::SearchResult) that would contain this metadata along with documents ids, instead of returning an arbitrary hash.

That being said, modifying the source in your fork for your specific usecase shouldn't be very difficult. You would need to add the STORED index option, and modify the search method both in Rust and Ruby.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants