Returning the match indexes along with results? #3

mfkp · 2022-03-11T01:17:41Z

Hello, first of all, thanks for publishing this, looks to be a very interesting gem.

I'm wondering if it would be possible to return the matched text (or index of matched characters) along with the results? This would be useful in cases where a fuzzy match returns results, and I would want to highlight the matching text (or just show a snippet of text around the matching text for context) in the results.

In the readme it says:

"You may have noticed that search method returns only documents ids. This is by design. The documents themselves are not stored in the index."

So maybe this is not possible, but I figured I would ask anyway:

From your example:

brother = {
  imdb_id: "tt0118767",
  type: "/crime/Russia",
  title: "Brother",
  description: "An ex-soldier with a personal honor code enters the family crime business in St. Petersburg, Russia.",
  duration: 99,
  rating: 7.9,
  release_date: Date.parse("December 12, 1997")
}

Right now, this is how the search returns:

index.search('bersonal coder', fuzzy_distance: 1)
=> ["tt0118767"]

It would be great to return something like:

index.search('bersonal coder', fuzzy_distance: 1)
=> 
[{
  "tt0118767": {
    match_ranges: [[21, 28], [36, 39]]
  }
}]

That way I could display the results in my search listing like:

An ex-soldier with a personal honor code enters the family crime business in St. Petersburg, Russia.

I haven't dug into the source yet to see if this is possible, but I figure you'd know the limitations better and might be able to provide some input on if this is feasible.

mfkp · 2022-03-11T01:36:05Z

I'm guessing somewhere around here, we would need to have the ability to add STORED as an option on text fields:

tantiny/src/index.rs

Lines 70 to 71 in 0320b41

    
           let options = TextOptions::default() 
        
               .set_indexing_options(indexing);

https://github.com/quickwit-oss/tantivy/blob/main/src/schema/text_options.rs#L27-L32

mfkp · 2022-03-11T04:11:53Z

Here's an example I found of using highlighted snippets:

https://github.com/quickwit-oss/tantivy/blob/eca6628b3cb6dbfdc75a441889367aa1fd58c2e1/examples/snippet.rs

baygeldin · 2022-03-11T19:28:52Z

I haven't thought about this, but it's a valid usecase. It's definitely feasable, but it would significally change the API, so the correct approach requires more thought.

First of all, as you correctly said for this to work the text fields should be stored. I don't want it to be the default behavior because storing fields in the index takes space and reading stored fields is not free (as Tantivy documentation puts it Reading the stored fields of a document is relatively slow. (100 microsecs)). So, this should be opt-in, maybe a stored option for text fields.

As performance goes, I would also make calculating ranges optional (e.g. index.search(query, with_match_ranges: true)).

Also, note that in your example there is only one text field, but there might be more, so we need to return match_ranges for every stored text field:

index.search('bersonal coder', fuzzy_distance: 1)
=> 
[
  {
    id: "tt0118767",
    match_ranges: {
      text_field_1: [[21, 28]],
      text_field_2: [[36, 39]],
    }
  }
]

And speaking of the search method: since we need to return additional metadata with every document, it would be best to create a new entity (e.g. Tantiny::SearchResult) that would contain this metadata along with documents ids, instead of returning an arbitrary hash.

That being said, modifying the source in your fork for your specific usecase shouldn't be very difficult. You would need to add the STORED index option, and modify the search method both in Rust and Ruby.

baygeldin added the enhancement New feature or request label Mar 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Returning the match indexes along with results? #3

Returning the match indexes along with results? #3

mfkp commented Mar 11, 2022 •

edited

Loading

mfkp commented Mar 11, 2022 •

edited

Loading

mfkp commented Mar 11, 2022

baygeldin commented Mar 11, 2022

Returning the match indexes along with results? #3

Returning the match indexes along with results? #3

Comments

mfkp commented Mar 11, 2022 • edited Loading

mfkp commented Mar 11, 2022 • edited Loading

mfkp commented Mar 11, 2022

baygeldin commented Mar 11, 2022

mfkp commented Mar 11, 2022 •

edited

Loading

mfkp commented Mar 11, 2022 •

edited

Loading