[Search] Enable results highlighting for textual fields #950

frascuchon · 2022-01-13T13:24:33Z

Find a way to return highligthting info with search results. We can exploit the elasticsearch highlighting keeping in mind that, for text classification, highlighting for multiple fields is not trivial when no default string searches.

For example, given a dataset with records including inputs.subject and inputs.body the query:

size weight

could match terms in inputs.subject, inputs.body or both.

But, if we select a field in query:

inputs.subject: (size weight)

the highlight results could bring info for both fields subject and body, resulting in highlighting information that is not totally truthful

The text was updated successfully, but these errors were encountered:

dcfidalgo · 2022-01-13T18:54:17Z

I think we could live with this "additional" highlighting info, as long as the highlights of the original query are present.

frascuchon · 2022-02-22T19:48:24Z

Since the computed highlighting won't be totally accurate and for the sake of performance, we can return only the relevant terms in search that makes record match.

What do you think @dvsrepo @dcfidalgo ???

The client (UI and client) could build the needed highlighting with those keywords.

dcfidalgo · 2022-02-23T15:04:12Z

Could you maybe provide an example of a query, record, and the returned terms you have in mind? Maybe we can also have a quick call about that.

frascuchon · 2022-02-23T16:42:22Z

The PR #1201 implements what I propose.

The query list* will match following records text:

- The listener can be implemente....
- Last listen in spotify was .....
- People is listening som.....
- Google is listed as the comp....

And the records keywords suggested in response should be:

- ["listener"]
- ["listen"]
- ["listening"]
- ["listed"]

Ready to talk if you prefer.

dcfidalgo · 2022-02-24T10:16:36Z

Yes, let's have a discussion about this. Maybe one comment ahead, not sure what the workflow of our WS for token classification will be, but maybe it would be helpful to also return start, end char indexes of the returned keywords. Not sure if this is technically possible.

* chore: include search_keywords in client records * chore: signatures * feat: include search_records as part of client records * fix: add highlight on dataset scan * test: add missing tests * test: estabilize tests * Apply suggestions from code review Co-authored-by: David Fidalgo <david@recogn.ai> * test: try to fix push to hf hub Co-authored-by: David Fidalgo <david@recogn.ai>

* chore: include search_keywords in client records * chore: signatures * feat: include search_records as part of client records * fix: add highlight on dataset scan * test: add missing tests * test: estabilize tests * Apply suggestions from code review Co-authored-by: David Fidalgo <david@recogn.ai> * test: try to fix push to hf hub Co-authored-by: David Fidalgo <david@recogn.ai> (cherry picked from commit 0678043)

* feat(#950): using record search_keywords for highlighting * feat(highlight): keywords for text-classification * feat(highlight): keywords for text2text * test: update snapshots * fix: compute search keywords for old datasets, based on 'words' field

* fix(hightlight): merge adjacent terms * refactor: apply multi-keyword search in text * feat: merge highlighted phrases * feat: parse keywords as entire words * fix(highlight): include hightlight info at visual token level * chore: lint * fix: escape html before apply highligh span * fix calc whitespace for highlighted entities * fix: parse middle highlighted tokens * refactor: highligh on inputs * test: update tests Co-authored-by: LeireA <leire@recogn.ai>

* feat(search): indexing keyword fields with textual info * fix: properties -> fields (cherry picked from commit ab80cbc) ci: include es 8.0 in build process (#1286) * ci: include es 8.0 in build process * chore: wip * fix(es-mapping): avoid nested multi-fields in mapping * fix(search): use id for default sorting (cherry picked from commit 2eb5276) fix(#1286): backward comp. sorting by id (#1304) * fix(search): backward comp. sorting by id * fix: error normalizing sort * chore: dockerfile (cherry picked from commit a3b0552)

* chore: include search_keywords in client records * chore: signatures * feat: include search_records as part of client records * fix: add highlight on dataset scan * test: add missing tests * test: estabilize tests * Apply suggestions from code review Co-authored-by: David Fidalgo <david@recogn.ai> * test: try to fix push to hf hub Co-authored-by: David Fidalgo <david@recogn.ai> (cherry picked from commit 0678043)

* feat(#950): using record search_keywords for highlighting * feat(highlight): keywords for text-classification * feat(highlight): keywords for text2text * test: update snapshots * fix: compute search keywords for old datasets, based on 'words' field (cherry picked from commit 9e11933) fix(#950): improve highlight for multi terms searches (#1278) * fix(hightlight): merge adjacent terms * refactor: apply multi-keyword search in text * feat: merge highlighted phrases * feat: parse keywords as entire words * fix(highlight): include hightlight info at visual token level * chore: lint * fix: escape html before apply highligh span * fix calc whitespace for highlighted entities * fix: parse middle highlighted tokens * refactor: highligh on inputs * test: update tests Co-authored-by: LeireA <leire@recogn.ai> (cherry picked from commit 3a32334)

* feat(#950): using record search_keywords for highlighting * feat(highlight): keywords for text-classification * feat(highlight): keywords for text2text * test: update snapshots * fix: compute search keywords for old datasets, based on 'words' field (cherry picked from commit 9e11933) fix(#950): improve highlight for multi terms searches (#1278) * fix(hightlight): merge adjacent terms * refactor: apply multi-keyword search in text * feat: merge highlighted phrases * feat: parse keywords as entire words * fix(highlight): include hightlight info at visual token level * chore: lint * fix: escape html before apply highligh span * fix calc whitespace for highlighted entities * fix: parse middle highlighted tokens * refactor: highligh on inputs * test: update tests Co-authored-by: LeireA <leire@recogn.ai> (cherry picked from commit 3a32334) chore(#1235): fix highlight function name in explain (#1316) Closes #1315 (cherry picked from commit 41b3321)