Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Search] Enable results highlighting for textual fields #950

Closed
frascuchon opened this issue Jan 13, 2022 · 5 comments · Fixed by #1235
Closed

[Search] Enable results highlighting for textual fields #950

frascuchon opened this issue Jan 13, 2022 · 5 comments · Fixed by #1235
Assignees
Labels
area: server Indicates that an issue or pull request is related to the server type: enhancement Indicates new feature requests
Projects

Comments

@frascuchon
Copy link
Member

Find a way to return highligthting info with search results. We can exploit the elasticsearch highlighting keeping in mind that, for text classification, highlighting for multiple fields is not trivial when no default string searches.

For example, given a dataset with records including inputs.subject and inputs.body the query:

size weight

could match terms in inputs.subject, inputs.body or both.

But, if we select a field in query:

inputs.subject: (size weight)

the highlight results could bring info for both fields subject and body, resulting in highlighting information that is not totally truthful

@frascuchon frascuchon added type: enhancement Indicates new feature requests app area: server Indicates that an issue or pull request is related to the server labels Jan 13, 2022
@frascuchon frascuchon added this to To do in Release via automation Jan 13, 2022
@dcfidalgo
Copy link
Contributor

I think we could live with this "additional" highlighting info, as long as the highlights of the original query are present.

@frascuchon frascuchon assigned leiyre and unassigned dvsrepo and dcfidalgo Jan 17, 2022
@frascuchon
Copy link
Member Author

Since the computed highlighting won't be totally accurate and for the sake of performance, we can return only the relevant terms in search that makes record match.

What do you think @dvsrepo @dcfidalgo ???

The client (UI and client) could build the needed highlighting with those keywords.

@dcfidalgo
Copy link
Contributor

Could you maybe provide an example of a query, record, and the returned terms you have in mind? Maybe we can also have a quick call about that.

@frascuchon
Copy link
Member Author

frascuchon commented Feb 23, 2022

The PR #1201 implements what I propose.

The query list* will match following records text:

- The listener can be implemente....
- Last listen in spotify was .....
- People is listening som.....
- Google is listed as the comp....

And the records keywords suggested in response should be:

- ["listener"]
- ["listen"]
- ["listening"]
- ["listed"]

Ready to talk if you prefer.

@dcfidalgo
Copy link
Contributor

Yes, let's have a discussion about this. Maybe one comment ahead, not sure what the workflow of our WS for token classification will be, but maybe it would be helpful to also return start, end char indexes of the returned keywords. Not sure if this is technically possible.

frascuchon added a commit that referenced this issue Feb 25, 2022
* chore: include search_keywords in client records

* chore: signatures

* feat: include search_records as part of client records

* fix: add highlight on dataset scan

* test: add missing tests

* test: estabilize tests

* Apply suggestions from code review

Co-authored-by: David Fidalgo <david@recogn.ai>

* test: try to fix push to hf hub

Co-authored-by: David Fidalgo <david@recogn.ai>
frascuchon added a commit that referenced this issue Mar 4, 2022
* chore: include search_keywords in client records

* chore: signatures

* feat: include search_records as part of client records

* fix: add highlight on dataset scan

* test: add missing tests

* test: estabilize tests

* Apply suggestions from code review

Co-authored-by: David Fidalgo <david@recogn.ai>

* test: try to fix push to hf hub

Co-authored-by: David Fidalgo <david@recogn.ai>

(cherry picked from commit 0678043)
frascuchon added a commit that referenced this issue Mar 4, 2022
* chore: include search_keywords in client records

* chore: signatures

* feat: include search_records as part of client records

* fix: add highlight on dataset scan

* test: add missing tests

* test: estabilize tests

* Apply suggestions from code review

Co-authored-by: David Fidalgo <david@recogn.ai>

* test: try to fix push to hf hub

Co-authored-by: David Fidalgo <david@recogn.ai>

(cherry picked from commit 0678043)
frascuchon added a commit that referenced this issue Mar 4, 2022
* chore: include search_keywords in client records

* chore: signatures

* feat: include search_records as part of client records

* fix: add highlight on dataset scan

* test: add missing tests

* test: estabilize tests

* Apply suggestions from code review

Co-authored-by: David Fidalgo <david@recogn.ai>

* test: try to fix push to hf hub

Co-authored-by: David Fidalgo <david@recogn.ai>

(cherry picked from commit 0678043)
Release automation moved this from In progress to Ready to DEV QA Mar 16, 2022
frascuchon added a commit that referenced this issue Mar 16, 2022
* feat(#950): using record search_keywords for highlighting

* feat(highlight): keywords for text-classification

* feat(highlight): keywords for text2text

* test: update snapshots

* fix: compute search keywords for old datasets, based on 'words' field
frascuchon added a commit that referenced this issue Mar 24, 2022
* fix(hightlight): merge adjacent terms

* refactor: apply multi-keyword search in text

* feat: merge highlighted phrases

* feat: parse keywords as entire words

* fix(highlight): include hightlight info at visual token level

* chore: lint

* fix: escape html before apply highligh span

* fix calc whitespace for highlighted entities

* fix: parse middle highlighted tokens

* refactor: highligh on inputs

* test: update tests

Co-authored-by: LeireA <leire@recogn.ai>
@frascuchon frascuchon moved this from Ready to DEV QA to Ready to Release QA in Release Mar 25, 2022
frascuchon added a commit that referenced this issue Mar 25, 2022
* feat(search): indexing keyword fields with textual info

* fix: properties -> fields

(cherry picked from commit ab80cbc)

ci: include es 8.0 in build process (#1286)

* ci: include es 8.0 in build process

* chore: wip

* fix(es-mapping): avoid nested multi-fields in mapping

* fix(search): use id for default sorting

(cherry picked from commit 2eb5276)

fix(#1286): backward comp. sorting by id (#1304)

* fix(search): backward comp. sorting by id

* fix: error normalizing sort

* chore: dockerfile

(cherry picked from commit a3b0552)
frascuchon added a commit that referenced this issue Mar 25, 2022
* chore: include search_keywords in client records

* chore: signatures

* feat: include search_records as part of client records

* fix: add highlight on dataset scan

* test: add missing tests

* test: estabilize tests

* Apply suggestions from code review

Co-authored-by: David Fidalgo <david@recogn.ai>

* test: try to fix push to hf hub

Co-authored-by: David Fidalgo <david@recogn.ai>

(cherry picked from commit 0678043)
frascuchon added a commit that referenced this issue Mar 25, 2022
* feat(#950): using record search_keywords for highlighting

* feat(highlight): keywords for text-classification

* feat(highlight): keywords for text2text

* test: update snapshots

* fix: compute search keywords for old datasets, based on 'words' field

(cherry picked from commit 9e11933)

fix(#950):  improve highlight for multi terms searches (#1278)

* fix(hightlight): merge adjacent terms

* refactor: apply multi-keyword search in text

* feat: merge highlighted phrases

* feat: parse keywords as entire words

* fix(highlight): include hightlight info at visual token level

* chore: lint

* fix: escape html before apply highligh span

* fix calc whitespace for highlighted entities

* fix: parse middle highlighted tokens

* refactor: highligh on inputs

* test: update tests

Co-authored-by: LeireA <leire@recogn.ai>
(cherry picked from commit 3a32334)
frascuchon added a commit that referenced this issue Mar 25, 2022
* feat(#950): using record search_keywords for highlighting

* feat(highlight): keywords for text-classification

* feat(highlight): keywords for text2text

* test: update snapshots

* fix: compute search keywords for old datasets, based on 'words' field

(cherry picked from commit 9e11933)

fix(#950):  improve highlight for multi terms searches (#1278)

* fix(hightlight): merge adjacent terms

* refactor: apply multi-keyword search in text

* feat: merge highlighted phrases

* feat: parse keywords as entire words

* fix(highlight): include hightlight info at visual token level

* chore: lint

* fix: escape html before apply highligh span

* fix calc whitespace for highlighted entities

* fix: parse middle highlighted tokens

* refactor: highligh on inputs

* test: update tests

Co-authored-by: LeireA <leire@recogn.ai>
(cherry picked from commit 3a32334)

chore(#1235): fix highlight function name in explain (#1316)

Closes #1315

(cherry picked from commit 41b3321)
@frascuchon frascuchon moved this from Ready to Release QA to Approved Release QA in Release Mar 28, 2022
frascuchon added a commit that referenced this issue Mar 28, 2022
* feat(search): indexing keyword fields with textual info

* fix: properties -> fields

(cherry picked from commit ab80cbc)

ci: include es 8.0 in build process (#1286)

* ci: include es 8.0 in build process

* chore: wip

* fix(es-mapping): avoid nested multi-fields in mapping

* fix(search): use id for default sorting

(cherry picked from commit 2eb5276)

fix(#1286): backward comp. sorting by id (#1304)

* fix(search): backward comp. sorting by id

* fix: error normalizing sort

* chore: dockerfile

(cherry picked from commit a3b0552)
frascuchon added a commit that referenced this issue Mar 28, 2022
* chore: include search_keywords in client records

* chore: signatures

* feat: include search_records as part of client records

* fix: add highlight on dataset scan

* test: add missing tests

* test: estabilize tests

* Apply suggestions from code review

Co-authored-by: David Fidalgo <david@recogn.ai>

* test: try to fix push to hf hub

Co-authored-by: David Fidalgo <david@recogn.ai>

(cherry picked from commit 0678043)
frascuchon added a commit that referenced this issue Mar 28, 2022
* feat(#950): using record search_keywords for highlighting

* feat(highlight): keywords for text-classification

* feat(highlight): keywords for text2text

* test: update snapshots

* fix: compute search keywords for old datasets, based on 'words' field

(cherry picked from commit 9e11933)

fix(#950):  improve highlight for multi terms searches (#1278)

* fix(hightlight): merge adjacent terms

* refactor: apply multi-keyword search in text

* feat: merge highlighted phrases

* feat: parse keywords as entire words

* fix(highlight): include hightlight info at visual token level

* chore: lint

* fix: escape html before apply highligh span

* fix calc whitespace for highlighted entities

* fix: parse middle highlighted tokens

* refactor: highligh on inputs

* test: update tests

Co-authored-by: LeireA <leire@recogn.ai>
(cherry picked from commit 3a32334)

chore(#1235): fix highlight function name in explain (#1316)

Closes #1315

(cherry picked from commit 41b3321)
frascuchon added a commit that referenced this issue Mar 28, 2022
* feat(search): indexing keyword fields with textual info

* fix: properties -> fields

(cherry picked from commit ab80cbc)

ci: include es 8.0 in build process (#1286)

* ci: include es 8.0 in build process

* chore: wip

* fix(es-mapping): avoid nested multi-fields in mapping

* fix(search): use id for default sorting

(cherry picked from commit 2eb5276)

fix(#1286): backward comp. sorting by id (#1304)

* fix(search): backward comp. sorting by id

* fix: error normalizing sort

* chore: dockerfile

(cherry picked from commit a3b0552)
frascuchon added a commit that referenced this issue Mar 28, 2022
* chore: include search_keywords in client records

* chore: signatures

* feat: include search_records as part of client records

* fix: add highlight on dataset scan

* test: add missing tests

* test: estabilize tests

* Apply suggestions from code review

Co-authored-by: David Fidalgo <david@recogn.ai>

* test: try to fix push to hf hub

Co-authored-by: David Fidalgo <david@recogn.ai>

(cherry picked from commit 0678043)
frascuchon added a commit that referenced this issue Mar 28, 2022
* feat(#950): using record search_keywords for highlighting

* feat(highlight): keywords for text-classification

* feat(highlight): keywords for text2text

* test: update snapshots

* fix: compute search keywords for old datasets, based on 'words' field

(cherry picked from commit 9e11933)

fix(#950):  improve highlight for multi terms searches (#1278)

* fix(hightlight): merge adjacent terms

* refactor: apply multi-keyword search in text

* feat: merge highlighted phrases

* feat: parse keywords as entire words

* fix(highlight): include hightlight info at visual token level

* chore: lint

* fix: escape html before apply highligh span

* fix calc whitespace for highlighted entities

* fix: parse middle highlighted tokens

* refactor: highligh on inputs

* test: update tests

Co-authored-by: LeireA <leire@recogn.ai>
(cherry picked from commit 3a32334)

chore(#1235): fix highlight function name in explain (#1316)

Closes #1315

(cherry picked from commit 41b3321)
frascuchon added a commit that referenced this issue Mar 30, 2022
* feat(search): indexing keyword fields with textual info

* fix: properties -> fields

(cherry picked from commit ab80cbc)

ci: include es 8.0 in build process (#1286)

* ci: include es 8.0 in build process

* chore: wip

* fix(es-mapping): avoid nested multi-fields in mapping

* fix(search): use id for default sorting

(cherry picked from commit 2eb5276)

fix(#1286): backward comp. sorting by id (#1304)

* fix(search): backward comp. sorting by id

* fix: error normalizing sort

* chore: dockerfile

(cherry picked from commit a3b0552)
frascuchon added a commit that referenced this issue Mar 30, 2022
* chore: include search_keywords in client records

* chore: signatures

* feat: include search_records as part of client records

* fix: add highlight on dataset scan

* test: add missing tests

* test: estabilize tests

* Apply suggestions from code review

Co-authored-by: David Fidalgo <david@recogn.ai>

* test: try to fix push to hf hub

Co-authored-by: David Fidalgo <david@recogn.ai>

(cherry picked from commit 0678043)
frascuchon added a commit that referenced this issue Mar 30, 2022
* feat(#950): using record search_keywords for highlighting

* feat(highlight): keywords for text-classification

* feat(highlight): keywords for text2text

* test: update snapshots

* fix: compute search keywords for old datasets, based on 'words' field

(cherry picked from commit 9e11933)

fix(#950):  improve highlight for multi terms searches (#1278)

* fix(hightlight): merge adjacent terms

* refactor: apply multi-keyword search in text

* feat: merge highlighted phrases

* feat: parse keywords as entire words

* fix(highlight): include hightlight info at visual token level

* chore: lint

* fix: escape html before apply highligh span

* fix calc whitespace for highlighted entities

* fix: parse middle highlighted tokens

* refactor: highligh on inputs

* test: update tests

Co-authored-by: LeireA <leire@recogn.ai>
(cherry picked from commit 3a32334)

chore(#1235): fix highlight function name in explain (#1316)

Closes #1315

(cherry picked from commit 41b3321)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: server Indicates that an issue or pull request is related to the server type: enhancement Indicates new feature requests
Projects
No open projects
Release
Approved Release QA
Development

Successfully merging a pull request may close this issue.

4 participants