Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fuzzy search for collections search #3092

Merged
merged 3 commits into from
Jun 19, 2023
Merged

Conversation

tillprochaska
Copy link
Contributor

The collections search returns results with exact matches only. Users do however expect the search to be fuzzy by default, i.e. "Romania" should also match "Romanian" etc. By changing the query type from query_string to multi_match, we can use fuzzy search by default.

The advantage of using fuzzy search is that it is applied at search time, i.e. we do not need to change the index configuration or re-index contents.

The main disadvantage is that it doesn't cover all variations that would be covered if we used proper stemming. For example, the following variations are not matched:

  • Contracts/contracting
  • Company/companies
  • Malta/Maltese

Also, fuzzy search cannot be combined with prefix search (for example "Open" does not match "OpenContracting").

Fix #2109

The collections search returns results with exact matches only. Users do however expect the search to be fuzzy by default, i.e. "Romania" should also match "Romanian" etc. By changing the query type from `query_string` to `multi_match`, we can use fuzzy search by default.

The advantage of using fuzzy search is that it is applied at search time, i.e. we do not need to change the index configuration or re-index contents.

The main disadvantage is that it doesn't cover all variations that would be covered if we used proper stemming. For example, the following variations are not matched:

* Contracts/contracting
* Company/companies
* Malta/Maltese

Also, fuzzy search cannot be combined with prefix search (for example "Open" does not match "OpenContracting").

Fix #2109
This will allow the following edit distances based on term length:

0..2: must match exactly
3: one edit allowed
>3: two edits allowed
@tillprochaska tillprochaska linked an issue May 26, 2023 that may be closed by this pull request
Copy link
Contributor

@stchris stchris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nifty change. Thanks for the exhaustive description of what the code does and the test coverage 👍

@Rosencrantz Rosencrantz merged commit e9884e7 into develop Jun 19, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve dataset search
3 participants