Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search revamp #43

Merged
merged 2 commits into from
Oct 7, 2021
Merged

Search revamp #43

merged 2 commits into from
Oct 7, 2021

Conversation

dkunitsk
Copy link

Signed-off-by: Dmitriy Kunitskiy dkunitskiy@lyft.com

Signed-off-by: Dmitriy Kunitskiy <dkunitskiy@lyft.com>
@dkunitsk dkunitsk requested a review from a team as a code owner September 22, 2021 01:48
Signed-off-by: Dmitriy Kunitskiy <dkunitskiy@lyft.com>
## Summary

Revamp search with a cleaner backend implementation, more robust queries, better customizability, and
making use of modern v7 ES features.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you considered moving directly to opensearch (fork of ES from AWS) instead of ES?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do know it exists but I didn't give it much consideration. My worry was that it'd be too new/different from ES and harder to find resources for, especially documentation which is very good for classic ES. And also I didn't want to accidentally write anything opensearch-specific. But perhaps there's a lot that I'm misunderstanding.

whereas for multi-valued fields we'd give users an options to choose AND or OR
(a user would choose AND/OR to apply the filters "product_id,user_id" to the "column" field)

#### Negation filtering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we builds the new search UI, one constraint we faced is that it supports other search proxy (e.g atlas) which doesn't support all the primitives of a lucene based search system does like ES. I wonder whether we should just remove that atlas search proxy. WDYT @mgorsk1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are definitely not looking back to using Atlas Search API but there are other Atlas users which might be impacted by removing search proxy.

Arguments for it would be simplicity of Amundsen implementation and aligned search experience but it might hurt the simplicity of Amundsen and Atlas deployment if some users are satisfied with Atlas search and don't want to go into trouble of setting up ES & running databuilder search sync.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a million percent in favor of removing. I'll run some informal polling first, of course, but I think the burden of maintaining it probably far exceeds the combined user burden (if any) of spinning up elasticsearch.

This is not to mention that the current Atlas proxy supports only a fraction of what's supported by the ES proxy. No user/dashboard/feature search, for starters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are fine removing it as long as we provide some transition period so users have comfortable time for switching into new approach. We'd promote the approach that is already outlined in docs (using databuilder search extractor).

As above, we want to avoid going down the rabbit hole of full-blown logical expressions. Likely, we'd implement
this on the frontend as a separate set of filter boxes below the current set. Stay tuned for designs.

#### Type-ahead suggestions for filters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ES v7 provides search as you type etc feature which could support type-ahead.

It also uses a different (and incorrect) elasticsearch [query_string](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html) command.
We plan to unify the implementations, so both use the same top-level elasticsearch command either with or without an additional filter specifier.

#### Replace JSON in queries/mappings with Elasticsearch library objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense.

- exact matches with spelling mistakes (e.g. "ordered_iems" instead of "ordered_items")
- analyzed matches where a user inputs keywords (e.g. "returns") and looks to find the most relevant entities

Elasticsearch uses [compound queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/compound-queries.html) to combine multiple subqueries together. The most common is [Boolean](https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-bool-query.html).
Copy link
Contributor

@mgorsk1 mgorsk1 Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we changed search to use boolean queries in elasticsearch. Other benefit, not mentioned in rfc, is that if we move filters (advanced search) to filter part of such query it will be cached by ES (if one user searches for pii tag using filters subsequent searches with just this filter are not hitting es storage but ram which makes things quicker)

Here, we’ll list a handful of optimization directions and parameters we’ll experiment with:

##### Fuzziness
Allows for misspellings by specifying a maximum edit distance. See [docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit but I think match query with fuzziness parameters would be a better fit (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html). This fuzzy query comes from term queries that operate on keyword type which is not analyzed and shouldn't be used for full-text (keyword is also case-sensitive which should be irrelevant for us).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right. I was not very clear on the difference between these two so thank you.

Allows for misspellings by specifying a maximum edit distance. See [docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html).

##### Synonyms
This will require tuning and validation, but some users may get significant mileage from a hand-generated synonyms dictionary file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we also use synonyms for Amundsen indices. Apart from generic synonym definitions users might introduce company-specific lingo. In our case users know the same data sources by different names - some of them are technical, some legacy, etc and we introduced synonyms to facilitate this.

In case someone searches for a phrase out of an entity description, this subquery will take word order into consideration. [Docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html).

##### Index Fields in More Ways
Experimenting with multiple analyzers (i.e. independently analyzing an input field via multiple analyzers) has potential benefits like normalizing stemmings and filtering stop words.
Copy link
Contributor

@mgorsk1 mgorsk1 Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, synonyms can also be one of multi fields. Stop words is something that would make sense for (for example) description fields.


These include (at least):
- exact matches
- exact matches with spelling mistakes (e.g. "ordered_iems" instead of "ordered_items")
Copy link
Contributor

@mgorsk1 mgorsk1 Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit but we might not really need different query for spelling mistakes cause full-text fuzziness automatically promotes exact matches over those which are just within expected levenshtein distance but not exact matches.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense!

## UI/UX Details

Although the bulk of proposed changes will be to search result quality and relevance, we're also proposing
a handful of UI/UX changes to filters. We're currently working on designs and will share those as soon as they're
Copy link
Contributor

@mgorsk1 mgorsk1 Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea that came up multiple times in our user sessions is better feedback on aggregated search results.

For example if after looking for transactions data and Amundsen returns 1902 results it's still confusing for some users and they'd like to know top 5 schemas from which those results are coming to get bird's eye view on the query made by user. It's just one example but introducing aggregates into search revamp could be also part of UX/UI change.

Very quick idea of what I have in mind is ebay search:
image

after looking for iphone I already see top Networks from which iphoines are available on ebay (bottom-left corner). In our case the UI might be different but the main idea is similar.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very solid direction. Would be a huge improvement, and we hear the same feedback around our parts. At the moment I'm not sure we can fit this into the scope of our proposed enhancements, but let me run this by a few folks. At the very least, we can plan ahead and think about how to add this down the road so we don't make suboptimal choices now.

(a user would choose AND/OR to apply the filters "product_id,user_id" to the "column" field)

#### Negation filtering
Another highly requested feature is specifying negative filters - e.g. "schema is not temp, tmp, or temporary".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from negation, positive boosting is also valid use-case. In ING we hard-coded it on query level (which could be something configurable in search proxy) to promote tables coming from official data sources. So if the table has ing_official_data tag it's relevance is boosted in the search results.

This could be something simple like configuration option - list of maps with field, value, boost_factor properties.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! We do something very very similar at Lyft. I also left this out of the RFC (since it's more of a user-configurable feature). Our plan, once queries are composed of extendable native python objects and not templated JSON, is to use Rescoring for this. I.e. you'd provide a rescore query to match docs that have the ing_official_data tag and then provide coefficients of how to combine the score of the rescore query vs the original query.

FYI rescoring isn't natively supported by the python lib but it can be injected as an extra, see issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm the idea with rescoring is fine idea but limiting in a sense that it doesn't span all of the results but only top N (10 by default) from each shard (afair there is only 1 shard by default in our mappings and our use case will very rarely require more than that). so negating or promoting would only touch top 10 of results from the original es scoring.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had figured this was not much of a limitation as one could adjust N to a point at which one can say "hey, if the result wasn't in the top 500 to begin with, maybe it shouldn't get boosted at all". But we're not married to this implementation - we're looking to let users tweak scoring according to their custom logic in a clean and extendable way (that plays well with the original document score instead of just overriding it) and we're open to suggestions.

The requests made to Elasticsearch are constructed with either hardcoded (in the case of mappings) or templated JSON.
This makes it more difficult to compose and understand requests,
but more importantly it makes it very difficult to extend OSS code.
The Elasticsearch-dsl library includes classes for constructing requests by putting together native python objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one aspect that could be also raised is how we approach mapping definitions.

Now we have separate mappings for each type of object (table,feature,user,dashboard) and I think the refactor could also improve on that. We can have one mapping (or even better - template) that would span all the fields for all objects. It's a common pattern and it would make reusability of new analyzers/tokenizers/filters much easier.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent suggestion and my instinct is strongly in favor. Let me think on this and draw up the pros/cons and how to best implement.

rfcs/000-search-revamp.md Show resolved Hide resolved
rfcs/000-search-revamp.md Show resolved Hide resolved
@dkunitsk
Copy link
Author

@mgorsk1 @feng-tao putting this in final comment period. We're getting started on the implementation. I want to settle on the ES v7 question. One thought is we could avoid this for now by keeping the core functionality both v6 and v7 compatible and putting the more interesting features I outlined behind some config flags.

I don't love it though - all things being equal, I'd rather make a clean break. But that's not without pain so I'm open to a hybrid approach.

@dkunitsk dkunitsk added the Status: Final Comment Period (FCP) On final comment period (seven days) label Sep 30, 2021
@feng-tao
Copy link
Member

actually are we talking about single ES index for all types? I didn't have time to read through all discussions, but if this is the case, we discuss more in details.

@mgorsk1
Copy link
Contributor

mgorsk1 commented Oct 1, 2021

the proposal was single index template and keep multiple indices (all based on the same template). The template would follow Elastic Common Schema approach for naming fields and would centralize configuration of analyzers/filters. There are fields across all indices that are common (like display name, name, description, tags, badges) for which we define mapping N times (N - amount of different resources like Table, ML Feature, Dashboard etc) even though they are named the same and serve same purpose. Templates seem like a good fit for unifying mapping approach imo.

@dkunitsk
Copy link
Author

dkunitsk commented Oct 1, 2021

@feng-tao @mgorsk1 Hey, we did think more on this. Quite agree on the problem @mgorsk1 pointed out - we specify the common fields N times, with random differences here and there.

I am not 100% convinced that introducing templates is worth it - seems like a bit more overhead and more place for error. Instead, we were thinking we'd make some base entity object class, which will be possible once we rewrite the hardcoded JSON mappings (not extendable or composable) into Py library object.

But, confession: I have not used templates before so I'm not an expert on that. And my first time googling Elastic Common Schema was after seeing it in the comment above 😄 . So please LMK if you believe templates are worth a second look.

@mgorsk1
Copy link
Contributor

mgorsk1 commented Oct 1, 2021

I understand how using python classes might also be used with this revamp but that really doesn't change the approach just moves from using json mappings to using python classes which in turn is something I feel more python-native than elastic-native. I tried this approach and indeed it feels like ORM so is attractive from developers perspective but not quite the same thing and it's definitely more cumbersome to fine-tune es mapping than using raw mapping definition. Imo looking into templates is worth second look and it's not scary at all!

I am not really sure what it means that introducing templates introduces room for more error. For me it's the opposite - having one place enforces unification and singular approach for common fields, mapping settings etc. instead of repeating it multiple times. It's also something Elastic ecosystem promotes for years now with templates being at the heart of every subproject (metricbeat, filebeat, etc.).

To be fair there is one downside of this - it would enforce naming convention for es indices - templates rely on the fact that index name follow naming convention. Template json is almost 1:1 same as mapping json but you define index_patterns: [] attribute with list of wildcard patterns for index names. For example index_patterns: ['discovery*'] would be automatically applied when index with discovery prefix in it's name is created - discovery_table_search_index_20211001, discovery_dashboard_search_index_20211001, discovery_mlfeature_search_index_20211001 etc.

But imo that's minor as we can enforce prefix and leave rest for the users.

@dkunitsk
Copy link
Author

dkunitsk commented Oct 1, 2021

@mgorsk1 I think my concern is something like this: we'll still need some definition (in code) of the common fields and their types, right? Call this M. And, in each databuilder run, we can't assume the right template(s) already exist in Elasticsearch.

So then, in a databuilder run, instead of creating a new index based off M, we first create a new template based off M and then create a new index using that template. So I'm not sure I understand where the benefit is to be found. And the source for error I was referring to is the new opportunity of creating an index that didn't use the right template.

Perhaps I am misunderstanding a few things and just need a bit more nudging.

@mgorsk1 let's take the rest of this conversation to the comment you made originally regarding unifying the indices. This way other folks who are less interested in this detail won't get lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Active Status: Final Comment Period (FCP) On final comment period (seven days)
Projects
None yet
3 participants