Search revamp #43

dkunitsk · 2021-09-22T01:48:25Z

Signed-off-by: Dmitriy Kunitskiy dkunitskiy@lyft.com

Signed-off-by: Dmitriy Kunitskiy <dkunitskiy@lyft.com>

feng-tao · 2021-09-22T05:16:01Z

rfcs/000-search-revamp.md

+## Summary
+
+Revamp search with a cleaner backend implementation, more robust queries, better customizability, and
+making use of modern v7 ES features.  


have you considered moving directly to opensearch (fork of ES from AWS) instead of ES?

I do know it exists but I didn't give it much consideration. My worry was that it'd be too new/different from ES and harder to find resources for, especially documentation which is very good for classic ES. And also I didn't want to accidentally write anything opensearch-specific. But perhaps there's a lot that I'm misunderstanding.

feng-tao · 2021-09-22T05:19:05Z

rfcs/000-search-revamp.md

+whereas for multi-valued fields we'd give users an options to choose AND or OR 
+(a user would choose AND/OR to apply the filters "product_id,user_id" to the "column" field)
+
+#### Negation filtering


when we builds the new search UI, one constraint we faced is that it supports other search proxy (e.g atlas) which doesn't support all the primitives of a lucene based search system does like ES. I wonder whether we should just remove that atlas search proxy. WDYT @mgorsk1

We are definitely not looking back to using Atlas Search API but there are other Atlas users which might be impacted by removing search proxy.

Arguments for it would be simplicity of Amundsen implementation and aligned search experience but it might hurt the simplicity of Amundsen and Atlas deployment if some users are satisfied with Atlas search and don't want to go into trouble of setting up ES & running databuilder search sync.

I am a million percent in favor of removing. I'll run some informal polling first, of course, but I think the burden of maintaining it probably far exceeds the combined user burden (if any) of spinning up elasticsearch.

This is not to mention that the current Atlas proxy supports only a fraction of what's supported by the ES proxy. No user/dashboard/feature search, for starters.

I think we are fine removing it as long as we provide some transition period so users have comfortable time for switching into new approach. We'd promote the approach that is already outlined in docs (using databuilder search extractor).

feng-tao · 2021-09-22T05:19:43Z

rfcs/000-search-revamp.md

+As above, we want to avoid going down the rabbit hole of full-blown logical expressions. Likely, we'd implement
+this on the frontend as a separate set of filter boxes below the current set. Stay tuned for designs. 
+
+#### Type-ahead suggestions for filters


ES v7 provides search as you type etc feature which could support type-ahead.

feng-tao · 2021-09-22T05:20:45Z

rfcs/000-search-revamp.md

+It also uses a different (and incorrect) elasticsearch [query_string](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html) command. 
+We plan to unify the implementations, so both use the same top-level elasticsearch command either with or without an additional filter specifier. 
+
+#### Replace JSON in queries/mappings with Elasticsearch library objects


make sense.

mgorsk1 · 2021-09-22T06:48:40Z

rfcs/000-search-revamp.md

+- exact matches with spelling mistakes (e.g. "ordered_iems" instead of "ordered_items")
+- analyzed matches where a user inputs keywords (e.g. "returns") and looks to find the most relevant entities
+
+Elasticsearch uses [compound queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/compound-queries.html) to combine multiple subqueries together. The most common is  [Boolean](https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-bool-query.html).


+1, we changed search to use boolean queries in elasticsearch. Other benefit, not mentioned in rfc, is that if we move filters (advanced search) to filter part of such query it will be cached by ES (if one user searches for pii tag using filters subsequent searches with just this filter are not hitting es storage but ram which makes things quicker)

mgorsk1 · 2021-09-22T06:50:45Z

rfcs/000-search-revamp.md

+Here, we’ll list a handful of optimization directions and parameters we’ll experiment with:
+
+##### Fuzziness
+Allows for misspellings by specifying a maximum edit distance. See [docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html).


just a nit but I think match query with fuzziness parameters would be a better fit (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html). This fuzzy query comes from term queries that operate on keyword type which is not analyzed and shouldn't be used for full-text (keyword is also case-sensitive which should be irrelevant for us).

Yes you're right. I was not very clear on the difference between these two so thank you.

mgorsk1 · 2021-09-22T06:52:44Z

rfcs/000-search-revamp.md

+Allows for misspellings by specifying a maximum edit distance. See [docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html).
+
+##### Synonyms 
+This will require tuning and validation, but some users may get significant mileage from a hand-generated synonyms dictionary file. 


+1, we also use synonyms for Amundsen indices. Apart from generic synonym definitions users might introduce company-specific lingo. In our case users know the same data sources by different names - some of them are technical, some legacy, etc and we introduced synonyms to facilitate this.

mgorsk1 · 2021-09-22T06:53:10Z

rfcs/000-search-revamp.md

+In case someone searches for a phrase out of an entity description, this subquery will take word order into consideration. [Docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html).
+
+##### Index Fields in More Ways
+Experimenting with multiple analyzers (i.e. independently analyzing an input field via multiple analyzers) has potential benefits like normalizing stemmings and filtering stop words. 


+1, synonyms can also be one of multi fields. Stop words is something that would make sense for (for example) description fields.

mgorsk1 · 2021-09-22T06:54:12Z

rfcs/000-search-revamp.md

+
+These include (at least):
+- exact matches
+- exact matches with spelling mistakes (e.g. "ordered_iems" instead of "ordered_items")


just a nit but we might not really need different query for spelling mistakes cause full-text fuzziness automatically promotes exact matches over those which are just within expected levenshtein distance but not exact matches.

That makes sense!

mgorsk1 · 2021-09-22T07:05:52Z

rfcs/000-search-revamp.md

+## UI/UX Details
+
+Although the bulk of proposed changes will be to search result quality and relevance, we're also proposing
+a handful of UI/UX changes to filters. We're currently working on designs and will share those as soon as they're


One idea that came up multiple times in our user sessions is better feedback on aggregated search results.

For example if after looking for transactions data and Amundsen returns 1902 results it's still confusing for some users and they'd like to know top 5 schemas from which those results are coming to get bird's eye view on the query made by user. It's just one example but introducing aggregates into search revamp could be also part of UX/UI change.

Very quick idea of what I have in mind is ebay search:

after looking for iphone I already see top Networks from which iphoines are available on ebay (bottom-left corner). In our case the UI might be different but the main idea is similar.

Very solid direction. Would be a huge improvement, and we hear the same feedback around our parts. At the moment I'm not sure we can fit this into the scope of our proposed enhancements, but let me run this by a few folks. At the very least, we can plan ahead and think about how to add this down the road so we don't make suboptimal choices now.

mgorsk1 · 2021-09-22T07:07:46Z

rfcs/000-search-revamp.md

+(a user would choose AND/OR to apply the filters "product_id,user_id" to the "column" field)
+
+#### Negation filtering
+Another highly requested feature is specifying negative filters - e.g. "schema is not temp, tmp, or temporary".


Apart from negation, positive boosting is also valid use-case. In ING we hard-coded it on query level (which could be something configurable in search proxy) to promote tables coming from official data sources. So if the table has ing_official_data tag it's relevance is boosted in the search results.

This could be something simple like configuration option - list of maps with field, value, boost_factor properties.

Yes! We do something very very similar at Lyft. I also left this out of the RFC (since it's more of a user-configurable feature). Our plan, once queries are composed of extendable native python objects and not templated JSON, is to use Rescoring for this. I.e. you'd provide a rescore query to match docs that have the ing_official_data tag and then provide coefficients of how to combine the score of the rescore query vs the original query.

FYI rescoring isn't natively supported by the python lib but it can be injected as an extra, see issue.

hm the idea with rescoring is fine idea but limiting in a sense that it doesn't span all of the results but only top N (10 by default) from each shard (afair there is only 1 shard by default in our mappings and our use case will very rarely require more than that). so negating or promoting would only touch top 10 of results from the original es scoring.

I had figured this was not much of a limitation as one could adjust N to a point at which one can say "hey, if the result wasn't in the top 500 to begin with, maybe it shouldn't get boosted at all". But we're not married to this implementation - we're looking to let users tweak scoring according to their custom logic in a clean and extendable way (that plays well with the original document score instead of just overriding it) and we're open to suggestions.

mgorsk1 · 2021-09-22T07:13:19Z

rfcs/000-search-revamp.md

+The requests made to Elasticsearch are constructed with either hardcoded (in the case of mappings) or templated JSON. 
+This makes it more difficult to compose and understand requests, 
+but more importantly it makes it very difficult to extend OSS code. 
+The Elasticsearch-dsl library includes classes for constructing requests by putting together native python objects.


I think one aspect that could be also raised is how we approach mapping definitions.

Now we have separate mappings for each type of object (table,feature,user,dashboard) and I think the refactor could also improve on that. We can have one mapping (or even better - template) that would span all the fields for all objects. It's a common pattern and it would make reusability of new analyzers/tokenizers/filters much easier.

This is an excellent suggestion and my instinct is strongly in favor. Let me think on this and draw up the pros/cons and how to best implement.

rfcs/000-search-revamp.md

dkunitsk · 2021-09-30T01:00:06Z

@mgorsk1 @feng-tao putting this in final comment period. We're getting started on the implementation. I want to settle on the ES v7 question. One thought is we could avoid this for now by keeping the core functionality both v6 and v7 compatible and putting the more interesting features I outlined behind some config flags.

I don't love it though - all things being equal, I'd rather make a clean break. But that's not without pain so I'm open to a hybrid approach.

feng-tao · 2021-09-30T22:16:15Z

actually are we talking about single ES index for all types? I didn't have time to read through all discussions, but if this is the case, we discuss more in details.

mgorsk1 · 2021-10-01T05:34:22Z

the proposal was single index template and keep multiple indices (all based on the same template). The template would follow Elastic Common Schema approach for naming fields and would centralize configuration of analyzers/filters. There are fields across all indices that are common (like display name, name, description, tags, badges) for which we define mapping N times (N - amount of different resources like Table, ML Feature, Dashboard etc) even though they are named the same and serve same purpose. Templates seem like a good fit for unifying mapping approach imo.

dkunitsk · 2021-10-01T05:44:50Z

@feng-tao @mgorsk1 Hey, we did think more on this. Quite agree on the problem @mgorsk1 pointed out - we specify the common fields N times, with random differences here and there.

I am not 100% convinced that introducing templates is worth it - seems like a bit more overhead and more place for error. Instead, we were thinking we'd make some base entity object class, which will be possible once we rewrite the hardcoded JSON mappings (not extendable or composable) into Py library object.

But, confession: I have not used templates before so I'm not an expert on that. And my first time googling Elastic Common Schema was after seeing it in the comment above 😄 . So please LMK if you believe templates are worth a second look.

mgorsk1 · 2021-10-01T05:51:52Z

I understand how using python classes might also be used with this revamp but that really doesn't change the approach just moves from using json mappings to using python classes which in turn is something I feel more python-native than elastic-native. I tried this approach and indeed it feels like ORM so is attractive from developers perspective but not quite the same thing and it's definitely more cumbersome to fine-tune es mapping than using raw mapping definition. Imo looking into templates is worth second look and it's not scary at all!

I am not really sure what it means that introducing templates introduces room for more error. For me it's the opposite - having one place enforces unification and singular approach for common fields, mapping settings etc. instead of repeating it multiple times. It's also something Elastic ecosystem promotes for years now with templates being at the heart of every subproject (metricbeat, filebeat, etc.).

To be fair there is one downside of this - it would enforce naming convention for es indices - templates rely on the fact that index name follow naming convention. Template json is almost 1:1 same as mapping json but you define index_patterns: [] attribute with list of wildcard patterns for index names. For example index_patterns: ['discovery*'] would be automatically applied when index with discovery prefix in it's name is created - discovery_table_search_index_20211001, discovery_dashboard_search_index_20211001, discovery_mlfeature_search_index_20211001 etc.

But imo that's minor as we can enforce prefix and leave rest for the users.

dkunitsk · 2021-10-01T06:00:45Z

@mgorsk1 I think my concern is something like this: we'll still need some definition (in code) of the common fields and their types, right? Call this M. And, in each databuilder run, we can't assume the right template(s) already exist in Elasticsearch.

So then, in a databuilder run, instead of creating a new index based off M, we first create a new template based off M and then create a new index using that template. So I'm not sure I understand where the benefit is to be found. And the source for error I was referring to is the new opportunity of creating an index that didn't use the right template.

Perhaps I am misunderstanding a few things and just need a bit more nudging.

@mgorsk1 let's take the rest of this conversation to the comment you made originally regarding unifying the indices. This way other folks who are less interested in this detail won't get lost.

Search revamp

cf12900

Signed-off-by: Dmitriy Kunitskiy <dkunitskiy@lyft.com>

dkunitsk requested a review from a team as a code owner September 22, 2021 01:48

typo

5f7938c

Signed-off-by: Dmitriy Kunitskiy <dkunitskiy@lyft.com>

dkunitsk added the Status: Active label Sep 22, 2021

feng-tao reviewed Sep 22, 2021

View reviewed changes

mgorsk1 reviewed Sep 22, 2021

View reviewed changes

dkunitsk added the Status: Final Comment Period (FCP) On final comment period (seven days) label Sep 30, 2021

feng-tao approved these changes Oct 6, 2021

View reviewed changes

dkunitsk merged commit 1cf1879 into master Oct 7, 2021

feng-tao mentioned this pull request Feb 3, 2022

feat: more flexible ES search proxy implementation amundsen-io/amundsen#1649

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search revamp #43

Search revamp #43

dkunitsk commented Sep 22, 2021

feng-tao Sep 22, 2021

dkunitsk Sep 22, 2021

feng-tao Sep 22, 2021

mgorsk1 Sep 22, 2021

dkunitsk Sep 22, 2021

mgorsk1 Sep 22, 2021

feng-tao Sep 22, 2021

feng-tao Sep 22, 2021

mgorsk1 Sep 22, 2021 •

edited

mgorsk1 Sep 22, 2021

dkunitsk Sep 22, 2021

mgorsk1 Sep 22, 2021

mgorsk1 Sep 22, 2021 •

edited

mgorsk1 Sep 22, 2021 •

edited

dkunitsk Sep 22, 2021

mgorsk1 Sep 22, 2021 •

edited

dkunitsk Sep 22, 2021

mgorsk1 Sep 22, 2021

dkunitsk Sep 22, 2021

mgorsk1 Sep 22, 2021

dkunitsk Sep 22, 2021

mgorsk1 Sep 22, 2021

dkunitsk Sep 22, 2021

dkunitsk commented Sep 30, 2021

feng-tao commented Sep 30, 2021

mgorsk1 commented Oct 1, 2021 •

edited

dkunitsk commented Oct 1, 2021 •

edited

mgorsk1 commented Oct 1, 2021 •

edited

dkunitsk commented Oct 1, 2021 •

edited

Search revamp #43

Search revamp #43

Conversation

dkunitsk commented Sep 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgorsk1 Sep 22, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgorsk1 Sep 22, 2021 • edited

Choose a reason for hiding this comment

mgorsk1 Sep 22, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgorsk1 Sep 22, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkunitsk commented Sep 30, 2021

feng-tao commented Sep 30, 2021

mgorsk1 commented Oct 1, 2021 • edited

dkunitsk commented Oct 1, 2021 • edited

mgorsk1 commented Oct 1, 2021 • edited

dkunitsk commented Oct 1, 2021 • edited

mgorsk1 Sep 22, 2021 •

edited

mgorsk1 Sep 22, 2021 •

edited

mgorsk1 Sep 22, 2021 •

edited

mgorsk1 Sep 22, 2021 •

edited

mgorsk1 commented Oct 1, 2021 •

edited

dkunitsk commented Oct 1, 2021 •

edited

mgorsk1 commented Oct 1, 2021 •

edited

dkunitsk commented Oct 1, 2021 •

edited