-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search revamp #43
Search revamp #43
Conversation
Signed-off-by: Dmitriy Kunitskiy <dkunitskiy@lyft.com>
## Summary | ||
|
||
Revamp search with a cleaner backend implementation, more robust queries, better customizability, and | ||
making use of modern v7 ES features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you considered moving directly to opensearch (fork of ES from AWS) instead of ES?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do know it exists but I didn't give it much consideration. My worry was that it'd be too new/different from ES and harder to find resources for, especially documentation which is very good for classic ES. And also I didn't want to accidentally write anything opensearch-specific. But perhaps there's a lot that I'm misunderstanding.
whereas for multi-valued fields we'd give users an options to choose AND or OR | ||
(a user would choose AND/OR to apply the filters "product_id,user_id" to the "column" field) | ||
|
||
#### Negation filtering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we builds the new search UI, one constraint we faced is that it supports other search proxy (e.g atlas) which doesn't support all the primitives of a lucene based search system does like ES. I wonder whether we should just remove that atlas search proxy. WDYT @mgorsk1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are definitely not looking back to using Atlas Search API but there are other Atlas users which might be impacted by removing search proxy.
Arguments for it would be simplicity of Amundsen implementation and aligned search experience but it might hurt the simplicity of Amundsen and Atlas deployment if some users are satisfied with Atlas search and don't want to go into trouble of setting up ES & running databuilder search sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a million percent in favor of removing. I'll run some informal polling first, of course, but I think the burden of maintaining it probably far exceeds the combined user burden (if any) of spinning up elasticsearch.
This is not to mention that the current Atlas proxy supports only a fraction of what's supported by the ES proxy. No user/dashboard/feature search, for starters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are fine removing it as long as we provide some transition period so users have comfortable time for switching into new approach. We'd promote the approach that is already outlined in docs (using databuilder search extractor).
As above, we want to avoid going down the rabbit hole of full-blown logical expressions. Likely, we'd implement | ||
this on the frontend as a separate set of filter boxes below the current set. Stay tuned for designs. | ||
|
||
#### Type-ahead suggestions for filters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ES v7 provides search as you type etc feature which could support type-ahead.
It also uses a different (and incorrect) elasticsearch [query_string](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html) command. | ||
We plan to unify the implementations, so both use the same top-level elasticsearch command either with or without an additional filter specifier. | ||
|
||
#### Replace JSON in queries/mappings with Elasticsearch library objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense.
- exact matches with spelling mistakes (e.g. "ordered_iems" instead of "ordered_items") | ||
- analyzed matches where a user inputs keywords (e.g. "returns") and looks to find the most relevant entities | ||
|
||
Elasticsearch uses [compound queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/compound-queries.html) to combine multiple subqueries together. The most common is [Boolean](https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-bool-query.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we changed search to use boolean queries in elasticsearch. Other benefit, not mentioned in rfc, is that if we move filters (advanced search) to filter
part of such query it will be cached by ES (if one user searches for pii
tag using filters subsequent searches with just this filter are not hitting es storage but ram which makes things quicker)
Here, we’ll list a handful of optimization directions and parameters we’ll experiment with: | ||
|
||
##### Fuzziness | ||
Allows for misspellings by specifying a maximum edit distance. See [docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a nit but I think match query with fuzziness parameters would be a better fit (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html). This fuzzy
query comes from term
queries that operate on keyword
type which is not analyzed and shouldn't be used for full-text (keyword is also case-sensitive which should be irrelevant for us).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right. I was not very clear on the difference between these two so thank you.
Allows for misspellings by specifying a maximum edit distance. See [docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html). | ||
|
||
##### Synonyms | ||
This will require tuning and validation, but some users may get significant mileage from a hand-generated synonyms dictionary file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we also use synonyms for Amundsen indices. Apart from generic synonym definitions users might introduce company-specific lingo. In our case users know the same data sources by different names - some of them are technical, some legacy, etc and we introduced synonyms to facilitate this.
In case someone searches for a phrase out of an entity description, this subquery will take word order into consideration. [Docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html). | ||
|
||
##### Index Fields in More Ways | ||
Experimenting with multiple analyzers (i.e. independently analyzing an input field via multiple analyzers) has potential benefits like normalizing stemmings and filtering stop words. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, synonyms can also be one of multi fields. Stop words is something that would make sense for (for example) description
fields.
|
||
These include (at least): | ||
- exact matches | ||
- exact matches with spelling mistakes (e.g. "ordered_iems" instead of "ordered_items") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a nit but we might not really need different query for spelling mistakes cause full-text fuzziness automatically promotes
exact matches over those which are just within expected levenshtein distance but not exact matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense!
## UI/UX Details | ||
|
||
Although the bulk of proposed changes will be to search result quality and relevance, we're also proposing | ||
a handful of UI/UX changes to filters. We're currently working on designs and will share those as soon as they're |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea that came up multiple times in our user sessions is better feedback on aggregated search results.
For example if after looking for transactions
data and Amundsen returns 1902 results it's still confusing for some users and they'd like to know top 5 schemas from which those results are coming to get bird's eye view
on the query made by user. It's just one example but introducing aggregates into search revamp could be also part of UX/UI change.
Very quick idea of what I have in mind is ebay search:
after looking for iphone
I already see top Networks from which iphoines are available on ebay (bottom-left corner). In our case the UI might be different but the main idea is similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very solid direction. Would be a huge improvement, and we hear the same feedback around our parts. At the moment I'm not sure we can fit this into the scope of our proposed enhancements, but let me run this by a few folks. At the very least, we can plan ahead and think about how to add this down the road so we don't make suboptimal choices now.
(a user would choose AND/OR to apply the filters "product_id,user_id" to the "column" field) | ||
|
||
#### Negation filtering | ||
Another highly requested feature is specifying negative filters - e.g. "schema is not temp, tmp, or temporary". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from negation, positive boosting is also valid use-case. In ING we hard-coded it on query level (which could be something configurable in search proxy) to promote tables coming from official data sources. So if the table has ing_official_data
tag it's relevance is boosted in the search results.
This could be something simple like configuration option - list of maps with field, value, boost_factor properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! We do something very very similar at Lyft. I also left this out of the RFC (since it's more of a user-configurable feature). Our plan, once queries are composed of extendable native python objects and not templated JSON, is to use Rescoring for this. I.e. you'd provide a rescore
query to match docs that have the ing_official_data
tag and then provide coefficients of how to combine the score of the rescore
query vs the original query.
FYI rescoring isn't natively supported by the python lib but it can be injected as an extra
, see issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm the idea with rescoring is fine idea but limiting in a sense that it doesn't span all
of the results but only top N (10 by default) from each shard (afair there is only 1 shard by default in our mappings and our use case will very rarely require more than that). so negating
or promoting
would only touch top 10 of results from the original es scoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had figured this was not much of a limitation as one could adjust N to a point at which one can say "hey, if the result wasn't in the top 500 to begin with, maybe it shouldn't get boosted at all". But we're not married to this implementation - we're looking to let users tweak scoring according to their custom logic in a clean and extendable way (that plays well with the original document score instead of just overriding it) and we're open to suggestions.
The requests made to Elasticsearch are constructed with either hardcoded (in the case of mappings) or templated JSON. | ||
This makes it more difficult to compose and understand requests, | ||
but more importantly it makes it very difficult to extend OSS code. | ||
The Elasticsearch-dsl library includes classes for constructing requests by putting together native python objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one aspect that could be also raised is how we approach mapping definitions.
Now we have separate mappings for each type of object (table,feature,user,dashboard) and I think the refactor could also improve on that. We can have one mapping (or even better - template) that would span all the fields for all objects. It's a common pattern and it would make reusability of new analyzers/tokenizers/filters much easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent suggestion and my instinct is strongly in favor. Let me think on this and draw up the pros/cons and how to best implement.
@mgorsk1 @feng-tao putting this in final comment period. We're getting started on the implementation. I want to settle on the ES v7 question. One thought is we could avoid this for now by keeping the core functionality both v6 and v7 compatible and putting the more interesting features I outlined behind some config flags. I don't love it though - all things being equal, I'd rather make a clean break. But that's not without pain so I'm open to a hybrid approach. |
actually are we talking about single ES index for all types? I didn't have time to read through all discussions, but if this is the case, we discuss more in details. |
the proposal was single index template and keep multiple indices (all based on the same template). The template would follow Elastic Common Schema approach for naming fields and would centralize configuration of analyzers/filters. There are fields across all indices that are common (like display name, name, description, tags, badges) for which we define mapping N times (N - amount of different resources like Table, ML Feature, Dashboard etc) even though they are named the same and serve same purpose. Templates seem like a good fit for unifying mapping approach imo. |
@feng-tao @mgorsk1 Hey, we did think more on this. Quite agree on the problem @mgorsk1 pointed out - we specify the common fields N times, with random differences here and there. I am not 100% convinced that introducing templates is worth it - seems like a bit more overhead and more place for error. Instead, we were thinking we'd make some base entity object class, which will be possible once we rewrite the hardcoded JSON mappings (not extendable or composable) into Py library object. But, confession: I have not used templates before so I'm not an expert on that. And my first time googling Elastic Common Schema was after seeing it in the comment above 😄 . So please LMK if you believe templates are worth a second look. |
I understand how using python classes might also be used with this revamp but that really doesn't change the approach just moves from using json mappings to using python classes which in turn is something I feel more python-native than elastic-native. I tried this approach and indeed it feels like ORM so is attractive from developers perspective but not quite the same thing and it's definitely more cumbersome to fine-tune es mapping than using raw mapping definition. Imo looking into templates is worth second look and it's not scary at all! I am not really sure what it means that introducing templates introduces room for more error. For me it's the opposite - having one place enforces unification and singular approach for common fields, mapping settings etc. instead of repeating it multiple times. It's also something Elastic ecosystem promotes for years now with templates being at the heart of every subproject (metricbeat, filebeat, etc.). To be fair there is one downside of this - it would enforce naming convention for es indices - templates rely on the fact that index name follow naming convention. Template json is almost 1:1 same as mapping json but you define But imo that's minor as we can enforce prefix and leave rest for the users. |
@mgorsk1 I think my concern is something like this: we'll still need some definition (in code) of the common fields and their types, right? Call this M. And, in each databuilder run, we can't assume the right template(s) already exist in Elasticsearch. So then, in a databuilder run, instead of creating a new index based off M, we first create a new template based off M and then create a new index using that template. So I'm not sure I understand where the benefit is to be found. And the source for error I was referring to is the new opportunity of creating an index that didn't use the right template. Perhaps I am misunderstanding a few things and just need a bit more nudging. @mgorsk1 let's take the rest of this conversation to the comment you made originally regarding unifying the indices. This way other folks who are less interested in this detail won't get lost. |
Signed-off-by: Dmitriy Kunitskiy dkunitskiy@lyft.com