Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(elasticsearch): advanced query, identity autocomplete, exact match weight #7354

Merged

Conversation

david-leifker
Copy link
Collaborator

  • Restore advanced field level queries using /q for backwards compatibility
  • Enable autocomplete on corp users/groups
  • Documentation updates around advanced queries
  • Add additional weight to exact vs prefix match

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added docs Issues and Improvements to docs product PR or Issue related to the DataHub UI/UX labels Feb 16, 2023
@david-leifker david-leifker added platform PR-s that make changes to core parts of the platform and removed product PR or Issue related to the DataHub UI/UX labels Feb 16, 2023
@david-leifker
Copy link
Collaborator Author

Need to look at that search test closer. Probably needs addressing.

@david-leifker david-leifker marked this pull request as draft February 16, 2023 23:18
@xiphl
Copy link
Contributor

xiphl commented Feb 17, 2023

Thanks for updating the search guide! We specifically covered the ways to search as part of our on boarding for new users, and it will be very helpful to have examples

@david-leifker david-leifker marked this pull request as ready for review February 19, 2023 03:24
return finalQuery.should().size() > 0 ? Optional.of(finalQuery) : Optional.empty();
}

private static QueryBuilder getPhraseQuery(@Nonnull EntitySpec entitySpec, String query) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code?

@david-leifker david-leifker merged commit 7bbabe0 into datahub-project:master Feb 20, 2023
BrowsePath is stored as a complete string, for instance ```/datasets/prod/hive/SampleKafkaDataset```, hence the need for wildcards on both ends of the term to return a result.
If you want to:

- Exact match on term or phrase
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!!

@@ -32,6 +32,9 @@


public class SearchQueryBuilder {

public static final String STRUCTURED_QUERY_PREFIX = "\\\\/q ";
public static final float EXACT_MATCH_BOOST_FACTOR = 10.0f;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exact match on ANY field?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a multiplier, so the annotation * this factor.

BoolQueryBuilder finalQuery = QueryBuilders.boolQuery();
if (query.contains("\"")) {
finalQuery.should(QueryBuilders.termQuery("urn", query.replaceAll("\"", ""))
.boost(Float.parseFloat((String) PRIMARY_URN_SEARCH_PROPERTIES.get("boostScore")) * EXACT_MATCH_BOOST_FACTOR)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait wait so this on occurs if the query contains " " quotations (at least one) and only applies to the urn field?

Is that really what we want? My experience is that exact match on NAME fields is not very good, which may or may not be in the URN at all.

Also, do users really want to have to put quotes in order to do exact match? If I search for a Table with the exact name I expect to see it - quotes or no quotes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The urn field is not part of any of the models' search specs. It is always treated as an addition added everywhere. This would only be used in a quoted urn, exact match. Below is where we handle the other fields.
More important for the non-urn field, the quotes become important when considering synonyms, stemming, and other tokenization steps should not be considered. Fulltext search is not the same as a query where exact matches are implicit, both the field and the search terms are manipulated for getting relevant results over exact match.

finalQuery.should(QueryBuilders.matchPhrasePrefixQuery(fieldSpec.getFieldName() + ".delimited", query)
.boost((float) fieldSpec.getBoostScore())
.queryName(fieldSpec.getFieldName())); // less than exact
if (query.contains("\"")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like kinda opaque functionality.

If the user provides quotes for a field, we automatically hit the keyword index?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

.queryName(fieldSpec.getFieldName())); // less than exact
if (query.contains("\"")) {
finalQuery.should(QueryBuilders.termQuery(fieldSpec.getFieldName() + ".keyword", query.replaceAll("\"", ""))
.boost(Float.parseFloat((String) PRIMARY_URN_SEARCH_PROPERTIES.get("boostScore")) * EXACT_MATCH_BOOST_FACTOR)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit confused, why are we using PRIMARY_URN_SEARCH_PROPERTIES here?

Also, can we factor out the constants?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, this should. not be based on urn properties. Fixing.

.queryName(fieldSpec.getFieldName())); // less than exact
if (query.contains("\"")) {
finalQuery.should(QueryBuilders.termQuery(fieldSpec.getFieldName() + ".keyword", query.replaceAll("\"", ""))
.boost(Float.parseFloat((String) PRIMARY_URN_SEARCH_PROPERTIES.get("boostScore")) * EXACT_MATCH_BOOST_FACTOR)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So again expressing what I mentioned above, boosting cases where there are quotations in the query string I don't think will be sufficient for what users expect. They expect that searching for a name of a Table by exact string will find that table, with or without quotations

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the quote conditional

highlightBuilder.field("urn.delimited");
_defaultQueryFieldNames.stream()
.flatMap(fieldName -> Stream.of(fieldName, fieldName + ".*")).distinct()
.forEach(highlightBuilder::field);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we verify the column name highlighting still works after this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

_defaultQueryFieldNames.forEach(fieldName -> highlightBuilder
.field(fieldName)
.field(fieldName + ".*"));
highlightBuilder.field("urn.delimited");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is removing this safe? Is there history showing why we added this initially?

@@ -288,6 +293,12 @@ private List<MatchedField> extractMatchedFields(@Nonnull Map<String, HighlightFi
highlightedFieldNamesAndValues.get(fieldName.get()).add(fieldValue.string());
}
}
// fallback matched query, non-analyzed field
for (String queryName : hit.getMatchedQueries()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this change? I'm not familiar enough

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For exact match, keyword fields + term queries are not analyzed, therefore they do not return highlighted fields. This is to populate which field is a matching when the field is unanalyzed. Let me see if I can populate the value. The queries I've seen that hit this were only returning the urn, however if the hit does have the matched field, it could be used to fill the value in.

}

@Test
public void testPrefixVsExact() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome tests!

@@ -20,6 +20,7 @@ record CorpGroupInfo {
@Searchable = {
"fieldType": "TEXT_PARTIAL"
"queryByDefault": true,
"enableAutocomplete": true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this previously enabled and now it is removed? Curious about the reasoning for including this? Or did a customer ask for it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're saying this was enabled previously and removed? I think it is natural to search for the display name of the group. I added this specifically because a customer was running a search and I'd suggested autocomplete for users/groups. I looked to write a test for this (in this pr in fact) and it didn't work because of missing annotations.

@@ -90,6 +91,7 @@ record CorpUserInfo includes CustomProperties {
@Searchable = {
"fieldType": "TEXT_PARTIAL",
"queryByDefault": true,
"enableAutocomplete": true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this removed? I recall we used to have this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't remove it, so makes sense to add it, especially given that I am suggesting enabling autocomplete for users/groups. Also I added tests for autocomplete on users/groups to ensure it works as expected.

@@ -11,7 +11,10 @@ record CorpGroupKey {
* The URL-encoded name of the AD/LDAP group. Serves as a globally unique identifier within DataHub.
*/
@Searchable = {
"fieldType": "TEXT_PARTIAL"
"fieldType": "TEXT_PARTIAL",
"queryByDefault": true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

oleg-ruban pushed a commit to RChygir/datahub that referenced this pull request Feb 28, 2023
yoonhyejin pushed a commit that referenced this pull request Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Issues and Improvements to docs platform PR-s that make changes to core parts of the platform
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants