Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Use exact attributes for data source extraction #99874

Merged
merged 2 commits into from Sep 27, 2023

Conversation

luigidellaquila
Copy link
Contributor

Fixes #99183

ValueSources will now use exact attributes (ie. exact subfields) when available.
This will make the extraction faster (eg. using a KEYWORD subfield instead of loading a TEXT value from _source) and will avoid failures in case the original field is not available at all, eg. in case of synthetic source.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've created a changelog YAML for you.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@astefan
Copy link
Contributor

astefan commented Sep 25, 2023

The PR looks good, but:

  • it is not clear to me what is the range of use cases this change should cover. I see you have modified a single test where the field starts to be visible in the results in case the source in synthetic. If the range of use cases is more broad, please add more tests.
  • the second point is probably similar to the first one. From the test you changed, if job has two exact sub-fields this query from test | where job like \"P*\" will behave differently. If there is a single exact field then the query works and returns something that makes sense and no warnings. If there are two exact sub-fields then the query doesn't return anything and the warning message (in headers) says: "Field [job] cannot be retrieved, it is unsupported or not indexed; returning null". I don't fully agree with this message because the fact that there are two sub-fields it doesn't make this field unsupported, just that the user needs to be more specific on which sub-field the query should use. At that point, if the information is more accurate in the warning message, the user can change the query to from test | where job.raw1 like \"P*\" and the query succeeds at that point.

@astefan astefan self-requested a review September 25, 2023 15:19
@luigidellaquila
Copy link
Contributor Author

Thank you for checking it @astefan

it is not clear to me what is the range of use cases this change should cover. I see you have modified a single test where the field starts to be visible in the results in case the source in synthetic. If the range of use cases is more broad, please add more tests.

The problem is very specific to the situation with synthetic source, a text field and a keyword subfield. All the other cases are not supported yet (extracting from source types different from text) or already managed (no exact subfield, that just returns null).

if job has two exact sub-fields this query from test | where job like "P*" will behave differently. [...] I don't fully agree with this message because the fact that there are two sub-fields it doesn't make this field unsupported, just that the user needs to be more specific on which sub-field the query should use

There is a generic problem with exact fields in QL: EsField.getExactInfo() only tells us if there is an exact subfield, but it takes into account only a "formal exactness", ie. that the subfield is KEYWORD and that there is only one, and does not consider other factors (eg. the fact that the subfield could be truncated or not represent the original value exactly). EsField.getExactInfo/field() is used in a few other places (eg. in the optimizer), so this specific problem is not strictly related to this PR, that is using the available API correctly; it's a problem of the API itself, that needs to be fixed in QL, and it doesn't seem trivial, so it definitely deserves separate investigation and a dedicated PR.

Even if I try to extract information from EsField.getProperties(), I'll still not have enough information to give the user correct suggestions. None of the subfields could be a valid one if all of them have a different content than the original field; a query like from test | where job.raw1 like \"P*\" could have a completely different meaning from the original one; the original value of job could be completely lost (ie. not indexed as the message says) and there could just be no way to query it.

I still have doubts about that message "Field [job] cannot be retrieved, it is unsupported or not indexed; returning null", there are so many corner cases that can fail, that it is unsupported or not indexed seems a bit guessing. We could decide to make it just "Field [job] cannot be retrieved; returning null", but it risks to be even less informative for the end user in the most common cases.

@astefan
Copy link
Contributor

astefan commented Sep 26, 2023

so it definitely deserves separate investigation and a dedicated PR.

agreed

I know this scenario is very open ended and I hope we can discuss this in a wider audience.
There is another aspect here: the information about the exact variants... should be a warning or an error message. No need to answer now, just something more to discuss.

Please, open an issue about improving the way we deal with exact variants of non-exact fields, plus the warning vs. error message aspect and link it to this PR.

@luigidellaquila
Copy link
Contributor Author

Done #99899

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@luigidellaquila luigidellaquila merged commit be29666 into elastic:main Sep 27, 2023
12 checks passed
piergm pushed a commit to piergm/elasticsearch that referenced this pull request Oct 2, 2023
@luigidellaquila luigidellaquila deleted the esql/fix_99183 branch January 30, 2024 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >bug Team:QL (Deprecated) Meta label for query languages team v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: Error when loading text fields from indexes with synthetic source
4 participants