Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EQL: Replace ` (back-tick) with ' (single quote) #62663

Closed
costin opened this issue Sep 18, 2020 · 10 comments
Closed

EQL: Replace ` (back-tick) with ' (single quote) #62663

costin opened this issue Sep 18, 2020 · 10 comments

Comments

@costin
Copy link
Member

costin commented Sep 18, 2020

Currently ` is used for quoting identifiers since ' was used for string declaration. Since #62458, ' has been freed (is currently deprecated) so we could replace ` with '.

This is a breaking change however considering only 51 or 2.25% of our own Python EQL rules are affected per (#61659), the impact of this change is insignificant.
It thus makes sense for Elasticsearch EQL to do it now before entering beta in 7.10 otherwise ' cannot be repurposed afterwards.

Initial discussions took place in this thread #61659 (comment) and #62638 (review)

Further more, if there are concerns on the Python EQL front (which operators under completely different constrains, has a different release cycle, versions, etc..) why not proceed with a release that deprecated/disables the ' before changing their meaning?

@rw-access
Copy link
Contributor

rw-access commented Sep 18, 2020

Since #62458, ' has been freed (is currently deprecated) so we could replace ` with '.

I think we need to continue to consider both Endgame users and Elasticsearch users. This hasn't been gone through a single release in either as deprecated feature. We deprecated it in a PR a few weeks ago, and are already trying to reuse it. From the user's POV, they'll see inconsistent behavior between Endgame and Elasticsearch, but also between Elasticsearch 7.9 and 7.10. If we leave it deprecated in 7.10, then we can later reuse it.

But I still don't think ' is a good choice for escaped identifiers:

I think this issue assumes agreement that ' is desirable, but I still don't understand why ' is more desirable than ` for identifiers. In most languages (programming, query, serialization) that I've encountered, ' is used for a single character literal or for a full string. SQL seems to be exception more than the norm, but even then the usage of ' and " is flipped.

I think we haven't had too many use cases of identifiers that don't match our existing regex: (LETTER | '_' | '@') (LETTER | DIGIT | '_')*. Do we have a guess at what those edge cases are? I think the biggest offender is the hyphen -. If we don't have pressing issues, why not leave the option open to decide more carefully for 7.11?

Since ' seems to have a very strong association with strings in many languages, I think the character ` is more appropriate for defining a field name with special characters. ` has less implicit associations and brings less baggage to EQL.

I've heard it said in these discussions that ` and ' have been confused for each other. But I struggle to see how this is the case for people writing a query in a query language.

I think we should go with the most intuitive decision for past, present, and future users of EQL. In my mind ' is an unintuitive choice for escaping field names. And with the prevalence of Slack and Markdown, the distinction between ` and ' will only be understood more.

@astefan
Copy link
Contributor

astefan commented Sep 21, 2020

Backtick is not a widely used symbol and definitely not more common than '. If this one (`) is available and I have to choose between it and ' I'll always choose '. Regarding Python EQL and already existent users outside ES EQL, a deprecation path in those products release cycle could make the backtick - single quote change. ES EQL is still in experimental phase and I think we should take advantage of this situation and not prolong its switch to Beta status unnecessarily. If we have a set of good features and a stable grammar in place, a Beta status could mean more interested users in taking ES EQL for a spin and provide useful feedback.

@matriv
Copy link
Contributor

matriv commented Sep 21, 2020

Let me add some more thoughts as well.

  • The single quote ' is currently used rarely to identify strings
  • If a user issues a query with ' around a string literal, e.g.: process where a = 'foo' then the foo will be parsed as a identifier and:
    • if foo is a valid identifier, an exception will be thrown since we cannot have comparisons with identifiers both to the left and to the right.
    • if foo is an invalid identifier, an exception will be thrown stating this.
      This doesn't seem as a dangerous behaviour since it doesn't lead to a correct query with wrong behaviour.

It seems quite inconvenient for the user to receive a message mentioning a wrong identifier for something that until now it was considered a string literal and I have a preference for the more clear 2 step approach (1. error message for single quotes around string, 2. replacement of backquotes with single quote for identifiers), but based on @astefan 's comment
If we lock ourselves in the situation that we have to wait replacing the backquote with the single quote for identifiers until next major release, I'd rather proceed with this breaking change now, since it's safe enough (no wrong results for deprecated syntax) and end up with a more clear grammar.

@rw-access
Copy link
Contributor

rw-access commented Sep 21, 2020

My biggest issue is that people have almost universally come to expect ' to mean strings or characters.
But people don't have expectations of the `, and its meaning is varied. ` usually has something to do with strings, like string interpolation or string escaping, if it's used at all.

If we went forward with ', then I think we have to justify why we are deliberately making what seems like an unintuitive choice. Even though the numbers for usage of ' for strings with Endgame is low (both by employees and by users), I think it's interesting that both were used, even in contexts where ' provided no utility over ". Did users read the syntax guide and realize that both were acceptable? Or did some pick ' out of habit? My point: it seems like that preference or association between ' and strings is already there.

We can stick with our removal of ' to remove options and direct users to a single syntax, but I don't think we should go against the grain here. We should always make these grammar/syntax decisions based on what's desirable, not just what's available.

@rw-access
Copy link
Contributor

rw-access commented Sep 21, 2020

We also have more options than just ' and `.

One more option that comes to mind is using backslashes to escape characters. This \<escape> would just be an extension of our existing syntax for fields: field\-with\-hyphens.

If you do that, it's crystal clear what is a string and what is a field:

  • If it's quoted at all, it's a string. If it's unquoted, it's a field.

We could extend this syntax:

// make @timestamp not require escaping, since @ has no other meaning
IDENTIFIER
: (LETTER | '_' | '@') (LETTER | DIGIT | '_')*
;

fragment IDENTIFIER_ESCAPE
    // prohibit unnecessary alpha escapes, so we can later extend these (such as \uXXXX and \UXXXXXXXX)
    : [\\] ([^A-Za-z\b\t\n\f\r]|[btnfr])
    ;

 IDENTIFIER 
     : (LETTER | '_' | '@' | IDENTIFIER_ESCAPE) (LETTER | DIGIT | '_' | IDENTIFIER_ESCAPE)* 
     ; 

@costin
Copy link
Member Author

costin commented Sep 21, 2020

between Elasticsearch 7.9 and 7.10. If we leave it deprecated in 7.10, then we can later reuse it.

That is incorrect. A beta means the feature set is locked but the implementation or some minor features not fully implemented. For the whole duration of the 7.x line.

If ` backticks are used for field identifiers in a beta, they should be available in the GA as well.
If ' are deprecated in the beta, we cannot add them with a different meaning in GA.

The whole reason EQL in Elasticsearch 7.9 is experimental is to allow such breaking changes to occur.
If Elasticsearch EQL is experimental, the whole stack on top of it will be experimental also.

In most languages (programming, query, serialization) that I've encountered, ' is used for a single character literal or for a full string. SQL seems to be exception more than the norm...

I don't follow...
The vast majority of programming languages have strict rules for identifiers and thus do not require quoting in the first place.

Further more in practice, in EQL " is used heavily for string declaration not '; are you suggesting we use " for identifiers and ' for strings?

I've heard it said in these discussions that ` and ' have been confused for each other. But I struggle to see how this is the case for people writing a query in a query language.

User experience? Back-ticks are rarely used in declarative/query languages as oppose to single quotes which are. We have already experienced this internally where folks on the team used single quotes instead of backticks to define a field name in an EQL query which obviously failed.

SQL might be quirky but it's the most popular declarative language and serves as a good analogy on when to use one over the other. Based on the Markdown/Slack conventions (which are markup* languages and have no notion of evaluation, identifiers, etc...), I would expect '`' to be used for raw string declaration (don't escape anything) as oppose to quoting identifiers.

Since quoting identifiers is a best practice and virtually all our queries should use, picking a known character is desirable over a less known one that is visually similar.

One more option that comes to mind is using backslashes to escape characters. This <escape> would just be an extension of our existing syntax for fields: field-with-hyphens.

That's a step backwards since instead of just quoting a field regardless of its value, one must now escape each problematic character in the name.

'this.is.a.subfield.of.an.object' becomes this\.is\.a\.subfield\.of\.an\.object

@rw-access
Copy link
Contributor

rw-access commented Sep 21, 2020

We would maintain our . handling, so that last change would be unnecessary. this.is.a.subfield.of.an.object is still supported.

This wouldn't be breaking at all, but would merely extend the regex for identifier matching. And the "quotes mean strings" approach will make our lives and our user's lives easier.

@costin
Copy link
Member Author

costin commented Sep 22, 2020

Pick a different special character, say ' '
'this is a special field' vs this\ a\ special\ field. Not only it will not work but it's unreadable.
As a user I don't know where if that's a regex or escaped field name and if so where does it start.

Further more regex in parser are discouraged due to their greedy nature which leads to ambiguity.

And the "quotes mean strings" approach will make our lives and our user's lives easier.

They already do.. Not sure what you mean.

@rw-access
Copy link
Contributor

rw-access commented Sep 22, 2020

By quotes mean strings, I'm referring to the existing bias in all sorts of languages for quote characters (both ' and ") to be used for string literals. But we're going against that bias/expectation if we use ' for fields instead. Meanwhile, ` doesn't have that problem, because it's biggest association (if any) is "inline code". And my reason for mentioning Markdown and Slack earlier was to illustrate how the usage of ` is only becoming more widespread and that ` must be easy enough to distinguish from ' if so many people already use it.

@costin
Copy link
Member Author

costin commented Sep 28, 2020

After discussing this topic, we concluded to:

  • keep using ` (backticks) for quoting identifiers
  • not use ' (single quote) for quoting identifiers. This means ' remains currently unused in the grammar.

Moving forward, the usage ` inside a quoted identifier needs to be checked. That is, what happens when using quoting an identifier that contains backticks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants