Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EQL: changes to strings and regular expressions #61659

Closed
rw-access opened this issue Aug 27, 2020 · 11 comments
Closed

EQL: changes to strings and regular expressions #61659

rw-access opened this issue Aug 27, 2020 · 11 comments
Assignees
Labels
:Analytics/EQL EQL querying Team:QL (Deprecated) Meta label for query languages team team-discuss

Comments

@rw-access
Copy link
Contributor

rw-access commented Aug 27, 2020

Bringing discussion from the 2020-08-27 meeting here

We've talked about changes to the EQL syntax, particularly around quoting (escaping fields: #57194 and #51443). And if we want to make changes to the language, we should do it now while we're experimental.

In EQL, there are four options for quoting literal strings:

  • "using double quotes" which is an escaped string, and requires " to be escaped
  • 'using single quotes' which is an escaped string, and allows for literal "
  • ?"using literal double quotes" which is a raw string, no literal " allowed
  • ?'using literal single quotes' which is a raw string, no literal ' allowed

I did some analysis from our internal repository of EQL rules to breakdown our usage of different types of strings. Almost 98% percent of our strings are in the " form.

type count
" 2389
' 56
?" 21
?' 8

Of those, all usages of ? were used in the match function. There were no other instances of those string types but in regular expressions. If we want to remove three of the four string permutations (keeping "), now seems like right time. Although, I do think it is very convenient to have an option for unescaped strings. Escaping can be cumbersome, and I think an unescaped form of a string would help our users. I don't know if ?" is the ideal form, or if something else is bettter.

For regular expressions, I think that a new syntax could make a lot of sense with our users, and we could better align with Lucene and Javascript syntax, using /.../. I believe a grammar would look like this /([^/\\]|\\/|\\.)*/. This was well received by Intelligence and Analytics team (owners of detection-rules and the EQL rules for Endgame).

If we change the accepted syntax for regular expressions, we need to decide on the appropriate contexts. I think this is the most obvious:

  • match(some.field, /some.regular.expression/)

We could potentially allow these as well, but I can see arguments for and against. I'm not sure where I lean.

  • some.field == /some.regular.expression/ equivalent to the above
  • some.field != /some.regular.expression/ inverse of the above
@rw-access rw-access added team-discuss :Analytics/EQL EQL querying Team:QL (Deprecated) Meta label for query languages team labels Aug 27, 2020
@rw-access rw-access self-assigned this Aug 27, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (:Query Languages/EQL)

@rw-access
Copy link
Contributor Author

Related, I think we should also support \uXXXX and \UXXXXXXXX escapes, like JSON.

@costin
Copy link
Member

costin commented Sep 2, 2020

Ross, thanks for the detailed analysis.

quoting

Considering the numbers, my preference is to keep only double quotes " moving forward.

regexp

Since regular expression declaration occurs only within match function, why not mention the string as a regular expression. Introducing a new syntax for a parameter declaration for a function inside the language seems overkill to me.
We already have the function for matching, some.field == /some.regex/ is just syntactic sugar,

The vast majority of regular expressions do not use " so there should be little difference between match(/some.regex/) or match("some.regex"). Further more by scoping the string to the match function, we have flexibility in the supported format such as Unicode escaping. That requires a separate discussion by itself since regular expressions on variable width encoding is not widely supported; Lucene might be limited only to ASCII ones.

@tsg
Copy link

tsg commented Sep 3, 2020

Regarding quoting, from what I understood, one advantage of standardizing only on " for literal strings is that we free up ' to use for field escaping, instead of the backtick `. This would be consistent with SQL, which might help people that know SQL well.

@rw-access
Copy link
Contributor Author

This doesn't immediately free up ' for quoting field names, though. The reason for that is because we should avoid reinterpreting existing syntax with new semantics. Otherwise currently valid EQL will remain syntactically valid but will work drastically differently.

If we want to reuse ', we should first deprecate it, then after we are confident that it's no longer used, we can reintroduce meaning to ' strings.

So in the near term, this doesn't solve that problem. But I'm also not convinced that ' is a good choice for quoting field names, and haven't heard an argument for choosing ' over ` or vice versa.

@rw-access
Copy link
Contributor Author

If/when we reintroduce ', I personally think the best usage is for literal I escaped strings. I expect this use case to be very common, as doubling up backslashes for file paths gets annoying. Ruby, YAML, and TOML are a few languages that come to mine with those semantics.

If we remove ?" and ?', then we'll have no syntax left to allow for unescaped strings, so we should keep that user experience in mind. We could reintroduce ' with that meaning later, but that would mean there's a gap time frame with no way to express an unescaped string.

@matriv matriv self-assigned this Sep 3, 2020
@costin
Copy link
Member

costin commented Sep 3, 2020

After discussing this issue today, we concluded to move forward with using " for escaped/regular string declarations and remove the rest of the usages.
Since regexp or path comparisons are common this results in strings with \ or / which do require escaping and we should strive in supporting this scenario, that is allow raw/unescaped strings.
@matriv suggested switching single quotes (') with backticks (```) however the two can fairly close and thus might end up confusing folks.
Another proposal is to use multiple double quotes : "" my raw string "", `"" some string with \ , / and " ""` as it doesn't change the chars being used.

@matriv
Copy link
Contributor

matriv commented Sep 10, 2020

It's easy to log a deprecation message for the usage of ' around string literals, and point to the usage of ".
If we plan to use the the single quote ' in the place of the backquote, to mark identifiers (field names), I'd personally suggest (keeping in mind that EQL is currently experimental) to avoid the deprecation step and directly make this breaking change, otherwise I think it's more troublesome to deprecate a usage and then do a breaking change anyways.

@rw-access
Copy link
Contributor Author

We shouldn't change the meaning of existing syntax overnight. I think we need to go to a deprecation path first before we assign new semantics. That way nobody ends up surprised and confusing by breaking changes.

I still think we can avoid that hassle by going with ` over '.

@costin
Copy link
Member

costin commented Sep 10, 2020

Keep in mind that a beta release locks the language, meaning ' will remain deprecated, and backquote will be used for quoting (which SIEM is currently investigating).
Experimental gives us that freedom which looks to impact 2.25% queries based on our own data.
It's not ideal but the alternative is worse.

matriv added a commit to matriv/elasticsearch that referenced this issue Sep 16, 2020
The usage of single quotes to wrap a string literal is forbidden
and an error encouraging the user to user double quotes is returned.

Tests are properly adjusted.

Related to elastic#61659
matriv added a commit that referenced this issue Sep 16, 2020
The usage of single quotes to wrap a string literal is forbidden
and an error encouraging the user to user double quotes is returned.

Tests are properly adjusted.

Relates to #61659
matriv added a commit that referenced this issue Sep 17, 2020
The usage of single quotes to wrap a string literal is forbidden
and an error encouraging the user to user double quotes is returned.

Tests are properly adjusted.

Relates to #61659

(cherry picked from commit 8be400b)
matriv added a commit to matriv/elasticsearch that referenced this issue Sep 17, 2020
Use triple doulbe quotes enclosing a string literal to interpret it
as unescaped, in order to use `?` for marking query params and avoid
user confusion.

Relates to elastic#61659
matriv added a commit to matriv/elasticsearch that referenced this issue Sep 18, 2020
Since `'` is no longer supported for defining string literals,
we replace the backquote [`] with single quote ['] for the definition of
identifiers. The backquote is not a commonly used symbol in languages
and many times, especially with some fonts, can easily be mistaken and
cause confusion.

Relates to elastic#61659
matriv added a commit that referenced this issue Oct 2, 2020
Use triple double quotes enclosing a string literal to interpret it
as unescaped, in order to use `?` for marking query params and avoid
user confusion. `?` also usually implies regex expressions.

Any character inside the `"""` beginning-closing markings is considered
raw and the only thing that is not permitted is the `"""` sequence itself.
If a user wants to use that, needs to resort to the normal `"` string literal
and use proper escaping.

Relates to #61659
matriv added a commit that referenced this issue Oct 2, 2020
…3174)

Use triple double quotes enclosing a string literal to interpret it
as unescaped, in order to use `?` for marking query params and avoid
user confusion. `?` also usually implies regex expressions.

Any character inside the `"""` beginning-closing markings is considered
raw and the only thing that is not permitted is the `"""` sequence itself.
If a user wants to use that, needs to resort to the normal `"` string literal
and use proper escaping.

Relates to #61659

(cherry picked from commit d87c2ca)
@costin
Copy link
Member

costin commented Oct 2, 2020

Superseded by #62652 (in particular #62663, #62645 and #62941).

@costin costin closed this as completed Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/EQL EQL querying Team:QL (Deprecated) Meta label for query languages team team-discuss
Projects
None yet
Development

No branches or pull requests

5 participants