Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EQL: Remove pattern == "wild*card" #62651

Closed
costin opened this issue Sep 18, 2020 · 14 comments
Closed

EQL: Remove pattern == "wild*card" #62651

costin opened this issue Sep 18, 2020 · 14 comments
Labels
:Analytics/EQL EQL querying >enhancement Team:QL (Deprecated) Meta label for query languages team

Comments

@costin
Copy link
Member

costin commented Sep 18, 2020

Currently EQL supports the following pattern for doing wildcard comparisons:
field == "wild*card" which translates to wildcard(field, "wild*card").
That is making a comparison (== or !=) to a string that contains * translates to a wildcard check.

This is convenient however it has a few side-effects:

  1. one cannot make a comparison to a string containing *.
  2. the comparison/wildcard inherits the semantics of equals such as case-insensitivity.
  3. makes it hard to reason whether an expression is string equality or wildcard comparison:
    foo == concat("wild", "*", "card")
  4. for more than one pattern, the pattern ends up being more verbose:
    foo == "wild*" or foo == "*card" vs wildcard(foo, "wild*", "*card")

It would be good to get some numbers to see how often this pattern is being used.
My proposal is to deprecate/remove this syntactic sugar and promote explicit use of wildcard for clear semantics and clarity in intent.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (:Query Languages/EQL)

@elasticmachine elasticmachine added the Team:QL (Deprecated) Meta label for query languages team label Sep 18, 2020
@rw-access
Copy link
Contributor

It would be good to get some numbers to see how often this pattern is being used.

If I were to take a SWAG, I would say approximately 95%+ of queries use this behavior currently.

@rw-access
Copy link
Contributor

rw-access commented Sep 18, 2020

I got some more stats

  • 84% of our production rules use wildcards, averaging 2.5 per rule. There are actually many more, but under the hood in the EQL engine for python, adjacent usages with an or get compressed into one. I don't have numbers for Endgame users or other EQL in the wild.

The ease of wildcards was intentionally baked into EQL from day one, so I think we should approach it very carefully before we make plans to change this behavior. And it could be good to share some context to better understand the expectations and desires of EQL users and rule writers.

When writing detections, you often don't have the full text and know exactly what you're looking for, so wildcards are important. Wildcards are so prevalent that we deliberately made them first class, letting == and != work for wildcard comparisons.

I think it shows success that we've made it this long without needing to express a literal asterisk in a string comparison. All of these concerns have been hypothetical, and that really goes to show the value of and asterisk as a wildcard character. And we aren't the only database/query language to recognize how its importance outweighs the utility of a literal asterisk.

Addressing some of the side-affects you mentioned:

  1. one cannot make a comparison to a string containing *.

There are workarounds to do this, but I will concede that they aren't ideal. One example: some_field in ("string with literal * asterisk")

  1. the comparison/wildcard inherits the semantics of equals such as case-insensitivity.

Agreed, but I'm not sure how this is a unique side-effect

  1. makes it hard to reason whether an expression is string equality or wildcard comparison

String equality is semantically identical to a wildcard comparison but without any wildcard characters, so I don't see an issue here. If the example you provided, foo == concat("wild", "*", "card") is converted to a wildcard under the hood, then I would argue that it's a bug. You could make an argument that it's Undefined Behavior for EQL, but there is no documentation to suggest that this would be a wildcard comparison. All of the examples using wildcards use literal strings. Neither the original documentation and the elasticsearch documentation makes no mention of wildcards for folded values.

  1. for more than one pattern, the pattern ends up being more verbose
    foo == "wild*" or foo == "*card" vs wildcard(foo, "wild*", "*card")

Yes, but this sounds like a feature not a bug.

Outstanding concerns
The biggest problem I have is that this is the worst kind of breaking change: a quiet one.

My proposal is to deprecate/remove this syntactic sugar and promote explicit use of wildcard for clear semantics and clarity in intent.

It's impossible to deprecate this syntactic sugar, without coming up with a new character for quotes. Not to be a broken record, but we can't quietly (aka no messages) change the meaning of existing syntax. Otherwise, users will have zero idea why queries that used to work no longer work.

TL;DR
If I had to sum up this approach in a quick sentence it would be this:
"It's a feature, not a bug."

@astefan
Copy link
Contributor

astefan commented Sep 21, 2020

I agree that this is an important feature @rw-access

The ease of wildcards was intentionally baked into EQL from day one, so I think we should approach it very carefully before we make plans to change this behavior. And it could be good to share some context to better understand the expectations and desires of EQL users and rule writers.

When writing detections, you often don't have the full text and know exactly what you're looking for, so wildcards are important. Wildcards are so prevalent that we deliberately made them first class, letting == and != work for wildcard comparisons.

but I cannot go over the inconsistency it adds to the language. The examples @costin provided show that a single use of the language could translate into two separate behaviors that cannot be reasoned about in a deterministic way. Is the user intention when writing foo == concat("wild", "*", "card") either to be an exact match or a wildcard match? It could be that the user really wants to search for wild*card (exact match), even though the vast majority of other uses aim for the second way of using wildcards and one could argue that we make an exception in this case. But still, the language needs to be consistent and deterministic. Also, just like the other features that we want to evaluate at the eql language, proposed changes could be deprecated in other eql implementations following those products deprecation flow and procedures, but for ES EQL and its experimental current status, these could go in without any notice or warnings.

@matriv
Copy link
Contributor

matriv commented Sep 21, 2020

This is a difficult decision, because unlike replacing backquote with single quote were we have a clear behaviour and an error message, if we make this change users could come up with wrong results. On the other hand, I'm not fan of such syntactic sugar because although it can ease the life of a user in the common case, it can have ambiguous semantics as for example in the case of foo == concat('wild', '*', 'card'). Is it a way to bypass the translation to wildcard() function or is still considered a wildcard? What about foo == ?"wild*card"? What's happening for sequence matching keys? if we have an event/doc with value for key= "wild*card", and another event/doc with key="wildXXXXcard" should they match?

Personally I'd always prefer clear semantics to user convenience. Imho. the == operator is the strongest operation for a filter in a query language, and it shouldn't have any syntactic sugar attached that translates it to something else. If we had to design the language from scratch, I'd rather have an operator like ~= as syntactic sugar, which translates the * to a wildcard query, than changing the semantics of EQUAL based on the the string literal.

Taking into consideration the fact that the change of this behaviour is "silent" and can lead to unexpected results, If we decide to make the change, and I'd vote towards it, let's make it now in the experimental stage, otherwise it will be impossible.

@costin
Copy link
Member Author

costin commented Sep 22, 2020

tl;dr

== is arguably the most important operator in bool clauses. Its semantics and properties are well understood yet EQL breaks them.
I'd like to get agreement that his is a bug and needs changing. If not, how do we resolve the inconsistencies brought by it.

I don't want to get side-tracked by the actual details and timeline (deprecation, compatibility flags, etc...).
That is a separate discussion especially considering the high usage of ==.

To your point on:

If the example you provided, foo == concat("wild", "*", "card") is converted to a wildcard under the hood, then I would argue that it's a bug. You could make an argument that it's Undefined Behavior for EQL, but there is no documentation to suggest that this would be a wildcard comparison.

// string eq
"abc" == concat("ab", "c")  // TRUE 
"abc" == concat("a", "bc")  // TRUE
// match eq
"abc" == "ab*"              // TRUE
// string eq
"abc" == concat("ab", "*")  // FALSE

The above is basic math using an operator association and commutativity on literals.

With fields it becomes even more confusing:

// string eq
field == "abc"              // TRUE
field == concat("ab", "c")  // TRUE

// match eq
field == "abc*"             // TRUE
// string eq
field == concat("ab", "c*") // FALSE 

Which end up breaking transitivity:

// assume field == "ab*"
"ab" == "ab*"               // TRUE
"ab*" == field              // TRUE
"ab" == field               //FALSE

This is a big deal. Essentially == is not equals.
So why reuse the operator in the first place if it has different semantics?

Proposal

Introduce a separate operator, say ~= or e= (eql equals) that takes over the current == semantics, such as identifying the wildcard pattern and being case insensitive plus whatever other semantic we want to add moving forward.
This frees the == operator to mean exact equality and nothing else, just like the rest of the operators.

Pros:

  • == means equals just like everybody expects. No gotchas, footnotes or trappy behavior
  • == is aligned with the rest of the operators including >=, <=
  • the custom EQL semantics are captured by a custom operator

Cons:

  • breaking compatibility with existing queries. Requires 80-90% rewrites

@rw-access
Copy link
Contributor

I surveyed the primary users (@elastic/security-intelligence-analytics)of EQL, and it was unanimous (12/12) that == "wild*card" is a feature and not a bug.

This was a deliberate design decision and is one that is expected and desired. Maybe the == operator isn't interpreted as literally as you're proposing. But there appears to be no cognitive dissonance by the people that are already using the I didn't ask ask for further analysis.

We can talk more about this offline, but I think == "some*wildcard" should be here to stay.

@costin
Copy link
Member Author

costin commented Sep 23, 2020

A few more examples of == breaking down:

"a"  > "A"   // TRUE
"a" == "A"   // TRUE
"a" <= "A"   // FALSE
"a" in ("A")   // FALSE
"ab" == "a*"   // TRUE
"ab" <= "a*"   // FALSE
"ab" in ("a*") // FALSE

@jpountz
Copy link
Contributor

jpountz commented Sep 23, 2020

I find the semantics of == to be surprising too. While I understand how important wildcard matching is for users of EQL, I share Costin's concerns around giving == semantics that do not meet the requirements of an equivalence relation and are also very different from major query languages.

@tsg
Copy link

tsg commented Sep 23, 2020

To me, this is a hard decision. On one hand, I agree that == in the current form is surprising in the examples that Costin gave. If we release it like that in 7.10, we're stuck with it and might become a source of constant problems, so I totally understand the concern.

On the other hand, if we do this change and then all our example rules in the docs use ~= and we need to tell people "make sure you always use ~= instead of == or you might miss events", I think we made the EQL usability for security rules worse and we're affecting its chances for adoption. We don't expect, say, <= to be used often with strings, so while the semantics will be confusing, it might not matter all that much in practice. I'm also not sure if going as close as possible to SQL is what is best for us, in the end, we started a new language because the existing ones weren't a great fit.

So I, for one, don't have a clear opinion yet. I'm trying to speak to more people to understand how offputting having to always use ~= would be to security users.

@mark-dufresne
Copy link

mark-dufresne commented Sep 23, 2020

I'll echo @rw-access's comments, speaking for the security user. Elastic's Security Solution is better if we leave this pattern in place.

Wildcards are extensively employed by security users. I'm unaware of users disapproving of the behavior in place today. They've appreciated the simplicity, readability, and ease of use in many EQL features, including our treatment of wildcards. Echo'ing @tsg, these usability factors with EQL and Detection Rules directly support one of the Solution's primary differentiators. We did it this way for a good reason. I'd expect but don't want to assume that the other Solutions would agree.

For what it's worth, Splunk does it like we do today (@rw-access edit: for both wildcards and case-insensitivity by default. Also, Microsoft KQL has this behavior for wildcards, although their usage is limited to prefix queries). They allow users to search for asterisks via character escaping. This implies they came to a similar conclusion about what users expect and want.

@rw-access
Copy link
Contributor

rw-access commented Sep 23, 2020

There are some bugs here:

A few more examples of == breaking down:

"a"  > "A"   // TRUE --> this isn't ==, so irrelevant. this is more about case sensitivity than equality
"a" == "A"   // TRUE
"a" <= "A"   // FALSE --> we just need to define whether this operator was case-sensitive or insensitive. my impression was that this operator is always sensitive. 
"a" in ("A")   // FALSE --> this is a bug. should be TRUE.
"ab" == "a*"   // TRUE
"ab" <= "a*"   // FALSE --> isn't == so wildcards aren't expanded. behaves differently and in a well defined way.
"ab" in ("a*") // FALSE --> correct, but we could resolve this inconsistency and make this TRUE

All string comparisons for Python EQL (and other Elastic Endgame implementations) are case-insensitive. This == functions, in, startsWith and other comparison functions.

Originally, range queries (<, <=, >=, >) were case-sensitive (feature or bug, up for debate), but at one point, we said in a sync that this exception was a bug, so #56771 and endgameinc/eql#30 were opened. Looks like we've double backed, but that's fine.

I don't see that as an issue, just a matter of defining behavior. Comparing strings lexicographically is a pretty rare, but does exist. I don't think we need to overextend for the one use case.

@costin
Copy link
Member Author

costin commented Sep 23, 2020

First off, thanks for engaging folks!

I recognize the security focus of EQL however the vision for Elasticsearch EQL is to grow beyond that and cater to any user of Elasticsearch.

@mark-dufresne
I understand that wildcards and case insensitive string comparison are extremely common and a core feature. My intent is not to remove this functionality rather make it consistent with the rest of the languages and semantics in Elasticsearch.

For what it's worth, Splunk does it like we do today (@rw-access edit: for both wildcards and case-insensitivity by default.

I'm not too familiar with Splunk; my understanding is the context/command/pipe defines the semantics. The wildcard you linked to is applied only inside a search context while it is unsupported in where. Same seems to apply for operators.

Microsoft KQL has this behavior for wildcards

Unless I'm looking at the wrong language, the documentation indicates == to be exact and case sensitive.
In fact for case insensitive comparison one should use =~ and !~. Which is what I'm arguing for.

@rw-access

"a"  > "A"   // TRUE --> this isn't ==, so irrelevant.
"a" == "A"   // TRUE

The > and == are crucial: two invariants that contradict each other.

"a" <= "A"   // FALSE --> we just need to define whether this operator was case-sensitive or insensitive. my >impression was that this operator is always sensitive. 

<= is already defined as < or ==. Not in EQL:

"a" < "A" or "a" == "A" // TRUE 
"a" <= "A" // FALSE

Same for wildcards:

"ab" <= "a*"   // FALSE
--> isn't == so wildcards aren't expanded. behaves differently and in a well defined way.

<= not being equivalent to < or = is a bug.

"ab" in ("a*") // FALSE --> correct, but we could resolve this inconsistency and make this TRUE

I've used in on purpose since you pointed to it above due to its "exact" nature, as a workaround for comparing strings containing *:

There are workarounds to do this, but I will concede that they aren't ideal. One example: some_field in ("string with literal * asterisk")

Another example of unexpected behavior that surprises even power users.

but at one point, we said in a sync that this exception was a bug, so #56771 and endgameinc/eql#30 were opened. Looks like we've double backed, but that's fine.

This is inaccurate.
The == semantics were always an issue hence why we discussed them at length and we still do since invariants get broken. It's the reason why the case_sensitive parameter was added which is now being removed.
Case sensitive comparisons were never implemented in Elasticsearch EQL; further more we agreed to keep them as such - see #61883 and #62255.
I'm not sure what makes you think otherwise...

Moving forward

Folks, inside QL we've given quite a lot of thought to this topic, starting with #54411 in March. Yet despite our best efforts, so far we haven't found a solid solution.

What's clear to me is EQL should out of the box provide case-insensitive equality check with wildcard support.

Hence why I propose the following:

  1. Introduce a separate operator that offers this semantics. Whether it's ~=/~! or =~/!~ remains to be decided. This will perform string case-insensitive comparison and support * expansion.
  2. Make == and != be exact, meaning case-sensitive and unaware of any wildcard.
  3. Introduce a compatibility flag (name to be decided) that will make equal/non-equal an alias to the operator in step 1. The purpose of this flag is to allow existing rules and users to migrate to the new operator.

Pro:

  • ==, >=, <= etc work consistently and as expected, no invariants are broken.
  • =~/!~ are independent from the other operators and do not combine with them. However if we desire, we have the option of introducing others (e.g. >~).
  • existing users can easily migrate to Elasticsearch

Cons:

  • existing users and queries have to be changed. However the compatibility flag & co (deprecation messages + query analysis, etc..) can help with the transition.
"a"  > "A" // TRUE
"a" >= "A" // TRUE
"a" == "A" // FALSE

"a" =~ "A" // TRUE

"ab" == "aB" // FALSE
"ab" == "a*" // FALSE
"ab" =~ "a*" // TRUE
"ab" =~ "aB" // TRUE

Let me know you think.

@costin
Copy link
Member Author

costin commented Sep 28, 2020

Thanks everyone for the discussion.
After further discussing the ticket offline, it was decided to go ahead and make EQL case sensitive/exact by default and move the case-insensitive equality and wildcard pattern matching to a separate operator.
See #62949 for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/EQL EQL querying >enhancement Team:QL (Deprecated) Meta label for query languages team
Projects
None yet
Development

No branches or pull requests

8 participants