Grok: Maybe allow defining the type in the pattern definition #1859

jordansissel · 2014-10-07T19:11:26Z

Problem: Many users do things like %{NUMBER:bytes} in grok and then are confused why Elasticsearch fails to do statistics or other numeric aggregations on it. The cause is that Grok only does strings by default and Elasticsearch is sent a string and maps 'bytes' to a string - and this is confusing.

I'm tired of users tripping over this problem. I would be willing to add a feature to grok that allowed you to define the 'type' of a pattern inside the pattern definition.

Background: In a grok patterns file, you can define a pattern with NAME PATTERN syntax (name of pattern, space, the regexp pattern).

Proposal: Allow the type to accompany the NAME.

By way of example, if we were to fix this NUMBER problem permanently, we would define the new pattern like this:

NUMBER:float (%{BASE10NUM})

The new syntax is NAME:TYPE REGEXP and is backwards-compatible with the old syntax (The :TYPE is made optional and defaults to string if not provided).

This would allow us to more reasonably define the patterns with their respective types such that this will be captured as a numeric type in Elasticsearch: %{NUMBER:bytes}

It's not clear if this will solve everything, though, since in some cases like 'bytes' the value is never fractional, so users doing %{NUMBER:bytes} and seeing a float may be confused because they wanted to see a long type in Elasticsearch.

Thoughts?

The text was updated successfully, but these errors were encountered:

untergeek · 2014-10-07T19:28:21Z

Can it be done? Yes.
Will it be worth doing? That's hard to say.

If adding this functionality sacrifices a substantial amount of performance in order to obtain the desired simplicity, I am not in favor of it. We can better use that time to write thorough docs and examples and how-to guides that explain how to get the desired outcome, and why it works that way. In fact, I will take this as a blog post suggestion for the Elasticsearch site right now, while I have some time.

If adding this functionality can be done without too much cost in performance, then it might be worth pursuing. The next cost/benefit thought is how much complexity this might add to testing and the code base for grok.

suyograo · 2014-10-07T21:00:31Z

+1 on this. Although we would need to measure performance when we coerce everything to a type (because user may never use it). This may actually be ok given the performance improvements we did in the grok library last month

There are cases like %{NUMBER:statuscode} which are used to match status code in apache logs. You actually do not want to convert this to number because you want to run a terms aggregation on it. Seems like we need a override in such cases.

Another common example is %{IPV4:ip}. User is expecting to map IPV4 pattern to elasticsearch mapping type at the time of pattern extraction, which is a reasonable assumption. It would be good to be able to allow for this.

jordansissel · 2014-10-07T23:28:41Z

Another common example is %{IPV4:ip}. User is expecting to map IPV4 pattern to elasticsearch

Indeed! I forget if I filed a ticket, but this (IP addresses) is related to an idea recently to allow Logstash to act as the dynamic-mapping feature for logstash indices to Elasticsearch (instead of using Elasticsearch's dynamic mapping) - or to improve dynamic mapping in ES to allow Logstash to use it most correctly for IP addresses (better template? something...)

untergeek · 2014-10-07T23:46:26Z

We could try to turn on automatic IP field detection in the default template. I don't know how much slowness that might introduce, but if minimal it would eliminate the need for IP-type coercion.

avleen · 2014-10-09T23:17:11Z

@suyograo We set the apache response code to a long in our mapping template and can still run a terms aggregation on it. We have other things as short, terms agg still works on those. Why wouldn't it?

untergeek · 2014-10-10T13:44:55Z

@avleen Since the http response code is mapped key value pairs, e.g. "404" = "not found," it doesn't make sense to do mathematical operations on key names like this. Storing this as a long seems like wasted space as storage of millions of long values costs more than than short (it's too bad it's bigger than 255, or byte would suffice). You should always store numbers in the smallest suitable type.

Even so, I store this "number" as a string in my own setup.

jordansissel · 2015-05-17T23:10:52Z

For Logstash 1.5.0, we've moved all plugins to individual repositories, so I have moved this issue to logstash-plugins/logstash-filter-grok#26. Let's continue the discussion there! :)

suyograo added the enhancement label Oct 7, 2014

jordansissel mentioned this issue May 17, 2015

Grok: Maybe allow defining the type *in* the pattern definition logstash-plugins/logstash-filter-grok#26

Closed

jordansissel closed this as completed May 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grok: Maybe allow defining the type in the pattern definition #1859

Grok: Maybe allow defining the type in the pattern definition #1859

jordansissel commented Oct 7, 2014

untergeek commented Oct 7, 2014

suyograo commented Oct 7, 2014

jordansissel commented Oct 7, 2014

untergeek commented Oct 7, 2014

avleen commented Oct 9, 2014

untergeek commented Oct 10, 2014

jordansissel commented May 17, 2015

Grok: Maybe allow defining the type *in* the pattern definition #1859

Grok: Maybe allow defining the type *in* the pattern definition #1859

Comments

jordansissel commented Oct 7, 2014

untergeek commented Oct 7, 2014

suyograo commented Oct 7, 2014

jordansissel commented Oct 7, 2014

untergeek commented Oct 7, 2014

avleen commented Oct 9, 2014

untergeek commented Oct 10, 2014

jordansissel commented May 17, 2015

Grok: Maybe allow defining the type in the pattern definition #1859

Grok: Maybe allow defining the type in the pattern definition #1859