Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grok: Maybe allow defining the type *in* the pattern definition #1859

Closed
jordansissel opened this issue Oct 7, 2014 · 7 comments
Closed

Comments

@jordansissel
Copy link
Contributor

Problem: Many users do things like %{NUMBER:bytes} in grok and then are confused why Elasticsearch fails to do statistics or other numeric aggregations on it. The cause is that Grok only does strings by default and Elasticsearch is sent a string and maps 'bytes' to a string - and this is confusing.

I'm tired of users tripping over this problem. I would be willing to add a feature to grok that allowed you to define the 'type' of a pattern inside the pattern definition.

Background: In a grok patterns file, you can define a pattern with NAME PATTERN syntax (name of pattern, space, the regexp pattern).

Proposal: Allow the type to accompany the NAME.

By way of example, if we were to fix this NUMBER problem permanently, we would define the new pattern like this:

NUMBER:float (%{BASE10NUM})

The new syntax is NAME:TYPE REGEXP and is backwards-compatible with the old syntax (The :TYPE is made optional and defaults to string if not provided).

This would allow us to more reasonably define the patterns with their respective types such that this will be captured as a numeric type in Elasticsearch: %{NUMBER:bytes}

It's not clear if this will solve everything, though, since in some cases like 'bytes' the value is never fractional, so users doing %{NUMBER:bytes} and seeing a float may be confused because they wanted to see a long type in Elasticsearch.

Thoughts?

@untergeek
Copy link
Member

Can it be done? Yes.
Will it be worth doing? That's hard to say.

If adding this functionality sacrifices a substantial amount of performance in order to obtain the desired simplicity, I am not in favor of it. We can better use that time to write thorough docs and examples and how-to guides that explain how to get the desired outcome, and why it works that way. In fact, I will take this as a blog post suggestion for the Elasticsearch site right now, while I have some time.

If adding this functionality can be done without too much cost in performance, then it might be worth pursuing. The next cost/benefit thought is how much complexity this might add to testing and the code base for grok.

@suyograo
Copy link
Contributor

suyograo commented Oct 7, 2014

+1 on this. Although we would need to measure performance when we coerce everything to a type (because user may never use it). This may actually be ok given the performance improvements we did in the grok library last month

There are cases like %{NUMBER:statuscode} which are used to match status code in apache logs. You actually do not want to convert this to number because you want to run a terms aggregation on it. Seems like we need a override in such cases.

Another common example is %{IPV4:ip}. User is expecting to map IPV4 pattern to elasticsearch mapping type at the time of pattern extraction, which is a reasonable assumption. It would be good to be able to allow for this.

@jordansissel
Copy link
Contributor Author

Another common example is %{IPV4:ip}. User is expecting to map IPV4 pattern to elasticsearch

Indeed! I forget if I filed a ticket, but this (IP addresses) is related to an idea recently to allow Logstash to act as the dynamic-mapping feature for logstash indices to Elasticsearch (instead of using Elasticsearch's dynamic mapping) - or to improve dynamic mapping in ES to allow Logstash to use it most correctly for IP addresses (better template? something...)

@untergeek
Copy link
Member

We could try to turn on automatic IP field detection in the default template. I don't know how much slowness that might introduce, but if minimal it would eliminate the need for IP-type coercion.

@avleen
Copy link
Contributor

avleen commented Oct 9, 2014

@suyograo We set the apache response code to a long in our mapping template and can still run a terms aggregation on it. We have other things as short, terms agg still works on those. Why wouldn't it?

@untergeek
Copy link
Member

@avleen Since the http response code is mapped key value pairs, e.g. "404" = "not found," it doesn't make sense to do mathematical operations on key names like this. Storing this as a long seems like wasted space as storage of millions of long values costs more than than short (it's too bad it's bigger than 255, or byte would suffice). You should always store numbers in the smallest suitable type.

Even so, I store this "number" as a string in my own setup.

@jordansissel
Copy link
Contributor Author

For Logstash 1.5.0, we've moved all plugins to individual repositories, so I have moved this issue to logstash-plugins/logstash-filter-grok#26. Let's continue the discussion there! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants