Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fieldref StringInterpolation and Fixnum/Float values #5114

Open
ph opened this issue Apr 13, 2016 · 19 comments
Open

Fieldref StringInterpolation and Fixnum/Float values #5114

ph opened this issue Apr 13, 2016 · 19 comments

Comments

@ph
Copy link
Contributor

ph commented Apr 13, 2016

In 2.3 we have introduced a change to deal with reference values when applying string interpolation for fieldref (see #4592). This change broke backward compatibility with existing configuration that were using the add_field function to copy Fixnum and Float values (#4961) into new fields.

Before 2.3, the string interpolation wasn't doing pessimistic conversion of the value into a string and was keeping the original type.

Example:

e = LogStash::Event.new({ "data" => 0.9)

# LS < 2.3
e.sprintf("%{data}") => 0.9

# LS > 2.3
e.sprintf("%{data}") => "0.9"

Since we are explicitly using quoted string to use the fieldref reference, I think its fair to expect it to behave like the Kernel.sprintf method and always return a string. (The underlying method is actually called sprintf)

I think for 5.0 we should make sprintf always return a string and this is already the case in the current Java Event implementation. The downside of this, people will actually need to manually convert the value using the mutate filter if they really need a specific type.

mutate
       {
         convert => { "src_ip_coords" => "float" }
       } 

In #4961 @colinsurprenant also suggested to support non-quoted string to keep the original value, this could be something to investigate.

@ph
Copy link
Contributor Author

ph commented Apr 13, 2016

@colinsurprenant @arizonawayfarer I was a bit curious and I did a bit more investigation, this is really a regression but not from the version we were thinking, it was in the 1.5.X series when I refactored the string interpolation.

1.4.2 was returning a string.
1.5.X was returning a float.
2.2.X was returning a float
2.3.0 was returning a string.

So for me the correct behavior is we should always return a string, so I dont think we should revert that change. WDYT?

@arizonawayfarer
Copy link

From a performance standpoint, wouldn't forcing users to explicitly convert strings back to their original datatype add unnecessary processing overhead? Why not just keep the original datatype, unless explicitly converted to a string using a mutate block??

Converting a float to a string and then back to a float seems silly, in my opinion. Although, I have to admit that I'm getting confused again. I don't know what the goal of converting everything to a string is/was.

@colinsurprenant
Copy link
Contributor

I think that from config standpoint we should support both the forced-string and original-value-type. The "%{[foo][bar]}" syntax should always return a string IMO, but we should also support a syntax to avoid post type conversion, as noted by @arizonawayfarer, using maybe %{[foo][bar]} (no quotes) or just a fieldref [foo][bar] ?

@arizonawayfarer
Copy link

@colinsurprenant That makes sense, and I totally agree.

As to how the field references are made.. I feel like removing %{} from the reference would make things easier to read, and people might appreciate the byte savings when transferring large amounts of configs? Also, why have superfluous decoration if it's not needed? Just makes things simpler, yeah?

@ph ph added the enhancement label Apr 13, 2016
@ph
Copy link
Contributor Author

ph commented Apr 13, 2016

I am +1 for removing the need to recast the value itself.
We will need to change the syntax parser to make it work and make sure it doesn't conflict when parsing array of values or hashes. I remember there is some tricky parsing rules for them.

So the for the record.

# This would keep the original type.
mutate {
add_field => ["src_ip_coords", [src_ip_geoip][longitude]]
}
# this wont and will return a string.
mutate {
add_field => ["src_ip_coords", "%{[src_ip_geoip][longitude]}"]
}

@arizonawayfarer
Copy link

I think, if I understood @colinsurprenant it would look more like this:

# This would keep the original type.
mutate {
   add_field => ["src_ip_coords", [src_ip_geoip][longitude]]
}

# this wont and will return a string.
mutate {
   add_field => ["src_ip_coords", "[src_ip_geoip][longitude]"]
}

@arizonawayfarer
Copy link

Probably makes sense to keep the curly braces, I feel like parsing that would be horrible.

@colinsurprenant
Copy link
Contributor

yes, we should keep the string interpolation syntax as-is, but with the quotes, as we have today "%{[foo][bar]}". I am not not sure about the necessity of adding quoted fileref as in "[src_ip_geoip][longitude]" but having a bare fieldref to denote the original value and type and is [src_ip_geoip][longitude] make sense IMO.

Recap:

In your previous example @ph I'd just add quotes to "%{[src_ip_geoip][longitude]}".

And @arizonawayfarer I wouldn't support quoted fieldref like "[src_ip_geoip][longitude]"

@ph
Copy link
Contributor Author

ph commented Apr 14, 2016

@colinsurprenant oops, yup just updated my comment with the quotes.

Supporting [src_ip_geoip][longitude] in the config syntax is doable, but we would need to enrich how the data is send from the config parsing to the plugins itself. Because right now its the plugin responsability to call sprintf on the event and the plugin config hash only deal with primitives (minor the password case). So we need a way to communicate it to the string interpolation code to keep the original type.

To make it simple, instead of sending a String we a send FieldRef object to the plugin and the string interpolation can use it to know how it should do things.

Concerning the 2.3.X issue, do we agree that the keeping the original type was a bug and we should not fix it?

@colinsurprenant
Copy link
Contributor

I'm on the fence but I think that since from 1.5 to 2.2 we were returning the original type using "%{[foo][bar]}" this is now pretty much the established and anticipated behaviour, be it a bug or not? I think we should return to this behaviour in the next 2.3 minor release but make it always return a string in 5.0 as discussed above?

@jsvd
Copy link
Member

jsvd commented Apr 14, 2016

my vote for the following plan:

for 2.3:
revert back to "%{[foo][bar]}" not making string coercion

for 5.0:
host => "%{[foo][bar]}" should return a string, that's the principle of least surprise.

To do a field reference without a string, either port => %{[foo][bar]} or port => [foo][bar] is ok for me.

@suyograo
Copy link
Contributor

I think it is very confusing for our end users who are not developers to understand using "" means getting back string and not using "" means getting back the correct type. If I have a port number in my event field [port], and if I am a non-developer, I'd expect it to just work if I'd use "es_port" => "[port]" or if use "es_port" => [port] in the config. I think the type should be honoured here regardless of wrapping this in a quote, IMO. If the user explicitly uses "host:%{[port]}" or "%{[host]}", then it is clear to return string because they are using the interpolation syntax %[fieldref].

In 2.3.2, we should go back to 2.x behaviour to not break compatibility. And stick to the same strategy in 5.0 as well.

@ph ph removed the v5.0.0-alpha2 label Apr 14, 2016
@suyograo
Copy link
Contributor

I like what @arizonawayfarer proposed:

This would keep the original type.
mutate {
add_field => ["src_ip_coords", [src_ip_geoip][longitude]]
}
this wont and will return a string.
mutate {
add_field => ["src_ip_coords", "%{[src_ip_geoip][longitude]}"]
}

@colinsurprenant
Copy link
Contributor

colinsurprenant commented Apr 14, 2016

@suyograo IMO the principle of least surprise is to avoid auto-magic type conversions which will, yes, require users to learn on the usage of using quotes or not. To make things simpler in the future (5.x) I think the string interpolation template syntax "%{...}" should only and always be used with quotes and always return a string. We really want to avoid the confusion, as I noted above, between the behaviours of "%{[bar]}" and "foo-%{[bar]}" where if [bar] is a numeric in the former we return a numeric and the later a string.

Now to get the original type I think we should only rely on using the current fieldref syntax and support add_field => ["src_ip_coords", [src_ip_geoip][longitude]]. The problem we have is that we mainly use fieldrefs strings in our configs, like "[src_ip_geoip][longitude]" - so I think both [src_ip_geoip][longitude] and "[src_ip_geoip][longitude]" should be equivalent and always just return the fieldref value with original type, no string-ification.

Using this strategy makes it pretty consistent: use "%{...}" syntax to produce a string and fieldref to get original value and type.

@jordansissel
Copy link
Contributor

What's the scope of users impacted by the fixed bug (which can be viewed as a feature break)?

For the 2.3 branch:

  • If lots of users are affected by this, we can revert the bug fix and restore the accidental special-case feature of add_field => { "bar" => "%{foo}" } copying the value (including type) instead of setting it as a string.
  • If only a few users are affected, we maybe should leave the fix as-is.

My instinct is to call it what it is, a bug, and leave it "fixed" - but that conclusion ignores one of my favorite lines "If a user has a bad time, it's a bug" and that I greatly care about upgrade experience.

@colinsurprenant proposed that we revert the "bug fix" for 2.3 only because users upgrading to 2.3 could have an upsetting experience. I agree that upgrade experience matters.

Regardless of what we do for 2.3, my understanding from this report is that some folks want a way to copy a field, so we should make that copying possible.

@arizonawayfarer
Copy link

arizonawayfarer commented Apr 14, 2016

Perhaps there should be a copy_field configuration option for filters? I apologize, as this is probably going to fall way out of the scope of this discussion. I have a lot of things floating around in my head.

When considering the geoip filter, you currently have to do this in order to use the heatmap visualization in kibana:

geoip {
    source => "src_ip"
    target => "src_ip_geoip"
    database => "C:\logstash\maxmind\GeoLiteCity.dat"
    add_field => ["src_ip_coords", "%{[src_ip_geoip][longitude]}"]
    add_field => ["src_ip_coords", "%{[src_ip_geoip][latitude]}"]
} 

Knowing what I know now, that this should never have worked, seems a little silly to me. As a user, I never realized string interpolation was happening until I started investigating this issue. I assumed that data types were being preserved along the way, and that if I really wanted those to be strings, I would have needed to use a mutate block. I can't recall seeing anything about the add_field option automatically converting stuff to strings. Although, with the deeper understanding I have now it makes total sense. However, in the context of the geoip filter it didn't click for me.

I would totally be behind something like this:

geoip {
    source => "src_ip"
    target => "src_ip_geoip"
    database => "C:\logstash\maxmind\GeoLiteCity.dat"
    copy_field => ["src_ip_coords", %{[src_ip_geoip][longitude]}]
    copy_field => ["src_ip_coords", %{[src_ip_geoip][latitude]}]
} 

The way I'm kind of imagining how this would work, add_field always converts to string, and copy_field retains datatypes? Something like that.

I don't know, there are literally 9000 ways to approach this hehe

I also think efforts should be made to reduce config bloat, having to explicitly mutate the fields to their original datatype would make an already big configuration file even bigger. Again, knowing what I know now, this is what would have been expected, and it just doesn't seem right.

geoip {
    source => "src_ip"
    target => "src_ip_geoip"
    database => "C:\logstash\maxmind\GeoLiteCity.dat"
    add_field => ["src_ip_coords", "%{[src_ip_geoip][longitude]}"]
    add_field => ["src_ip_coords", "%{[src_ip_geoip][latitude]}"]
} 
mutate
{
    convert => { "src_ip_coords" => "float" }
} 

I would imagine that for larger arrays, that would be an expensive operation.

@jordansissel
Copy link
Contributor

you currently have to do this in order to use the heatmap visualization in kibana

Geoip filter provides a location field that contains [longitude,latitude]. You don't need to do what you're doing. From the geoip filter docs:

Starting with version 1.3.0 of Logstash, a [geoip][location] field is created if the GeoIP lookup returns a latitude and longitude. The field is stored in GeoJSON format. Additionally, the default Elasticsearch template provided with the elasticsearch output maps the [geoip][location] field to an Elasticsearch geo_point.

@arizonawayfarer
Copy link

Oh wow, I must have been looking at some old docs when I started mucking with it. I'm an idiot hehe

@arichiardi
Copy link

arichiardi commented Oct 11, 2020

Hi there and sorry to barge in. I came across this conversation cause I spent some time figuring out why my %{number} were always returning a string.

For plugin authors, what is the current way to get a primitive type when using event.sprintf? Is there any?

From the source it seems the value is always returned as string.

https://www.rubydoc.info/gems/logstash-event/LogStash/Event:sprintf

Do I read it right?

Edit: Decided to use event.get and Reference syntax now for making sure types are preserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants