Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add auto_type_convert option for parser.rb #151

Closed
wants to merge 6 commits into from

Conversation

y-ken
Copy link
Member

@y-ken y-ken commented Jul 25, 2013

summary

We have a problem to analyzing int/float columns with TreasureData or mongoDB which parsed from file with in_tail.
I expect 123 to be integer, 123.45 to be float value but in_tail makes them string type.

On using fluent-plugin-typecast, It could convert types by manually.
It is not cool to convert type (cast) by manually. it have to update whenever log format has changed.
it takes much cost of maintenance.

Thus, I have add auto type conversion option for parser.rb.

configuration sample

<source>
  type tail
  format ltsv
  time_format %d/%b/%Y:%H:%M:%S %z
  path /var/log/httpd/access_log
  tag debug.apache.access
  auto_type_convert yes
</source>

sample results

the difference seems in status, size, response_time.

$ diff -u before.json after.json
--- before.json 2013-07-25 22:46:45.000000000 +0900
+++ after.json  2013-07-25 22:47:09.000000000 +0900
@@ -1,3 +1,2 @@
-{"domain":"127.0.0.1","host":"127.0.0.1","server":"127.0.0.1","ident":"-","user":"-","method":"GET","path":"/","protocol":"HTTP/1.1","status":"404","size":"198","referer":"-","agent":"Mozilla","response_time":"398","cookie":"-","set_cookie":"-"}
-
+{"domain":"127.0.0.1","host":"127.0.0.1","server":"127.0.0.1","ident":"-","user":"-","method":"GET","path":"/","protocol":"HTTP/1.1","status":404,"size":198,"referer":"-","agent":"Mozilla","response_time":398,"cookie":"-","set_cookie":"-"}

…log and so on.

I have add auto type conversion option for parser.

It is not cool to manualy type conversion by manualy.
it have to update whenever log format has changed.

#### configuration sample

```
<source>
  type tail
  format ltsv
  time_format %d/%b/%Y:%H:%M:%S %z
  path /var/log/httpd/access_log
  tag debug.apache.access
  auto_type_convert yes
</source>
```
#### sample results

the difference seems in status, size, response_time.

```
#auto_type_convert yes
2013-07-25 01:12:48 +0900 debug.apache.access:
{"domain":"127.0.0.1","host":"127.0.0.1","server":"127.0.0.1","ident":"-
","user":"-","me
thod":"GET","path":"/","protocol":"HTTP/1.1","status":404,"size":198,"re
ferer":"-","agent":"Mozilla","response_time":398,"cookie":"-","set_cooki
e":"-"}

#auto_type_convert no (or undefined)
2013-07-25 01:12:48 +0900 debug.apache.access:
{"domain":"127.0.0.1","host":"127.0.0.1","server":"127.0.0.1","ident":"-
","user":"-","me
thod":"GET","path":"/","protocol":"HTTP/1.1","status":"404","size":"198"
,"referer":"-","agent":"Mozilla","response_time":"398","cookie":"-","set
_cookie":"-"}
```
it is twice as fast!

```
$ bundle exec ruby test.rb
       user     system      total        real
   0.000000   0.000000   0.000000 (  0.000033)
   0.000000   0.000000   0.000000 (  0.000016)
```
@y-ken
Copy link
Member Author

y-ken commented Jul 25, 2013

benchmark result

I have benchmarked the type conversion speed because it is very important to parse as fast as possible.
The result, Integer(value) rescue false is faster than value.to_i.to_s implementation.

benchmark ken$ bundle exec ruby test.rb
       user     system      total        real
   0.000000   0.000000   0.000000 (  0.000032)
   0.000000   0.000000   0.000000 (  0.000033)
   0.000000   0.000000   0.000000 (  0.000016)
benchmark ken$ bundle exec ruby test.rb
       user     system      total        real
   0.000000   0.000000   0.000000 (  0.000032)
   0.000000   0.000000   0.000000 (  0.000029)
   0.000000   0.000000   0.000000 (  0.000010)
benchmark ken$ bundle exec ruby test.rb
       user     system      total        real
   0.000000   0.000000   0.000000 (  0.000032)
   0.000000   0.000000   0.000000 (  0.000029)
   0.000000   0.000000   0.000000 (  0.000011)
benchmark ken$ bundle exec ruby test.rb
       user     system      total        real
   0.000000   0.000000   0.000000 (  0.000031)
   0.000000   0.000000   0.000000 (  0.000019)
   0.000000   0.000000   0.000000 (  0.000009)

benchmark code

require 'benchmark'

def convert_type1(record)
  record.each do |index,value|
    if value == value.to_i.to_s
      record[index] = value.to_i
    elsif value == value.to_f.to_s
      record[index] = value.to_f
    end
  end
  return record
end

def convert_type2(record)
  record.each do |index,value|
    if (Integer(value) rescue false)
      record[index] = value.to_i
    elsif (Float(value) rescue false)
      record[index] = value.to_f
    end
  end
  return record
end

def convert_type3(record)
  record.each do |index,value|
    if (int_value = Integer(value) rescue false)
      record[index] = int_value
    elsif (float_value = Float(value) rescue false)
      record[index] = float_value
    end
  end
  return record
end

record = {"int"=>"123","float"=>"12.34"}
n = 1000000
Benchmark.bm do |x|
  x.report { convert_type1(record) }
  x.report { convert_type2(record) }
  x.report { convert_type3(record) }
end

@ghost ghost assigned repeatedly Jul 25, 2013
@y-ken
Copy link
Member Author

y-ken commented Jul 25, 2013

I have wrote a blog about this pull-req.
http://y-ken.hatenablog.com/entry/fluentd-parser-auto-type-conversion

@@ -95,6 +95,7 @@ class ValuesParser
config_param :keys, :string
config_param :time_key, :string, :default => nil
config_param :time_format, :string, :default => nil
config_param :auto_type_convert, :string, :default => nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use string type?

On using yes/no or true/false, it should use :bool instead of :string
@y-ken
Copy link
Member Author

y-ken commented Jul 25, 2013

Why use string type?

Oops, I've fixed. y-ken@64e21a4
It's sorry for my misunderstood usage for config_param parameter.

@repeatedly
Copy link
Member

Maybe, your benchmark is wrong.
Please try follwoing code:

require 'benchmark'

def convert_type1(record)
  record.each do |key, value|
    if value == value.to_i.to_s
      record[key] = value.to_i
    elsif value == value.to_f.to_s
      record[key] = value.to_f
    end
  end
  record
end

def convert_type2(record)
  record.each do |key, value|
    if (Integer(value) rescue false)
      record[key] = value.to_i
    elsif (Float(value) rescue false)
      record[key] = value.to_f
    end
  end
  record
end

def convert_type3(record)
  record.each do |key, value|
    if (int_value = Integer(value) rescue false)
      record[key] = int_value
    elsif (float_value = Float(value) rescue false)
      record[key] = float_value
    end
  end
  record
end

record = {"int" => "123", "float" => "12.34", "str" => "13498734hoge", "msg" => "/path/to"}
n = 10000
Benchmark.bmbm do |x|
  x.report { n.times { convert_type1(record.dup) } }
  x.report { n.times { convert_type2(record.dup) } }
  x.report { n.times { convert_type3(record.dup) } }
end

In my env, result is below:

% ruby --version
ruby 1.9.3p385 (2013-02-06 revision 39114) [x86_64-darwin12.2.1]
% ruby b.rb
Rehearsal ------------------------------------
   0.070000   0.000000   0.070000 (  0.068451)
   0.560000   0.020000   0.580000 (  0.577230)
   0.560000   0.020000   0.580000 (  0.582787)
--------------------------- total: 1.230000sec

       user     system      total        real
   0.070000   0.000000   0.070000 (  0.068951)
   0.540000   0.020000   0.560000 (  0.560255)
   0.530000   0.020000   0.550000 (  0.555380)

% ruby --version
ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.4.0]
% ruby b.rb
Rehearsal ------------------------------------
   0.080000   0.000000   0.080000 (  0.077255)
   0.280000   0.010000   0.290000 (  0.286506)
   0.260000   0.010000   0.270000 (  0.273608)
--------------------------- total: 0.640000sec

       user     system      total        real
   0.080000   0.000000   0.080000 (  0.075255)
   0.260000   0.010000   0.270000 (  0.267902)
   0.280000   0.000000   0.280000 (  0.276650)

@y-ken
Copy link
Member Author

y-ken commented Jul 25, 2013

thank you.
The result, Integer(value) rescue false is slower than value.to_i.to_s implementation.

$ ruby --version
ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-darwin12.2.0]

$ bundle exec ruby test.rb
Rehearsal ------------------------------------
   0.200000   0.000000   0.200000 (  0.198400)
   1.530000   0.030000   1.560000 (  1.565692)
   1.530000   0.030000   1.560000 (  1.555802)
--------------------------- total: 3.320000sec

       user     system      total        real
   0.200000   0.000000   0.200000 (  0.197015)
   1.450000   0.030000   1.480000 (  1.481821)
   1.460000   0.030000   1.490000 (  1.493970)

@y-ken
Copy link
Member Author

y-ken commented Jul 26, 2013

On adding this test, convert_type4 is same speed or faster a little than convert_type1.
I'm planning to update convert_type like following convert_type4 code.
How do you think to use convert_type4 instead of convert_type1?

def convert_type4(record)
  record.each do |index,value|
    if value == (int_value = value.to_i).to_s
      record[index] = int_value
    elsif value == (float_value = value.to_f).to_s
      record[index] = float_value
    end
  end
  return record
end

results

$ ruby --version
ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-darwin12.2.0]

$ bundle exec ruby test.rb
Rehearsal ------------------------------------
   0.200000   0.000000   0.200000 (  0.199526)
   1.560000   0.030000   1.590000 (  1.596019)
   1.580000   0.030000   1.610000 (  1.606314)
   0.200000   0.000000   0.200000 (  0.197653)
--------------------------- total: 3.600000sec

       user     system      total        real
   0.200000   0.000000   0.200000 (  0.200536)
   1.500000   0.030000   1.530000 (  1.537644)
   1.450000   0.030000   1.480000 (  1.484264)
   0.190000   0.000000   0.190000 (  0.193165)

code

require 'benchmark'

def convert_type1(record)
  record.each do |index,value|
    if value == value.to_i.to_s
      record[index] = value.to_i
    elsif value == value.to_f.to_s
      record[index] = value.to_f
    end
  end
  return record
end

def convert_type2(record)
  record.each do |index,value|
    if (Integer(value) rescue false)
      record[index] = value.to_i
    elsif (Float(value) rescue false)
      record[index] = value.to_f
    end
  end
  return record
end

def convert_type3(record)
  record.each do |index,value|
    if (int_value = Integer(value) rescue false)
      record[index] = int_value
    elsif (float_value = Float(value) rescue false)
      record[index] = float_value
    end
  end
  return record
end

def convert_type4(record)
  record.each do |index,value|
    if value == (int_value = value.to_i).to_s
      record[index] = int_value
    elsif value == (float_value = value.to_f).to_s
      record[index] = float_value
    end
  end
  return record
end

record = {"int" => "123", "float" => "12.34", "str" => "13498734hoge", "msg" => "/path/to"}
n = 10000
Benchmark.bmbm do |x|
  x.report { n.times { convert_type1(record.dup) } }
  x.report { n.times { convert_type2(record.dup) } }
  x.report { n.times { convert_type3(record.dup) } }
  x.report { n.times { convert_type4(record.dup) } }
end

@y-ken
Copy link
Member Author

y-ken commented Jul 26, 2013

I have updated the code and also confirmed rake test has passed. y-ken@7313cdf

speed up for convert type implementation.
It has mistaken for my previous benchmark.
@@ -126,6 +128,17 @@ def values_map(values)

return time, record
end

def convert_type(record)
record.each do |index,value|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable name index is wrong. It's not index.
field? name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite agree. It should use field, name or key.

@y-ken
Copy link
Member Author

y-ken commented Jul 26, 2013

Variable name index is wrong. It's not index. field? name?

Thank you @tagomoris . I quite agree. It should use field, name or key.
I have modified it as key as like official document.
http://doc.ruby-lang.org/ja/1.9.3/class/Hash.html

@y-ken
Copy link
Member Author

y-ken commented Aug 30, 2013

It may results unexpected conversion and it should use non-missin-critical situation.
On adding this option as official, it may raise support cost and/or decrease a feeling of trust for fluentd.
Thus, I'd be better to write this function as meta plugin like below.

Would you please give me a opinion for this issue?

# input sample
<source>
  type autocast
  <config>
    type tail
    format ltsv
    time_format %d/%b/%Y:%H:%M:%S %z
    path /var/log/httpd/access_log
    tag apache.access
  </config>
</source>

# output(filter) sample
<match some.foo>
  type autocast
  <config>
    type mongo
    database fluent
    collection test
  </config>
</match>

@repeatedly
Copy link
Member

Auto casting is one of type conversion.
So type_convert plugin is more general than 'autocast' plugin.

Example:

<match **>
  type type_convert
  auto_convert true # like autocast
</match>

<match **>
  type type_convert
  convert_rules key1:float, key2:integer, key3:time
</match

What do you think?

@repeatedly
Copy link
Member

Now parser has types option and this approach is better to avoid unexpected behaviour.

@repeatedly repeatedly closed this Dec 23, 2013
@y-ken
Copy link
Member Author

y-ken commented Dec 24, 2013

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants