Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a caching mechanism for the #sprintf calls #3425

Closed
wants to merge 7 commits into from

Conversation

Projects
None yet
5 participants
@ph
Copy link
Member

commented Jun 11, 2015

This PR allow the Event class to compile a sprintf the first time and
reuse the template for the other calls. This strategy speeds up considerably
calls that uses date formatting or use fieldref, its a bit slower if you
don't do any string manipulation.

You can run the benchmark of this new version against the old version with
rake benchmark.

Benchmark

Complex cached: Event#sprintf:   142354.5 i/s
Complex No cache: Event#old_sprintf:    84337.8 i/s - 1.69x slower

Date only cached: Event#sprintf:   247996.9 i/s
Date only No cache: Event#old_sprintf:   131059.5 i/s - 1.89x slower

key only cached: Event#sprintf:  1008076.0 i/s
key only No cache: Event#old_sprintf:   587649.2 i/s - 1.72x slower

string only No cache: Event#old_sprintf:  5067418.7 i/s
string only cached: Event#sprintf:  3771866.2 i/s - 1.34x slower

New event on each iteration

Complex cached: Event#sprintf:    95704.8 i/s
Complex No cache: Event#old_sprintf:    68501.5 i/s - 1.40x slower

Date only cached: Event#sprintf:   162363.2 i/s
Date only No cache: Event#old_sprintf:   102311.1 i/s - 1.59x slower

string only No cache: Event#old_sprintf:   569356.7 i/s
string only cached: Event#sprintf:   537009.0 i/s - 1.06x slower

key only cached: Event#sprintf:   294911.4 i/s
key only No cache: Event#old_sprintf:   230412.7 i/s - 1.28x slower
Add a caching mechanism for the #sprintf calls
This PR allow the Event class to compile a `sprintf` the first time and
reuse the template for the other calls. This strategy speeds up considerably
calls that uses date formatting or use fieldref its a bit slower if you
dont do any string manipulation.

@ph ph changed the title Add a caching mechanism for the #sprintf calls [WIP] Add a caching mechanism for the #sprintf calls Jun 11, 2015

@ph

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2015

Not sure why the string is slower in that case, It need more investigation.

@suyograo suyograo added the v1.5.2 label Jun 11, 2015

@suyograo

This comment has been minimized.

Copy link
Member

commented Jun 11, 2015

@ph can you open an issue for this please?

@ph

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2015

@suyograo I've marked this PR as [WIP] ?:)

@colinsurprenant

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2015

Just tested with the feature/java_backend branch bench config and there is an interpolation error, there's probably a missing spec for the mixed case. Here's the output:

echo '{"foo":"testfoo","bar":"testbar"}' | bin/logstash -f tmp/t.conf
Logstash startup completed
{"@version":"1","@timestamp":"2015-06-11T20:29:21.755Z","host":"colin-mbp13r.lan","fields":{"date":"2015-06-11:20:29:21","mixed":"foo %{+YYYY-MM-dd} 2015-06-11testfoo","foo":"testfoo","bar":"testbar"}}
Logstash shutdown completed

Note the foo %{+YYYY-MM-dd} in the output.

Here's the config

input {
  stdin {
    codec => json_lines
  }
}

filter {
  noop {
    add_field => ["[fields][date]",  "%{+YYYY-MM-dd:HH:mm:ss}"]
    add_field => ["[fields][mixed]",  "foo %{+YYYY-MM-dd} %{foo}"]
    add_field => ["[fields][foo]",  "%{foo}"]
    add_field => ["[fields][bar]",  "%{bar}"]
    remove_field => ["foo"]
    remove_field => ["bar"]
  }
}

output {
  stdout { codec => json_lines }
}
@ph

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2015

@colinsurprenant I can reproduce it! ty for finding this

@ph

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2015

@colinsurprenant I've added the test case and fixed the issue you can rerun your tests on this branch.


Benchmark.ips do |x|
x.config(options)
event = LogStash::Event.new("foo" => "bar",

This comment has been minimized.

Copy link
@jordansissel

jordansissel Jun 12, 2015

Contributor

Can we try to use a new Event object on each iteration? I wonder how much impact using the same object every time is masking the performance

This comment has been minimized.

Copy link
@ph

ph Jun 12, 2015

Author Member

I have updated the benchmark in the PR that create a new Event on each iteration.

@colinsurprenant

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2015

here are the results of the benchmark I use for the Java POC which tests both accessors and sprintf:

1- config

input { stdin { codec => json_lines } }

filter {
  noop {
    add_field => ["[fields][date]", "%{+YYYY-MM-dd:HH:mm:ss}"]
    add_field => ["[fields][mixed]", "foo %{+YYYY-MM-dd} %{foo}"]
    add_field => ["[fields][foo]", "%{foo}"]
    add_field => ["[fields][bar]", "%{bar}"]
    remove_field => ["foo"]
    remove_field => ["bar"]
  }
}

output { stdout { codec => json_lines } }

2- results summary

  • the interesting number is the average tps and represent the number of json lines per seconds. We see about ~20% improvement! 👍 👍
branch runtime tps average tps
master 02:34 [9.18k/s] [9.89k/s]
pr/3425 02:26 [11.1k/s] [ 12k/s]

3- results details

  • master
echo '{"foo":"testfoo","bar":"testbar"}' | bin/logstash -f tmp/t.conf; yes '{"foo":"testfoo","bar":"testbar"}' | bin/logstash -f tmp/t.conf | pv -Wlart > /dev/null
Logstash startup completed
{"@version":"1","@timestamp":"2015-06-12T20:09:42.507Z","host":"colin-mbp13r-2.local","fields":{"date":"2015-06-12:20:09:42","mixed":"foo 2015-06-12 testfoo","foo":"testfoo","bar":"testbar"}}
Logstash shutdown completed
02:34 [9.18k/s] [9.89k/s]
  • pr/3425
echo '{"foo":"testfoo","bar":"testbar"}' | bin/logstash -f tmp/t.conf; yes '{"foo":"testfoo","bar":"testbar"}' | bin/logstash -f tmp/t.conf | pv -Wlart > /dev/null
Logstash startup completed
{"@version":"1","@timestamp":"2015-06-12T20:05:19.980Z","host":"colin-mbp13r-2.local","fields":{"date":"2015-06-12:20:05:19","mixed":"foo 2015-06-12 testfoo","foo":"testfoo","bar":"testbar"}}
Logstash shutdown completed
02:26 [11.1k/s] [  12k/s]

@ph ph changed the title [WIP] Add a caching mechanism for the #sprintf calls Add a caching mechanism for the #sprintf calls Jun 13, 2015

@ph ph removed the work in progress label Jun 13, 2015

@ph

This comment has been minimized.

Copy link
Member Author

commented Jun 13, 2015

Ready for review!

end

def self.evaluate(event, format)
INSTANCE.evaluate(event, format)

This comment has been minimized.

Copy link
@colinsurprenant

colinsurprenant Jun 16, 2015

Contributor

any particular reason not to use the module singleton idiom?

module StringInterpolation
  extend self
 ...

This comment has been minimized.

Copy link
@ph

ph Jun 16, 2015

Author Member

I think it make sense to use it when you have more than 1 method.

This comment has been minimized.

Copy link
@ph

ph Jun 16, 2015

Author Member

Also this is a class not a module

This comment has been minimized.

Copy link
@colinsurprenant

colinsurprenant Jun 16, 2015

Contributor

in Ruby, using extend self on a module is the most elegant way to make a singleton class and you don't have to manage an INSTANCE variable.

module MySingleton
  extend self

  def public_function
    private_function
    puts(@instance_var)
  end

  private

  def private_function
    @instance_var ||= "foobar"
  end
end

There is always only one instance of MySingleton and you cannot call new on it thus you do not define the initialize method. You can use instance variables, but they must be initialized outside a constructor.

This comment has been minimized.

Copy link
@ph

ph Jun 16, 2015

Author Member

There is no initialize in the StringInterpolation class so I can certainly do that.

else
compiled = CACHE.get_or_default(template, nil) || CACHE.put(template, compile_template(template))
compiled.evaluate(event)
end

This comment has been minimized.

Copy link
@colinsurprenant

colinsurprenant Jun 16, 2015

Contributor

I would suggest the more concise

return template if not_cachable?(template)

compiled = CACHE.get_or_default(template, nil) || CACHE.put(template, compile_template(template))
compiled.evaluate(event)

This comment has been minimized.

Copy link
@ph

ph Jun 16, 2015

Author Member

+1

CACHE = ThreadSafe::Cache.new
TEMPLATE_TAG_REGEXP = /%\{[^}]+\}/

def evaluate(event, template)

This comment has been minimized.

Copy link
@ph

ph Jun 16, 2015

Author Member

@colinsurprenant I think this should be protected since it's a singleton ,what do you think?

This comment has been minimized.

Copy link
@colinsurprenant

colinsurprenant Jun 16, 2015

Contributor

meh. I don't use protected.

This comment has been minimized.

Copy link
@ph

ph Jun 16, 2015

Author Member

The Singleton make the previous comment invalid.


private
def not_cachable?(template)
template.index("%").nil?

This comment has been minimized.

Copy link
@colinsurprenant

colinsurprenant Jun 16, 2015

Contributor

this is food for thoughts: since this is static, we could annotate these template at config "compile time" and avoid doing this costly string index check at every interpolation

nodes = Template.new

position = 0
matches = template.to_enum(:scan, TEMPLATE_TAG_REGEXP).map { |m| $~ }

This comment has been minimized.

Copy link
@colinsurprenant

colinsurprenant Jun 16, 2015

Contributor

TIL to_enum

position = 0
matches = template.to_enum(:scan, TEMPLATE_TAG_REGEXP).map { |m| $~ }

if matches

This comment has been minimized.

Copy link
@colinsurprenant

colinsurprenant Jun 16, 2015

Contributor

I don't think matches can be nil here

else
nodes
end
end

This comment has been minimized.

Copy link
@colinsurprenant

colinsurprenant Jun 16, 2015

Contributor

this is a personal preference but I prefer:

nodes.size == 1 ? nodes.first : nodes
t = event.timestamp
if t.nil?
raise LogStash::Error, "Unable to format in string \"#{@format}\", #{LogStash::Event::TIMESTAMP} field not found"
end

This comment has been minimized.

Copy link
@colinsurprenant

colinsurprenant Jun 16, 2015

Contributor

I prefer

raise LogStash::Error, "Unable to format in string \"#{@format}\", #{LogStash::Event::TIMESTAMP} field not found" unless t

if t.nil?
raise LogStash::Error, "Unable to format in string \"#{@format}\", #{LogStash::Event::TIMESTAMP} field not found"
end

This comment has been minimized.

Copy link
@colinsurprenant

colinsurprenant Jun 16, 2015

Contributor

I prefer

raise LogStash::Error, "Unable to format in string \"#{@format}\", #{LogStash::Event::TIMESTAMP} field not found" unless t

This comment has been minimized.

Copy link
@ph

ph Jun 16, 2015

Author Member

if the line is short < 100, I use .... if @test When it's too long I prefer to use it first, so when I read the code my brain just skip it without reading the line.

I have a hard limit on width on my vim setup even in 2015.. 🐒

This comment has been minimized.

Copy link
@colinsurprenant
@colinsurprenant

This comment has been minimized.

Copy link
Contributor

commented Jun 16, 2015

about the benchmarking: for the scope of this PR I would split the benchmarking code from the actual sprintf code and make another PR just for that. I like the idea of having a benchmark directory and associated rake task(s) but I think we should discuss this in the larger context and agree to the idea before committing to this.

@ph

This comment has been minimized.

Copy link
Member Author

commented Jun 16, 2015

+1 on creating another PR with the benchmark stuff.

@ph

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2015

I have extracted the benchmark logic into his own PR at #3457

ph added some commits Jun 17, 2015

@ph

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2015

@colinsurprenant I have adressed all your concerns, can you look at it again?

def evaluate(event)
t = event.timestamp
raise LogStash::Error, "Unable to format in string \"#{@format}\", #{LogStash::Event::TIMESTAMP} field not found" unless t
t.to_i.to_s

This comment has been minimized.

Copy link
@colinsurprenant

colinsurprenant Jun 17, 2015

Contributor

AFAIK t can never be nil, an event should always return a timestamp

This comment has been minimized.

Copy link
@ph

ph Jun 17, 2015

Author Member

Even if the user remove the field with mutate? I've seen that case last week.

This comment has been minimized.

Copy link
@ph

ph Jun 17, 2015

Author Member

In some cases, timestamp could be nil if the user remove it.
I think it make sense for Event#timestamp to always return a valid TimeStamp, if the user deletes the key we should return the current time.

Its out of scope for the current PR, I'll create an issue to discuss that.

This comment has been minimized.

Copy link
@colinsurprenant

colinsurprenant Jun 17, 2015

Contributor

got it. see comment below.

def evaluate(event)
t = event.timestamp

raise LogStash::Error, "Unable to format in string \"#{@format}\", #{LogStash::Event::TIMESTAMP} field not found" unless t

This comment has been minimized.

Copy link
@colinsurprenant

colinsurprenant Jun 17, 2015

Contributor

same as above: t should never be nil

@colinsurprenant

This comment has been minimized.

Copy link
Contributor

commented Jun 17, 2015

Oh, btw, this is text-book worthy of a great refactoring, good job :)

@colinsurprenant

This comment has been minimized.

Copy link
Contributor

commented Jun 17, 2015

2 notes, as discussed verbally:

1- about the Event#timestamp - I now understand there is currently a situation where if a mutate is done to remove the @timestamp field, Event#timestamp can return nil, so let's keep that sanity check here but let's also open an issue to fix that in Event. IMO Event#timestamp should never return nil regardless of @timestamp; the actual field can be removed but Event#timestamp should always return a valid Timestamp. Add a note to circle back here to remove that check once this is fixed.

2- In the inline comments I noted that some checks/annotations could be done at config "compile" time and avoid costly string checks at every interpolation call. We should open a separate issue for this. In fact, if you could do the same kind of refactoring to config_ast.rb that would be really cool ;)

@ph

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2015

@colinsurprenant do I have a LGTM?

value = event[@key]

case value
when nil

This comment has been minimized.

Copy link
@colinsurprenant

colinsurprenant Jun 18, 2015

Contributor

since this case statement selects base on class, I would suggest to use NilClass here instead of the nil value. both will work fine but I think NilClass will be more consistent.

This comment has been minimized.

Copy link
@ph

ph Jun 18, 2015

Author Member

For consistency in this case statement I would agree with your comment, but I don't think it is usual in the ruby world to compare against NilClass, I did a quick code search on github and I only find people adding helpers method the NilClass so using the class directly is uncommon. The added helpers are mostly related to some sort of serialization..

just for reference:

irb(main):002:0> nil.class
=> NilClass
@colinsurprenant

This comment has been minimized.

Copy link
Contributor

commented Jun 18, 2015

other than my NilClass suggestion, LGTM. Ran core tests and default plugins tests, all pass.
:shipit:

@elasticsearch-bot

This comment has been minimized.

Copy link

commented Jun 18, 2015

Merged sucessfully into master 1.5!

ph added a commit that referenced this pull request Jun 18, 2015

Add a caching mechanism for the #sprintf calls
This PR allow the Event class to compile a `sprintf` the first time and
reuse the template for the other calls. This strategy speeds up considerably
calls that uses date formatting or use fieldref its a bit slower if you
dont do any string manipulation.

Fixes #3425

ph added a commit that referenced this pull request Jun 18, 2015

ph added a commit that referenced this pull request Jun 18, 2015

ph added a commit that referenced this pull request Jun 18, 2015

ph added a commit that referenced this pull request Jun 18, 2015

ph added a commit that referenced this pull request Jun 18, 2015

@ph ph closed this in fcf8d78 Jun 18, 2015

ph added a commit that referenced this pull request Jun 18, 2015

ph added a commit that referenced this pull request Jun 18, 2015

ph added a commit that referenced this pull request Jun 18, 2015

ph added a commit that referenced this pull request Jun 18, 2015

ph added a commit that referenced this pull request Jun 18, 2015

@ph

This comment has been minimized.

Copy link
Member Author

commented Jun 18, 2015

Damn I should have squashed this :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.