[FLINK-15171] fix issue with netty shuffle buffer allocation skewing benchmark results #43

shuttie · 2019-12-18T14:04:32Z

This is a follow-up to the upstream FLINK-15171 PR.

Currently most of the benchmarks use a single FlinkEnvironmentContext to use for running test jobs. While running SerializationFrameworkMiniBenchmarks suite, I've found that there is still quite a lot of time spent on cluster initialization right inside the benchmarking code. The following image is
SerializationFrameworkMiniBenchmarks.serializerTuple running with async-profiler:

The giant hill on the left side is actually nothing more than netty shuffle buffer allocation (which is 1gb by default) and it takes quite a lot of time:

I propose lowering the shuffle buffer size in the FlinkEnvironmentContext from default 1gb to something more reasonable like 8m, so it will eliminate this skew in the benchmark results.

After this PR, flame graph for SerializationFrameworkMiniBenchmarks.serializerTuple looks much more representative, and the cluster init taking 10% of the time is gone:

Other option may be to rework the FlinkEnvironmentContext in a way so it will directly invoke the LocalExecutor, starting the MiniCluster once per the whole microbenchmark suite. But looks like it may require changes in the upstream LocalExecutor code.

…benchmark results

pnowojski · 2019-12-19T10:27:54Z

Thanks for reporting this. You are right, but we are already aware of this issue and we didn't want to change the number of buffers in a middle of couple performance regression investigations

Something like 1000 should be more than enough, but that's for another story. Could you Xintong Song for now configure the benchmarks to always use the same number of buffers as before? After the release, we can decrease the number of buffers to something more sane (that would decrease the startup overhead costs, but would also present as "false" performance improvement, so I don't want this to happen while we are investigating this and FLINK-15104 ).

I'm closing this for now, as we were planning to just set a fixed number of buffers, instead of configuring the min/max in MB. We will definitely fix it once https://issues.apache.org/jira/browse/FLINK-15103 https://issues.apache.org/jira/browse/FLINK-15104 and https://issues.apache.org/jira/browse/FLINK-15171 are resolved.

[FLINK-15171] fix issue with netty shuffle buffer allocation skewing …

daf2964

…benchmark results

pnowojski closed this Dec 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-15171] fix issue with netty shuffle buffer allocation skewing benchmark results #43

[FLINK-15171] fix issue with netty shuffle buffer allocation skewing benchmark results #43

shuttie commented Dec 18, 2019 •

edited

pnowojski commented Dec 19, 2019

[FLINK-15171] fix issue with netty shuffle buffer allocation skewing benchmark results #43

[FLINK-15171] fix issue with netty shuffle buffer allocation skewing benchmark results #43

Conversation

shuttie commented Dec 18, 2019 • edited

pnowojski commented Dec 19, 2019

shuttie commented Dec 18, 2019 •

edited