Skip to content
This repository has been archived by the owner on Aug 3, 2020. It is now read-only.

[FLINK-15171] fix issue with netty shuffle buffer allocation skewing benchmark results #43

Closed
wants to merge 1 commit into from

Conversation

shuttie
Copy link
Contributor

@shuttie shuttie commented Dec 18, 2019

This is a follow-up to the upstream FLINK-15171 PR.

Currently most of the benchmarks use a single FlinkEnvironmentContext to use for running test jobs. While running SerializationFrameworkMiniBenchmarks suite, I've found that there is still quite a lot of time spent on cluster initialization right inside the benchmarking code. The following image is
SerializationFrameworkMiniBenchmarks.serializerTuple running with async-profiler:

image

The giant hill on the left side is actually nothing more than netty shuffle buffer allocation (which is 1gb by default) and it takes quite a lot of time:

image

I propose lowering the shuffle buffer size in the FlinkEnvironmentContext from default 1gb to something more reasonable like 8m, so it will eliminate this skew in the benchmark results.

After this PR, flame graph for SerializationFrameworkMiniBenchmarks.serializerTuple looks much more representative, and the cluster init taking 10% of the time is gone:

image

Other option may be to rework the FlinkEnvironmentContext in a way so it will directly invoke the LocalExecutor, starting the MiniCluster once per the whole microbenchmark suite. But looks like it may require changes in the upstream LocalExecutor code.

@pnowojski
Copy link
Contributor

Thanks for reporting this. You are right, but we are already aware of this issue and we didn't want to change the number of buffers in a middle of couple performance regression investigations

Something like 1000 should be more than enough, but that's for another story. Could you Xintong Song for now configure the benchmarks to always use the same number of buffers as before? After the release, we can decrease the number of buffers to something more sane (that would decrease the startup overhead costs, but would also present as "false" performance improvement, so I don't want this to happen while we are investigating this and FLINK-15104 ).

I'm closing this for now, as we were planning to just set a fixed number of buffers, instead of configuring the min/max in MB. We will definitely fix it once https://issues.apache.org/jira/browse/FLINK-15103 https://issues.apache.org/jira/browse/FLINK-15104 and https://issues.apache.org/jira/browse/FLINK-15171 are resolved.

@pnowojski pnowojski closed this Dec 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants