Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate mixed workload with Get, Put, Seek in db_bench #4788

Closed
wants to merge 5 commits into from

Conversation

zhichao-cao
Copy link
Contributor

@zhichao-cao zhichao-cao commented Dec 17, 2018

Based on the specific workload models (key access distribution, value size distribution, and iterator scan length distribution, the QPS variation), the MixGraph benchmark generate the synthetic workload according to these distributions which can reflect the real-world workload characteristics.

After user enable the tracing function, they will get the trace file. By analyzing the trace file with the trace_analyzer tool, user can generate a set of statistic data files. The *_accessed_key_stats.txt, *-accessed_value_size_distribution.txt, *-iterator_length_distribution.txt, and *-qps_stats.txt are mainly used to fit the Matlab model fitting. After that, user can get the parameters of the workload distributions (the modeling details are described: here)

The key access distribution follows the the two-term power model. The probability density function is: f(x) = ax^{b}+c. The corresponding parameters are key_dist_a, key_dist_b, and key_dist_c in db_bench

For the value size distribution and iterator scan length distribution, they both follow the Generalized Pareto Distribution. The probability density function is f(x) = (1/sigma)(1+k*(x-theta)/sigma))^{-1-1/k). The parameters are: value_k, value_theta, value_sigma and iter_k, iter_theta, iter_sigma. For more information about the Generalized Pareto Distribution, users can find the wiki and Matalb page

As for the QPS, it follows the diurnal pattern. So Sine is a good model to fit it. F(x) = sine_a*sin(sine_b*x + sine_c) + sine_d. The trace_will tell you the average QPS in the print out resutls, which is sine_d. After user fit the "*-qps_stats.txt" to the Matlab model, user can get the sine_a, sine_b, and sine_c. By using the 4 parameters, user can control the QPS variation including the period, average, changes.

To use the bench mark, user can indicate the following parameters as examples:

-benchmarks="mixgraph" -key_dist_a=0.002312 -key_dist_b=0.3467 -value_k=0.9233 -value_sigma=226.4092 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.7 -mix_put_ratio=0.25 -mix_seek_ratio=0.05 -sine_mix_rate_interval_milliseconds=500 -sine_a=15000 -sine_b=1 -sine_d=20000

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@sagar0 sagar0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhichao-cao .
Could you provide more information about the distributions that you added in the summary section, so that RocksDB users can know what type of models/equations/distributions are used to generate the workload, without delving into the code?

tools/db_bench_tool.cc Show resolved Hide resolved
tools/db_bench_tool.cc Show resolved Hide resolved
@sagar0
Copy link
Contributor

sagar0 commented Jan 4, 2019

I believe it is quite difficult for users to come up with values for these parameters. Is there an easy way a user can figure out what values to provide for these parameters based on some other data? (say from trace_analyzer?)

@zhichao-cao
Copy link
Contributor Author

I believe it is quite difficult for users to come up with values for these parameters. Is there an easy way a user can figure out what values to provide for these parameters based on some other data? (say from trace_analyzer?)

The trace analyzer provide the statistic files for the users. User needs to use Matlab to fit the statistic data to the models. The fitting functions are complex and I think using the well-develop tool boxes in Matlab is a better way for users to figure out these parameters. I have wrote the instructions on the intro of "Tracing, analyzing, and Modeling" at RocksDB wiki, which includes the files generated by trace_analyzer and how these files can be used in Matlab for model fitting. The Matlab scripts are also listed there.

@facebook-github-bot
Copy link
Contributor

@zhichao-cao has updated the pull request. Re-import the pull request

@sagar0
Copy link
Contributor

sagar0 commented Jan 4, 2019

I believe it is quite difficult for users to come up with values for these parameters. Is there an easy way a user can figure out what values to provide for these parameters based on some other data? (say from trace_analyzer?)

The trace analyzer provide the statistic files for the users. User needs to use Matlab to fit the statistic data to the models. The fitting functions are complex and I think using the well-develop tool boxes in Matlab is a better way for users to figure out these parameters. I have wrote the instructions on the intro of "Tracing, analyzing, and Modeling" at RocksDB wiki, which includes the files generated by trace_analyzer and how these files can be used in Matlab for model fitting. The Matlab scripts are also listed there.

That's great. I haven't seen the "Model the Workloads" section on that wiki page before.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@sagar0 sagar0 changed the title Generate the mix workload with Get, Put, Seek in db_bench Generate mix workload with Get, Put, Seek in db_bench Jan 8, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sagar0 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@sagar0 sagar0 changed the title Generate mix workload with Get, Put, Seek in db_bench Generate mixed workload with Get, Put, Seek in db_bench Jan 8, 2019
Copy link
Contributor

@sagar0 sagar0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm.
Lets get this in and you can iterate on it.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sagar0 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@xkszltl
Copy link
Contributor

xkszltl commented Jan 24, 2019

char value_buffer[2 * value_max];

This seems to be a dynamic array, which is gnu extension.

@sagar0
Copy link
Contributor

sagar0 commented Jan 24, 2019

I wonder why the appveyor build didn't catch this 😕 .

facebook-github-bot pushed a commit that referenced this pull request Jan 28, 2019
Summary:
In the MixGraph benchmark of db_bench #4788 , the char array is initialized with an argument from user's input, which can cause build error on some platforms. Also, the msg char array size can be potentially smaller than the printed data, which should be extended from 100 to 256.

Tested with make check.
Pull Request resolved: #4918

Differential Revision: D13844298

Pulled By: sagar0

fbshipit-source-id: 33c4809c5c4438f0a9f7b289d3f42e20c545bbab
facebook-github-bot pushed a commit that referenced this pull request Nov 6, 2019
Summary:
In the previous PR #4788, user can use db_bench mix_graph option to generate the workload that is from the social graph. The key is generated based on the key access hotness. In this PR, user can further model the key-range hotness and fit those to two-term-exponential distribution. First, user cuts the whole key space into small key ranges (e.g., key-ranges are the same size and the key-range number is the number of SST files). Then, user calculates the average access count per key of each key-range as the key-range hotness. Next, user fits the key-range hotness to two-term-exponential distribution (f(x) = f(x) = a*exp(b*x) + c*exp(d*x)) and generate the value of a, b, c, and d. They are the parameters in db_bench: prefix_dist_a, prefix_dist_b, prefix_dist_c, and prefix_dist_d. Finally, user can run db_bench by specify the parameters.
For example:
`./db_bench --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=268435456 -key_dist_a=0.002312 -key_dist_b=0.3467 -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=350 -sine_b=0.0105 -sine_d=50000 --perf_level=2 -reads=1000000 -num=5000000 -key_size=48`
Pull Request resolved: #5953

Test Plan: run db_bench with different parameters and checked the results.

Differential Revision: D18053527

Pulled By: zhichao-cao

fbshipit-source-id: 171f8b3142bd76462f1967c58345ad7e4f84bab7
merryChris pushed a commit to merryChris/rocksdb that referenced this pull request Nov 18, 2019
Summary:
In the previous PR facebook#4788, user can use db_bench mix_graph option to generate the workload that is from the social graph. The key is generated based on the key access hotness. In this PR, user can further model the key-range hotness and fit those to two-term-exponential distribution. First, user cuts the whole key space into small key ranges (e.g., key-ranges are the same size and the key-range number is the number of SST files). Then, user calculates the average access count per key of each key-range as the key-range hotness. Next, user fits the key-range hotness to two-term-exponential distribution (f(x) = f(x) = a*exp(b*x) + c*exp(d*x)) and generate the value of a, b, c, and d. They are the parameters in db_bench: prefix_dist_a, prefix_dist_b, prefix_dist_c, and prefix_dist_d. Finally, user can run db_bench by specify the parameters.
For example:
`./db_bench --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=268435456 -key_dist_a=0.002312 -key_dist_b=0.3467 -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=350 -sine_b=0.0105 -sine_d=50000 --perf_level=2 -reads=1000000 -num=5000000 -key_size=48`
Pull Request resolved: facebook#5953

Test Plan: run db_bench with different parameters and checked the results.

Differential Revision: D18053527

Pulled By: zhichao-cao

fbshipit-source-id: 171f8b3142bd76462f1967c58345ad7e4f84bab7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants