Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2577: [Plasma] Add asv benchmarks for plasma #2038

Closed
wants to merge 7 commits into from

Conversation

pcmoritz
Copy link
Contributor

This adds some initial ASV benchmarks for plasma:

  • Put latency
  • Get latency
  • Put throughput for 1KB, 10KB, 100KB, 1MB, 10MB, 100MB

It also includes some minor code restructuring to expose the start_plasma_store method.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice :-) Some comments below.

plasma_store_name, p = self.plasma_store_ctx.__enter__()
self.plasma_client = plasma.connect(plasma_store_name, "", 64)

self.data_1kb = np.random.randn(1000 // 8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use parametrization for the various sizes (you can look at the conversion benchmarks to see how that is done, or see docs). Also I don't think we should test so many sizes. Testing a very small size (1kb) and a large-ish size (10mb) sounds sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

from . import common


class PlasmaBenchmarks(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a high-level point of view, what do we want to benchmark here? The plasma client, or client + store? You may want to choose an explicit timer that reflects that decision (see the timer attribute here). Also add a docstring :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a docstring, not sure what you mean with the timer attribute; the default seems fine to me :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that if you want to time the client overhead, you need a timer that measures the elapsed CPU time of the current process is required (but that is asv's default, AFAICT). If OTOH you want to measure the whole client + server roundtrip, you need a timer that measures the elapsed wall clock time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, thanks for pointing that out. I do want to measure the roundtrip wallclock time, and think timeit which is the default for asv is doing that (according to https://docs.python.org/3.0/library/timeit.html).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, according to https://asv.readthedocs.io/en/latest/writing_benchmarks.html#timing the default is process CPU time (by way of time.process_time). Which makes sense in most cases but not here :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I didn't realize they were overriding the timeit default, should be fixed now :)

@codecov-io
Copy link

Codecov Report

Merging #2038 into master will increase coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2038      +/-   ##
==========================================
+ Coverage   87.44%   87.47%   +0.02%     
==========================================
  Files         189      178      -11     
  Lines       29368    28595     -773     
==========================================
- Hits        25682    25014     -668     
+ Misses       3686     3581     -105
Impacted Files Coverage Δ
cpp/src/arrow/util/thread-pool-test.cc 98.87% <0%> (-0.57%) ⬇️
rust/src/bitmap.rs
rust/src/builder.rs
rust/src/memory.rs
rust/src/error.rs
rust/src/buffer.rs
rust/src/memory_pool.rs
rust/src/datatypes.rs
rust/src/list.rs
rust/src/array.rs
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b8511f...34a0684. Read the comment docs.

@pcmoritz pcmoritz closed this in 75acaba May 14, 2018
@pcmoritz pcmoritz deleted the plasma-asv branch May 14, 2018 23:58
@pcmoritz
Copy link
Contributor Author

@pitrou Is there a way to get all the numbers for each parameter from a command line run? For me it only shows the first one and the rest is omitted with ... (at the very right):

ubuntu@ip-172-31-49-70:~/arrow/python$ asv run --python=same
· Discovering benchmarks
· Running 14 total benchmarks (1 commits * 1 environments * 14 benchmarks)
[  0.00%] ·· Building for existing-py_home_ubuntu_anaconda3_bin_python
[  0.00%] ·· Benchmarking existing-py_home_ubuntu_anaconda3_bin_python
[  7.14%] ··· Running array_ops.ScalarAccess.time_as_py                                                                                                                        9.41ms
[ 14.29%] ··· Running array_ops.ScalarAccess.time_getitem                                                                                                                     41.09ms
[ 21.43%] ··· Running convert_builtins.ConvertArrayToPyList.time_convert                                                                                                  41.60ms;...
[ 28.57%] ··· Running convert_builtins.ConvertPyListToArray.time_convert                                                                                                  15.41ms;...
[ 35.71%] ··· Running convert_builtins.InferPyListToArray.time_infer                                                                                                      19.75ms;...
[ 42.86%] ··· Running convert_pandas.PandasConversionsFromArrow.time_to_series                                                                                             1.28ms;...
[ 50.00%] ··· Running convert_pandas.PandasConversionsToArrow.time_from_series                                                                                           594.60μs;...
[ 57.14%] ··· Running convert_pandas.ZeroCopyPandasRead.time_deserialize_from_buffer                                                                                         609.45μs
[ 64.29%] ··· Running convert_pandas.ZeroCopyPandasRead.time_deserialize_from_components                                                                                     597.48μs
[ 71.43%] ··· Running microbenchmarks.PandasObjectIsNull.time_PandasObjectIsNull                                                                                           2.76ms;...
[ 78.57%] ··· Running plasma.SimplePlasmaLatency.time_plasma_put                                                                                                             500.94ms
[ 85.71%] ··· Running plasma.SimplePlasmaLatency.time_plasma_putget                                                                                                          691.88ms
[ 92.86%] ··· Running plasma.SimplePlasmaThroughput.time_plasma_put_data                                                                                                 562.55μs;...
[100.00%] ··· Running streaming.StreamReader.time_read_to_dataframe                                                                                                      232.14ms;...

@pitrou
Copy link
Member

pitrou commented May 15, 2018

@pcmoritz The -e flag should do it, IIRC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants