Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3170: [C++] Experimental readahead spooler #2492

Closed
wants to merge 2 commits into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Aug 29, 2018

No description provided.

@pitrou pitrou force-pushed the ARROW-501-readahead branch 3 times, most recently from 2f11399 to 3cbaa80 Compare August 29, 2018 11:48
@codecov-io
Copy link

Codecov Report

Merging #2492 into master will decrease coverage by 0.06%.
The diff coverage is 96.63%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2492      +/-   ##
==========================================
- Coverage   86.28%   86.21%   -0.07%     
==========================================
  Files         308      311       +3     
  Lines       47099    47338     +239     
==========================================
+ Hits        40639    40814     +175     
- Misses       6388     6450      +62     
- Partials       72       74       +2
Impacted Files Coverage Δ
cpp/src/arrow/io/readahead.h 100% <100%> (ø)
cpp/src/plasma/test/client_tests.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/test-util.h 76.25% <100%> (-0.49%) ⬇️
cpp/src/arrow/io/readahead.cc 93.68% <93.68%> (ø)
cpp/src/arrow/io/io-readahead-test.cc 98.49% <98.49%> (ø)
go/arrow/math/int64_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/float64_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/memory/memory_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/uint64_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/uint64_amd64.go 33.33% <0%> (-33.34%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef21ec9...3cbaa80. Read the comment docs.

std::shared_ptr<Buffer> buffer;
ABORT_NOT_OK(AllocateBuffer(data.length(), &buffer));
memcpy(buffer->mutable_data(), data.data(), data.length());
return std::make_shared<BufferReader>(std::move(buffer));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return buffer? Shouldn't RVO take care of not duplicating the pointer twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I'm returning a BufferReader, not a Buffer ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah!

~ReadaheadSpooler();

/// Configure zero-padding at beginning and end of buffers (default 0 bytes).
int64_t GetLeftPadding();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more descriptive comments about what the paddings do?

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

std::shared_ptr<Buffer> buffer;
ABORT_NOT_OK(AllocateBuffer(data.length(), &buffer));
memcpy(buffer->mutable_data(), data.data(), data.length());
return std::make_shared<BufferReader>(std::move(buffer));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah!

@pitrou pitrou requested a review from wesm August 29, 2018 15:37
@wesm
Copy link
Member

wesm commented Aug 29, 2018

I am working on a review, probably will finish later today

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly OK to me. My main question is how we could begin to employ asynchronous IO in some of our performance critical paths, like reading Parquet files etc. I suppose after the parquet-cpp merge we can begin to explore this question more quickly

@@ -26,6 +26,7 @@ if (ARROW_HDFS AND NOT ARROW_BOOST_HEADER_ONLY)
endif()

ADD_ARROW_TEST(io-memory-test)
ADD_ARROW_TEST(io-readahead-test)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might consider putting all async, buffering, etc. tests in a single test module to help avoid having a too large number of test executables longer term

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have separate test modules. array-test.cc is unreadable for me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this means 1) better parallelization of tests 2) when running tests for a module I don't endure the costs of testing other modules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. There's other tools we can use to help with this if testing link times become an issue, like linking multiple *-test.cc files into a single test executable (I think that should work...)

class Status;

namespace io {
namespace internal {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to develop a public API for this that is useful for people that use Arrow as a library

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... OTOH, this is experimental and the API is likely to change depending on our internal uses. Do you think it should be made public?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now let's develop this for our own uses and then we can always promote to public API later

@@ -38,4 +39,5 @@ install(FILES
hdfs.h
interfaces.h
memory.h
readahead.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May consider creating a module with many async interfaces, arrow/io/async.h

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not conflate terminology here. There's nothing async in this class: the Read method is blocking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

return pos;
}

static void AssertEventualPosition(std::shared_ptr<RandomAccessFile> file,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to get in habit of passing RandomAccessFile* unless you need the shared_ptr<T>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... I thought our policy was to pass a managed pointer most of the time?
This is just a private test helper regardless :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, not a big deal here, but in general utility code if a function doesn't need to return ownership of shared_ptr<T> then passing T* makes it more versatile

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, updated.

/// \brief EXPERIMENTAL: Create a readahead spooler wrapping the given input stream.
///
/// The spooler launches a background thread that reads up to a given number
/// of fixed-size blocks in advance from the underlying stream.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a bit more documentation about what left padding and right padding are

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

int64_t GetRightPadding();
void SetRightPadding(int64_t size);

/// \brief Close the spooler. This implicitly closes the underlying input stream.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You think closing the InputStream is the correct thing (curious)? If we were more consistent about using unique_ptr I would say we should pass the InputStream to the ctor as that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the InputStream is left in an undefined state, since its position has advanced more than the user has read, so I don't think it's useful to keep it open.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

/// Otherwise, the buffer will contain at most read_size bytes in addition
/// to the configured padding (short reads are possible at the end of a file).
// How do we allow reusing the buffer in ReadaheadBuffer? perhaps by using
// a caching memory pool?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be too complicated; we need an API to "return" a buffer to the pool so that it will be popped off a deque in the future. This would mean that we could end up with queue_size + 1 buffers but that's true anyway in general

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the solution with an explicit ReadaheadSpooler API. The other solution would be to have a caching memory pool that puts freed buffers (up to a given number) in a freelist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We could either create a memory pool-compatible API or define some kind of explicit TemporaryBufferQueue to coordinate different kinds of consumers and producers

private:
static constexpr int64_t kDefaultReadSize = 1 << 20; // 1 MB

class ARROW_NO_EXPORT Impl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: name this ReadaheadSpoolerImpl

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for a nested class like this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to leave as is

return read_status_;
}
if (eof_) {
// XXX maybe we need a EOF status code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, can open an issue to consider in more detail. I think it's OK to return null for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I'll remove the comment.


while (true) {
if (please_close_) {
goto eof;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't seen goto in this codebase yet, but if it's the best way to write this code it is ok with me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful for cleanup code at the end of a function like this. I know C++ uses RAII when possible but it would be unwieldy here to define a local class just for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem

@wesm wesm changed the title ARROW-501: [C++] Experimental readahead spooler ARROW-3170: [C++] Experimental readahead spooler Sep 4, 2018
@wesm
Copy link
Member

wesm commented Sep 4, 2018

I opened ARROW-3170 as this is not exactly what is scoped there

@wesm
Copy link
Member

wesm commented Sep 7, 2018

Anything more you wanted to do on this? Can merge otherwise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants