-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3170: [C++] Experimental readahead spooler #2492
Conversation
2f11399
to
3cbaa80
Compare
Codecov Report
@@ Coverage Diff @@
## master #2492 +/- ##
==========================================
- Coverage 86.28% 86.21% -0.07%
==========================================
Files 308 311 +3
Lines 47099 47338 +239
==========================================
+ Hits 40639 40814 +175
- Misses 6388 6450 +62
- Partials 72 74 +2
Continue to review full report at Codecov.
|
std::shared_ptr<Buffer> buffer; | ||
ABORT_NOT_OK(AllocateBuffer(data.length(), &buffer)); | ||
memcpy(buffer->mutable_data(), data.data(), data.length()); | ||
return std::make_shared<BufferReader>(std::move(buffer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return buffer
? Shouldn't RVO take care of not duplicating the pointer twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I'm returning a BufferReader, not a Buffer ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah!
cpp/src/arrow/io/readahead.h
Outdated
~ReadaheadSpooler(); | ||
|
||
/// Configure zero-padding at beginning and end of buffers (default 0 bytes). | ||
int64_t GetLeftPadding(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add more descriptive comments about what the paddings do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
std::shared_ptr<Buffer> buffer; | ||
ABORT_NOT_OK(AllocateBuffer(data.length(), &buffer)); | ||
memcpy(buffer->mutable_data(), data.data(), data.length()); | ||
return std::make_shared<BufferReader>(std::move(buffer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah!
I am working on a review, probably will finish later today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks mostly OK to me. My main question is how we could begin to employ asynchronous IO in some of our performance critical paths, like reading Parquet files etc. I suppose after the parquet-cpp merge we can begin to explore this question more quickly
@@ -26,6 +26,7 @@ if (ARROW_HDFS AND NOT ARROW_BOOST_HEADER_ONLY) | |||
endif() | |||
|
|||
ADD_ARROW_TEST(io-memory-test) | |||
ADD_ARROW_TEST(io-readahead-test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might consider putting all async, buffering, etc. tests in a single test module to help avoid having a too large number of test executables longer term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather have separate test modules. array-test.cc is unreadable for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this means 1) better parallelization of tests 2) when running tests for a module I don't endure the costs of testing other modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. There's other tools we can use to help with this if testing link times become an issue, like linking multiple *-test.cc
files into a single test executable (I think that should work...)
class Status; | ||
|
||
namespace io { | ||
namespace internal { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to develop a public API for this that is useful for people that use Arrow as a library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes... OTOH, this is experimental and the API is likely to change depending on our internal uses. Do you think it should be made public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now let's develop this for our own uses and then we can always promote to public API later
@@ -38,4 +39,5 @@ install(FILES | |||
hdfs.h | |||
interfaces.h | |||
memory.h | |||
readahead.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May consider creating a module with many async interfaces, arrow/io/async.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not conflate terminology here. There's nothing async in this class: the Read
method is blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
return pos; | ||
} | ||
|
||
static void AssertEventualPosition(std::shared_ptr<RandomAccessFile> file, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to get in habit of passing RandomAccessFile*
unless you need the shared_ptr<T>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... I thought our policy was to pass a managed pointer most of the time?
This is just a private test helper regardless :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, not a big deal here, but in general utility code if a function doesn't need to return ownership of shared_ptr<T>
then passing T*
makes it more versatile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, updated.
/// \brief EXPERIMENTAL: Create a readahead spooler wrapping the given input stream. | ||
/// | ||
/// The spooler launches a background thread that reads up to a given number | ||
/// of fixed-size blocks in advance from the underlying stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use a bit more documentation about what left padding and right padding are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
int64_t GetRightPadding(); | ||
void SetRightPadding(int64_t size); | ||
|
||
/// \brief Close the spooler. This implicitly closes the underlying input stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You think closing the InputStream is the correct thing (curious)? If we were more consistent about using unique_ptr I would say we should pass the InputStream to the ctor as that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the InputStream is left in an undefined state, since its position has advanced more than the user has read, so I don't think it's useful to keep it open.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
/// Otherwise, the buffer will contain at most read_size bytes in addition | ||
/// to the configured padding (short reads are possible at the end of a file). | ||
// How do we allow reusing the buffer in ReadaheadBuffer? perhaps by using | ||
// a caching memory pool? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be too complicated; we need an API to "return" a buffer to the pool so that it will be popped off a deque in the future. This would mean that we could end up with queue_size + 1
buffers but that's true anyway in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the solution with an explicit ReadaheadSpooler API. The other solution would be to have a caching memory pool that puts freed buffers (up to a given number) in a freelist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We could either create a memory pool-compatible API or define some kind of explicit TemporaryBufferQueue
to coordinate different kinds of consumers and producers
private: | ||
static constexpr int64_t kDefaultReadSize = 1 << 20; // 1 MB | ||
|
||
class ARROW_NO_EXPORT Impl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: name this ReadaheadSpoolerImpl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even for a nested class like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK to leave as is
cpp/src/arrow/io/readahead.cc
Outdated
return read_status_; | ||
} | ||
if (eof_) { | ||
// XXX maybe we need a EOF status code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, can open an issue to consider in more detail. I think it's OK to return null for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I'll remove the comment.
|
||
while (true) { | ||
if (please_close_) { | ||
goto eof; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't seen goto
in this codebase yet, but if it's the best way to write this code it is ok with me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's useful for cleanup code at the end of a function like this. I know C++ uses RAII when possible but it would be unwieldy here to define a local class just for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem
f38562c
to
6f72e54
Compare
I opened ARROW-3170 as this is not exactly what is scoped there |
Anything more you wanted to do on this? Can merge otherwise |
6f72e54
to
cbbd3db
Compare
No description provided.