New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1422: [C++] Use common Arrow IO interfaces throughout codebase #4404
Conversation
@majetideepak I'm sensitive to how this might impact your use of these classes so if something doesn't build right with these changes please let me know and I'll fix it. The only APIs that changed were relatively internal APIs |
cpp/src/parquet/deprecated_io.cc
Outdated
|
||
ParquetInputWrapper::~ParquetInputWrapper() { | ||
if (!closed_) { | ||
source_->Close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably safe to put a try-catch
block here since this can throw inside a destructor.
In the old form:
~SerializedFile() override {
try {
Close();
} catch (...) {
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm Thanks for the heads-up. The deprecated API should help us for now.
Regarding your open question on Parquet semantics of handling the ownership of files, I think we can leave the file ownership to the client.
Since this PR deprecates the RandomAccessSource
and OutputStream
API, the following API must be deprecated as well.
std::unique_ptr<ParquetFileReader> ParquetFileReader::Open(
std::unique_ptr<RandomAccessSource> source, const ReaderProperties& props,
const std::shared_ptr<FileMetaData>& metadata);
std::unique_ptr<ParquetFileWriter> ParquetFileWriter::Open(
const std::shared_ptr<OutputStream>& sink,
const std::shared_ptr<schema::GroupNode>& schema,
const std::shared_ptr<WriterProperties>& properties,
const std::shared_ptr<const KeyValueMetadata>& key_value_metadata);
CC: @AnatoliShein
I just saw that you have |
Super weird failure in Travis CI
It looks like Cython just released on conda-forge https://github.com/conda-forge/cython-feedstock -- @pitrou @kszucs have you seen other issues? |
f23a468
to
077da2d
Compare
@majetideepak I addressed your comments and added unit tests for the wrapper classes as an extra security measure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the Arrow changes, skipped through the Parquet changes for the most part.
cpp/src/arrow/io/buffered.cc
Outdated
int64_t total_avail = bytes_buffered_; | ||
|
||
if (raw_read_bound_ > 0) { | ||
total_avail += raw_read_bound_ - raw_read_total_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why you're doing this here. total_avail
is what's left in the buffer, not in the whole stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Parquet, it is peeking ahead into the available bytes in the raw stream to look for a data page header. So the total number of bytes available to peak is the number of bytes buffered plus any known additional bytes in the raw stream (as indicated by the bound parameter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's only if raw_read_bound_ > 0
. Otherwise you're only considering the number of buffered bytes... That seems inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we are in a difficult place with respect to finding the next data page in the stream. Can you look at where Peek is used in parquet/column_reader.cc so we can focus on addressing the core issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If raw_read_bound_
is not set, then the total_avail
should be treated as unbounded, does that sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned this up and added tests for the unbounded case
cpp/src/arrow/io/buffered.cc
Outdated
// Read more data when buffer has insufficient left | ||
if (nbytes > bytes_buffered_) { | ||
// Read as much as possible to fill the buffer, but not past stream end | ||
int64_t read_size = std::min(nbytes - bytes_buffered_, total_avail); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the min
needed? There's already nbytes = std::min(nbytes, total_avail)
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in cleanup
cpp/src/arrow/io/buffered.cc
Outdated
Status Peek(int64_t nbytes, util::string_view* out) { | ||
int64_t total_avail = bytes_buffered_; | ||
|
||
if (raw_read_bound_ > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raw_read_bound >= 0
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix
cpp/src/arrow/io/interfaces.cc
Outdated
return util::string_view(nullptr, 0); | ||
Status InputStream::Peek(int64_t ARROW_ARG_UNUSED(nbytes), util::string_view* out) { | ||
*out = util::string_view(nullptr, 0); | ||
return Status::OK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return NotImplemented
here now that we're returning a Status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it didn't fail before. Do you think it should fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's not able to peek (it returns an empty string_view with a nullptr). Previously we weren't returning a Status, so we couldn't fail explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/arrow/io/memory.cc
Outdated
if (!is_open_) { | ||
return {}; | ||
*out = {}; | ||
return Status::OK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should return Status::Invalid
like other methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer: not returning a Status meant we couldn't fail explicitly :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I'll address the comments and get the CI passing tomorrow. Would be great to get this merged later tomorrow or Friday so that other Parquet PRs can rebase if necessary |
077da2d
to
f264e05
Compare
I think I've addressed all feedback. I'm going to merge this when the build is green if there are no objections |
Oh, glib needs some fixing |
constexpr int64_t kDefaultOutputStreamSize = 1024; | ||
|
||
std::shared_ptr<::arrow::io::BufferOutputStream> CreateOutputStream( | ||
::arrow::MemoryPool* pool = ::arrow::default_memory_pool()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that PARQUET_EXPORT
is missing.
bloom_filter-test.cc.obj : error LNK2019: unresolved external symbol "class std::shared_ptr<class arrow::io::BufferOutputStream> __cdecl parquet::CreateOutputStream(class arrow::MemoryPool *)" (?CreateOutputStream@parquet@@YA?AV?$shared_ptr@VBufferOutputStream@io@arrow@@@std@@PEAVMemoryPool@arrow@@@Z) referenced in function "private: virtual void __cdecl parquet::test::BasicTest_TestBloomFilter_Test::TestBody(void)" (?TestBody@BasicTest_TestBloomFilter_Test@test@parquet@@EEAAXXZ)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks. Fixing
ddcbff3
to
4c80f22
Compare
cpp/src/parquet/deprecated_io.h
Outdated
// ---------------------------------------------------------------------- | ||
// Wrapper classes | ||
|
||
class ParquetInputWrapper : public ::arrow::io::RandomAccessFile { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PARQUET_EXPORT
is missing here too.
cpp/src/parquet/deprecated_io.h
Outdated
bool closed_; | ||
}; | ||
|
||
class ParquetOutputWrapper : public ::arrow::io::OutputStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
… to be able to return Status
199b1b2
to
f010a8e
Compare
I think I have it sorted out now -- tried actually building on Windows finally =) Rebased and will await CI |
Travis CI build is passing: https://travis-ci.org/wesm/arrow/builds/539704609. Merging |
@wesm: @czxrrr was porting this new API to Vertica and he discovered that the |
|
class PARQUET_EXPORT BufferedInputStream : public InputStream { | ||
public: | ||
BufferedInputStream(::arrow::MemoryPool* pool, int64_t buffer_size, | ||
RandomAccessSource* source, int64_t start, int64_t end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Arrow version takes an InputStream
instead of RandomAccessSource
. This is an issue since InputStream
does not have ReadAt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, yes, there's a reason it's called BufferedInputStream
;-)
Feel free to work on a BufferedRandomFile
(also good luck finding the right design).
Can you open a new JIRA issue about investigating this? |
std::shared_ptr<::arrow::io::BufferedInputStream> stream; | ||
PARQUET_THROW_NOT_OK(source->Seek(start)); | ||
PARQUET_THROW_NOT_OK(::arrow::io::BufferedInputStream::Create( | ||
buffer_size_, pool_, source, &stream, num_bytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ArrowInputFile
is a RandomAccessFile
. However, ::arrow::io::BufferedInputStream
takes an InputStream
. Casting RandomAccessFile
to InputStream
is incorrect since the ::arrow::io::BufferedInputStream::Peek
causes the RandomAccessFile
offset to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, when using a buffering layer, the buffering layer owns the underlying raw file. Using both at once is incorrect. This is not Arrow-specific, but happens in any language (Python, C, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requirement was not explicitly spelled out before, it only behaved as you wanted "by accident" (I think). Let's open a new JIRA and see what can be done
This is a long overdue unification of platform code that wasn't possible until after the monorepo merge that occurred last year. This should also permit us to take a more consistent approach with regards to asynchronous IO.
A backwards compatibility layer is provided for the now deprecated
parquet::RandomAccessSource
andparquet::OutputStream
classes.Some incidental changes were required to get things to work:
arrow::io::InputStream::Peek
needed to have its API changed to return Status, because of the next pointarrow::io::BufferedOutputStream::Peek
will expand the buffer if a Peek is requested that is larger than the buffer. The idea is that it should be possible to "look ahead" in the stream without altering the stream position. This is needed as part of finding the next data header (which can be large or small depending on statistics size, etc.) in a Parquet stream[]
operator toBuffer
to facilitate testingSome outstanding questions:
Close
is not appropriate. I've attempted to preserve this logic by having Close called in the destructors of the wrapper classes inparquet/deprecated_io.h
An issue I ran into