PARQUET-1422: [C++] Use common Arrow IO interfaces throughout codebase #4404

wesm · 2019-05-29T02:18:24Z

This is a long overdue unification of platform code that wasn't possible until after the monorepo merge that occurred last year. This should also permit us to take a more consistent approach with regards to asynchronous IO.

A backwards compatibility layer is provided for the now deprecated parquet::RandomAccessSource and parquet::OutputStream classes.

Some incidental changes were required to get things to work:

ARROW-5428: Adding a "read extent" option to BufferedInputStream to limit the extent of bytes read from the underlying raw stream
arrow::io::InputStream::Peek needed to have its API changed to return Status, because of the next point
arrow::io::BufferedOutputStream::Peek will expand the buffer if a Peek is requested that is larger than the buffer. The idea is that it should be possible to "look ahead" in the stream without altering the stream position. This is needed as part of finding the next data header (which can be large or small depending on statistics size, etc.) in a Parquet stream
Added a [] operator to Buffer to facilitate testing
Some continued "flattening" of the "parquet/util" directory to be simpler

Some outstanding questions:

The Parquet reader and writer classes assumed exclusive ownership of the file handles, and they are closed when the Parquet file is closed. Arrow files are shared, and so calling Close is not appropriate. I've attempted to preserve this logic by having Close called in the destructors of the wrapper classes in parquet/deprecated_io.h

An issue I ran into

Changes in d82ac40 introduced a unit test with meaningful trailing whitespace, which my editor strips away. I've commented out the offending test and will have to open a JIRA about fixing

wesm · 2019-05-29T02:36:54Z

@majetideepak I'm sensitive to how this might impact your use of these classes so if something doesn't build right with these changes please let me know and I'll fix it. The only APIs that changed were relatively internal APIs

majetideepak · 2019-05-29T04:43:49Z

cpp/src/parquet/deprecated_io.cc

+
+ParquetInputWrapper::~ParquetInputWrapper() {
+  if (!closed_) {
+    source_->Close();


Probably safe to put a try-catch block here since this can throw inside a destructor.
In the old form:

~SerializedFile() override { try { Close(); } catch (...) { } }

majetideepak

@wesm Thanks for the heads-up. The deprecated API should help us for now.
Regarding your open question on Parquet semantics of handling the ownership of files, I think we can leave the file ownership to the client.
Since this PR deprecates the RandomAccessSource and OutputStream API, the following API must be deprecated as well.

std::unique_ptr<ParquetFileReader> ParquetFileReader::Open(
    std::unique_ptr<RandomAccessSource> source, const ReaderProperties& props,
    const std::shared_ptr<FileMetaData>& metadata);
std::unique_ptr<ParquetFileWriter> ParquetFileWriter::Open(
    const std::shared_ptr<OutputStream>& sink,
    const std::shared_ptr<schema::GroupNode>& schema,
    const std::shared_ptr<WriterProperties>& properties,
    const std::shared_ptr<const KeyValueMetadata>& key_value_metadata);

CC: @AnatoliShein

majetideepak · 2019-05-29T07:00:51Z

I just saw that you have ARROW_DEPRECATED("Use arrow::io::RandomAccessFile version") for ParquetFileReader::Open(). You have to add a similar one for ParquetFileWriter::Open()

wesm · 2019-05-29T14:58:50Z

Super weird failure in Travis CI

[ 11%] Building CXX object CMakeFiles/_orc.dir/_orc.cpp.o
/home/travis/build/apache/arrow/python/build/temp.linux-x86_64-2.7/_orc.cpp: In function 'PyCodeObject* __Pyx_createFrameCodeObject(const char*, const char*, int)':
/home/travis/build/apache/arrow/python/build/temp.linux-x86_64-2.7/_orc.cpp:5770:1: error: label 'bad' defined but not used [-Werror=unused-label]
 bad:
 ^~~
cc1plus: all warnings being treated as errors

It looks like Cython just released on conda-forge https://github.com/conda-forge/cython-feedstock -- @pitrou @kszucs have you seen other issues?

wesm · 2019-05-29T20:17:29Z

@majetideepak I addressed your comments and added unit tests for the wrapper classes as an extra security measure

pitrou

I reviewed the Arrow changes, skipped through the Parquet changes for the most part.

pitrou · 2019-05-29T20:24:29Z

cpp/src/arrow/io/buffered.cc

+    int64_t total_avail = bytes_buffered_;
+
+    if (raw_read_bound_ > 0) {
+      total_avail += raw_read_bound_ - raw_read_total_;


I don't understand why you're doing this here. total_avail is what's left in the buffer, not in the whole stream.

In Parquet, it is peeking ahead into the available bytes in the raw stream to look for a data page header. So the total number of bytes available to peak is the number of bytes buffered plus any known additional bytes in the raw stream (as indicated by the bound parameter)

But that's only if raw_read_bound_ > 0. Otherwise you're only considering the number of buffered bytes... That seems inconsistent.

Well, we are in a difficult place with respect to finding the next data page in the stream. Can you look at where Peek is used in parquet/column_reader.cc so we can focus on addressing the core issue?

If raw_read_bound_ is not set, then the total_avail should be treated as unbounded, does that sound reasonable?

Cleaned this up and added tests for the unbounded case

pitrou · 2019-05-29T20:27:46Z

cpp/src/arrow/io/buffered.cc

+    // Read more data when buffer has insufficient left
+    if (nbytes > bytes_buffered_) {
+      // Read as much as possible to fill the buffer, but not past stream end
+      int64_t read_size = std::min(nbytes - bytes_buffered_, total_avail);


Is the min needed? There's already nbytes = std::min(nbytes, total_avail) above.

addressed in cleanup

pitrou · 2019-05-29T20:30:11Z

cpp/src/arrow/io/buffered.cc

+  Status Peek(int64_t nbytes, util::string_view* out) {
+    int64_t total_avail = bytes_buffered_;
+
+    if (raw_read_bound_ > 0) {


raw_read_bound >= 0 below.

pitrou · 2019-05-29T20:32:20Z

cpp/src/arrow/io/interfaces.cc

-  return util::string_view(nullptr, 0);
+Status InputStream::Peek(int64_t ARROW_ARG_UNUSED(nbytes), util::string_view* out) {
+  *out = util::string_view(nullptr, 0);
+  return Status::OK();


Why not return NotImplemented here now that we're returning a Status?

Well, it didn't fail before. Do you think it should fail?

Because it's not able to peek (it returns an empty string_view with a nullptr). Previously we weren't returning a Status, so we couldn't fail explicitly.

pitrou · 2019-05-29T20:33:34Z

cpp/src/arrow/io/memory.cc

  if (!is_open_) {
-    return {};
+    *out = {};
+    return Status::OK();


Should return Status::Invalid like other methods.

Same question as above

Same answer: not returning a Status meant we couldn't fail explicitly :-)

cpp/src/arrow/io/memory.h

cpp/src/parquet/schema-internal.h

wesm · 2019-05-29T22:48:48Z

I'll address the comments and get the CI passing tomorrow. Would be great to get this merged later tomorrow or Friday so that other Parquet PRs can rebase if necessary

wesm · 2019-05-30T17:12:10Z

I think I've addressed all feedback. I'm going to merge this when the build is green if there are no objections

wesm · 2019-05-30T23:58:47Z

Oh, glib needs some fixing

kou · 2019-05-31T02:28:55Z

cpp/src/parquet/platform.h

+constexpr int64_t kDefaultOutputStreamSize = 1024;
+
+std::shared_ptr<::arrow::io::BufferOutputStream> CreateOutputStream(
+    ::arrow::MemoryPool* pool = ::arrow::default_memory_pool());


It seems that PARQUET_EXPORT is missing.

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/24938354/job/8eqg024oo1mim9y8#L3103

bloom_filter-test.cc.obj : error LNK2019: unresolved external symbol "class std::shared_ptr<class arrow::io::BufferOutputStream> __cdecl parquet::CreateOutputStream(class arrow::MemoryPool *)" (?CreateOutputStream@parquet@@YA?AV?$shared_ptr@VBufferOutputStream@io@arrow@@@std@@PEAVMemoryPool@arrow@@@Z) referenced in function "private: virtual void __cdecl parquet::test::BasicTest_TestBloomFilter_Test::TestBody(void)" (?TestBody@BasicTest_TestBloomFilter_Test@test@parquet@@EEAAXXZ)

Ah thanks. Fixing

kou · 2019-05-31T05:49:17Z

cpp/src/parquet/deprecated_io.h

+// ----------------------------------------------------------------------
+// Wrapper classes
+
+class ParquetInputWrapper : public ::arrow::io::RandomAccessFile {


PARQUET_EXPORT is missing here too.

kou · 2019-05-31T05:49:48Z

cpp/src/parquet/deprecated_io.h

+  bool closed_;
+};
+
+class ParquetOutputWrapper : public ::arrow::io::OutputStream {


…nterfaces

…aces

…eam unit tests

… to be able to return Status

wesm · 2019-05-31T13:12:21Z

I think I have it sorted out now -- tried actually building on Windows finally =)

Rebased and will await CI

wesm · 2019-05-31T15:09:14Z

Travis CI build is passing: https://travis-ci.org/wesm/arrow/builds/539704609. Merging

majetideepak · 2019-08-07T22:10:42Z

@wesm: @czxrrr was porting this new API to Vertica and he discovered that the arrow::io::BufferedInputStream::Peek() is not compatible with the Parquet's implementation in this change.
The issue is that arrow::io::BufferedInputStream::Peek() https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/buffered.cc#L291 uses the Read call which ends up modifying the raw stream's offset. The Parquet's implementation here uses ReadAt which is missing from the Arrow's API.
What are your thoughts in fixing this?

pitrou · 2019-08-08T07:34:03Z

BufferedInputStream takes an input stream, not a random-access file, so it cannot call ReadAt.

majetideepak · 2019-08-08T13:54:38Z

cpp/src/parquet/util/memory.h

-class PARQUET_EXPORT BufferedInputStream : public InputStream {
- public:
-  BufferedInputStream(::arrow::MemoryPool* pool, int64_t buffer_size,
-                      RandomAccessSource* source, int64_t start, int64_t end);


The Arrow version takes an InputStream instead of RandomAccessSource. This is an issue since InputStream does not have ReadAt.

Well, yes, there's a reason it's called BufferedInputStream ;-)
Feel free to work on a BufferedRandomFile (also good luck finding the right design).

wesm · 2019-08-08T13:58:42Z

Can you open a new JIRA issue about investigating this?

majetideepak · 2019-08-08T14:01:05Z

cpp/src/parquet/properties.cc

+    std::shared_ptr<::arrow::io::BufferedInputStream> stream;
+    PARQUET_THROW_NOT_OK(source->Seek(start));
+    PARQUET_THROW_NOT_OK(::arrow::io::BufferedInputStream::Create(
+        buffer_size_, pool_, source, &stream, num_bytes));


The ArrowInputFile is a RandomAccessFile. However, ::arrow::io::BufferedInputStream takes an InputStream. Casting RandomAccessFile to InputStream is incorrect since the ::arrow::io::BufferedInputStream::Peek causes the RandomAccessFile offset to change.

Generally speaking, when using a buffering layer, the buffering layer owns the underlying raw file. Using both at once is incorrect. This is not Arrow-specific, but happens in any language (Python, C, etc.).

This requirement was not explicitly spelled out before, it only behaved as you wanted "by accident" (I think). Let's open a new JIRA and see what can be done

majetideepak reviewed May 29, 2019

View reviewed changes

wesm force-pushed the parquet-use-arrow-io branch from f23a468 to 077da2d Compare May 29, 2019 20:16

pitrou reviewed May 29, 2019

View reviewed changes

wesm force-pushed the parquet-use-arrow-io branch from 077da2d to f264e05 Compare May 30, 2019 17:09

kou reviewed May 31, 2019

View reviewed changes

wesm force-pushed the parquet-use-arrow-io branch from ddcbff3 to 4c80f22 Compare May 31, 2019 02:38

kou reviewed May 31, 2019

View reviewed changes

wesm added 15 commits May 31, 2019 08:11

Start a bit of refactoring/consolidation in prep for using Arrow IO i…

59143dd

…nterfaces

Port more code, add basic wrapper implementation for legacy IO interf…

66be1af

…aces

More refactoring

b05a712

More progress toward compilation, port over parquet::BufferedInputStr…

db1877e

…eam unit tests

Implement expanding-peek logic, change signature of InputStream::Peak…

4efb4e7

… to be able to return Status

Get things compiling again, but tests are broken

30f1f4d

Fix one bug

7efc1ac

column_writer more similar to before

1886de8

Tests passing again

769974a

Adapt Python bindings

4c40bf2

remove outdated comment

e03f07d

Add unit tests for legacy Parquet input/output wrappers

cd2a3cd

Allow unbounded peeks in BufferedInputStream

b6e1739

Fix unit tests

cc7789e

ReadableFile::Peek now returns NotImplemented

7c1ae55

wesm added 3 commits May 31, 2019 08:11

Follow changes in c_glib, fix Doxygen warning

3b27ac2

Add missing PARQUET_EXPORT

50f7b92

Add missing PARQUET_EXPORT macros

f010a8e

wesm force-pushed the parquet-use-arrow-io branch from 199b1b2 to f010a8e Compare May 31, 2019 13:11

wesm closed this in ff2ee42 May 31, 2019

wesm deleted the parquet-use-arrow-io branch May 31, 2019 15:18

majetideepak reviewed Aug 8, 2019

View reviewed changes

PARQUET-1422: [C++] Use common Arrow IO interfaces throughout codebase #4404

PARQUET-1422: [C++] Use common Arrow IO interfaces throughout codebase #4404

Conversation

wesm commented May 29, 2019 • edited

wesm commented May 29, 2019

majetideepak May 29, 2019 • edited

Choose a reason for hiding this comment

majetideepak left a comment • edited

Choose a reason for hiding this comment

majetideepak commented May 29, 2019

wesm commented May 29, 2019

wesm commented May 29, 2019

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented May 29, 2019

wesm commented May 30, 2019

wesm commented May 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented May 31, 2019

wesm commented May 31, 2019 • edited

majetideepak commented Aug 7, 2019

pitrou commented Aug 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Aug 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm Aug 8, 2019 • edited

Choose a reason for hiding this comment

wesm commented May 29, 2019 •

edited

majetideepak May 29, 2019 •

edited

majetideepak left a comment •

edited

wesm commented May 31, 2019 •

edited

wesm Aug 8, 2019 •

edited