PARQUET-1508: [C++] Read ByteArray data directly into arrow::BinaryBuilder and BinaryDictionaryBuilder. Refactor encoders/decoders to use cleaner virtual interfaces #3492

wesm · 2019-01-26T16:46:30Z

This patch ended up being a bit of a bloodbath, but it sorted out a number of technical debt problems.

Summary:

Add type-specific virtual encoder interfaces such as ByteArrayEncoder and ByteArrayDecoder -- this enables adding new encoder or decoder methods without conflicting with the other types. This was very hard to do before because all types shared a common template such as
PlainDecoder<ByteArrayType>
Encoder and decoder implementations are now all in an encoding.cc compilation unit, performance should be unchanged (I will check to make sure)
Add BYTE_ARRAY decoder methods that write into ChunkedBinaryBuilder or BinaryDictionaryBuilder. This unblocks the long-desired direct-to-categorical Parquet reads
Altered RecordReader to decode BYTE_ARRAY values directly into ChunkedBinaryBuilder. More work will be required to expose DictionaryArray reads in a sane way

Along the way I've decided I want to eradicate all instances of extern template class from the codebase. It's insanely brittle with different visibility rules in MSVC, gcc, AND clang (no kidding, gcc and clang do different things). I'll refactor the others parts of the codebase that use them later

wesm · 2019-01-26T16:47:41Z

cc @xhochy @majetideepak for your review. This work does not impact any public APIs (the encoders/decoders are non-public)

wesm · 2019-01-26T16:49:59Z

This yields maybe 5-10% speedup in binary reads. I'm going to see if I can add some ASV benchmarks for this

setup

import pandas as pd
import numpy as np

import pyarrow as pa
import pyarrow.parquet as pq

import pandas.util.testing as tm

string_length = 10
uniqueness = 100000
total_approx_bytes = 100000000

unique_values = [tm.rands(string_length) for i in range(uniqueness)]
values = unique_values * (total_approx_bytes // uniqueness // string_length)

table = pa.Table.from_arrays([pa.array(values)], names=['f1'])

def _write_table(table):
    sink = pa.BufferOutputStream()
    pq.write_table(table, sink, compression=None)
    return sink.getvalue()

buf = _write_table(table)

0.12:

In [3]: timeit pq.read_table(buf)                                                                                                                                                              
306 ms ± 7.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

this patch

In [2]: timeit pq.read_table(buf)                                                                                                                                                              
276 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(gcc 8.2.0)

wesm · 2019-01-26T16:51:05Z

cpp/src/parquet/arrow/record_reader.cc

+      return result;
+    } else {
+      return nullptr;
+    }


This is kind of a hack. I didn't want to obsess over this class since we also need to holistically look at the code duplication between RecordReader and ColumnReader, a known area of technical debt

wesm · 2019-01-27T01:56:46Z

We're having some failing macOS builds in multiple PRs, so I don't think it's caused by this:

-- Configuring done

CMake Warning (dev):

  Policy CMP0068 is not set: RPATH settings on macOS do not affect

  install_name.  Run "cmake --help-policy CMP0068" for policy details.  Use

  the cmake_policy command to set the policy and suppress this warning.



  For compatibility with older versions of CMake, the install_name fields for

  the following targets are still affected by RPATH settings:



   arrow_python_shared

   arrow_shared

   arrow_testing_shared

   parquet_shared

   plasma_shared



This warning is for project developers.  Use -Wno-dev to suppress it.



-- Generating done

-- Build files have been written to: /Users/travis/build/apache/arrow/cpp-build

ninja: error: manifest 'build.ninja' still dirty after 100 tries

xhochy · 2019-01-27T20:14:25Z

@wesm Looks like this is going in the right direction. I have no high level comments, only nitpicking. As there are still some commented sections, I will wait a bit until the code has settled.

wesm · 2019-01-27T21:31:46Z

I'll get the build passing and do a quick pass through to make any more changes that I want to make. Some of the commented-out methods are there are placeholders for the person who is going to work on getting end-to-end Categorical reads working. I'll just remove them for now out of cleanliness

xhochy · 2019-01-27T21:45:11Z

Some of the commented-out methods are there are placeholders for the person who is going to work on getting end-to-end Categorical reads working. I'll just remove them for now out of cleanliness

Just leave them there and add the ticket number. They can be helpful, I was just irritated by the commented functions without any text.

…bility Refactor DictionaryDecoder to be more easily extensible Add missing file windows hell Do not use __declspec(dllexport) when building parquet_static. does not fix problems though gcc does not like using the DictEncoder name inside EncoderTraits Change-Id: I67b51494da3e76f8e856a672cdfd29c9e257efbe Add Ben's windows fixes Refactor encoders to use virtual interface, with ability to add additional methods for ByteArray, FixedLenByteArray Factor decoder code into encoding.cc Tests passing again Make DictEncoder templated too for consistency Draft direct read into ChunkedBinaryBuilder Finish implementing DecodeArrow methods for ByteArrayType Fix release build Use dynamic_cast in a couple places that MSVC doesn't like Don't export interfaces that we don't need to debug Change-Id: Ida18b93c1ea1e51216ee007b258635fcac6f54b5 Clearer names, comments, fix segfault that appeared when ZSTD was turned off IWYU cleaning Revert changes to schema.cc to fix msvc build

…emented DecodeArrowNonNull method for DictionaryBuilder Change-Id: Id75eed26fd057aaf3f7ba6d8664c240a33a6cbd7

Change-Id: Icd4fdd9c9f6358a76e25d0eeb994773a707e6e3b

wesm · 2019-01-27T22:53:04Z

I'm done making code changes outside of getting the build to pass. I didn't get the Windows failure on VS 2015 but it seems to happen on VS 2017 so I'll look into that. I think I got a bit overzealous fixing IWYU issues

codecov-io · 2019-01-28T00:22:38Z

Codecov Report

Merging #3492 into master will decrease coverage by 0.02%.
The diff coverage is 90.03%.

@@            Coverage Diff             @@
##           master    #3492      +/-   ##
==========================================
- Coverage   88.18%   88.15%   -0.03%     
==========================================
  Files         635      635              
  Lines       79824    80082     +258     
  Branches     1069     1069              
==========================================
+ Hits        70392    70599     +207     
- Misses       9321     9368      +47     
- Partials      111      115       +4

Impacted Files	Coverage Δ
cpp/src/parquet/column_writer.h	`70.58% <ø> (ø)`	⬆️
cpp/src/parquet/bloom_filter.cc	`91.13% <ø> (ø)`	⬆️
cpp/src/parquet/metadata.h	`100% <ø> (ø)`	⬆️
cpp/src/parquet/file_reader.cc	`97.38% <ø> (ø)`	⬆️
cpp/src/parquet/column_writer-test.cc	`99.56% <ø> (ø)`	⬆️
cpp/src/parquet/file_writer.h	`100% <ø> (ø)`	⬆️
cpp/src/parquet/column_reader.h	`91.05% <ø> (ø)`	⬆️
cpp/src/parquet/schema.cc	`90.22% <ø> (ø)`	⬆️
cpp/src/parquet/arrow/writer.h	`100% <ø> (ø)`	⬆️
cpp/src/parquet/arrow/schema.cc	`91.85% <ø> (ø)`	⬆️
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f6e0c2...4bafc55. Read the comment docs.

wesm · 2019-01-28T03:25:21Z

@xhochy please go ahead and merge this if it looks OK to you. I'll keep pushing to get end-to-end categorical read/write in time for 0.13.

I opened https://issues.apache.org/jira/browse/ARROW-4398 about adding benchmarks for this

wesm · 2019-01-28T18:19:17Z

@xhochy any further thoughts about this? Parquet is technically review-then-commit so would prefer to have someone +1 the patch

xhochy

Looks good, would be nice to have one more comment. Feel free to merge afterwards.

cpp/src/parquet/arrow/record_reader.cc

Co-Authored-By: wesm <wesm@users.noreply.github.com>

Change-Id: Ic7faef17bc54e81d4975eece5d2f7d316d87f03d

wesm commented Jan 26, 2019

View reviewed changes

wesm force-pushed the PARQUET-1508 branch from 9ccf226 to 453ecbd Compare January 27, 2019 22:31

wesm added 2 commits January 27, 2019 16:36

Delete a couple commented-out methods, add code comments about unimpl…

daac8a6

…emented DecodeArrowNonNull method for DictionaryBuilder Change-Id: Id75eed26fd057aaf3f7ba6d8664c240a33a6cbd7

lint

06e2c23

Change-Id: Icd4fdd9c9f6358a76e25d0eeb994773a707e6e3b

wesm added 2 commits January 27, 2019 17:25

verbose makefile

c4fcf74

Fix public compile definition on windows

4bafc55

xhochy approved these changes Jan 29, 2019

View reviewed changes

cpp/src/parquet/arrow/record_reader.cc Show resolved Hide resolved

xhochy and others added 2 commits January 29, 2019 10:24

Update cpp/src/parquet/arrow/record_reader.cc

f3fadcb

Co-Authored-By: wesm <wesm@users.noreply.github.com>

lint

df1bfc0

Change-Id: Ic7faef17bc54e81d4975eece5d2f7d316d87f03d

wesm closed this in 3d435e4 Jan 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1508: [C++] Read ByteArray data directly into arrow::BinaryBuilder and BinaryDictionaryBuilder. Refactor encoders/decoders to use cleaner virtual interfaces #3492

PARQUET-1508: [C++] Read ByteArray data directly into arrow::BinaryBuilder and BinaryDictionaryBuilder. Refactor encoders/decoders to use cleaner virtual interfaces #3492

wesm commented Jan 26, 2019

wesm commented Jan 26, 2019

wesm commented Jan 26, 2019

wesm Jan 26, 2019

wesm commented Jan 27, 2019

xhochy commented Jan 27, 2019

wesm commented Jan 27, 2019

xhochy commented Jan 27, 2019

wesm commented Jan 27, 2019

codecov-io commented Jan 28, 2019

wesm commented Jan 28, 2019

wesm commented Jan 28, 2019

xhochy left a comment

PARQUET-1508: [C++] Read ByteArray data directly into arrow::BinaryBuilder and BinaryDictionaryBuilder. Refactor encoders/decoders to use cleaner virtual interfaces #3492

PARQUET-1508: [C++] Read ByteArray data directly into arrow::BinaryBuilder and BinaryDictionaryBuilder. Refactor encoders/decoders to use cleaner virtual interfaces #3492

Conversation

wesm commented Jan 26, 2019

wesm commented Jan 26, 2019

wesm commented Jan 26, 2019

wesm Jan 26, 2019

Choose a reason for hiding this comment

wesm commented Jan 27, 2019

xhochy commented Jan 27, 2019

wesm commented Jan 27, 2019

xhochy commented Jan 27, 2019

wesm commented Jan 27, 2019

codecov-io commented Jan 28, 2019

Codecov Report

wesm commented Jan 28, 2019

wesm commented Jan 28, 2019

xhochy left a comment

Choose a reason for hiding this comment