New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1508: [C++] Read ByteArray data directly into arrow::BinaryBuilder and BinaryDictionaryBuilder. Refactor encoders/decoders to use cleaner virtual interfaces #3492
Conversation
cc @xhochy @majetideepak for your review. This work does not impact any public APIs (the encoders/decoders are non-public) |
This yields maybe 5-10% speedup in binary reads. I'm going to see if I can add some ASV benchmarks for this setup import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pandas.util.testing as tm
string_length = 10
uniqueness = 100000
total_approx_bytes = 100000000
unique_values = [tm.rands(string_length) for i in range(uniqueness)]
values = unique_values * (total_approx_bytes // uniqueness // string_length)
table = pa.Table.from_arrays([pa.array(values)], names=['f1'])
def _write_table(table):
sink = pa.BufferOutputStream()
pq.write_table(table, sink, compression=None)
return sink.getvalue()
buf = _write_table(table) 0.12:
this patch
(gcc 8.2.0) |
return result; | ||
} else { | ||
return nullptr; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of a hack. I didn't want to obsess over this class since we also need to holistically look at the code duplication between RecordReader and ColumnReader, a known area of technical debt
We're having some failing macOS builds in multiple PRs, so I don't think it's caused by this:
|
@wesm Looks like this is going in the right direction. I have no high level comments, only nitpicking. As there are still some commented sections, I will wait a bit until the code has settled. |
I'll get the build passing and do a quick pass through to make any more changes that I want to make. Some of the commented-out methods are there are placeholders for the person who is going to work on getting end-to-end Categorical reads working. I'll just remove them for now out of cleanliness |
Just leave them there and add the ticket number. They can be helpful, I was just irritated by the commented functions without any text. |
…bility Refactor DictionaryDecoder to be more easily extensible Add missing file windows hell Do not use __declspec(dllexport) when building parquet_static. does not fix problems though gcc does not like using the DictEncoder name inside EncoderTraits Change-Id: I67b51494da3e76f8e856a672cdfd29c9e257efbe Add Ben's windows fixes Refactor encoders to use virtual interface, with ability to add additional methods for ByteArray, FixedLenByteArray Factor decoder code into encoding.cc Tests passing again Make DictEncoder templated too for consistency Draft direct read into ChunkedBinaryBuilder Finish implementing DecodeArrow methods for ByteArrayType Fix release build Use dynamic_cast in a couple places that MSVC doesn't like Don't export interfaces that we don't need to debug Change-Id: Ida18b93c1ea1e51216ee007b258635fcac6f54b5 Clearer names, comments, fix segfault that appeared when ZSTD was turned off IWYU cleaning Revert changes to schema.cc to fix msvc build
…emented DecodeArrowNonNull method for DictionaryBuilder Change-Id: Id75eed26fd057aaf3f7ba6d8664c240a33a6cbd7
I'm done making code changes outside of getting the build to pass. I didn't get the Windows failure on VS 2015 but it seems to happen on VS 2017 so I'll look into that. I think I got a bit overzealous fixing IWYU issues |
Codecov Report
@@ Coverage Diff @@
## master #3492 +/- ##
==========================================
- Coverage 88.18% 88.15% -0.03%
==========================================
Files 635 635
Lines 79824 80082 +258
Branches 1069 1069
==========================================
+ Hits 70392 70599 +207
- Misses 9321 9368 +47
- Partials 111 115 +4
Continue to review full report at Codecov.
|
@xhochy please go ahead and merge this if it looks OK to you. I'll keep pushing to get end-to-end categorical read/write in time for 0.13. I opened https://issues.apache.org/jira/browse/ARROW-4398 about adding benchmarks for this |
@xhochy any further thoughts about this? Parquet is technically review-then-commit so would prefer to have someone +1 the patch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, would be nice to have one more comment. Feel free to merge afterwards.
Co-Authored-By: wesm <wesm@users.noreply.github.com>
This patch ended up being a bit of a bloodbath, but it sorted out a number of technical debt problems.
Summary:
ByteArrayEncoder
andByteArrayDecoder
-- this enables adding new encoder or decoder methods without conflicting with the other types. This was very hard to do before because all types shared a common template such asPlainDecoder<ByteArrayType>
encoding.cc
compilation unit, performance should be unchanged (I will check to make sure)ChunkedBinaryBuilder
orBinaryDictionaryBuilder
. This unblocks the long-desired direct-to-categorical Parquet readsChunkedBinaryBuilder
. More work will be required to expose DictionaryArray reads in a sane wayAlong the way I've decided I want to eradicate all instances of
extern template class
from the codebase. It's insanely brittle with different visibility rules in MSVC, gcc, AND clang (no kidding, gcc and clang do different things). I'll refactor the others parts of the codebase that use them later