Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1508: [C++] Read ByteArray data directly into arrow::BinaryBuilder and BinaryDictionaryBuilder. Refactor encoders/decoders to use cleaner virtual interfaces #3492

Closed
wants to merge 7 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Jan 26, 2019

This patch ended up being a bit of a bloodbath, but it sorted out a number of technical debt problems.

Summary:

  • Add type-specific virtual encoder interfaces such as ByteArrayEncoder and ByteArrayDecoder -- this enables adding new encoder or decoder methods without conflicting with the other types. This was very hard to do before because all types shared a common template such as
    PlainDecoder<ByteArrayType>
  • Encoder and decoder implementations are now all in an encoding.cc compilation unit, performance should be unchanged (I will check to make sure)
  • Add BYTE_ARRAY decoder methods that write into ChunkedBinaryBuilder or BinaryDictionaryBuilder. This unblocks the long-desired direct-to-categorical Parquet reads
  • Altered RecordReader to decode BYTE_ARRAY values directly into ChunkedBinaryBuilder. More work will be required to expose DictionaryArray reads in a sane way

Along the way I've decided I want to eradicate all instances of extern template class from the codebase. It's insanely brittle with different visibility rules in MSVC, gcc, AND clang (no kidding, gcc and clang do different things). I'll refactor the others parts of the codebase that use them later

@wesm
Copy link
Member Author

wesm commented Jan 26, 2019

cc @xhochy @majetideepak for your review. This work does not impact any public APIs (the encoders/decoders are non-public)

@wesm
Copy link
Member Author

wesm commented Jan 26, 2019

This yields maybe 5-10% speedup in binary reads. I'm going to see if I can add some ASV benchmarks for this

setup

import pandas as pd
import numpy as np

import pyarrow as pa
import pyarrow.parquet as pq

import pandas.util.testing as tm

string_length = 10
uniqueness = 100000
total_approx_bytes = 100000000

unique_values = [tm.rands(string_length) for i in range(uniqueness)]
values = unique_values * (total_approx_bytes // uniqueness // string_length)

table = pa.Table.from_arrays([pa.array(values)], names=['f1'])

def _write_table(table):
    sink = pa.BufferOutputStream()
    pq.write_table(table, sink, compression=None)
    return sink.getvalue()

buf = _write_table(table)

0.12:

In [3]: timeit pq.read_table(buf)                                                                                                                                                              
306 ms ± 7.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

this patch

In [2]: timeit pq.read_table(buf)                                                                                                                                                              
276 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(gcc 8.2.0)

return result;
} else {
return nullptr;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of a hack. I didn't want to obsess over this class since we also need to holistically look at the code duplication between RecordReader and ColumnReader, a known area of technical debt

@wesm
Copy link
Member Author

wesm commented Jan 27, 2019

We're having some failing macOS builds in multiple PRs, so I don't think it's caused by this:

-- Configuring done

CMake Warning (dev):

  Policy CMP0068 is not set: RPATH settings on macOS do not affect

  install_name.  Run "cmake --help-policy CMP0068" for policy details.  Use

  the cmake_policy command to set the policy and suppress this warning.



  For compatibility with older versions of CMake, the install_name fields for

  the following targets are still affected by RPATH settings:



   arrow_python_shared

   arrow_shared

   arrow_testing_shared

   parquet_shared

   plasma_shared



This warning is for project developers.  Use -Wno-dev to suppress it.



-- Generating done

-- Build files have been written to: /Users/travis/build/apache/arrow/cpp-build

ninja: error: manifest 'build.ninja' still dirty after 100 tries

@xhochy
Copy link
Member

xhochy commented Jan 27, 2019

@wesm Looks like this is going in the right direction. I have no high level comments, only nitpicking. As there are still some commented sections, I will wait a bit until the code has settled.

@wesm
Copy link
Member Author

wesm commented Jan 27, 2019

I'll get the build passing and do a quick pass through to make any more changes that I want to make. Some of the commented-out methods are there are placeholders for the person who is going to work on getting end-to-end Categorical reads working. I'll just remove them for now out of cleanliness

@xhochy
Copy link
Member

xhochy commented Jan 27, 2019

Some of the commented-out methods are there are placeholders for the person who is going to work on getting end-to-end Categorical reads working. I'll just remove them for now out of cleanliness

Just leave them there and add the ticket number. They can be helpful, I was just irritated by the commented functions without any text.

…bility

Refactor DictionaryDecoder to be more easily extensible

Add missing file

windows hell

Do not use __declspec(dllexport) when building parquet_static. does not fix problems though

gcc does not like using the DictEncoder name inside EncoderTraits

Change-Id: I67b51494da3e76f8e856a672cdfd29c9e257efbe

Add Ben's windows fixes

Refactor encoders to use virtual interface, with ability to add additional methods for ByteArray, FixedLenByteArray

Factor decoder code into encoding.cc

Tests passing again

Make DictEncoder templated too for consistency

Draft direct read into ChunkedBinaryBuilder

Finish implementing DecodeArrow methods for ByteArrayType

Fix release build

Use dynamic_cast in a couple places that MSVC doesn't like

Don't export interfaces that we don't need to

debug

Change-Id: Ida18b93c1ea1e51216ee007b258635fcac6f54b5

Clearer names, comments, fix segfault that appeared when ZSTD was turned off

IWYU cleaning

Revert changes to schema.cc to fix msvc build
…emented DecodeArrowNonNull method for DictionaryBuilder

Change-Id: Id75eed26fd057aaf3f7ba6d8664c240a33a6cbd7
Change-Id: Icd4fdd9c9f6358a76e25d0eeb994773a707e6e3b
@wesm
Copy link
Member Author

wesm commented Jan 27, 2019

I'm done making code changes outside of getting the build to pass. I didn't get the Windows failure on VS 2015 but it seems to happen on VS 2017 so I'll look into that. I think I got a bit overzealous fixing IWYU issues

@codecov-io
Copy link

Codecov Report

Merging #3492 into master will decrease coverage by 0.02%.
The diff coverage is 90.03%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3492      +/-   ##
==========================================
- Coverage   88.18%   88.15%   -0.03%     
==========================================
  Files         635      635              
  Lines       79824    80082     +258     
  Branches     1069     1069              
==========================================
+ Hits        70392    70599     +207     
- Misses       9321     9368      +47     
- Partials      111      115       +4
Impacted Files Coverage Δ
cpp/src/parquet/column_writer.h 70.58% <ø> (ø) ⬆️
cpp/src/parquet/bloom_filter.cc 91.13% <ø> (ø) ⬆️
cpp/src/parquet/metadata.h 100% <ø> (ø) ⬆️
cpp/src/parquet/file_reader.cc 97.38% <ø> (ø) ⬆️
cpp/src/parquet/column_writer-test.cc 99.56% <ø> (ø) ⬆️
cpp/src/parquet/file_writer.h 100% <ø> (ø) ⬆️
cpp/src/parquet/column_reader.h 91.05% <ø> (ø) ⬆️
cpp/src/parquet/schema.cc 90.22% <ø> (ø) ⬆️
cpp/src/parquet/arrow/writer.h 100% <ø> (ø) ⬆️
cpp/src/parquet/arrow/schema.cc 91.85% <ø> (ø) ⬆️
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f6e0c2...4bafc55. Read the comment docs.

@wesm
Copy link
Member Author

wesm commented Jan 28, 2019

@xhochy please go ahead and merge this if it looks OK to you. I'll keep pushing to get end-to-end categorical read/write in time for 0.13.

I opened https://issues.apache.org/jira/browse/ARROW-4398 about adding benchmarks for this

@wesm
Copy link
Member Author

wesm commented Jan 28, 2019

@xhochy any further thoughts about this? Parquet is technically review-then-commit so would prefer to have someone +1 the patch

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, would be nice to have one more comment. Feel free to merge afterwards.

cpp/src/parquet/arrow/record_reader.cc Show resolved Hide resolved
xhochy and others added 2 commits January 29, 2019 10:24
Co-Authored-By: wesm <wesm@users.noreply.github.com>
Change-Id: Ic7faef17bc54e81d4975eece5d2f7d316d87f03d
@wesm wesm closed this in 3d435e4 Jan 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants