Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1523: [C++] Vectorize Comparator interface, remove virtual calls on inner loop. Refactor Statistics to not require PARQUET_EXTERN_TEMPLATE #4233

Closed
wants to merge 9 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented May 1, 2019

This patch supersedes #3752

I took the liberty of consolidating the comparator code with the statistics code since the two things are effectively inseparable. I also renamed the statistics classes for clarity, since "Statistics" is clearer than "RowGroupStatistics" -- the "scope" of the statistics need not be limited to a row group.

I apologize for the size of the diff; it is largely the result of moving code around and shuffling code from header files into parquet/statistics.cc

@wesm
Copy link
Member Author

wesm commented May 1, 2019

I wanted to get this up for feedback in case I did something deeply offensive -- I will add doxygen comments to the header files before merging this.

I also need to run the benchmarks to see if there is impact (this should be faster, but I'm not sure how much the benchmarks exercise the stats on the write path)

@wesm
Copy link
Member Author

wesm commented May 1, 2019

I'll fix the CI issues tomorrow.

Write performance benchmarks

before

--------------------------------------------------------------------------------------------------------------
Benchmark                                                                       Time           CPU Iterations
--------------------------------------------------------------------------------------------------------------
BM_WriteInt64Column<Repetition::REQUIRED>/1048576                         6112498 ns    6112346 ns        117   327.207MB/s
BM_WriteInt64Column<Repetition::OPTIONAL>/1048576                         8175286 ns    8175149 ns         86   244.644MB/s
BM_WriteInt64Column<Repetition::REPEATED>/1048576                        11043673 ns   11043294 ns         64   181.105MB/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/1048576    6491432 ns    6491125 ns        107   308.113MB/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/1048576    8357793 ns    8357599 ns         84   239.303MB/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::SNAPPY>/1048576   11278808 ns   11278491 ns         63   177.329MB/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::LZ4>/1048576       6614531 ns    6614423 ns        105    302.37MB/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::LZ4>/1048576       8529249 ns    8528994 ns         81   234.494MB/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::LZ4>/1048576      11289507 ns   11289159 ns         62   177.161MB/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::ZSTD>/1048576      6566587 ns    6565922 ns        106   304.603MB/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/1048576      8647931 ns    8647881 ns         81   231.271MB/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::ZSTD>/1048576     11137814 ns   11137516 ns         62   179.573MB/s

after

--------------------------------------------------------------------------------------------------------------
Benchmark                                                                       Time           CPU Iterations
--------------------------------------------------------------------------------------------------------------
BM_WriteInt64Column<Repetition::REQUIRED>/1048576                         4245946 ns    4245849 ns        166   471.048MB/s
BM_WriteInt64Column<Repetition::OPTIONAL>/1048576                         6416929 ns    6416840 ns        108    311.68MB/s
BM_WriteInt64Column<Repetition::REPEATED>/1048576                         9377082 ns    9376758 ns         75   213.293MB/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/1048576    4551200 ns    4551132 ns        156   439.451MB/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/1048576    6406567 ns    6406357 ns        109    312.19MB/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::SNAPPY>/1048576    9436338 ns    9436170 ns         74    211.95MB/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::LZ4>/1048576       4488158 ns    4487987 ns        158   445.634MB/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::LZ4>/1048576       6364199 ns    6364079 ns        109   314.264MB/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::LZ4>/1048576       9488606 ns    9488350 ns         75   210.785MB/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::ZSTD>/1048576      4764453 ns    4764293 ns        149   419.789MB/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/1048576      6514745 ns    6514404 ns        110   307.012MB/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::ZSTD>/1048576      9441827 ns    9441712 ns         76   211.826MB/s

So anywhere from 10-40% faster with the vectorized stats

case Type::FLOAT:
case Type::DOUBLE:
return SortOrder::SIGNED;
case Type::BYTE_ARRAY:
case Type::FIXED_LEN_BYTE_ARRAY:
return SortOrder::UNSIGNED;
case Type::INT96:
return SortOrder::UNKNOWN;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change appeared untested in the prior iteration of the code; @majetideepak I think this is the correct change but please confirm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current spec (https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L821) says it is undefined. So this change is incorrect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should be the default comparator, then? Prior to this patch the default was SIGNED

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't think this is true. I'm struggling to figure out how Int96 statistics were working at all prior to this patch. Where is the Int96 comparator created? Other types go through Comparator::Make but that does not support Int96

std::shared_ptr<Comparator> Comparator::Make(const ColumnDescriptor* descr) {

Copy link
Contributor

@majetideepak majetideepak May 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't create the TypedStats object at all for INT96. We don't compute statistics as a result and we never invoke Comparator::Make for INT96.

if (properties->statistics_enabled(descr_->path()) &&
(SortOrder::UNKNOWN != descr_->sort_order())) {
page_statistics_ = std::unique_ptr<TypedStats>(new TypedStats(descr_, allocator_));
chunk_statistics_ = std::unique_ptr<TypedStats>(new TypedStats(descr_, allocator_));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK. I will fix. Thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@wesm
Copy link
Member Author

wesm commented May 2, 2019

@pitrou @lidavidm I fixed some (valid) clang warnings that aren't caught in our CI for some reason -- I guess we don't build Flight with clang yet

@wesm
Copy link
Member Author

wesm commented May 2, 2019

This patch is ready to go I think, I'm just addressing the last CI issues

Copy link
Contributor

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove all the INT96 related Statistics code? It helps to error if we somehow end up creating a statistics object for INT96.
The Compare can stay since we are using that to compare test output.
The rest looks good to me.

MAKE_STATS(BOOLEAN, BooleanType);
MAKE_STATS(INT32, Int32Type);
MAKE_STATS(INT64, Int64Type);
MAKE_STATS(INT96, Int96Type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

using BoolStatistics = TypedStatistics<BooleanType>;
using Int32Statistics = TypedStatistics<Int32Type>;
using Int64Statistics = TypedStatistics<Int64Type>;
using Int96Statistics = TypedStatistics<Int96Type>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


ASSERT_THROW(Comparator::Make(&descr), ParquetException);

NodePtr int96_node = PrimitiveNode::Make("Unknown", Repetition::REQUIRED, Type::INT96);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return std::make_shared<TypedComparatorImpl<Int32Type>>();
case Type::INT64:
return std::make_shared<TypedComparatorImpl<Int64Type>>();
case Type::INT96:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving this so it's possible to instantiate the Comparator

return std::make_shared<TypedComparatorImpl<Int32Type, false>>();
case Type::INT64:
return std::make_shared<TypedComparatorImpl<Int64Type, false>>();
case Type::INT96:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving this so it's possible to instantiate the Comparator

return std::make_shared<TypedStatisticsImpl<Int32Type>>(descr, pool);
case Type::INT64:
return std::make_shared<TypedStatisticsImpl<Int64Type>>(descr, pool);
case Type::INT96:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

MAKE_STATS(BOOLEAN, BooleanType);
MAKE_STATS(INT32, Int32Type);
MAKE_STATS(INT64, Int64Type);
MAKE_STATS(INT96, Int96Type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wesm wesm force-pushed the PARQUET-1523-vectorize-comparator branch from 3b82f9a to a1f2f38 Compare May 2, 2019 19:16
Copy link
Member Author

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @majetideepak for the review -- I addressed your comments and will merge once the CI is happy


ASSERT_THROW(Comparator::Make(&descr), ParquetException);

NodePtr int96_node = PrimitiveNode::Make("Unknown", Repetition::REQUIRED, Type::INT96);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return std::make_shared<TypedComparatorImpl<Int32Type>>();
case Type::INT64:
return std::make_shared<TypedComparatorImpl<Int64Type>>();
case Type::INT96:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving this so it's possible to instantiate the Comparator

return std::make_shared<TypedComparatorImpl<Int32Type, false>>();
case Type::INT64:
return std::make_shared<TypedComparatorImpl<Int64Type, false>>();
case Type::INT96:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving this so it's possible to instantiate the Comparator

return std::make_shared<TypedStatisticsImpl<Int32Type>>(descr, pool);
case Type::INT64:
return std::make_shared<TypedStatisticsImpl<Int64Type>>(descr, pool);
case Type::INT96:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

MAKE_STATS(BOOLEAN, BooleanType);
MAKE_STATS(INT32, Int32Type);
MAKE_STATS(INT64, Int64Type);
MAKE_STATS(INT96, Int96Type);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

MAKE_STATS(BOOLEAN, BooleanType);
MAKE_STATS(INT32, Int32Type);
MAKE_STATS(INT64, Int64Type);
MAKE_STATS(INT96, Int96Type);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

using BoolStatistics = TypedStatistics<BooleanType>;
using Int32Statistics = TypedStatistics<Int32Type>;
using Int64Statistics = TypedStatistics<Int64Type>;
using Int96Statistics = TypedStatistics<Int96Type>;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

int64_t valid_bits_offset, T* out_min, T* out_max) override {
::arrow::internal::BitmapReader valid_bits_reader(valid_bits, valid_bits_offset,
length);
T min = values[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the first element is null, this is undefined. I think it needs to be initialized with respective T::max and T::lowest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adapted from the original version. If it's wrong here, it's wrong there, and we should open an issue to fix separately

https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L240

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do once this is merged so I'll be able to point the proper line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values can never be null. All nulls are encoded in the definition levels. It is also guaranteed that there is at least one element since there is a check in the writer to create statistics only if there is at least one row.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the Spaced code path (for Arrow?), so it's possible there is a logical error somewhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! I didn't notice that this is from the Spaced code path. Thanks! It should be handled then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By null, I mean logical Arrow::Array null, e.g. the bit in the validity bitmap is not set, then the corresponding cell in the data array is undefined (or implementation defined).

You should check that there is at least one valid row (if not done already).

template <>
struct CompareHelper<Int32Type, false> {
static inline bool Compare(int type_length, int32_t a, int32_t b) {
const uint32_t ua = a;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is what you want? How is this handled on the writer side?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is replicated from the original location, if it's wrong I'm not fixing it here

https://github.com/apache/arrow/blob/master/cpp/src/parquet/util/comparison.cc#L68

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fsaintjacques Can you please elaborate your concern here with the writer side? I authored this code originally.

@wesm
Copy link
Member Author

wesm commented May 2, 2019

Merging now

@wesm wesm closed this in 250e97c May 2, 2019
@wesm wesm deleted the PARQUET-1523-vectorize-comparator branch May 2, 2019 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants