-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1523: [C++] Vectorize Comparator interface, remove virtual calls on inner loop. Refactor Statistics to not require PARQUET_EXTERN_TEMPLATE #4233
Conversation
I wanted to get this up for feedback in case I did something deeply offensive -- I will add doxygen comments to the header files before merging this. I also need to run the benchmarks to see if there is impact (this should be faster, but I'm not sure how much the benchmarks exercise the stats on the write path) |
I'll fix the CI issues tomorrow. Write performance benchmarks before
after
So anywhere from 10-40% faster with the vectorized stats |
cpp/src/parquet/types.cc
Outdated
case Type::FLOAT: | ||
case Type::DOUBLE: | ||
return SortOrder::SIGNED; | ||
case Type::BYTE_ARRAY: | ||
case Type::FIXED_LEN_BYTE_ARRAY: | ||
return SortOrder::UNSIGNED; | ||
case Type::INT96: | ||
return SortOrder::UNKNOWN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change appeared untested in the prior iteration of the code; @majetideepak I think this is the correct change but please confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current spec (https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L821) says it is undefined. So this change is incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should be the default comparator, then? Prior to this patch the default was SIGNED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't think this is true. I'm struggling to figure out how Int96 statistics were working at all prior to this patch. Where is the Int96 comparator created? Other types go through Comparator::Make
but that does not support Int96
arrow/cpp/src/parquet/util/comparison.cc
Line 28 in 2a4e48e
std::shared_ptr<Comparator> Comparator::Make(const ColumnDescriptor* descr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't create the TypedStats
object at all for INT96
. We don't compute statistics as a result and we never invoke Comparator::Make
for INT96.
arrow/cpp/src/parquet/column_writer.cc
Lines 647 to 650 in 2a4e48e
if (properties->statistics_enabled(descr_->path()) && | |
(SortOrder::UNKNOWN != descr_->sort_order())) { | |
page_statistics_ = std::unique_ptr<TypedStats>(new TypedStats(descr_, allocator_)); | |
chunk_statistics_ = std::unique_ptr<TypedStats>(new TypedStats(descr_, allocator_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK. I will fix. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
71440b7
to
9e35d85
Compare
This patch is ready to go I think, I'm just addressing the last CI issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove all the INT96
related Statistics
code? It helps to error if we somehow end up creating a statistics object for INT96.
The Compare
can stay since we are using that to compare test output.
The rest looks good to me.
cpp/src/parquet/statistics.cc
Outdated
MAKE_STATS(BOOLEAN, BooleanType); | ||
MAKE_STATS(INT32, Int32Type); | ||
MAKE_STATS(INT64, Int64Type); | ||
MAKE_STATS(INT96, Int96Type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/parquet/statistics.h
Outdated
using BoolStatistics = TypedStatistics<BooleanType>; | ||
using Int32Statistics = TypedStatistics<Int32Type>; | ||
using Int64Statistics = TypedStatistics<Int64Type>; | ||
using Int96Statistics = TypedStatistics<Int96Type>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/parquet/statistics-test.cc
Outdated
|
||
ASSERT_THROW(Comparator::Make(&descr), ParquetException); | ||
|
||
NodePtr int96_node = PrimitiveNode::Make("Unknown", Repetition::REQUIRED, Type::INT96); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return std::make_shared<TypedComparatorImpl<Int32Type>>(); | ||
case Type::INT64: | ||
return std::make_shared<TypedComparatorImpl<Int64Type>>(); | ||
case Type::INT96: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving this so it's possible to instantiate the Comparator
return std::make_shared<TypedComparatorImpl<Int32Type, false>>(); | ||
case Type::INT64: | ||
return std::make_shared<TypedComparatorImpl<Int64Type, false>>(); | ||
case Type::INT96: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving this so it's possible to instantiate the Comparator
cpp/src/parquet/statistics.cc
Outdated
return std::make_shared<TypedStatisticsImpl<Int32Type>>(descr, pool); | ||
case Type::INT64: | ||
return std::make_shared<TypedStatisticsImpl<Int64Type>>(descr, pool); | ||
case Type::INT96: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/parquet/statistics.cc
Outdated
MAKE_STATS(BOOLEAN, BooleanType); | ||
MAKE_STATS(INT32, Int32Type); | ||
MAKE_STATS(INT64, Int64Type); | ||
MAKE_STATS(INT96, Int96Type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
3b82f9a
to
a1f2f38
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @majetideepak for the review -- I addressed your comments and will merge once the CI is happy
cpp/src/parquet/statistics-test.cc
Outdated
|
||
ASSERT_THROW(Comparator::Make(&descr), ParquetException); | ||
|
||
NodePtr int96_node = PrimitiveNode::Make("Unknown", Repetition::REQUIRED, Type::INT96); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return std::make_shared<TypedComparatorImpl<Int32Type>>(); | ||
case Type::INT64: | ||
return std::make_shared<TypedComparatorImpl<Int64Type>>(); | ||
case Type::INT96: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving this so it's possible to instantiate the Comparator
return std::make_shared<TypedComparatorImpl<Int32Type, false>>(); | ||
case Type::INT64: | ||
return std::make_shared<TypedComparatorImpl<Int64Type, false>>(); | ||
case Type::INT96: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving this so it's possible to instantiate the Comparator
cpp/src/parquet/statistics.cc
Outdated
return std::make_shared<TypedStatisticsImpl<Int32Type>>(descr, pool); | ||
case Type::INT64: | ||
return std::make_shared<TypedStatisticsImpl<Int64Type>>(descr, pool); | ||
case Type::INT96: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/parquet/statistics.cc
Outdated
MAKE_STATS(BOOLEAN, BooleanType); | ||
MAKE_STATS(INT32, Int32Type); | ||
MAKE_STATS(INT64, Int64Type); | ||
MAKE_STATS(INT96, Int96Type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/parquet/statistics.cc
Outdated
MAKE_STATS(BOOLEAN, BooleanType); | ||
MAKE_STATS(INT32, Int32Type); | ||
MAKE_STATS(INT64, Int64Type); | ||
MAKE_STATS(INT96, Int96Type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/parquet/statistics.h
Outdated
using BoolStatistics = TypedStatistics<BooleanType>; | ||
using Int32Statistics = TypedStatistics<Int32Type>; | ||
using Int64Statistics = TypedStatistics<Int64Type>; | ||
using Int96Statistics = TypedStatistics<Int96Type>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
int64_t valid_bits_offset, T* out_min, T* out_max) override { | ||
::arrow::internal::BitmapReader valid_bits_reader(valid_bits, valid_bits_offset, | ||
length); | ||
T min = values[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the first element is null, this is undefined. I think it needs to be initialized with respective T::max and T::lowest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is adapted from the original version. If it's wrong here, it's wrong there, and we should open an issue to fix separately
https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L240
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do once this is merged so I'll be able to point the proper line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values can never be null. All nulls are encoded in the definition levels. It is also guaranteed that there is at least one element since there is a check in the writer to create statistics only if there is at least one row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in the Spaced code path (for Arrow?), so it's possible there is a logical error somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah! I didn't notice that this is from the Spaced code path. Thanks! It should be handled then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By null, I mean logical Arrow::Array null, e.g. the bit in the validity bitmap is not set, then the corresponding cell in the data array is undefined (or implementation defined).
You should check that there is at least one valid row (if not done already).
template <> | ||
struct CompareHelper<Int32Type, false> { | ||
static inline bool Compare(int type_length, int32_t a, int32_t b) { | ||
const uint32_t ua = a; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this is what you want? How is this handled on the writer side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is replicated from the original location, if it's wrong I'm not fixing it here
https://github.com/apache/arrow/blob/master/cpp/src/parquet/util/comparison.cc#L68
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fsaintjacques Can you please elaborate your concern here with the writer side? I authored this code originally.
Merging now |
This patch supersedes #3752
I took the liberty of consolidating the comparator code with the statistics code since the two things are effectively inseparable. I also renamed the statistics classes for clarity, since "Statistics" is clearer than "RowGroupStatistics" -- the "scope" of the statistics need not be limited to a row group.
I apologize for the size of the diff; it is largely the result of moving code around and shuffling code from header files into parquet/statistics.cc