-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12513: [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls #10729
Conversation
cpp/src/parquet/statistics.cc
Outdated
return; | ||
} | ||
|
||
::arrow::compute::ExecContext ctx(pool_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably want to turn off threading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/statistics.cc
Outdated
} | ||
|
||
::arrow::compute::ExecContext ctx(pool_); | ||
PARQUET_ASSIGN_OR_THROW(auto referenced_indices, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please spell out the auto types here and below, its not really clear what the output type is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed all instances of auto
unless it was blatantly obvious (e.g. returning from make_shared). Although in this spot it is Datum which isn't much clearer :)
cpp/src/parquet/statistics.h
Outdated
@@ -291,6 +291,9 @@ class TypedStatistics : public Statistics { | |||
/// arrow::BinaryArray | |||
virtual void Update(const ::arrow::Array& values) = 0; | |||
|
|||
virtual void UpdateArrowDictionary(const ::arrow::Array& indices, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be moved to the caller instead. It would be nice to try to avoid more coupling of Arrow with the default parquet library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move the "compute referenced values" logic to the caller so that statistics.cc doesn't depend on ::arrow::compute but I think I still need a second Update method here because the null count comes from the indices array and the min/max from the values array.
Alternatively, I could change the existing Update method to take an optional null_count parameter (where -1 flags to use the old behavior and grab from the array). I think I'll do this latter approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think making update take an optional null_count parameter actually makes the most sense. Not sure if you are subscribed to dev@parquet but the null_count statistics are incorrect for repeated fields https://issues.apache.org/jira/browse/PARQUET-2067 and I think the fix will involve passing along the null count as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up just exposing IncrementNullCount
and IncrementNumValues
and then added bool update_counts
to Update
.
cpp/src/parquet/statistics.cc
Outdated
::arrow::compute::ExecContext ctx(pool_); | ||
PARQUET_ASSIGN_OR_THROW(auto referenced_indices, | ||
::arrow::compute::Unique(indices, &ctx)); | ||
PARQUET_ASSIGN_OR_THROW( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this allocate a whole new array? maybe we should file a follow-up JIRA to make this more efficient.
In particular it seems like it could be more efficient to get map of dictionary index to sort order (I think we already have a kernel for this) if necessary each time before calculating the statistics if there are any new entries then iterate through the indices doing a comparison in index space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll create a follow-up. Another approach I considered was that we could compute the unique indices and then pass those down as a selection filter to the GetMinMax function. My goal was minimizing change to the existing code.
Or, since we have all these compute kernels now, just use ::arrow::compute::MinMax and pass the min & max into the statistics directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have MinMax and it works for dictionary arrays that sounds like a good idea to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I'm not sure if it was clear, but there are actually two array allocations here. First it creates a unique array which is a (hopefully small) subset of the indices. Second, it creates a referenced values array which is a (probably not much smaller) subset of the values. Sadly the MinMax functions don't yet support strings. There are kernels (I think) to get a sort order but I can't see how that will help because they don't take selection vectors yet. We could take the output from Unique
improve Comparator to take a selection vector. This would save the second allocation. However, I'm reluctant to undertake that work when I think the eventual solution will be for the MinMax kernels to support all types. I've created PARQUET-2068 for follow-up.
// Nulls will be inserted starting at index 0. If there are any null | ||
// values then start will not be the true min. However, the dictionary | ||
// array will always contain all the values even if they aren't all used. | ||
void GenerateRange(int num_nulls, char start, char end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems overly complex, could you construct the arrays in question with JSON and then combine into a single test case that verifies all of the statistics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I had grander things in mind and then didn't need them. I simplified the test considerably and added cases for some of the other binary types.
This also fixes: ARROW-11634 |
And apparently: PARQUET-1783 a triple wammy |
@westonpace this looks ok to me, I think the only real blocker is not putting this in the parquet statistics class but moving to the call-site. |
cpp/src/parquet/column_writer.cc
Outdated
@@ -1490,7 +1490,8 @@ Status TypedColumnWriterImpl<DType>::WriteArrowDictionary( | |||
// TODO(wesm): If some dictionary values are unobserved, then the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment should be deleted. I also think you need to move this into WriteIndicesChunk above (I think you need to adjust your test to have batch size and page size small enough to generate multiple pages to catch this (I actually guess this is a different bug though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've deleted the comment but I don't yet understand the point about WriteIndicesChunk. I will need to look at this again later. Is it a performance concern (don't make these extra allocations for the entire array all at once but make the extra allocations many times at the batch level to avoid total memory use) or is it a functional concern (maybe statistics are stored on a per-batch basis?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functional concern (unfortunately I think the performance would be worse since it will require extra allocations for each batch). Statistics are stored at two levels rowgroup (column chunk) and page level. The batching in done here and in other locations in the code to be able to get some level of vectorization but not make any individual page too large. I might have traced the code incorrectly but the current location for updating statistics will only cover correct statistics at the row group level. It seems like this is an orthogonal bug so you can maybe do it as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking so long to get back to this but I dug into this a bit today and I can't find any page statistics. Either in the code or the thrift definition. Can you give me a pointer to the statistics objects you think might be incorrect.
There is a PageEncodingStats
which has a count
but that appears to be counting the number of pages and not counting the rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L530 and https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L584 I thought I traced the code an we currently populate them but I could be mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will also be important if we ever implement columindex writing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I've expanded the unit test to verify page level statistics. It appears that we recompute the null count in WriteIndicesChunk
and do so based on def_levels
which appears to be calculated on a completely different code path (MultipathLevelBuilderResult
). So the concern about indices vs. dictionary does not apply and the null counts are correct.
I did notice we don't encode page level min/max stats at all. I'm not sure if that is a bug or not (although, if so, I'd tackle that as a separate PR/JIRA). So, if the unit test seems ok and the logic above seems valid then I think this is good.
auto metadata = LoadWrittenMetadata(); | ||
auto stats = metadata->RowGroup(0)->ColumnChunk(0)->statistics(); | ||
ASSERT_TRUE(stats->HasMinMax()); | ||
ASSERT_EQ(stats->EncodeMin(), "f"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: min and max should probably use EXPECT_EQ instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
void CheckMinMax() { | ||
GenerateRange(5, 'a', 'z'); | ||
WriteToBuffer(); | ||
auto metadata = LoadWrittenMetadata(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please generally careful with the use of auto. The style guide recommends it just of saving space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR I've backed away from auto. I'll have to check myself going forward as my inclination had been "whenever possible".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might be the only person that calls people on it, I think most active developers will use auto alot.
57c4ffa
to
3a2912e
Compare
@emkornfield Thanks for your quick review. I believe I've addressed your points (except possibly for the WriteIndicesChunk note). I don't think this will make the cutoff for RC0 but hopefully we can get it in if there is a follow-up RC. |
c1d39be
to
7f9fd68
Compare
I modified the test to test the case where there are multiple row groups and I added a small optimization (it will only call |
LGTM module placement of stats update (inside per page processing or not) |
7f9fd68
to
73123f4
Compare
Also, it seems we were not writing page statistics at all for data page V2. I added it back in but wasn't sure if that was intentionally disabled for any reason. |
CI failures appear unrelated. I'll merge this tomorrow assuming no concerns about #10729 (comment) |
…iven a dictionary encoded array
…ces to column_writer. Added TODO to later optimize
…stics to NOT use threads
… test the coverage of that path. Added an if check to avoid calling if we know we are going to take everything
d87fc29
to
62043fa
Compare
Forgot about this. Rebasing and then merging on green. |
forgot to to comment, these changes looks fine. thanks for tracing down the paths. I don't know why stats for datapagev2 would have been disabled. |
…n Parquet statistics for dictionary-encoded array with nulls This fixes two issues. * The null_count must be obtained from the indices array and not the values array * The min/max should be based on referenced values and not all values in the values array Note: This further adds a dependency from parquet onto arrow::compute (I use it both to compute the unique indices and to take the referenced values). This dependency already existed (column_writer.cc relies on arrow::compute::Cast) so I'm pretty sure this isn't a problem. Related: ARROW-8891 Closes apache#10729 from westonpace/bugfix/ARROW-12513--c-parquet-parquet-writer-always-puts-null_cou Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
This fixes two issues.
Note: This further adds a dependency from parquet onto arrow::compute (I use it both to compute the unique indices and to take the referenced values). This dependency already existed (column_writer.cc relies on arrow::compute::Cast) so I'm pretty sure this isn't a problem. Related: ARROW-8891