Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12513: [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls #10729

Conversation

westonpace
Copy link
Member

This fixes two issues.

  • The null_count must be obtained from the indices array and not the values array
  • The min/max should be based on referenced values and not all values in the values array

Note: This further adds a dependency from parquet onto arrow::compute (I use it both to compute the unique indices and to take the referenced values). This dependency already existed (column_writer.cc relies on arrow::compute::Cast) so I'm pretty sure this isn't a problem. Related: ARROW-8891

@github-actions
Copy link

return;
}

::arrow::compute::ExecContext ctx(pool_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want to turn off threading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

::arrow::compute::ExecContext ctx(pool_);
PARQUET_ASSIGN_OR_THROW(auto referenced_indices,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please spell out the auto types here and below, its not really clear what the output type is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed all instances of auto unless it was blatantly obvious (e.g. returning from make_shared). Although in this spot it is Datum which isn't much clearer :)

@@ -291,6 +291,9 @@ class TypedStatistics : public Statistics {
/// arrow::BinaryArray
virtual void Update(const ::arrow::Array& values) = 0;

virtual void UpdateArrowDictionary(const ::arrow::Array& indices,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be moved to the caller instead. It would be nice to try to avoid more coupling of Arrow with the default parquet library.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move the "compute referenced values" logic to the caller so that statistics.cc doesn't depend on ::arrow::compute but I think I still need a second Update method here because the null count comes from the indices array and the min/max from the values array.

Alternatively, I could change the existing Update method to take an optional null_count parameter (where -1 flags to use the old behavior and grab from the array). I think I'll do this latter approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think making update take an optional null_count parameter actually makes the most sense. Not sure if you are subscribed to dev@parquet but the null_count statistics are incorrect for repeated fields https://issues.apache.org/jira/browse/PARQUET-2067 and I think the fix will involve passing along the null count as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up just exposing IncrementNullCount and IncrementNumValues and then added bool update_counts to Update.

::arrow::compute::ExecContext ctx(pool_);
PARQUET_ASSIGN_OR_THROW(auto referenced_indices,
::arrow::compute::Unique(indices, &ctx));
PARQUET_ASSIGN_OR_THROW(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this allocate a whole new array? maybe we should file a follow-up JIRA to make this more efficient.

In particular it seems like it could be more efficient to get map of dictionary index to sort order (I think we already have a kernel for this) if necessary each time before calculating the statistics if there are any new entries then iterate through the indices doing a comparison in index space.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create a follow-up. Another approach I considered was that we could compute the unique indices and then pass those down as a selection filter to the GetMinMax function. My goal was minimizing change to the existing code.

Or, since we have all these compute kernels now, just use ::arrow::compute::MinMax and pass the min & max into the statistics directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have MinMax and it works for dictionary arrays that sounds like a good idea to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I'm not sure if it was clear, but there are actually two array allocations here. First it creates a unique array which is a (hopefully small) subset of the indices. Second, it creates a referenced values array which is a (probably not much smaller) subset of the values. Sadly the MinMax functions don't yet support strings. There are kernels (I think) to get a sort order but I can't see how that will help because they don't take selection vectors yet. We could take the output from Unique improve Comparator to take a selection vector. This would save the second allocation. However, I'm reluctant to undertake that work when I think the eventual solution will be for the MinMax kernels to support all types. I've created PARQUET-2068 for follow-up.

// Nulls will be inserted starting at index 0. If there are any null
// values then start will not be the true min. However, the dictionary
// array will always contain all the values even if they aren't all used.
void GenerateRange(int num_nulls, char start, char end) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems overly complex, could you construct the arrays in question with JSON and then combine into a single test case that verifies all of the statistics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had grander things in mind and then didn't need them. I simplified the test considerably and added cases for some of the other binary types.

@emkornfield
Copy link
Contributor

This also fixes: ARROW-11634

@emkornfield
Copy link
Contributor

And apparently: PARQUET-1783 a triple wammy

@emkornfield
Copy link
Contributor

@westonpace this looks ok to me, I think the only real blocker is not putting this in the parquet statistics class but moving to the call-site.

@@ -1490,7 +1490,8 @@ Status TypedColumnWriterImpl<DType>::WriteArrowDictionary(
// TODO(wesm): If some dictionary values are unobserved, then the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment should be deleted. I also think you need to move this into WriteIndicesChunk above (I think you need to adjust your test to have batch size and page size small enough to generate multiple pages to catch this (I actually guess this is a different bug though).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've deleted the comment but I don't yet understand the point about WriteIndicesChunk. I will need to look at this again later. Is it a performance concern (don't make these extra allocations for the entire array all at once but make the extra allocations many times at the batch level to avoid total memory use) or is it a functional concern (maybe statistics are stored on a per-batch basis?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functional concern (unfortunately I think the performance would be worse since it will require extra allocations for each batch). Statistics are stored at two levels rowgroup (column chunk) and page level. The batching in done here and in other locations in the code to be able to get some level of vectorization but not make any individual page too large. I might have traced the code incorrectly but the current location for updating statistics will only cover correct statistics at the row group level. It seems like this is an orthogonal bug so you can maybe do it as a follow-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to get back to this but I dug into this a bit today and I can't find any page statistics. Either in the code or the thrift definition. Can you give me a pointer to the statistics objects you think might be incorrect.

There is a PageEncodingStats which has a count but that appears to be counting the number of pages and not counting the rows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will also be important if we ever implement columindex writing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I've expanded the unit test to verify page level statistics. It appears that we recompute the null count in WriteIndicesChunk and do so based on def_levels which appears to be calculated on a completely different code path (MultipathLevelBuilderResult). So the concern about indices vs. dictionary does not apply and the null counts are correct.

I did notice we don't encode page level min/max stats at all. I'm not sure if that is a bug or not (although, if so, I'd tackle that as a separate PR/JIRA). So, if the unit test seems ok and the logic above seems valid then I think this is good.

auto metadata = LoadWrittenMetadata();
auto stats = metadata->RowGroup(0)->ColumnChunk(0)->statistics();
ASSERT_TRUE(stats->HasMinMax());
ASSERT_EQ(stats->EncodeMin(), "f");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: min and max should probably use EXPECT_EQ instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

void CheckMinMax() {
GenerateRange(5, 'a', 'z');
WriteToBuffer();
auto metadata = LoadWrittenMetadata();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please generally careful with the use of auto. The style guide recommends it just of saving space

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR I've backed away from auto. I'll have to check myself going forward as my inclination had been "whenever possible".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might be the only person that calls people on it, I think most active developers will use auto alot.

@westonpace westonpace force-pushed the bugfix/ARROW-12513--c-parquet-parquet-writer-always-puts-null_cou branch from 57c4ffa to 3a2912e Compare July 17, 2021 02:01
@westonpace
Copy link
Member Author

@emkornfield Thanks for your quick review. I believe I've addressed your points (except possibly for the WriteIndicesChunk note). I don't think this will make the cutoff for RC0 but hopefully we can get it in if there is a follow-up RC.

@westonpace westonpace force-pushed the bugfix/ARROW-12513--c-parquet-parquet-writer-always-puts-null_cou branch from c1d39be to 7f9fd68 Compare July 28, 2021 03:27
@westonpace
Copy link
Member Author

I modified the test to test the case where there are multiple row groups and I added a small optimization (it will only call compute::Take if the # of referenced indices is smaller than the size of the dictionary). There is one outstanding comment on column chunk level statistics vs page statistics but if that is resolved (and CI remains green) I will go ahead and merge.

@emkornfield
Copy link
Contributor

LGTM module placement of stats update (inside per page processing or not)

@westonpace westonpace force-pushed the bugfix/ARROW-12513--c-parquet-parquet-writer-always-puts-null_cou branch from 7f9fd68 to 73123f4 Compare August 3, 2021 06:21
@westonpace
Copy link
Member Author

Also, it seems we were not writing page statistics at all for data page V2. I added it back in but wasn't sure if that was intentionally disabled for any reason.

@westonpace
Copy link
Member Author

CI failures appear unrelated. I'll merge this tomorrow assuming no concerns about #10729 (comment)

@westonpace westonpace force-pushed the bugfix/ARROW-12513--c-parquet-parquet-writer-always-puts-null_cou branch from d87fc29 to 62043fa Compare August 9, 2021 19:37
@westonpace
Copy link
Member Author

Forgot about this. Rebasing and then merging on green.

@emkornfield
Copy link
Contributor

forgot to to comment, these changes looks fine. thanks for tracing down the paths. I don't know why stats for datapagev2 would have been disabled.

@westonpace westonpace closed this in 64ed8ae Aug 9, 2021
michalursa pushed a commit to michalursa/arrow that referenced this pull request Aug 17, 2021
…n Parquet statistics for dictionary-encoded array with nulls

This fixes two issues.
* The null_count must be obtained from the indices array and not the values array
* The min/max should be based on referenced values and not all values in the values array

Note: This further adds a dependency from parquet onto arrow::compute (I use it both to compute the unique indices and to take the referenced values).  This dependency already existed (column_writer.cc relies on arrow::compute::Cast) so I'm pretty sure this isn't a problem.  Related: ARROW-8891

Closes apache#10729 from westonpace/bugfix/ARROW-12513--c-parquet-parquet-writer-always-puts-null_cou

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace deleted the bugfix/ARROW-12513--c-parquet-parquet-writer-always-puts-null_cou branch January 6, 2022 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants