ARROW-12513: [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls #10729

westonpace · 2021-07-16T04:39:37Z

This fixes two issues.

The null_count must be obtained from the indices array and not the values array
The min/max should be based on referenced values and not all values in the values array

Note: This further adds a dependency from parquet onto arrow::compute (I use it both to compute the unique indices and to take the referenced values). This dependency already existed (column_writer.cc relies on arrow::compute::Cast) so I'm pretty sure this isn't a problem. Related: ARROW-8891

github-actions · 2021-07-16T04:39:55Z

https://issues.apache.org/jira/browse/ARROW-12513

emkornfield · 2021-07-16T04:42:45Z

cpp/src/parquet/statistics.cc

+      return;
+    }
+
+    ::arrow::compute::ExecContext ctx(pool_);


we probably want to turn off threading.

emkornfield · 2021-07-16T04:44:12Z

cpp/src/parquet/statistics.cc

+    }
+
+    ::arrow::compute::ExecContext ctx(pool_);
+    PARQUET_ASSIGN_OR_THROW(auto referenced_indices,


please spell out the auto types here and below, its not really clear what the output type is.

I removed all instances of auto unless it was blatantly obvious (e.g. returning from make_shared). Although in this spot it is Datum which isn't much clearer :)

emkornfield · 2021-07-16T04:47:52Z

cpp/src/parquet/statistics.h

@@ -291,6 +291,9 @@ class TypedStatistics : public Statistics {
  /// arrow::BinaryArray
  virtual void Update(const ::arrow::Array& values) = 0;

+  virtual void UpdateArrowDictionary(const ::arrow::Array& indices,


can this be moved to the caller instead. It would be nice to try to avoid more coupling of Arrow with the default parquet library.

I can move the "compute referenced values" logic to the caller so that statistics.cc doesn't depend on ::arrow::compute but I think I still need a second Update method here because the null count comes from the indices array and the min/max from the values array.

Alternatively, I could change the existing Update method to take an optional null_count parameter (where -1 flags to use the old behavior and grab from the array). I think I'll do this latter approach.

I think making update take an optional null_count parameter actually makes the most sense. Not sure if you are subscribed to dev@parquet but the null_count statistics are incorrect for repeated fields https://issues.apache.org/jira/browse/PARQUET-2067 and I think the fix will involve passing along the null count as well.

I ended up just exposing IncrementNullCount and IncrementNumValues and then added bool update_counts to Update.

emkornfield · 2021-07-16T04:50:35Z

cpp/src/parquet/statistics.cc

+    ::arrow::compute::ExecContext ctx(pool_);
+    PARQUET_ASSIGN_OR_THROW(auto referenced_indices,
+                            ::arrow::compute::Unique(indices, &ctx));
+    PARQUET_ASSIGN_OR_THROW(


this allocate a whole new array? maybe we should file a follow-up JIRA to make this more efficient.

In particular it seems like it could be more efficient to get map of dictionary index to sort order (I think we already have a kernel for this) if necessary each time before calculating the statistics if there are any new entries then iterate through the indices doing a comparison in index space.

I'll create a follow-up. Another approach I considered was that we could compute the unique indices and then pass those down as a selection filter to the GetMinMax function. My goal was minimizing change to the existing code.

Or, since we have all these compute kernels now, just use ::arrow::compute::MinMax and pass the min & max into the statistics directly.

If we have MinMax and it works for dictionary arrays that sounds like a good idea to me.

First, I'm not sure if it was clear, but there are actually two array allocations here. First it creates a unique array which is a (hopefully small) subset of the indices. Second, it creates a referenced values array which is a (probably not much smaller) subset of the values. Sadly the MinMax functions don't yet support strings. There are kernels (I think) to get a sort order but I can't see how that will help because they don't take selection vectors yet. We could take the output from Unique improve Comparator to take a selection vector. This would save the second allocation. However, I'm reluctant to undertake that work when I think the eventual solution will be for the MinMax kernels to support all types. I've created PARQUET-2068 for follow-up.

emkornfield · 2021-07-16T04:53:59Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

+  // Nulls will be inserted starting at index 0.  If there are any null
+  // values then start will not be the true min.  However, the dictionary
+  // array will always contain all the values even if they aren't all used.
+  void GenerateRange(int num_nulls, char start, char end) {


this seems overly complex, could you construct the arrays in question with JSON and then combine into a single test case that verifies all of the statistics?

Yes, I had grander things in mind and then didn't need them. I simplified the test considerably and added cases for some of the other binary types.

emkornfield · 2021-07-16T04:58:03Z

This also fixes: ARROW-11634

emkornfield · 2021-07-16T04:59:06Z

And apparently: PARQUET-1783 a triple wammy

emkornfield · 2021-07-16T05:00:23Z

@westonpace this looks ok to me, I think the only real blocker is not putting this in the parquet statistics class but moving to the call-site.

emkornfield · 2021-07-16T16:33:45Z

cpp/src/parquet/column_writer.cc

@@ -1490,7 +1490,8 @@ Status TypedColumnWriterImpl<DType>::WriteArrowDictionary(
    // TODO(wesm): If some dictionary values are unobserved, then the


this comment should be deleted. I also think you need to move this into WriteIndicesChunk above (I think you need to adjust your test to have batch size and page size small enough to generate multiple pages to catch this (I actually guess this is a different bug though).

I've deleted the comment but I don't yet understand the point about WriteIndicesChunk. I will need to look at this again later. Is it a performance concern (don't make these extra allocations for the entire array all at once but make the extra allocations many times at the batch level to avoid total memory use) or is it a functional concern (maybe statistics are stored on a per-batch basis?)

functional concern (unfortunately I think the performance would be worse since it will require extra allocations for each batch). Statistics are stored at two levels rowgroup (column chunk) and page level. The batching in done here and in other locations in the code to be able to get some level of vectorization but not make any individual page too large. I might have traced the code incorrectly but the current location for updating statistics will only cover correct statistics at the row group level. It seems like this is an orthogonal bug so you can maybe do it as a follow-up.

Sorry for taking so long to get back to this but I dug into this a bit today and I can't find any page statistics. Either in the code or the thrift definition. Can you give me a pointer to the statistics objects you think might be incorrect.

There is a PageEncodingStats which has a count but that appears to be counting the number of pages and not counting the rows.

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L530 and https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L584 I thought I traced the code an we currently populate them but I could be mistaken.

It will also be important if we ever implement columindex writing

Sorry for the delay. I've expanded the unit test to verify page level statistics. It appears that we recompute the null count in WriteIndicesChunk and do so based on def_levels which appears to be calculated on a completely different code path (MultipathLevelBuilderResult). So the concern about indices vs. dictionary does not apply and the null counts are correct.

I did notice we don't encode page level min/max stats at all. I'm not sure if that is a bug or not (although, if so, I'd tackle that as a separate PR/JIRA). So, if the unit test seems ok and the logic above seems valid then I think this is good.

emkornfield · 2021-07-16T17:00:31Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

+    auto metadata = LoadWrittenMetadata();
+    auto stats = metadata->RowGroup(0)->ColumnChunk(0)->statistics();
+    ASSERT_TRUE(stats->HasMinMax());
+    ASSERT_EQ(stats->EncodeMin(), "f");


nit: min and max should probably use EXPECT_EQ instead.

emkornfield · 2021-07-16T17:03:40Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

+  void CheckMinMax() {
+    GenerateRange(5, 'a', 'z');
+    WriteToBuffer();
+    auto metadata = LoadWrittenMetadata();


Please generally careful with the use of auto. The style guide recommends it just of saving space

For this PR I've backed away from auto. I'll have to check myself going forward as my inclination had been "whenever possible".

I think I might be the only person that calls people on it, I think most active developers will use auto alot.

westonpace · 2021-07-17T02:30:19Z

@emkornfield Thanks for your quick review. I believe I've addressed your points (except possibly for the WriteIndicesChunk note). I don't think this will make the cutoff for RC0 but hopefully we can get it in if there is a follow-up RC.

westonpace · 2021-07-28T03:48:08Z

I modified the test to test the case where there are multiple row groups and I added a small optimization (it will only call compute::Take if the # of referenced indices is smaller than the size of the dictionary). There is one outstanding comment on column chunk level statistics vs page statistics but if that is resolved (and CI remains green) I will go ahead and merge.

emkornfield · 2021-07-28T03:59:13Z

LGTM module placement of stats update (inside per page processing or not)

westonpace · 2021-08-03T17:34:13Z

Also, it seems we were not writing page statistics at all for data page V2. I added it back in but wasn't sure if that was intentionally disabled for any reason.

westonpace · 2021-08-03T21:18:51Z

CI failures appear unrelated. I'll merge this tomorrow assuming no concerns about #10729 (comment)

…iven a dictionary encoded array

…ces to column_writer. Added TODO to later optimize

…stics to NOT use threads

… test the coverage of that path. Added an if check to avoid calling if we know we are going to take everything

…as v2

westonpace · 2021-08-09T19:37:42Z

Forgot about this. Rebasing and then merging on green.

emkornfield · 2021-08-09T19:49:50Z

forgot to to comment, these changes looks fine. thanks for tracing down the paths. I don't know why stats for datapagev2 would have been disabled.

…n Parquet statistics for dictionary-encoded array with nulls This fixes two issues. * The null_count must be obtained from the indices array and not the values array * The min/max should be based on referenced values and not all values in the values array Note: This further adds a dependency from parquet onto arrow::compute (I use it both to compute the unique indices and to take the referenced values). This dependency already existed (column_writer.cc relies on arrow::compute::Cast) so I'm pretty sure this isn't a problem. Related: ARROW-8891 Closes apache#10729 from westonpace/bugfix/ARROW-12513--c-parquet-parquet-writer-always-puts-null_cou Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

github-actions bot added Component: C++ Component: Parquet labels Jul 16, 2021

emkornfield reviewed Jul 16, 2021

View reviewed changes

westonpace force-pushed the bugfix/ARROW-12513--c-parquet-parquet-writer-always-puts-null_cou branch from 57c4ffa to 3a2912e Compare July 17, 2021 02:01

westonpace force-pushed the bugfix/ARROW-12513--c-parquet-parquet-writer-always-puts-null_cou branch from c1d39be to 7f9fd68 Compare July 28, 2021 03:27

westonpace force-pushed the bugfix/ARROW-12513--c-parquet-parquet-writer-always-puts-null_cou branch from 7f9fd68 to 73123f4 Compare August 3, 2021 06:21

westonpace added 8 commits August 9, 2021 09:37

ARROW-12513: Added logic to properly update parquet statistics when g…

a86ba9d

…iven a dictionary encoded array

ARROW-12513: Fixing compiler warnings

766446a

ARROW-12513: Simplified per PR comments. Moved arrow::compute referen…

1351abd

…ces to column_writer. Added TODO to later optimize

ARROW-12513: Changed exec_context used for computing dictionary stati…

c7cfaad

…stics to NOT use threads

ARROW-12513: The last commit adjusted the wrong exec context

c03dc95

ARROW-12513: Modified the statistics test to create two row groups to…

0317f94

… test the coverage of that path. Added an if check to avoid calling if we know we are going to take everything

ARROW-12513: Expanded unit test to also verify page level statistics

6a1a551

ARROW-12513: Stats were not being written when the data page format w…

62043fa

…as v2

westonpace force-pushed the bugfix/ARROW-12513--c-parquet-parquet-writer-always-puts-null_cou branch from d87fc29 to 62043fa Compare August 9, 2021 19:37

westonpace closed this in 64ed8ae Aug 9, 2021

westonpace deleted the bugfix/ARROW-12513--c-parquet-parquet-writer-always-puts-null_cou branch January 6, 2022 08:16

asfimport mentioned this pull request Aug 9, 2021

[C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls #18632

Closed

		@@ -1490,7 +1490,8 @@ Status TypedColumnWriterImpl<DType>::WriteArrowDictionary(
		// TODO(wesm): If some dictionary values are unobserved, then the

ARROW-12513: [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls #10729

ARROW-12513: [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls #10729

Conversation

westonpace commented Jul 16, 2021

github-actions bot commented Jul 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield commented Jul 16, 2021

emkornfield commented Jul 16, 2021

emkornfield commented Jul 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Jul 17, 2021

westonpace commented Jul 28, 2021

emkornfield commented Jul 28, 2021

westonpace commented Aug 3, 2021

westonpace commented Aug 3, 2021

westonpace commented Aug 9, 2021

emkornfield commented Aug 9, 2021