ARROW-10438: [C++][Dataset] Partitioning::Format on nulls #9323

westonpace · 2021-01-26T03:37:35Z

Tested and added support for partitioning with nulls.

I had to make some changes to the hash kernels. You can now specify how you want DictionaryEncode to treat nulls. The MASK option will continue the current behavior (null not in dictionary, null value in indices) and the ENCODE option will put null in the dictionary and there will be no null values in the indices array.

Partitioning on nulls will depend on the partitioning scheme.

For directory partitioning null is allowed on inner fields but it is not allowed on an outer field if an inner field is defined. In other words, if the schema is a(int32), b(int32), c(int32) then the following are allowed

/ (a=null, b=null, c=null)
/32 (a=32, b=null, c=null)
/32/57 (a=32, b=57, c=null)

There is no way to specify a=null, b=57, c=null. This does mean that partition directories can contain a mix of files and nested partition directories (e.g. /32 might contain file.parquet and the directory /57). Alternatively we could just forbid nulls in the directory partitioning scheme.

For the hive scheme we need to be compatible with other tools that read/write hive. Those tools use a fallback value which defaults to __HIVE_DEFAULT_PARTITION__. So by default you would have directories that look like...

/a=__HIVE_DEFAULT_PARTITION__/b=__HIVE_DEFAULT_PARTITION__/c=__HIVE_DEFAULT_PARTITION__

The null fallback value is configurable as a string passed to HivePartitioning::HivePartitioning or HivePartitioning::MakeFactory.

ARROW-11649 has been created for extending this null fallback configuration to R.

github-actions · 2021-01-26T03:37:59Z

https://issues.apache.org/jira/browse/ARROW-10438

cpp/src/arrow/pretty_print.cc

cpp/src/arrow/compute/api_vector.h

cpp/src/arrow/compute/kernels/vector_hash.cc

cpp/src/arrow/dataset/partition.cc

cpp/src/arrow/pretty_print.cc

cpp/src/arrow/compute/kernels/vector_hash_test.cc

cpp/src/arrow/compute/kernels/vector_hash.cc

jorisvandenbossche

(didn't yet take a detailed look, just a few quick questions)

What currently happens if you use directory partitioning and you have null values?
And should we give the user a null_fallback option there as well?

cpp/src/arrow/dataset/partition.cc

python/pyarrow/tests/test_dataset.py

jorisvandenbossche · 2021-02-12T16:05:22Z

python/pyarrow/tests/test_dataset.py

@@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file():

 @pytest.mark.parquet
 @pytest.mark.parametrize('partitioning', ["directory", "hive"])
+@pytest.mark.parametrize('null_fallback', ['xyz', None])


What does null_fallback=None mean? (based on the docstring above it seems it can only be a string?)

In that case it does not pass in anything when it creates the partitioning and ensures that it defaults to the correct default value.

jorisvandenbossche · 2021-02-12T16:05:55Z

python/pyarrow/tests/test_dataset.py

 ])
-def test_open_dataset_partitioned_dictionary_type(tempdir, partitioning,
-                                                  partition_keys):
+def test_open_dataset_partitioned_dictionary_type(


you added this to a test that is specifically about reading partitioned datasets while inferring the partition fields as dictionary. Which is fine (as this case also needs to be able to hand that), but this should also work (and so be tested) in the default case not inferring dictionary type?
And should we also have a test for the writing part? (this one only tests reading)

I added a parameter so this test will now cover both inferred dictionary and normal reading. I renamed it to just test_partition_discovery since the test case now covers several methods of discovery. I've also added tests for writing and a few other situations.

bkietz

This is looking great, thanks for working on this! A few smaller comments follow and a large reversion for which I've included a suggestion PR.

cpp/src/arrow/compute/kernels/vector_hash.cc

bkietz · 2021-02-19T16:20:47Z

cpp/src/arrow/compute/kernels/vector_hash.cc

@@ -147,6 +152,8 @@ class ValueCountsAction final : ActionBase {
    }
  }

+  bool ShouldEncodeNulls() { return true; }


Suggested change

bool ShouldEncodeNulls() { return true; }

constexpr bool ShouldEncodeNulls() const { return true; }

This appears to fail on GCC 6.3. See https://github.com/apache/arrow/pull/9323/checks?check_run_id=1956850616 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66297

Suggestions?

cpp/src/arrow/compute/kernels/vector_hash.cc

cpp/src/arrow/compute/kernels/vector_hash_test.cc

cpp/src/arrow/dataset/expression.h

cpp/src/arrow/dataset/partition.h

cpp/src/arrow/dataset/partition.cc

python/pyarrow/_dataset.pyx

cpp/src/arrow/dataset/partition.cc

cpp/src/arrow/dataset/expression.h

bkietz

Please rebase to pick up the fix for https://issues.apache.org/jira/browse/ARROW-11724
(should resolve the appveyor build failure)

cpp/src/arrow/dataset/expression.cc

…ecify encoded nulls when encoding a dictionary

…p everything

…on null

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

…mn that is only null. Changed it to an error to match directory partitioning

…c bugs. Pulling out for a second

bkietz

A few concerns remain but they can/probably should be addressed in follow up; this is good to go.

Thanks for working on this!

bkietz · 2021-02-23T20:38:40Z

cpp/src/arrow/dataset/partition_test.cc

+                       const RecordBatchVector& expected_batches,
+                       const std::vector<Expression>& expected_expressions) {
+    ASSERT_OK_AND_ASSIGN(auto partition_results, partitioning->Partition(full_batch));
+    std::shared_ptr<RecordBatch> rest = full_batch;


Unused:

Suggested change

std::shared_ptr<RecordBatch> rest = full_batch;

bkietz · 2021-02-24T14:36:02Z

cpp/src/arrow/dataset/partition.cc

-  if (field->type()->id() == Type::DICTIONARY) {
+  if (!key.value.has_value()) {
+    return is_null(field_ref(field->name()));
+  } else if (field->type()->id() == Type::DICTIONARY) {


nit:

Suggested change

} else if (field->type()->id() == Type::DICTIONARY) {

}

if (field->type()->id() == Type::DICTIONARY) {

bkietz · 2021-02-24T14:39:00Z

cpp/src/arrow/dataset/partition.cc

@@ -625,8 +650,27 @@ class StructDictionary {

 private:
  Status AddOne(Datum column, std::shared_ptr<Int32Array>* fused_indices) {
+    if (column.type()->id() == Type::DICTIONARY) {
+      if (column.null_count() != 0) {
+        // TODO(ARROW-11732) Optimize this by allowign DictionaryEncode to transfer a


Suggested change

// TODO(ARROW-11732) Optimize this by allowign DictionaryEncode to transfer a

// TODO(ARROW-11732) Optimize this by allowing DictionaryEncode to transfer a

bkietz · 2021-02-24T14:40:48Z

cpp/src/arrow/dataset/partition.cc

+    auto field_index = GetOrInsertField(name);
+    if (repr.has_value()) {
+      return InsertRepr(field_index, *repr);
+    } else {
+      return Status::OK();
+    }


Suggested change

auto field_index = GetOrInsertField(name);

if (repr.has_value()) {

return InsertRepr(field_index, *repr);

} else {

return Status::OK();

}

if (repr.has_value()) {

auto field_index = GetOrInsertField(name);

return InsertRepr(field_index, *repr);

}

return Status::OK();

bkietz · 2021-02-24T14:45:00Z

cpp/src/arrow/compute/kernels/vector_hash_test.cc

+  CheckUnique<NullType, std::nullptr_t>(null(), {nullptr, nullptr}, {false, true},
+                                        {nullptr}, {false});
+  CheckUnique<NullType, std::nullptr_t>(null(), {}, {}, {}, {});


Suggested change

CheckUnique<NullType, std::nullptr_t>(null(), {nullptr, nullptr}, {false, true},

{nullptr}, {false});

CheckUnique<NullType, std::nullptr_t>(null(), {}, {}, {}, {});

CheckUnique(ArrayFromJSON(null(), "[null, null, null]"), ArrayFromJSON(null(), "[null]"));

CheckUnique(ArrayFromJSON(null(), "[]"), ArrayFromJSON(null(), "[]"));

bkietz · 2021-02-24T15:30:49Z

cpp/src/arrow/dataset/partition_test.cc

@@ -282,9 +348,16 @@ TEST_F(TestPartitioning, HivePartitioningFormat) {
                    equal(field_ref("alpha"), literal(0))),
               "alpha=0/beta=3.25");
  AssertFormat(equal(field_ref("alpha"), literal(0)), "alpha=0");


Since a null valued partition key is semantically equivalent to an absent one, we should ensure they format identically. I've created ARROW-11762 for follow up

Tested and added support for partitioning with nulls. I had to make some changes to the hash kernels. You can now specify how you want DictionaryEncode to treat nulls. The MASK option will continue the current behavior (null not in dictionary, null value in indices) and the ENCODE option will put `null` in the dictionary and there will be no null values in the indices array. Partitioning on nulls will depend on the partitioning scheme. For directory partitioning null is allowed on inner fields but it is not allowed on an outer field if an inner field is defined. In other words, if the schema is a(int32), b(int32), c(int32) then the following are allowed ``` / (a=null, b=null, c=null) /32 (a=32, b=null, c=null) /32/57 (a=32, b=57, c=null) ``` There is no way to specify `a=null, b=57, c=null`. This does mean that partition directories can contain a mix of files and nested partition directories (e.g. /32 might contain file.parquet and the directory /57). Alternatively we could just forbid nulls in the directory partitioning scheme. For the hive scheme we need to be compatible with other tools that read/write hive. Those tools use a fallback value which defaults to `__HIVE_DEFAULT_PARTITION__`. So by default you would have directories that look like... ``` /a=__HIVE_DEFAULT_PARTITION__/b=__HIVE_DEFAULT_PARTITION__/c=__HIVE_DEFAULT_PARTITION__ ``` The null fallback value is configurable as a string passed to HivePartitioning::HivePartitioning or HivePartitioning::MakeFactory. ARROW-11649 has been created for extending this null fallback configuration to R. Closes apache#9323 from westonpace/feature/arrow-10438 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

github-actions bot added Component: C++ Component: Parquet labels Jan 26, 2021

westonpace commented Jan 26, 2021

View reviewed changes

cpp/src/arrow/pretty_print.cc Outdated Show resolved Hide resolved

bkietz self-requested a review January 26, 2021 20:51

bkietz requested changes Jan 26, 2021

View reviewed changes

westonpace force-pushed the feature/arrow-10438 branch from 5f01d7b to aa20ad7 Compare February 1, 2021 19:50

bkietz reviewed Feb 4, 2021

View reviewed changes

cpp/src/arrow/compute/kernels/vector_hash.cc Show resolved Hide resolved

github-actions bot added the Component: Python label Feb 5, 2021

westonpace force-pushed the feature/arrow-10438 branch from 299e62e to 03a848c Compare February 11, 2021 23:43

jorisvandenbossche reviewed Feb 12, 2021

View reviewed changes

jorgecarleitao force-pushed the master branch 2 times, most recently from d4608a9 to 356c300 Compare February 14, 2021 12:09

westonpace force-pushed the feature/arrow-10438 branch from f815940 to 2140ab1 Compare February 15, 2021 21:13

westonpace marked this pull request as ready for review February 16, 2021 19:42

westonpace requested review from jorisvandenbossche and bkietz February 16, 2021 19:42

bkietz requested changes Feb 19, 2021

View reviewed changes

bkietz requested changes Feb 23, 2021

View reviewed changes

cpp/src/arrow/dataset/expression.cc Outdated Show resolved Hide resolved

westonpace added 11 commits February 23, 2021 08:44

Merge/rebase

a4202a9

WIP commit

53853b6

Added tests of vector_hash for inputs with nulls. Added ability to sp…

570de2c

…ecify encoded nulls when encoding a dictionary

Prevent using dictionary columns as partition columns. It wouldn't work.

807c600

Addressing PR comments

353ea9d

Taking out an extraneous using that I missed in the last commit

68cf487

WIP

ae0b859

Adding the null fallback logic to the python half

c941bae

WIP

d502e05

WIP

5b18c96

WIP

613b286

westonpace and others added 17 commits February 23, 2021 08:49

Missed a test case

cd00e59

Re-lint, it appears my IDE is using the wrong style file

4506853

Final lint pass. Turns out I was relying on black which was messing u…

3ca5f34

…p everything

Added more tests, rounded out a few behaviors

982f68c

Added tests for SetDefaultValues to ensure it does the correct thing …

07eee3a

…on null

Cleaned up logic for valid but not known case

8f1792d

Fixing compiler warning

6f7ced5

Python lint

212c9bc

Addressing PR comments

9ef4a71

Update cpp/src/arrow/compute/kernels/vector_hash.cc

c54c55d

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

Update cpp/src/arrow/compute/kernels/vector_hash.cc

9b0f8ee

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

Update cpp/src/arrow/compute/kernels/vector_hash_test.cc

ce53d4e

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

Update cpp/src/arrow/dataset/partition_test.cc

c2aa3ad

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

Update python/pyarrow/_dataset.pyx

7d5de82

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

Added test case to probe what happens when inferring a partition colu…

f1a6759

…mn that is only null. Changed it to an error to match directory partitioning

Use null scalars for known-null fields

dadbe8b

constexpr not supported in this context in all gcc versions due to gc…

d3bfe09

…c bugs. Pulling out for a second

westonpace force-pushed the feature/arrow-10438 branch from c794653 to d3bfe09 Compare February 23, 2021 18:50

westonpace added 2 commits February 23, 2021 09:22

Missed one of the merge conflicts

f18c701

Putting in suggestion from Ben. It got lost on rebase / force-push

591021e

bkietz approved these changes Feb 24, 2021

View reviewed changes

bkietz closed this in 8e8a000 Feb 24, 2021

westonpace deleted the feature/arrow-10438 branch April 14, 2021 20:17

westonpace restored the feature/arrow-10438 branch April 14, 2021 20:17

westonpace deleted the feature/arrow-10438 branch April 14, 2021 20:17

westonpace restored the feature/arrow-10438 branch April 14, 2021 20:18

westonpace deleted the feature/arrow-10438 branch April 14, 2021 20:18

asfimport mentioned this pull request Feb 24, 2021

[C++][Dataset] Partitioning::Format on nulls #26416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10438: [C++][Dataset] Partitioning::Format on nulls #9323

ARROW-10438: [C++][Dataset] Partitioning::Format on nulls #9323

westonpace commented Jan 26, 2021 •

edited

Loading

github-actions bot commented Jan 26, 2021

jorisvandenbossche left a comment

jorisvandenbossche Feb 12, 2021

westonpace Feb 12, 2021

jorisvandenbossche Feb 12, 2021

westonpace Feb 16, 2021

bkietz left a comment

bkietz Feb 19, 2021

westonpace Feb 23, 2021

bkietz left a comment

bkietz left a comment

bkietz Feb 23, 2021

bkietz Feb 24, 2021

bkietz Feb 24, 2021

bkietz Feb 24, 2021

bkietz Feb 24, 2021

bkietz Feb 24, 2021

	bool ShouldEncodeNulls() { return true; }
	constexpr bool ShouldEncodeNulls() const { return true; }

	// TODO(ARROW-11732) Optimize this by allowign DictionaryEncode to transfer a
	// TODO(ARROW-11732) Optimize this by allowing DictionaryEncode to transfer a

ARROW-10438: [C++][Dataset] Partitioning::Format on nulls #9323

ARROW-10438: [C++][Dataset] Partitioning::Format on nulls #9323

Conversation

westonpace commented Jan 26, 2021 • edited Loading

github-actions bot commented Jan 26, 2021

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkietz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkietz left a comment

Choose a reason for hiding this comment

bkietz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Jan 26, 2021 •

edited

Loading