Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Dataset] Partitioning::Format on nulls #26416

Closed
asfimport opened this issue Oct 30, 2020 · 7 comments
Closed

[C++][Dataset] Partitioning::Format on nulls #26416

asfimport opened this issue Oct 30, 2020 · 7 comments

Comments

@asfimport
Copy link
Collaborator

Writing a dataset with null partition keys is currently untested. Ensure the behavior is documented and correct

Reporter: Ben Kietzman / @bkietz
Assignee: Weston Pace / @westonpace

PRs and other links:

Note: This issue was originally created as ARROW-10438. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
I've created a test for this.  Currently the code bails (Grouping on a field with nulls).  It appears the default behavior in Hive (https://github.com/apache/hive/blob/1d5e6bdff99cd5aa7b885f001635d7231c3b9d44/common/src/java/org/apache/hadoop/hive/common/FileUtils.java#L271) is to use the string "HIVE_DEFAULT_PARTITION" as the value.  Googling around for this value confirms that seems to be how it is used in practice.

 

Furthermore, in Hive, empty strings also map to this value.  So empty strings and null will map to the same partition.  I'm assuming we want compatibility with Hive in this way?  Impala did things slightly differently to avoid the ambiguity (https://issues.apache.org/jira/browse/IMPALA-252) by choosing to reject with an error data that had empty strings.  However, this sort of strictness doesn't seem quite in keeping with Arrow.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
Ah, even worse, it appears to be configurable...

hive.exec.default.partition.name
  • Default Value: __HIVE_DEFAULT_PARTITION__
  • Added In: Hive 0.6.0

The default partition name in case the dynamic partition column value is null/empty string or any other values that cannot be escaped. This value must not contain any special character used in HDFS URI (e.g., ':', '%', '/' etc). The user has to be aware that the dynamic partition value should not contain this value to avoid confusions.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
I am not sure we should exactly follow the (potentially non-ideal) behaviour of Hive, here. Or at least have the option (or default, and have hive-behaviour as option) for different behaviour that can preserve the actual values would be nice? (there will also be many people that use arrow datasets to write hive-like datastores without ever actually interacting with hive)

Another source about the topic: https://kb.databricks.com/data/null-empty-strings.html, which concludes with "This is the expected behavior. It is inherited from Apache Hive." and "Solution: In general, you shouldn’t use both null and empty strings as values in a partitioned column."

Some random other first thoughts:

  • A default could also be to error? (so users will at least be aware of the problem, and of that it will loose empty strings)

  • We also need to think about how to do this for directory partitioning, not only for hive partitioning (and using a hive-specific name for a partitioning schema that is not compatible with Hive might make less sense?)

  • We currently already read empty string partition values from /key=/ directory names just fine, although this is probably not tested and might only work accidentally (and might also not work for other readers like spark?)

  • This might also interact with the discussion whether to include the partition fields in the actual data files or not (because when not left out, the actual file could still hold the real value to distinguish empty vs null)

    As another observation: dask simply drops rows with missing values in the partition column (silently), but I think that is just inherited by the fact that pandas' groupby implementation by default drops missing values, and not necessarily intentional design.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
Perhaps we can get away with two string options for each partitioning scheme, empty_fallback_value and null_fallback_value.  The default for both would be the empty string but the behavior would be slightly different.

Default behavior for hive partitioning:

"key=HIVE_DEFAULT_PARTITION" would map to "null" on read and on write
"key=" would map to an empty string on read, empty strings would result in error on write

Default behavior for directory partitioning:

Nothing would map to "null" on read, null strings would result in error on write
Nothing would map to empty on read, empty strings would result in error on write.

This way hive datasets can be read by default.  Datasets with null partitions will write in hive format by default.  Datasets with empty strings will throw an error but this can be overridden if the customer desires the hive behavior (by setting "empty_fallback_value" to "HIVE_DEFAULT_PARTITION")  By default no data will be lost (since empty strings will error).

 

For directory partitioning we won't do anything surprising and will just error on missing data.  Customers can choose to map values how they want.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
Although, on further thought, that would prevent the ability to create key= style partitions.  That would seem ok but in the unlucky event some other system expects the existence of key= style partitions it would be pretty frustrating.  Also, one small change, I'm preferring "empty_Fallback" and "null_fallback" (without the _value) since these are labels and not values.

Another approach could be to introduce a third option "hive_compatibility" which defaults to True.

 

|empty_fallback|null_fallback|hive_compatibility|Read null|Write null|Read empty|Write empty|Allows Data Loss|
|
|-|-|-|-|-|-|-|-|-|
|"" (default)|"_HIVE_DEFAULT_PARTITION_" (default)|True (default)|_HIVE_DEFAULT_PARITION_|_HIVE_DEFAULT_PARTITION_|Can't happen|Error|False|
|
|_HIVE_DEFAULT_PARTITION_|"_HIVE_DEFAULT_PARTITION_" (default)|True (default)|_HIVE_DEFAULT_PARITION_|_HIVE_DEFAULT_PARTITION_|Can't happen|_HIVE_DEFAULT_PARTITION_|True|
|
|"" (default)|"_HIVE_DEFAULT_PARTITION_" (default)|False|_HIVE_DEFAULT_PARITION_|_HIVE_DEFAULT_PARITION_|""|""|False|
|
|"XYZ"|"XYZ"|True|XYZ|XYZ|Can't happen|XYZ|True|
|
|"XYZ"|"XYZ"|False|Raise error on partition create| | | | |
|
|"XYZ"|"ABC"|True|Raise error on partition create| | | | |
|
|"XYZ"|"ABC"|False|XYZ|XYZ|ABC|ABC|False|
|
|"XYZ"|""|False|""|""|XYZ|XYZ|False|
|
|"" (default)|"XYZ"|True|XYZ|XYZ|Can't happen|Error|False|



Docstrings for the three options could look something like...



 



empty_fallback - Arrow will use this label when the value is empty.  If hive_compatibility is True then the default behavior will raise an exception to prevent data loss.  If you would like to maintain hive interoperability with empty strings set this to the same value as null_fallback.



null_fallback - Arrow will use this label when the value is null.  By default, for legacy reasons, this is "_HIVE_DEFAULT_PARTITION_"



hive_compatibility - When this is True Arrow will not allow a separate fallback value for empty strings.  Writing empty strings will produce an error.  If you wish to silently map empty strings to null (normal hive behavior) then you should also set empty_fallback to match null_fallback.  If False, then Arrow will require the empty fallback and null fallback to be separate values.



 



This all sounds complicated but it might "just work".  The customer probably won't even be aware of the options until they attempt to write data with empty strings and then they will get an error.  At that point they can agree to the data loss by changing "empty_fallback" or they can agree to breaking with Hive by disabling "hive_compatibility".
|

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
@jorisvandenbossche , I spoke with @bkietz a bit on this and, at the risk of putting words in his mouth, he also agreed with I am not sure we should exactly follow the (potentially non-ideal) behaviour of Hive, here.

Ben's assumption was that we would just omit the directory on null and, if \_HIVE\_DEFAULT\_PARTITION\_ is present then just read in that string and allow the user to convert it to null at a later projection stage if that is what they desire.

That does make inference a little difficult in this case (right now HivePartitioning will attempt to infer int32 if possible).

It also puts the responsibility back on the user if they want to create a dataset compatible with other Hive tools.

We agreed it would be good to revisit the topic with you and see if you had any strong opinions.

@asfimport
Copy link
Collaborator Author

Ben Kietzman / @bkietz:
Issue resolved by pull request 9323
#9323

@asfimport asfimport added this to the 4.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants