-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Dataset] Partitioning::Format on nulls #26416
Comments
Weston Pace / @westonpace:
Furthermore, in Hive, empty strings also map to this value. So empty strings and null will map to the same partition. I'm assuming we want compatibility with Hive in this way? Impala did things slightly differently to avoid the ambiguity (https://issues.apache.org/jira/browse/IMPALA-252) by choosing to reject with an error data that had empty strings. However, this sort of strictness doesn't seem quite in keeping with Arrow. |
Weston Pace / @westonpace:
|
Joris Van den Bossche / @jorisvandenbossche: Another source about the topic: https://kb.databricks.com/data/null-empty-strings.html, which concludes with "This is the expected behavior. It is inherited from Apache Hive." and "Solution: In general, you shouldn’t use both null and empty strings as values in a partitioned column." Some random other first thoughts:
|
Weston Pace / @westonpace: Default behavior for hive partitioning: "key=HIVE_DEFAULT_PARTITION" would map to "null" on read and on write Default behavior for directory partitioning: Nothing would map to "null" on read, null strings would result in error on write This way hive datasets can be read by default. Datasets with null partitions will write in hive format by default. Datasets with empty strings will throw an error but this can be overridden if the customer desires the hive behavior (by setting "empty_fallback_value" to "HIVE_DEFAULT_PARTITION") By default no data will be lost (since empty strings will error).
For directory partitioning we won't do anything surprising and will just error on missing data. Customers can choose to map values how they want. |
Weston Pace / @westonpace: Another approach could be to introduce a third option "hive_compatibility" which defaults to True.
|empty_fallback|null_fallback|hive_compatibility|Read null|Write null|Read empty|Write empty|Allows Data Loss| |
Weston Pace / @westonpace: Ben's assumption was that we would just omit the directory on null and, if That does make inference a little difficult in this case (right now HivePartitioning will attempt to infer int32 if possible). It also puts the responsibility back on the user if they want to create a dataset compatible with other Hive tools. We agreed it would be good to revisit the topic with you and see if you had any strong opinions. |
Ben Kietzman / @bkietz: |
Writing a dataset with null partition keys is currently untested. Ensure the behavior is documented and correct
Reporter: Ben Kietzman / @bkietz
Assignee: Weston Pace / @westonpace
PRs and other links:
Note: This issue was originally created as ARROW-10438. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: