-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-757: Add NULL type to Bring Parquet logical types to par with Arrow #45
Conversation
* Annotates a FLOAT type when the original type was half precision on 16 bits | ||
* IEEE 754-2008 the 16-bit base 2 format | ||
*/ | ||
FLOAT_HALF_PRECISION = 26; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned yesterday in the sync: This seems to be neither implemented in Arrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove from now
@xhochy updated |
@@ -366,3 +366,7 @@ optional group my_map (MAP_KEY_VALUE) { | |||
} | |||
``` | |||
|
|||
## Null | |||
Sometimes when discovering the schema of existing data values are always null and there's no type information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used when an inferred type is null because a schema inference tool has only observed null values, right? Should this be a primitive type? My concern is that we may have a situation like this:
- Job 1 writes only nulls for column A as
optional binary A (NULL);
- Job 2 writes mixed nulls and ints for column A as
optional int32 A;
- Job 3 reads both files and chokes because it can't reconcile the schemas.
This may seem contrived at first because there is an easy merge rule, but in distributed use cases like MR it is common not to read the file schema on the client/driver. If there is no merge or requested schema, then it is possible for nodes to use the file schema, lose the NULL annotation, and die because string and int can't be merged during a shuffle.
Am I overthinking this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct. This is why arrow has a null type in the first place.
I'm wondering what is the impact of adding a primitive type.
The fact that the value is always null simplifies a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the merge logic stays the same in distributed mode as long as the full type is carried along and we know it's null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have to make sure this works. I think our Schema union logic isn't currently based on logical types. For example, DECIMAL backed by binary merged with one backed by int64 probably fails. Probably not a big deal.
* Sometimes when discovering the schema of existing data | ||
* values are always null | ||
*/ | ||
NULL = 25; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this specify what underlying type to use for NULL?
@julienledem, sorry about the late review. I just noticed you merged this and when I checked the PR, it turns out my comments were in a pending review. |
@rdblue no worries, I can do a follow up. |
Note that release builds have many compiler warnings. We also should address the other supressed compiler warnings in a separate patch (see PARQUET-519). Author: Wes McKinney <wes@cloudera.com> Closes apache#45 from wesm/PARQUET-447 and squashes the following commits: 6ef295d [Wes McKinney] Explicit -O0 in DEBUG build type, remove redundant -O0 from coverage build flags 50b02a1 [Wes McKinney] Add release and fastdebug build types
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec: apache/parquet-format#45 but then quickly removed in favour of the Null logical type: apache/parquet-format#51 Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type. Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec: apache/parquet-format#45 but then quickly removed in favour of the Null logical type: apache/parquet-format#51 Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type. Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec: apache/parquet-format#45 but then quickly removed in favour of the Null logical type: apache/parquet-format#51 Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type. Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec: apache/parquet-format#45 but then quickly removed in favour of the Null logical type: apache/parquet-format#51 Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type. Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null. Closes #9863 from pitrou/PARQUET-1990-invalid-converted-type Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec: apache/parquet-format#45 but then quickly removed in favour of the Null logical type: apache/parquet-format#51 Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type. Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null. Closes apache#9863 from pitrou/PARQUET-1990-invalid-converted-type Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec: apache/parquet-format#45 but then quickly removed in favour of the Null logical type: apache/parquet-format#51 Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type. Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null. Closes apache#9863 from pitrou/PARQUET-1990-invalid-converted-type Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec: apache/parquet-format#45 but then quickly removed in favour of the Null logical type: apache/parquet-format#51 Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type. Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null. Closes apache#9863 from pitrou/PARQUET-1990-invalid-converted-type Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
No description provided.