Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-757: Add NULL type to Bring Parquet logical types to par with Arrow #45

Closed
wants to merge 2 commits into from

Conversation

julienledem
Copy link
Member

No description provided.

* Annotates a FLOAT type when the original type was half precision on 16 bits
* IEEE 754-2008 the 16-bit base 2 format
*/
FLOAT_HALF_PRECISION = 26;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned yesterday in the sync: This seems to be neither implemented in Arrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove from now

@julienledem julienledem changed the title PARQUET-757: Bring Parquet logical types to par with Arrow PARQUET-757: Add NULL type to Bring Parquet logical types to par with Arrow Oct 28, 2016
@julienledem
Copy link
Member Author

@xhochy updated

@julienledem
Copy link
Member Author

@rdblue @xhochy please take another look.

@asfgit asfgit closed this in 4bddbad Nov 4, 2016
@@ -366,3 +366,7 @@ optional group my_map (MAP_KEY_VALUE) {
}
```

## Null
Sometimes when discovering the schema of existing data values are always null and there's no type information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used when an inferred type is null because a schema inference tool has only observed null values, right? Should this be a primitive type? My concern is that we may have a situation like this:

  1. Job 1 writes only nulls for column A as optional binary A (NULL);
  2. Job 2 writes mixed nulls and ints for column A as optional int32 A;
  3. Job 3 reads both files and chokes because it can't reconcile the schemas.

This may seem contrived at first because there is an easy merge rule, but in distributed use cases like MR it is common not to read the file schema on the client/driver. If there is no merge or requested schema, then it is possible for nodes to use the file schema, lose the NULL annotation, and die because string and int can't be merged during a shuffle.

Am I overthinking this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct. This is why arrow has a null type in the first place.
I'm wondering what is the impact of adding a primitive type.
The fact that the value is always null simplifies a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the merge logic stays the same in distributed mode as long as the full type is carried along and we know it's null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to make sure this works. I think our Schema union logic isn't currently based on logical types. For example, DECIMAL backed by binary merged with one backed by int64 probably fails. Probably not a big deal.

* Sometimes when discovering the schema of existing data
* values are always null
*/
NULL = 25;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this specify what underlying type to use for NULL?

@rdblue
Copy link
Contributor

rdblue commented Nov 4, 2016

@julienledem, sorry about the late review. I just noticed you merged this and when I checked the PR, it turns out my comments were in a pending review.

@julienledem
Copy link
Member Author

@rdblue no worries, I can do a follow up.

lekv pushed a commit to lekv/parquet-format that referenced this pull request Jul 31, 2017
Note that release builds have many compiler warnings. We also should address the other supressed compiler warnings in a separate patch (see PARQUET-519).

Author: Wes McKinney <wes@cloudera.com>

Closes apache#45 from wesm/PARQUET-447 and squashes the following commits:

6ef295d [Wes McKinney] Explicit -O0 in DEBUG build type, remove redundant -O0 from coverage build flags
50b02a1 [Wes McKinney] Add release and fastdebug build types
pitrou added a commit to pitrou/arrow that referenced this pull request Mar 31, 2021
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.
pitrou added a commit to pitrou/arrow that referenced this pull request Mar 31, 2021
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.
pitrou added a commit to pitrou/arrow that referenced this pull request Mar 31, 2021
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.
emkornfield pushed a commit to apache/arrow that referenced this pull request Apr 1, 2021
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.

Closes #9863 from pitrou/PARQUET-1990-invalid-converted-type

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.

Closes apache#9863 from pitrou/PARQUET-1990-invalid-converted-type

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 10, 2021
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.

Closes apache#9863 from pitrou/PARQUET-1990-invalid-converted-type

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
ConvertedType::NA corresponds to an invalid converted type that was once added to the Parquet spec:
apache/parquet-format#45
but then quickly removed in favour of the Null logical type:
apache/parquet-format#51

Unfortunately, Parquet C++ could still in some cases emit the unofficial converted type.

Also remove the confusingly-named LogicalType::Unknown, while "UNKNOWN" in the Thrift specification points to LogicalType::Null.

Closes apache#9863 from pitrou/PARQUET-1990-invalid-converted-type

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants