Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-212: Implement LIST read compatibility rules in Thrift #300

Closed

Conversation

@rdblue
Copy link
Contributor

rdblue commented Dec 11, 2015

This implements the read-side compatibility rules for 2-level and 3-level lists in Thrift.

Thrift doesn't allow null elements inside lists, but 3-level lists may have optional elements. This PR adds a property, parquet.thrift.ignore-null-elements, that allows thrift to read lists with optional elements by ignoring nulls. This is off by default, but is provided as an opt-in for compatibility with data written by Hive.

Thrift's schema conversion does not change because a Thrift class (or Scrooge etc.) must be set in a file's metadata or provided when constructing a reader.

This replaces and closes #144.

@rdblue

This comment has been minimized.

Copy link
Contributor Author

rdblue commented Dec 11, 2015

@isnotinvain you may want to have a look at this since you guys do the most Thrift work. This adds support in Thrift to read data written by the other object models correctly.

@isnotinvain

This comment has been minimized.

Copy link
Contributor

isnotinvain commented Dec 11, 2015

I'll try to take a look thanks

@rdblue

This comment has been minimized.

Copy link
Contributor Author

rdblue commented Dec 11, 2015

Thanks, Alex!

@liancheng may have some comments as well since he is familiar with the compatibility rules.

// The repeated type must be the element type because it matches the
// structure of the Avro element's schema.
return true;
return fieldNames.contains(repeatedType.asGroupType().getFieldName(0));

This comment has been minimized.

Copy link
@liancheng

liancheng Dec 13, 2015

Contributor

I don't quite get the purpose of this change. Is this mainly for more robust support for schema evolution? Take the following Avro schema and Parquet schema as an example:

// Avro
record Struct {
  int f0;
  int f1;
}

// Parquet
message StructArray {
  required struct_array (LIST) {
    repeated group array {
      required int32 f0;
    }
  }
}

According to the condition of this if branch, we should interpret struct_array as a LIST of Avro Struct, with all f1 field set as null. However, f1 defined in Struct isn't optional and can't be null.

Did I miss something here?

This comment has been minimized.

Copy link
@liancheng

liancheng Dec 13, 2015

Contributor

My last comment is a little bit ambiguous. In the example I made, the Avro record Struct is the elementSchema, while the Parquet repeated group array is the repeatedType in the code.

This comment has been minimized.

Copy link
@liancheng

liancheng Dec 13, 2015

Contributor

I guess what is missing here is that we should check all the missing fields are indeed optional, unless this constraint has already been guaranteed somewhere else (then let's document it in the comment).

This comment has been minimized.

Copy link
@rdblue

rdblue Dec 14, 2015

Author Contributor

This is for the case where only one field of a nested record is projected. Avro has a projection schema and an expected schema so they can differ. Previously, the logic to match the requested schema (elementSchema) with the file's structure checked whether the name of the single field matched the name of the single field of the element schema. But that left out the case where the two do match (as in your example) but the requested schema has an optional field that should be defaulted.

Whether the other field is optional doesn't matter at this point in the code. Avro fills in default values later and throws an exception if something is missing and has no default at that point.

This comment has been minimized.

Copy link
@liancheng

liancheng Dec 15, 2015

Contributor

Oh I see, this totally makes sense. Thanks for the explanation!

}
// If the repeated type is a subset of the structure of the ThriftField,
// then it must be the element type.
return fieldNames.contains(repeatedType.asGroupType().getFieldName(0));

This comment has been minimized.

Copy link
@liancheng

liancheng Dec 13, 2015

Contributor

Same comments as the ones I made in AvroRecordConverter.java.

This comment has been minimized.

Copy link
@rdblue

rdblue Dec 14, 2015

Author Contributor

Same as above. The field projection (set by a filter expression) is independent of the Thrift type used to carry data. Nulls are added later and handled separately.

@liancheng

This comment has been minimized.

Copy link
Contributor

liancheng commented Dec 13, 2015

Got some questions about isListElementType updates.

Also, it would be nice to add parquet-thrift test cases that cover PARQUET-364, e.g., something like this.

@rdblue

This comment has been minimized.

Copy link
Contributor Author

rdblue commented Dec 14, 2015

@liancheng, you're right that I still need to add more test cases, thanks for pointing that out.

@rdblue

This comment has been minimized.

Copy link
Contributor Author

rdblue commented Dec 18, 2015

@liancheng, I added all of the test cases that were added in PARQUET-364 to this commit. I think that addresses all of your concerns.

@isnotinvain, do you still want to have a look at this?

@liancheng

This comment has been minimized.

Copy link
Contributor

liancheng commented Dec 18, 2015

+1

@rdblue

This comment has been minimized.

Copy link
Contributor Author

rdblue commented Dec 18, 2015

Thanks @liancheng!

rdblue added 8 commits Mar 10, 2015
This adds convenience methods for writing to files using the
RecordConsumer API directly. This is useful for mimicing files from
other writers for compatibility tests.
Parquet-thrift can now read files not written by parquet-thrift if an
appropriate Thrift class is supplied. This adds a check to derive the
necessary StructType from a class. Previously, attempting to read a file
without Thrift metadata in its properties would return a null
ThriftMetaData and throw NPE.

This also updates the logic in ThriftMetaData.fromExtraMetaData to avoid
NPE when the class is present by the descriptor property is not.
This mirrors the support in parquet-avro and uses an isElementType check
in the schema converter. Thrift classes are compiled and must always be
present, so there is no ambiguous case because the expected structure is
always known.

This initial version suppresses nulls when list elements are optional.
There is no way to pass a null list element to thrift when it constructs
records.

This does not change how parquet-thrift writes or converts schemas, so
there are no projection changes needed.
This adds parquet.thrift.ignore-null-elements to suppress an exception
thrown when reading a list with optional elements. This makes reading
Hive lists an opt-in so that the caller must be aware that null values
are ignored.
Previously, this added Configurable to ThriftRecordConverter, but that
caused problems with semver and test failures because a separate
initialize method had to be called. This is brittle because callers
might not know to call initialize and the class is part of the API
because subclassing is allowed. To avoid the issue, this replaces the
Configurable interface and initialize method with a new constructor that
takes a Configuration.
This fixes parquet-avro and parquet-thrift list handling for the case
where the Parquet schema columns are projected and only one remains.
This appears to not match the element schema because the element schema
has more than one field. The solution is to match the element schema if
the Parquet schema is a subset.
@rdblue rdblue force-pushed the rdblue:PARQUET-212-fix-thrift-3-level-lists branch from 881df14 to ac7c405 Jan 12, 2016
@asfgit asfgit closed this in 37f72dc Jan 12, 2016
piyushnarang pushed a commit to piyushnarang/parquet-mr that referenced this pull request Jun 15, 2016
This implements the read-side compatibility rules for 2-level and 3-level lists in Thrift.

Thrift doesn't allow null elements inside lists, but 3-level lists may have optional elements. This PR adds a property, parquet.thrift.ignore-null-elements, that allows thrift to read lists with optional elements by ignoring nulls. This is off by default, but is provided as an opt-in for compatibility with data written by Hive.

Thrift's schema conversion does not change because a Thrift class (or Scrooge etc.) must be set in a file's metadata or provided when constructing a reader.

This replaces and closes apache#144.

Author: Ryan Blue <blue@apache.org>

Closes apache#300 from rdblue/PARQUET-212-fix-thrift-3-level-lists and squashes the following commits:

ac7c405 [Ryan Blue] PARQUET-212: Add tests for list of list cases from PARQUET-364.
356fdb7 [Ryan Blue] PARQUET-212: Rename isElementType => isListElementType.
5d3b094 [Ryan Blue] PARQUET-212: Fix list handling with projection.
b5f207f [Ryan Blue] PARQUET-212: Add Configuration to the ThriftRecordConverter ctor.
b87eb65 [Ryan Blue] PARQUET-212: Add property to ignore nulls in lists.
3d1e92f [Ryan Blue] PARQUET-212: Update thrift reads for LIST compatibility rules.
0bf2b45 [Ryan Blue] PARQUET-212: Read non-thrift files if a Thrift class is supplied.
4e148dc [Ryan Blue] PARQUET-212: Add DirectWriterTest base class.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.