PARQUET-968 Add Hive/Presto support in ProtoParquet #411

costimuraru · 2017-04-29T20:58:56Z

This PR adds Hive (https://github.com/apache/hive) and Presto (https://github.com/prestodb/presto) support for parquet messages written with ProtoParquetWriter. Hive and other tools, such as Presto (used by AWS Athena), rely on specific LIST/MAP wrappers (as defined in the parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). These wrappers are currently missing from the ProtoParquet schema. AvroParquet works just fine, because it adds these wrappers when it deals with arrays and maps. This PR brings these wrappers in parquet-proto, providing the same functionality that already exists in parquet-avro.

This is backward compatible. Messages written without the extra LIST/MAP wrappers are still being read successfully using the updated ProtoParquetReader.

Regarding the change.
Given the following protobuf schema:

message ListOfPrimitives {
    repeated int64 my_repeated_id = 1;
}

Old parquet schema was:

message ListOfPrimitives {
  repeated int64 my_repeated_id = 1;
}

New parquet schema is:

message ListOfPrimitives {
  required group my_repeated_id (LIST) = 1 {
    repeated group list {
      required int64 element;
    }
  }
}

For list of messages, the changes look like this:

Protobuf schema:

message ListOfMessages {
    string top_field = 1;
    repeated MyInnerMessage first_array = 2;
}

message MyInnerMessage {
    int32 inner_field = 1;
}

Old parquet schema was:

message TestProto3.ListOfMessages {
  optional binary top_field (UTF8) = 1;
  repeated group first_array = 2 {
    optional int32 inner_field = 1;
  }
}

The expected parquet schema, compatible with Hive (and similar to parquet-avro) is the following (notice the LIST wrapper):

message TestProto3.ListOfMessages {
  optional binary top_field (UTF8) = 1;
  required group first_array (LIST) = 2 {
    repeated group list {
      optional group element {
        optional int32 inner_field = 1;
      }
    }
  }
}

Similar for maps. Protobuf schema:

message TopMessage {
    map<int64, MyInnerMessage> myMap = 1;
}

message MyInnerMessage {
    int32 inner_field = 1;
}

Old parquet schema:

message TestProto3.TopMessage {
  repeated group myMap = 1 {
    optional int64 key = 1;
    optional group value = 2 {
      optional int32 inner_field = 1;
    }
  }
}

New parquet schema (notice the MAP wrapper):

message TestProto3.TopMessage {
  required group myMap (MAP) = 1 {
    repeated group key_value {
      required int64 key;
      optional group value {
        optional int32 inner_field = 1;
      }
    }
  }
}

Jira: https://issues.apache.org/jira/browse/PARQUET-968

kgalieva · 2017-05-05T13:00:31Z

Hello @costimuraru
Could you please clarify why you decided to replace

repeated int32 repeatedPrimitive = 3;

with

required group repeatedPrimitive (LIST) = 3 {
    repeated int32 array;
 }

not with

optional group repeatedPrimitive (LIST) {
 repeated group list {
   optional int32 element;
 }
}

as described in documentation https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists

costimuraru · 2017-05-05T18:37:38Z

Hi @kgalieva,

You raise a good point. What I had in mind was to make it similar to what parquet-avro is doing, like you can see here: https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java#L82
But indeed, it seems the right approach to always add the list wrapper.

Cheers,
Costi

kgalieva · 2017-05-05T18:54:17Z

Here are examples of how Spark and Hive handle repeated fields.
Spark:

 optional group repeatedPrimitive (LIST) {
    repeated group list {
      optional int32 element;
    }
  }

Hive

 optional group repeatedPrimitive (LIST) {
    repeated group bag {
      optional int32 array_element;
    }
  }

Both are compliant with the specification.
Would you consider implementing it Spark/Hive way?

costimuraru · 2017-05-05T23:52:42Z

Hi @kgalieva,

I've spent the last couple of hours trying to add the inner layer for primitive values, but the changes needed to support this are quite involved.

However, looking again over the spec, it says this:

Backward-compatibility rules
[...] Some existing data does not include the inner element layer. [...] 
Examples that can be interpreted using these rules:

// List<Integer> (nullable list, non-null elements)
optional group my_list (LIST) {
  repeated int32 element;
}

This is exactly the same as what avro and this PR is producing. (By the way, this format is working perfect with Hive and Presto - tested on our own data set, with a massive protobuf schema (40+ fields)).

Also, the spec does not mention a best practice for this use case: List<Tuple<String, Integer>>.
Specifically in protobuf:

message ListOfMessages {
    repeated MyInnerMessage my_array = 1;
}

message MyInnerMessage {
    string field1 = 1;
    int32 field2 = 2;
}

Clearly we can't just use element here, since we have two of them (field1/field2).

julienledem · 2017-05-12T21:53:05Z

Hi @costimuraru and @kgalieva: great to see design discussions happening :)
If you want to have systematically a 3 level parquet list here are some hints:

example 1:
Proto: (note that this list can not be null and does not contain null)
repeated int32 repeatedPrimitive = 3;
should map to (almost what @kgalieva was saying, just added required where it can not be null)

*required* group repeatedPrimitive (LIST) {
 repeated group list {
   *required* int32 element;
 }
}

example 2
repeated MyInnerMessage my_array = 1;
In this case element is just of type MyInnerMessage

*required* group my_array (LIST) {
 repeated group list {
   *required* MyInnerMessage element;
 }
}

CC: @rdblue

costimuraru · 2017-05-18T22:48:29Z

@julienledem, @kgalieva, I've made the changes so that the resulting parquet schema now follows the spec.

@julienledem, I've also made the LIST required and the element is also required now.

See the changes in ProtoSchemaConverterTest.java

Updated the pull request description to reflect the schema changes.

qinghui-xu · 2017-05-30T16:16:09Z

@julienledem @costimuraru
This PR would be quite interesting if the wrapper could be defined as optional in parquet schema, from the point of view of our use cases in which we need to distinguish whether a list is null or empty.
If using optional on the first level, the list will be nullable (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists). I'll take the first example:
optional group repeatedPrimitive (LIST) {
repeated group list {
required int32 element;
}
}
Such, we have an optional list containing not null integers.

julienledem · 2017-06-09T18:32:52Z

@matt-martin @lukasnalezenec @costimuraru @qinghui-xu @kgalieva this looks great. What is left to resolve before we can merge this? I'm not using parquet-proto myself at the moment but I'm happy to organize a google hangout if that helps getting to a resolution.

costimuraru · 2017-06-12T21:58:35Z

@julienledem sounds good. I think this is ready for merge.

lumost · 2017-08-30T20:59:36Z

I've also encountered this issue with ProtoParquet formatted parquet files. Is it possible for this to be merged in the near future? Also happy to pitch in on any outstanding items @costimuraru @julienledem

andredasilvapinto · 2017-09-01T12:35:19Z

I had problems with this when using Proto3. I've done a few changes to get it to work. I also noticed the field names and structures for lists and maps don't conform with the official representation here ( https://github.com/apache/parquet-format/blob/master/LogicalTypes.md ), but with one of the backward compatible ones. This was not enough to get us to read the Parquet files in all our tech stack (Spark, Hive, Athena/Presto), so I also changed that.

I might be able to push my changes sometime soon.

costimuraru · 2017-09-01T12:51:54Z

Hey Andre. There are multiple commits on this PR, including ones that make the schema compatible with the spec defined at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md including for lists and maps.
For instance the last commit 28837b3 makes it compatible with Spark (which was tested and validated by AWS).
Are you using the latest version here?

We've been using this patch to produce proto3-parquet files that we're feeding to Athena for quite some time, with a schema containing 40 fields including maps, lists and inner groups and with 10s of TB of data.

andredasilvapinto · 2017-09-01T13:08:10Z

Ah no. I did this ~ 3 weeks ago. Nice that you fixed it ;)

andredasilvapinto · 2017-09-01T13:14:58Z

You didn't fix the lists representation though. The documentation says:

The middle level, named list, must be a repeated group with a single field named element.

Your single field is not named element.

Also why did you pick Repetition.REQUIRED here?
28837b3#diff-e5fd77f88cc2bb6c2ff8fa3b53f4d56bR140

In Protocol Buffers 3 everything is Optional.

costimuraru · 2017-09-04T21:06:09Z

Thanks for the input, @andredasilvapinto.

Your single field is not named element.

Again, with one of the latest commits, the proto-parquet list should look something like this:

required group my_repeated_id (LIST) = 1 {
  repeated group list {
    **required int64 element;**
  }
}

Are you not seeing this with the latest version on this patch?

Also why did you pick Repetition.REQUIRED here?
28837b3#diff-e5fd77f88cc2bb6c2ff8fa3b53f4d56bR140

It's a good question! In the spec it says:

The outer-most level must be a group annotated with MAP that contains a single field named key_value. The repetition of this level must be either optional or required and determines whether the list is nullable.

AFAIK, protobuf does not have lists/maps that are null. In fact "it makes no distinction between an empty list and a null list." - so I think it doesn't matter what the repetion is here. I tried it with Repetion.REQUIRED and it worked fine even without adding any values to the protobuf map. If you know otherwise, feedback is appreciated.

andredasilvapinto · 2017-09-04T22:18:33Z

You are only doing it for primitive types: https://github.com/apache/parquet-mr/pull/411/files#diff-3b093ba1a3c729ad39bd47b0c148a586R298

even your tests show it:
https://github.com/apache/parquet-mr/pull/411/files#diff-ae1342df26f3212198daf98364cde51dR161

I went with OPTIONAL because in Proto3 there is no REQUIRED, so I thought that an optional parquet field was a more adequate type to represent the equivalent Protobuf 3 optional field.

andredasilvapinto · 2017-09-05T12:23:35Z

If you find it useful, these were the changes I did on top of your last "Implement review" commit: d694f20:

andredasilvapinto@dfa9701

Some of the changes are already present on your latest commit.

We have been running this for dozens of different data sets (Protobuf 3) for a few weeks already without any known problems.

costimuraru · 2017-09-05T20:39:10Z

You are only doing it for primitive types: https://github.com/apache/parquet-mr/pull/411/files#diff-3b093ba1a3c729ad39bd47b0c148a586R298
even your tests show it:
https://github.com/apache/parquet-mr/pull/411/files#diff-ae1342df26f3212198daf98364cde51dR161

You raise an interesting point, @andredasilvapinto!

Case 1.
Suppose we have the following protobuf schema, containing a list of messages, where the inner message has two fields:

message MyTopMessage {
     repeated MyInnerMessage repeatedMessage = 1;
}

message MyInnerMessage {
    int32 someId = 1;
    int32 otherId = 2;   
}

Ideally, I would like to be able to query each individual sub-field (someId/otherId) individually, in Athena or Hive. Something like this:

SELECT repeatedMessage[1].someId FROM athenalist limit 10;
SELECT repeatedMessage[1].otherId FROM athenalist limit 10;

Where the table (Presto) would look something like this (notice the array of struct):

CREATE EXTERNAL TABLE IF NOT EXISTS athenalist (
`repeatedComplexMessage` array<struct<`someId`:int,`otherId`:int>>)
STORED AS PARQUET

This works fine with the current version. The current parquet schema looks like this:

message TestProto3.MyTopMessage {
  required group repeatedComplexMessage (LIST) {
    repeated group list {
      optional int32 someId;
      optional int32 otherId;
    }
  }
}

My question here would be: where should the element be? The spec does not specify what to do when we're dealing with a list of messages with multiple fields. Thoughts?

Case 2.
The second case is the one present in the unit test, where the inner message has just one field:

message MyTopMessage {
     repeated MyInnerMessage repeatedMessage = 1;
}

message MyInnerMessage {
    int32 someId = 1;
}

Again, ideally I would like to be able to select that field specifically:

SELECT repeatedMessage[1].someId FROM athenalist2 limit 10;

And have a CREATE table with a struct containing one field:

CREATE EXTERNAL TABLE IF NOT EXISTS athenalist (
`repeatedComplexMessage` array<struct<`someId`:int>>)
STORED AS PARQUET

However... this does not work! I'm getting a parquet parsing error in Presto (HIVE_CURSOR_ERROR: Can not read value at 0 in block 0).

It works however when I change the CREATE table to remove the struct:

CREATE EXTERNAL TABLE IF NOT EXISTS athenalist (
`repeatedComplexMessage` array<int>)
STORED AS PARQUET

Though I'm left with no way of querying the someId field directly. I can only do:

SELECT repeatedMessage[1] FROM athenalist2 limit 10;

Which will return an int.

If this is the desired behavior, then indeed @andredasilvapinto, we can have element as the inner field name instead of someId. Something like:

message TestProto3.MyTopMessage {
  required group repeatedMessage (LIST) {
    repeated group list {
      optional int32 ~~someId~~ element;
    }
  }
}
But this will work only when the inner message has just one field. And again, this seems to prevent the ability to SELECT that specific field (someId).

What do you think?

Later Edit: Ah, I see in your commit (andredasilvapinto@dfa9701) what should be done here. It should actually be:

message TestProto3.MyTopMessage {
  required group repeatedMessage (LIST) {
    repeated group list {
      optional group element {
        optional int32 someId;
      }
    }
  }
}

The same goes for above. Nice catch! I'll give this a try and will reply.

costimuraru · 2017-09-05T21:19:05Z

@andredasilvapinto, you were right! After adding the extra element wrapper (like you suggested, even for non-primitive types) it started working also for Case 2. Great job, man!
I picked your commit, which contains also the fixes for the MAP fields.
If you wish to preserve the "copyright", I'd be more than happy to do a cherry pick from your fork after you rebase on top of the PARQUET-968 Implement feedback 28837b3

andredasilvapinto · 2017-09-05T21:23:20Z

Nice @costimuraru. No problem with the "copyright". Just as long as this gets merged I'm happy (one less reason to keep our internal parquet-mr fork!). cheers!

andredasilvapinto · 2017-09-29T09:21:08Z

Are there any efforts currently being made in order to merge this to master?

andredasilvapinto · 2017-11-07T00:49:56Z

Just noticed that this doesn't write the values of Protobuf fields that are equal to their default values. This happens because in Protobuf3, setting a field to its default value is equivalent to clearing the field. Therefore the conversion to Parquet needs to take that in consideration.

abelke · 2017-11-13T09:45:28Z

Hi when I use this patch, this need protoc3(install Protobuf 3.4.0 ), if the same case on protoc2(Protobuf 2.5.0) , have other solution?

qinghui-xu · 2018-02-13T15:57:24Z

@costimuraru @julienledem
Hey, it seems this patch stays here for a while, and it is indeed important for us to have this fix. Could somebody merge it?

lumost · 2018-02-13T17:32:15Z

seconding @qinghui-xu's comment, we've had a version of this Patch in production for nearly 6 months now.

BenoitHanotte · 2018-04-17T11:24:15Z

@lukasnalezenec this is the schema written for a map with the flag set to true:

optional group nonEmptyMap (MAP) = 5 {
  repeated group key_value {
    required int32 key;
    optional int32 value;
  }
}

There is a test suite at https://github.com/BenoitHanotte/parquet-968/blob/master/src/test/scala/org/bhnte/parquet968/SparkTest.scala to validate the support of empty maps and lists as well as to check that Spark is now able to correctly interpret maps as such with the flag set to true

BenoitHanotte · 2018-04-24T11:53:21Z

Hello @lukasnalezenec, have you had time to have a look? Thanks

lukasnalezenec · 2018-04-24T15:12:45Z

Hi, I already did.
There is one typo in comment and it is little bit harder to read - i wanted to check flow once more. I think we can commit it as it is.

julienledem · 2018-04-24T16:31:55Z

This looks good.
Thank you for this collaborative effort!

chawlakunal · 2018-04-24T23:03:29Z

When can this be expected to be merged to master and released?

chawlakunal · 2018-04-27T19:16:07Z

@BenoitHanotte @costimuraru @julienledem There is no way to instantiate ProtoParquetWriter with parquet.proto.writeSpecsCompliant flag enabled. Am I missing something or is this intentional? It would have been great if a constructor to enable the flag was provided.

public ProtoParquetWriter(Path file, Class<? extends Message> protoMessage,
          CompressionCodecName compressionCodecName, int blockSize, int pageSize, boolean enableDictionary,
          boolean validating, boolean writeSpecsCompliant) throws IOException {
      super(file, new ProtoWriteSupport(protoMessage), compressionCodecName, blockSize, pageSize, pageSize,
              enableDictionary, validating, DEFAULT_WRITER_VERSION,
              getConfigWithWriteSpecsCompliant(writeSpecsCompliant));
  }
  
  private static Configuration getConfigWithWriteSpecsCompliant(boolean writeSpecsCompliant) {
      Configuration config = new Configuration();
      ProtoWriteSupport.setWriteSpecsCompliant(config, writeSpecsCompliant);
      return config;
  }

BenoitHanotte · 2018-04-29T18:20:04Z

@chawlakunal you can manually create your ParquetWriter by providing the ProtoWriteSupport as following:

Configuration conf = new Configuration();
ProtoWriteSupport.setWriteSpecsCompliant(true, conf); // set the flag in the configuration
new ParquetWriter(file, conf, new ProtoWriteSupport(protoClass));

(or any variation of this as ParquetWriter has multiple constructors that accept a configuration in which we can set the flag)

chawlakunal · 2018-04-30T03:35:33Z

@BenoitHanotte That's exactly how I am using it right now but it kind of defeats the purpose of having ProtoParquetWriter class.

chawlakunal · 2018-04-30T18:55:04Z

@BenoitHanotte Is there a timeline of when this will be released?

BenoitHanotte · 2018-05-01T16:38:22Z

I believe the 1.10.0 has just been released, so this will liekly land in the next "major" release, unfortunately I am not aware of any plan to have a new release in the near future.

For the ProtoParquetWriter class, we have discussed it with @costimuraru and we will add a constructor with the flag in a future PR but I can't commit on a timeframe.

chawlakunal · 2018-05-01T20:35:14Z

@BenoitHanotte Here's the PR for constructor with flag #473

Also, if a minor release can be done for this fix it will be greatly appreciated.

BenoitHanotte · 2018-05-03T09:27:22Z

@chawlakunal I had a look at your PR (#473) , it looks good, there is just a comment that I believe needs to be changed (regarding the block size).
For the release, that's a decision that will need to be taken by the maintainers, I will ask them about their plans for releases the next time I have the chance to talk with them.

@andredasilvapinto

This PR adds Hive (https://github.com/apache/hive) and Presto (https://github.com/prestodb/presto) support for parquet messages written with ProtoParquetWriter. Hive and other tools, such as Presto (used by AWS Athena), rely on specific LIST/MAP wrappers (as defined in the parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). These wrappers are currently missing from the ProtoParquet schema. AvroParquet works just fine, because it adds these wrappers when it deals with arrays and maps. This PR brings these wrappers in parquet-proto, providing the same functionality that already exists in parquet-avro. This is backward compatible. Messages written without the extra LIST/MAP wrappers are still being read successfully using the updated ProtoParquetReader. Regarding the change. Given the following protobuf schema: ``` message ListOfPrimitives { repeated int64 my_repeated_id = 1; } ``` Old parquet schema was: ``` message ListOfPrimitives { repeated int64 my_repeated_id = 1; } ``` New parquet schema is: ``` message ListOfPrimitives { required group my_repeated_id (LIST) = 1 { repeated group list { required int64 element; } } } ``` --- For list of messages, the changes look like this: Protobuf schema: ``` message ListOfMessages { string top_field = 1; repeated MyInnerMessage first_array = 2; } message MyInnerMessage { int32 inner_field = 1; } ``` Old parquet schema was: ``` message TestProto3.ListOfMessages { optional binary top_field (UTF8) = 1; repeated group first_array = 2 { optional int32 inner_field = 1; } } ``` The expected parquet schema, compatible with Hive (and similar to parquet-avro) is the following (notice the LIST wrapper): ``` message TestProto3.ListOfMessages { optional binary top_field (UTF8) = 1; required group first_array (LIST) = 2 { repeated group list { optional group element { optional int32 inner_field = 1; } } } } ``` --- Similar for maps. Protobuf schema: ``` message TopMessage { map<int64, MyInnerMessage> myMap = 1; } message MyInnerMessage { int32 inner_field = 1; } ``` Old parquet schema: ``` message TestProto3.TopMessage { repeated group myMap = 1 { optional int64 key = 1; optional group value = 2 { optional int32 inner_field = 1; } } } ``` New parquet schema (notice the `MAP` wrapper): ``` message TestProto3.TopMessage { required group myMap (MAP) = 1 { repeated group key_value { required int64 key; optional group value { optional int32 inner_field = 1; } } } } ``` Jira: https://issues.apache.org/jira/browse/PARQUET-968 Author: Constantin Muraru <cmuraru@adobe.com> Author: Benoît Hanotte <BenoitHanotte@users.noreply.github.com> Closes apache#411 from costimuraru/PARQUET-968 and squashes the following commits: 16eafcb [Benoît Hanotte] PARQUET-968 add proto flag to enable writing using specs-compliant schemas (#2) a8bd704 [Constantin Muraru] Pick up commit from @andredasilvapinto 5cf9248 [Constantin Muraru] PARQUET-968 Add Hive support in ProtoParquet

CCv5 · 2018-10-30T08:21:20Z

Yes. The way I solved it was to add a flag to ProtoWriteSupport to define whether to include default values or not. If set to true I always set empty fields to their default protobuf values (except one of fields).

I can share the commit if people are interested.

Hi @andredasilvapinto , Do you have any progress regarding the default value is not persisted in parquet? It is quite an annoying bug when read enum value as ['null', 'Type1', 'Type2'], not ['Type0', 'Type1', 'Type2']

andredasilvapinto · 2018-10-30T08:28:49Z

Yes, if you look at the commit I linked to several months ago costimuraru@9a4c016 it contains that flag to decide whether to print the default values or not.

…

On Tue, Oct 30, 2018, 08:21 CHuAn ***@***.***> wrote: Yes. The way I solved it was to add a flag to ProtoWriteSupport to define whether to include default values or not. If set to true I always set empty fields to their default protobuf values (except one of fields). I can share the commit if people are interested. Hi @andredasilvapinto <https://github.com/andredasilvapinto> , Do you have any progress regarding the default value is not persisted in parquet? It is quite an annoying bug when read enum value as ['null', 'Type1', 'Type2'], not ['Type0', 'Type1', 'Type2'] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#411 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACjD6U3beSyBsjmTkF6EGtxHV8jdpEPCks5uqAwHgaJpZM4NMYWi> .

CCv5 · 2018-11-05T02:01:46Z

Yes, if you look at the commit I linked to several months ago costimuraru@9a4c016 it contains that flag to decide whether to print the default values or not.
…
On Tue, Oct 30, 2018, 08:21 CHuAn @.***> wrote: Yes. The way I solved it was to add a flag to ProtoWriteSupport to define whether to include default values or not. If set to true I always set empty fields to their default protobuf values (except one of fields). I can share the commit if people are interested. Hi @andredasilvapinto https://github.com/andredasilvapinto , Do you have any progress regarding the default value is not persisted in parquet? It is quite an annoying bug when read enum value as ['null', 'Type1', 'Type2'], not ['Type0', 'Type1', 'Type2'] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#411 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ACjD6U3beSyBsjmTkF6EGtxHV8jdpEPCks5uqAwHgaJpZM4NMYWi .

ok, it is on your personal branch, great patch! when will it merge to master? any plan?

andredasilvapinto · 2018-11-05T09:41:36Z

I have no idea. I think a few changes were already made to the base version during this approval process.

ccpstephanie · 2021-07-23T16:33:54Z

Although it's closed but I'm a bit confused... why I always get the old schema version? parquet.proto.writeSpecsCompliant=false, and directly using ParquetWriter. I'm using the latest version currently 1.12.0.

I'd highly appreciate if someone could point out something stupid in my code! Or it's the same issue you are experiencing?

My goal is to be able query data via Athena/Presto, or Hive Metastore, so need the the new parquet schema version.

Method 1:

  // Doesn't work!
    Configuration conf = new Configuration();
    ProtoWriteSupport.setWriteSpecsCompliant(conf, false); // If set to true, the old schema style will be used (without wrappers).

    ParquetWriter<MessageOrBuilder> writer =
    ProtoParquetWriter.<MessageOrBuilder>builder(file).withMessage(cls).withConf(conf).build();

    for (MessageOrBuilder record : records) {
        writer.write(record);
    }

    writer.close();
    System.err.println(writer.getFooter());

Method 2:

  // Doesn't work!
    Configuration conf = new Configuration();
    ProtoWriteSupport.setWriteSpecsCompliant(conf, false); // If set to true, the old schema style will be used (without wrappers).

    try (ParquetWriter writer = new ParquetWriter(
                                            file,
                                            new ProtoWriteSupport<AddressBook>(AddressBook.class),
                                            CompressionCodecName.GZIP,
                                            128 * 1024 * 1024,//PARQUET_BLOCK_SIZE,
                                            ParquetProperties.DEFAULT_PAGE_SIZE,
                                            ParquetProperties.DEFAULT_PAGE_SIZE, 
                                            true,
                                            false,
                                            ParquetProperties.DEFAULT_WRITER_VERSION,
                                            conf)) {
        for (Object record : messages) {
            writer.write(record);
        }
        writer.close();
        System.err.println(writer.getFooter());

Parquet output Metadata:
_ParquetMetaData{FileMetaData{schema: message AddressBookProtos.AddressBook { repeated group people = 1 { optional binary name (STRING) = 1; optional int32 id = 2; optional binary email (STRING) = 3; repeated group phones = 4 { optional binary number (STRING) = 1; optional binary type (ENUM) = 2; } }} , metadata: {parquet.proto.descriptor=name: "AddressBook" field { name: "people" number: 1 label: LABEL_REPEATED type: TYPE_MESSAGE type_name: ".AddressBookProtos.Person"} , parquet.proto.writeSpecsCompliant=false, ...}

Protobuf Messasge:

`
syntax = "proto3";

package AddressBookProtos;

option java_multiple_files = true;
option java_package = "com.mycompany.app";
option java_outer_classname = "AddressBookProtos";

message Person {
string name = 1;
int32 id = 2;
string email = 3;

enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}

message PhoneNumber {
string number = 1;
PhoneType type = 2;
}

repeated PhoneNumber phones = 4;
}

message AddressBook {
repeated Person people = 1;
}
`

Srb1996 · 2024-05-08T05:58:19Z

Was it merged ? facing same issue with 1.13.0(latest)

wgtmac · 2024-05-08T06:08:23Z

I believe it was merged: f849384

Srb1996 · 2024-05-08T06:54:16Z

@wgtmac thanks for reply. It means version 1.13.0 with proto3 should work, right ?

wgtmac · 2024-05-08T06:57:24Z

It was merged long ago and I don't have any context about it. If this is the fix to the issue that you have seen, then yes it should not appear in 1.13.0.

Srb1996 · 2024-05-08T06:58:32Z

i am using proto2, could it be a reason ?

qinghui-xu · 2024-05-08T12:08:52Z

i am using proto2, could it be a reason ?

Probably that's the reason. parquet-proto requires proto 3 as dependency.

Srb1996 · 2024-05-08T13:23:49Z

@qinghui-xu i am using for parquet-proto with proto2 for few years, but question is to take the changes of this pr in consideration do i need to upgrade to proto3 or this solution should work with proto2 as well ?
Currently, i am using proto2 with parquet-protobuf(1.13.0) and querying using presto is breaking.

Srb1996 · 2024-05-08T13:25:45Z

@ccpstephanie was you able to resolve ?

julienledem mentioned this pull request May 1, 2017

[PARQUET-951] Pull request for handling protobuf field id #410

Closed

costimuraru force-pushed the PARQUET-968 branch from 8e6f8d1 to 28837b3 Compare August 27, 2017 22:14

lumost mentioned this pull request Aug 30, 2017

Incorrect parsing for 'repeated group' in parquet format prestodb/presto#5316

Closed

costimuraru changed the title ~~PARQUET-968 Add Hive support in ProtoParquet~~ PARQUET-968 Add Hive/Presto support in ProtoParquet Sep 5, 2017

dg3feiko mentioned this pull request Dec 12, 2017

List/Map is not compatible with AWS Athena/Hive/PrestoDB ironSource/parquetjs#30

Open

julienledem closed this in f849384 Apr 26, 2018

ggershinsky mentioned this pull request Apr 30, 2018

PARQUET-968 Add Hive/Presto support in ProtoParquet ggershinsky/parquet-mr#1

Merged

chawlakunal mentioned this pull request May 1, 2018

PARQUET-1292 Adding constructors to ProtoParquetWriter with writeSpecsCompliant flag #473

Open

nathan-pr mentioned this pull request Jun 3, 2022

Attempt at forcing write compliance on parquet schemas fullcontact/secor#2

Merged

PARQUET-968 Add Hive/Presto support in ProtoParquet #411

PARQUET-968 Add Hive/Presto support in ProtoParquet #411

Conversation

costimuraru commented Apr 29, 2017 • edited

kgalieva commented May 5, 2017

costimuraru commented May 5, 2017

kgalieva commented May 5, 2017

costimuraru commented May 5, 2017 • edited

julienledem commented May 12, 2017 • edited

costimuraru commented May 18, 2017 • edited

qinghui-xu commented May 30, 2017 • edited

julienledem commented Jun 9, 2017

costimuraru commented Jun 12, 2017

lumost commented Aug 30, 2017 • edited

andredasilvapinto commented Sep 1, 2017 • edited

costimuraru commented Sep 1, 2017 • edited

andredasilvapinto commented Sep 1, 2017

andredasilvapinto commented Sep 1, 2017

costimuraru commented Sep 4, 2017 • edited

andredasilvapinto commented Sep 4, 2017

andredasilvapinto commented Sep 5, 2017 • edited

costimuraru commented Sep 5, 2017 • edited

costimuraru commented Sep 5, 2017

andredasilvapinto commented Sep 5, 2017

andredasilvapinto commented Sep 29, 2017

andredasilvapinto commented Nov 7, 2017

abelke commented Nov 13, 2017

qinghui-xu commented Feb 13, 2018

lumost commented Feb 13, 2018

BenoitHanotte commented Apr 17, 2018 • edited

BenoitHanotte commented Apr 24, 2018

lukasnalezenec commented Apr 24, 2018

julienledem commented Apr 24, 2018

chawlakunal commented Apr 24, 2018

chawlakunal commented Apr 27, 2018 • edited

BenoitHanotte commented Apr 29, 2018

chawlakunal commented Apr 30, 2018

chawlakunal commented Apr 30, 2018

BenoitHanotte commented May 1, 2018

chawlakunal commented May 1, 2018

BenoitHanotte commented May 3, 2018

CCv5 commented Oct 30, 2018

andredasilvapinto commented Oct 30, 2018 via email

CCv5 commented Nov 5, 2018

andredasilvapinto commented Nov 5, 2018

ccpstephanie commented Jul 23, 2021

Srb1996 commented May 8, 2024

wgtmac commented May 8, 2024

Srb1996 commented May 8, 2024 • edited

wgtmac commented May 8, 2024

Srb1996 commented May 8, 2024

qinghui-xu commented May 8, 2024

Srb1996 commented May 8, 2024

Srb1996 commented May 8, 2024

costimuraru commented Apr 29, 2017 •

edited

costimuraru commented May 5, 2017 •

edited

julienledem commented May 12, 2017 •

edited

costimuraru commented May 18, 2017 •

edited

qinghui-xu commented May 30, 2017 •

edited

lumost commented Aug 30, 2017 •

edited

andredasilvapinto commented Sep 1, 2017 •

edited

costimuraru commented Sep 1, 2017 •

edited

costimuraru commented Sep 4, 2017 •

edited

andredasilvapinto commented Sep 5, 2017 •

edited

costimuraru commented Sep 5, 2017 •

edited

BenoitHanotte commented Apr 17, 2018 •

edited

chawlakunal commented Apr 27, 2018 •

edited

Srb1996 commented May 8, 2024 •

edited