Druid-parquet-extensions fails on timestamps (stored as INT96) in parquet files #5150

amalakar · 2017-12-11T21:26:26Z

I am trying to use the druid-parquet-extensions to ingest parquet data in druid. As per my understanding parquet uses INT96 as the datatype for timestamps:

optional int96 logged_at

Trying to read this parquet files in hadoop index job throws an error that INT96 not supported. It looks like INT96 field may need to be read as an byte array, has anyone seen similar error, suggestion?

I am using druid 0.11.0, here is the detailed stack trace:

Caused by: java.lang.IllegalArgumentException: INT96 not yet implemented.
    at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279) ~[parquet-avro-1.8.2.jar:0.11.0.1]
    at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264) ~[parquet-avro-1.8.2.jar:0.11.0.1]
    at org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:223) ~[parquet-column-1.8.2.jar:1.8.2]
    at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:263) ~[parquet-avro-1.8.2.jar:0.11.0.1]
    at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:241) ~[parquet-avro-1.8.2.jar:0.11.0.1]
    at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:231) ~[parquet-avro-1.8.2.jar:0.11.0.1]
    at org.apache.parquet.avro.DruidParquetReadSupport.prepareForRead(DruidParquetReadSupport.java:98) ~[druid-parquet-extensions-0.11.0.1.jar:0.11.0.1]
    at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:175) ~[parquet-hadoop-1.8.2.jar:1.8.2]
    at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:190) ~[parquet-hadoop-1.8.2.jar:1.8.2]
    at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:147) ~[parquet-hadoop-1.8.2.jar:1.8.2]
    at org.apache.hadoop.mapreduce.lib.input.DelegatingRecordReader.initialize(DelegatingRecordReader.java:84) ~[hadoop-mapreduce-client-core-2.7.3.2.5.3.0-37.jar:?]
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548) ~[hadoop-mapreduce-client-core-2.7.3.2.5.3.0-37.jar:?]
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786) ~[hadoop-mapreduce-client-core-2.7.3.2.5.3.0-37.jar:?]
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) ~[hadoop-mapreduce-client-core-2.7.3.2.5.3.0-37.jar:?]
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) ~[hadoop-mapreduce-client-common-2.7.3.2.5.3.0-37.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_141]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_141]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_141]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_141]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_141]

The text was updated successfully, but these errors were encountered:

gianm · 2017-12-12T19:37:21Z

Hi @amalakar, it seems like this is due to Parquet's parquet-to-avro converter (which Druid uses for reading Parquet files) not supporting int96. I'm not sure what the best fix is. Maybe upgrading parquet would help if it's been fixed in a newer version. Or maybe using a Parquet reading strategy that doesn't involve conversion to Avro. If you're familiar enough with Parquet to help contribute the latter, you could give that a shot.

amalakar · 2017-12-12T22:29:51Z

Looks like it is not fixed in newer parquet - https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L279

The parquet-avro repo also seems to have no commit in last one year. Getting rid of the avro converter in the druid parquet extension looks like the right path to me. I will probably take a stab, but I am very new to druid so not familiar with code base and interfaces. Also I will probably get to it only early next year.

bohemia420 · 2018-08-03T12:34:20Z

is this resolved? @amalakar how did u remove avro converter? @gianm how does one change the parquet reading strategy?

amalakar · 2018-08-03T16:08:18Z

I havn’t made any progress in this issue.

shaharck · 2018-08-28T20:41:02Z

is there an existing work around if you do not need that column? doesn't seem like excluding the dimension helps

amalakar · 2018-08-28T22:34:00Z

@shaharck in our case we ended up converting to csv and then importing into druid, which is less than ideal but unblocked us for now.

shaharck · 2018-08-29T20:00:37Z

thanks @amalakar. actually if i excluded the field it does seem to work, just had to find a different field for timestamp

clintropolis mentioned this issue Sep 20, 2018

overhaul 'druid-parquet-extensions' module, promoting from 'contrib' to 'core' #6360

Merged

jon-wei closed this as completed in #6360 Nov 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Druid-parquet-extensions fails on timestamps (stored as INT96) in parquet files #5150

Druid-parquet-extensions fails on timestamps (stored as INT96) in parquet files #5150

amalakar commented Dec 11, 2017

gianm commented Dec 12, 2017

amalakar commented Dec 12, 2017

bohemia420 commented Aug 3, 2018

amalakar commented Aug 3, 2018

shaharck commented Aug 28, 2018

amalakar commented Aug 28, 2018

shaharck commented Aug 29, 2018

Druid-parquet-extensions fails on timestamps (stored as INT96) in parquet files #5150

Druid-parquet-extensions fails on timestamps (stored as INT96) in parquet files #5150

Comments

amalakar commented Dec 11, 2017

gianm commented Dec 12, 2017

amalakar commented Dec 12, 2017

bohemia420 commented Aug 3, 2018

amalakar commented Aug 3, 2018

shaharck commented Aug 28, 2018

amalakar commented Aug 28, 2018

shaharck commented Aug 29, 2018