Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Druid-parquet-extensions fails on timestamps (stored as INT96) in parquet files #5150

Closed
amalakar opened this issue Dec 11, 2017 · 7 comments

Comments

@amalakar
Copy link
Contributor

I am trying to use the druid-parquet-extensions to ingest parquet data in druid. As per my understanding parquet uses INT96 as the datatype for timestamps:

optional int96 logged_at

Trying to read this parquet files in hadoop index job throws an error that INT96 not supported. It looks like INT96 field may need to be read as an byte array, has anyone seen similar error, suggestion?

I am using druid 0.11.0, here is the detailed stack trace:

Caused by: java.lang.IllegalArgumentException: INT96 not yet implemented.
    at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279) ~[parquet-avro-1.8.2.jar:0.11.0.1]
    at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264) ~[parquet-avro-1.8.2.jar:0.11.0.1]
    at org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:223) ~[parquet-column-1.8.2.jar:1.8.2]
    at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:263) ~[parquet-avro-1.8.2.jar:0.11.0.1]
    at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:241) ~[parquet-avro-1.8.2.jar:0.11.0.1]
    at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:231) ~[parquet-avro-1.8.2.jar:0.11.0.1]
    at org.apache.parquet.avro.DruidParquetReadSupport.prepareForRead(DruidParquetReadSupport.java:98) ~[druid-parquet-extensions-0.11.0.1.jar:0.11.0.1]
    at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:175) ~[parquet-hadoop-1.8.2.jar:1.8.2]
    at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:190) ~[parquet-hadoop-1.8.2.jar:1.8.2]
    at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:147) ~[parquet-hadoop-1.8.2.jar:1.8.2]
    at org.apache.hadoop.mapreduce.lib.input.DelegatingRecordReader.initialize(DelegatingRecordReader.java:84) ~[hadoop-mapreduce-client-core-2.7.3.2.5.3.0-37.jar:?]
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548) ~[hadoop-mapreduce-client-core-2.7.3.2.5.3.0-37.jar:?]
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786) ~[hadoop-mapreduce-client-core-2.7.3.2.5.3.0-37.jar:?]
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) ~[hadoop-mapreduce-client-core-2.7.3.2.5.3.0-37.jar:?]
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) ~[hadoop-mapreduce-client-common-2.7.3.2.5.3.0-37.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_141]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_141]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_141]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_141]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_141] 
@gianm
Copy link
Contributor

gianm commented Dec 12, 2017

Hi @amalakar, it seems like this is due to Parquet's parquet-to-avro converter (which Druid uses for reading Parquet files) not supporting int96. I'm not sure what the best fix is. Maybe upgrading parquet would help if it's been fixed in a newer version. Or maybe using a Parquet reading strategy that doesn't involve conversion to Avro. If you're familiar enough with Parquet to help contribute the latter, you could give that a shot.

@amalakar
Copy link
Contributor Author

Looks like it is not fixed in newer parquet - https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L279

The parquet-avro repo also seems to have no commit in last one year. Getting rid of the avro converter in the druid parquet extension looks like the right path to me. I will probably take a stab, but I am very new to druid so not familiar with code base and interfaces. Also I will probably get to it only early next year.

@bohemia420
Copy link

is this resolved? @amalakar how did u remove avro converter? @gianm how does one change the parquet reading strategy?

@amalakar
Copy link
Contributor Author

amalakar commented Aug 3, 2018

I havn’t made any progress in this issue.

@shaharck
Copy link

is there an existing work around if you do not need that column? doesn't seem like excluding the dimension helps

@amalakar
Copy link
Contributor Author

@shaharck in our case we ended up converting to csv and then importing into druid, which is less than ideal but unblocked us for now.

@shaharck
Copy link

thanks @amalakar. actually if i excluded the field it does seem to work, just had to find a different field for timestamp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants