-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Druid-parquet-extensions fails on timestamps (stored as INT96) in parquet files #5150
Comments
Hi @amalakar, it seems like this is due to Parquet's parquet-to-avro converter (which Druid uses for reading Parquet files) not supporting int96. I'm not sure what the best fix is. Maybe upgrading parquet would help if it's been fixed in a newer version. Or maybe using a Parquet reading strategy that doesn't involve conversion to Avro. If you're familiar enough with Parquet to help contribute the latter, you could give that a shot. |
Looks like it is not fixed in newer parquet - https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L279 The parquet-avro repo also seems to have no commit in last one year. Getting rid of the avro converter in the druid parquet extension looks like the right path to me. I will probably take a stab, but I am very new to druid so not familiar with code base and interfaces. Also I will probably get to it only early next year. |
I havn’t made any progress in this issue. |
is there an existing work around if you do not need that column? doesn't seem like excluding the dimension helps |
@shaharck in our case we ended up converting to csv and then importing into druid, which is less than ideal but unblocked us for now. |
thanks @amalakar. actually if i excluded the field it does seem to work, just had to find a different field for timestamp |
I am trying to use the druid-parquet-extensions to ingest parquet data in druid. As per my understanding parquet uses INT96 as the datatype for timestamps:
optional int96 logged_at
Trying to read this parquet files in hadoop index job throws an error that INT96 not supported. It looks like INT96 field may need to be read as an byte array, has anyone seen similar error, suggestion?
I am using druid 0.11.0, here is the detailed stack trace:
The text was updated successfully, but these errors were encountered: