Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data changes during ingestion of Array<String> using druid-parquet-extensions #5433

Closed
code-ditya opened this issue Feb 27, 2018 · 3 comments

Comments

@code-ditya
Copy link

I am using druid-0.11.0, with druid-avro-extension:0.11.0 and druid-parquet-extension:0.10.0 and druid-hdfs-storage:0.11.0.

My data contained a column which was of type Array<String>. If data for this column contained value ["str1", "str2", "str3"] then after ingestion the same becomes ["{\"element\": \"str1\"}", "{\"element\": \"str2\"}", "{\"element\": \"str3\"}"].

The actual data for the same is stored in hdfs in parquet with snappy compression, and the parser for the same is used during ingestion. This issue persisted even after I changed the compression to gzip.

When the compression was removed in hdfs the data for the column post ingestion became ["{\"array_element\": \"str1\"}", "{\"array_element\": \"str2\"}", "{\"array_element\": \"str3\"}"].

The same data was earlier being properly ingested into druid when the data was in avro format with gzip compression with druid's avro parser.

I am attaching the ingestionSpec alongside.
druid_ingestion_schema.txt

@saurabh3091
Copy link

I am also facing the same issue and its a blocker for us. Will really appreciate if somebody can provide an explanation/fix.
Regards

gauravkumar37 added a commit to gauravkumar37/druid that referenced this issue Mar 3, 2018
Fixes apache#5433
This change makes Parquet input row reader corrects handle List data type.
@quiet-listener
Copy link

quiet-listener commented May 12, 2018

"tuningConfig": {
"jobProperties":{
"parquet.avro.add-list-element-records":"false"
}
}

try adding"parquet.avro.add-list-element-records":"false"in your ingestion spec file under jobProperties . It worked for me

@code-ditya
Copy link
Author

Thanks @quitelistner . Setting this property helped resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants