Sub-partitioning of Parquet file for ADAM #1003

Closed
jpdna opened this Issue Apr 18, 2016 · 3 comments

Comments

Projects
None yet
4 participants
@jpdna
Member

jpdna commented Apr 18, 2016

The Spark-SQL programming guide describe an optimization of parquet usage that involves splitting parquet file into directories corresponding to different column values. here

This issue is meant as a place for discussion of this topic and to determine if we should prototype such a parquet directory layout, for example dividing the parquet file into individual files per chromosome.

Look forward to any comments and/or links to earlier discussions of this topic

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 6, 2016

Member

Closing as dupe of #651.

Member

fnothaft commented Jul 6, 2016

Closing as dupe of #651.

@fnothaft fnothaft closed this Jul 6, 2016

@fnothaft fnothaft added the duplicate label Jul 6, 2016

@heuermh

This comment has been minimized.

Show comment
Hide comment
@tomwhite

This comment has been minimized.

Show comment
Hide comment
@tomwhite

tomwhite Jul 22, 2016

Member

@jpdna @heuermh that's pretty old now - I think using Spark to do the partitioning is the way forward, and Impala supports nested types so flattening is not necessary. See #651 (comment)

Member

tomwhite commented Jul 22, 2016

@jpdna @heuermh that's pretty old now - I think using Spark to do the partitioning is the way forward, and Impala supports nested types so flattening is not necessary. See #651 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment