New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate Apache Carbondata INDEXED column store file format for genomics #1527

jpdna opened this Issue May 13, 2017 · 1 comment


3 participants

jpdna commented May 13, 2017

Is anyone familiar with CarbonData:

This quote is interesting

"Older file formats like Parquet and ORC in the Hadoop eco-system fail to cater equal efficiency to all domains of query like OLAP, Sequential and Random. While working with these two, we found them to be working suitably for big scans. Also they support HDFS to allow levering of existing Hadoop Cluster, but they have been found to be unsuitable in providing the sub second responses for primary key lookups, Olap style queries over big data involving filters and fetching of all columns of record. For Primary Key based fetching we will need to have indexes and that is one of the key features of CarbonData."

My thoughts: The HBase option we've looked at, while good for some use cases, is a heavy weight solution in terms of database administration/optimization. I'd like a file format more like Parquet - but one that has indexing to allow the low latencies we need for visualization/interactive notebook like for Mango. Perhaps Carbondata could fill this need....

@fnothaft fnothaft added the wontfix label Jan 9, 2018


This comment has been minimized.


fnothaft commented Jan 9, 2018

No plans to support this; closing as won't fix.

@fnothaft fnothaft closed this Jan 9, 2018

@heuermh heuermh added this to the 0.24.0 milestone Jan 9, 2018

@heuermh heuermh added this to Completed in Release 0.24.0 Feb 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment