Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Evaluate Apache Carbondata INDEXED column store file format for genomics #1527
This quote is interesting
"Older file formats like Parquet and ORC in the Hadoop eco-system fail to cater equal efficiency to all domains of query like OLAP, Sequential and Random. While working with these two, we found them to be working suitably for big scans. Also they support HDFS to allow levering of existing Hadoop Cluster, but they have been found to be unsuitable in providing the sub second responses for primary key lookups, Olap style queries over big data involving filters and fetching of all columns of record. For Primary Key based fetching we will need to have indexes and that is one of the key features of CarbonData."
My thoughts: The HBase option we've looked at, while good for some use cases, is a heavy weight solution in terms of database administration/optimization. I'd like a file format more like Parquet - but one that has indexing to allow the low latencies we need for visualization/interactive notebook like for Mango. Perhaps Carbondata could fill this need....