Release date: Jan 15, 2014
Shark 0.8.1 introduces set of performance, maintenance and usability features, with emphasis on improved Hive compatibility, Tachyon support, Spark integration, and table generating functions. This release requires
- Scala 2.9.3
- Spark 0.8.1
- AMPLab's Hive 0.9 distribution. Binaries are provided in the
hive-0.9.0-bin.tgzshipped with this release.
To simplify caching and table recovery semantics, we've implemented a write-through cache as the default for in-memory tables (i.e., tables created with
_cached or with the
shark.cache table property set to
Any table data written to the in-memory, columnar cache is synchronized with the backing, fault-tolerant store specified by the Hive warehouse directory (e.g., HDFS). Since table metadata and in-memory data are both persistent, such tables can now be automatically recovered across Shark session restarts.
Additional notes on table caching semantics:
- You can now create a cached,
MEMORYtable by simply caching the underlying table:
- Append operations (i.e., using
MEMORYtables may be slower due to the additional write to persistent store.
- Tables targeted with the
CACHEcommand and created with the
_cachedname suffix are always pinned at the
MEMORYlevel. To revert to the ephemeral scheme offered in v0.8.0 and prior, create a table with
shark.cachetable property set to
MEMORY_ONLYand a name that does not include the
Users are now able to create and cache partitioned tables. Different from RDD partitions that correspond to Hadoop splits, Hive "partitions" are analogous to indexes. Each partition is represented by an RDD and identifiable by the set of runtimes values for virtual partitioning columns that specified at table creation.
In-memory partitioned tables also adhere to partition-level cache policies, which can be toggled through the
shark.cache.policy table property and customized by implementing the CachePolicy interface (an LRU implementation is provided).
During query execution, Shark uses partitioning keys to automatically filter input partitions. This feature can is be combined with RDD-partition level pruning on non-partitioned columns to further decrease the amount of data that needs to be fetched and scanned.
The complete set of commands supported for in-memory Shark tables stored in the Spark-managed heap are now supported for Tachyon-backed tables as well. This includes Hive-partitioned tables and table recovery features added in this 0.8.1 release.
Stability and usability improvements have been added to reduce friction in converting between native Spark RDDs and Shark tables. A key pair of features are
sqlRdd() functions and
rddToTable implicit conversions, both of which can automatically deduce data types and update necessary metadata for transitions between RDDs and Shark tables. Both can be tested by launching a Shark shell (
Table Generating Functions (TGFs)
Shark can now call into libraries that generate tables through TGFs. This enables Shark to easily access external libraries, such as Spark’s machine learning library (MLLIB).
Calls can be made into TGFs by executing
GENERATE tgf_name(params) or
GENERATE tgf_name(params) SAVE table_name. TGFs are flexible and can take arbitrary tables and parameters as inputs and produce a new table with an accompanying schema.
- To reduce the overhead for Hive-partitioned table scans, job configurations are only broadcasted once and shared throughout the entire read operation over a partitioned table. Previously, these configuration variables were broadcasted once per partition.
- Commands that use
COUNT DISTINCToperations, but don’t include grouping keys, are automatically rewritten to generate query plans that can take advantage of multiple reducers (set through the
mapred.reduce.tasksproperty) and increased parallelism. This eliminates the previous single-reducer bottleneck.
Michael Armbrust - test util improvements
Harvey Feng - Tachyon support, caching semantics, partitioned table, release manager
Ali Ghodsi - table generating functions
Mark Hamstra - build fix
Cheng Hao - work on removing Hive operator dependencies
Nandu Jayakumar - code and style cleanup
Andy Konwinski - build script fix
Haoyuan Li - Tachyon integration
Xi Liu - byte buffer overflow bug fix
Sundeep Narravula - support for database namespaces for cached tables, code cleanup
Patrick Wendell - bug fix
Reynold Xin - caching semantics, Spark integration, miscellaneous bug fixes
Thanks to everyone who contributed!