Shark 0.8.1

@harveyfeng harveyfeng released this Feb 3, 2014 · 7 commits to branch-0.8 since this release

Release date: Jan 15, 2014

Shark 0.8.1 introduces set of performance, maintenance and usability features, with emphasis on improved Hive compatibility, Tachyon support, Spark integration, and table generating functions. This release requires

  • Scala 2.9.3
  • Spark 0.8.1
  • AMPLab's Hive 0.9 distribution. Binaries are provided in the hive-0.9.0-bin.tgz shipped with this release.

Caching Semantics

To simplify caching and table recovery semantics, we've implemented a write-through cache as the default for in-memory tables (i.e., tables created with _cached or with the shark.cache table property set to MEMORY).

Any table data written to the in-memory, columnar cache is synchronized with the backing, fault-tolerant store specified by the Hive warehouse directory (e.g., HDFS). Since table metadata and in-memory data are both persistent, such tables can now be automatically recovered across Shark session restarts.

Additional notes on table caching semantics:

  • You can now create a cached, MEMORY table by simply caching the underlying table:
    CACHE <table_name>
  • Append operations (i.e., using INSERT, LOAD) on MEMORY tables may be slower due to the additional write to persistent store.
  • Tables targeted with the CACHE command and created with the _cached name suffix are always pinned at the MEMORY level. To revert to the ephemeral scheme offered in v0.8.0 and prior, create a table with shark.cache table property set to MEMORY_ONLY and a name that does not include the _cached suffix.

Partitioned Tables

Users are now able to create and cache partitioned tables. Different from RDD partitions that correspond to Hadoop splits, Hive "partitions" are analogous to indexes. Each partition is represented by an RDD and identifiable by the set of runtimes values for virtual partitioning columns that specified at table creation.

In-memory partitioned tables also adhere to partition-level cache policies, which can be toggled through the shark.cache.policy table property and customized by implementing the CachePolicy interface (an LRU implementation is provided).

During query execution, Shark uses partitioning keys to automatically filter input partitions. This feature can is be combined with RDD-partition level pruning on non-partitioned columns to further decrease the amount of data that needs to be fetched and scanned.

Tachyon Support

The complete set of commands supported for in-memory Shark tables stored in the Spark-managed heap are now supported for Tachyon-backed tables as well. This includes Hive-partitioned tables and table recovery features added in this 0.8.1 release.

Spark Integration

Stability and usability improvements have been added to reduce friction in converting between native Spark RDDs and Shark tables. A key pair of features are SharkContext’s sqlRdd() functions and rddToTable implicit conversions, both of which can automatically deduce data types and update necessary metadata for transitions between RDDs and Shark tables. Both can be tested by launching a Shark shell (SHARK_HOME/bin/shark-shell).

Table Generating Functions (TGFs)

Shark can now call into libraries that generate tables through TGFs. This enables Shark to easily access external libraries, such as Spark’s machine learning library (MLLIB).

Calls can be made into TGFs by executing GENERATE tgf_name(params) or GENERATE tgf_name(params) SAVE table_name. TGFs are flexible and can take arbitrary tables and parameters as inputs and produce a new table with an accompanying schema.

Other improvements

  • To reduce the overhead for Hive-partitioned table scans, job configurations are only broadcasted once and shared throughout the entire read operation over a partitioned table. Previously, these configuration variables were broadcasted once per partition.
  • Commands that use COUNT DISTINCT operations, but don’t include grouping keys, are automatically rewritten to generate query plans that can take advantage of multiple reducers (set through the mapred.reduce.tasks property) and increased parallelism. This eliminates the previous single-reducer bottleneck.

Credits

Michael Armbrust - test util improvements
Harvey Feng - Tachyon support, caching semantics, partitioned table, release manager
Ali Ghodsi - table generating functions
Mark Hamstra - build fix
Cheng Hao - work on removing Hive operator dependencies
Nandu Jayakumar - code and style cleanup
Andy Konwinski - build script fix
Haoyuan Li - Tachyon integration
Xi Liu - byte buffer overflow bug fix
Sundeep Narravula - support for database namespaces for cached tables, code cleanup
Patrick Wendell - bug fix
Reynold Xin - caching semantics, Spark integration, miscellaneous bug fixes

Thanks to everyone who contributed!