Skip to content

Shark 0.9.1

Compare
Choose a tag to compare
@harveyfeng harveyfeng released this 10 Apr 22:43
· 30 commits to master since this release

Release date: April 10, 2014

Shark 0.9.1 is a maintenance release that stabilizes 0.9.0, which bumps up Scala compatibility to 2.10.3 and Hive compliance to 0.11. The core dependencies for this version are:

  • Scala 2.10.3
  • Spark 0.9.1
  • AMPLab’s Hive 0.9.0
  • (Optional) Tachyon 0.4.1

Hive Compatibility

We’ve extensively upgraded the Shark codebase to be Hive 0.11 compliant. Existing users can now launch Shark as a drop-in replacement for operating with existing Hive 0.11 metastores.
Two major components added during this upgrade process are support for new windowing and analytics functions, and SharkServer2. More detail is available in the respective sections below.

Analytics Functions

Windowing functions

Shark now supports the windowing functions added by HIVE-896. All of the supported window functions operate based on the SQL standard.

Rollups

Shark also supports enhanced aggregation in the form of rollups. This feature allows users to compute aggregations over multiple groups easily and efficiently. For example, the following query uses the new GROUPING SETS clause:

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a)

The above query is equivalent to running multiple aggregations as follows:

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
UNION ALL
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a

SharkServer2

SharkServer2 is an improved Thrift server that’s compatible with the HiveServer2 developed in Hive 0.11. SharkServer2 allows for hosting concurrent client connections and query executions. Semantics are the same as for HiveServer2:
To start a SharkServer2:

$ bin/shark -service sharkserver2

To connect to the server from remote clients, you can use JDBC with the network address and port that the server is listening on. For example, to use the Beeline CLI:

$ bin/beeline
beeline > !connect jdbc:hive2://localhost:10000/default

Usability

  • <table name>_cached now caches the table in the MEMORY_ONLY ephemeral layer (Spark block manager), which is consistent with pre-0.8.0 behavior. Previously, Shark was using MEMORY, which incurs added latency in DDL commands due to writes to both persistent and ephemeral storage.
  • CACHE <table name> IN <cache type> can be used to specify the cache layer for a table. This is equivalent to ALTER TABLE <table name> TBLPROPERTIES('shark.cache'='<cache type>'). <cache type> can be MEMORY, MEMORY_ONLY, or TACHYON.

Maven Central and Easier Deployment

To simplify deployment and installation, we’ve uploaded all AMPLab Hive and Shark binaries to Maven Central under the edu.berkeley.cs.shark organization. HIVE_HOME is now obsolete, and Hive binary downloads are no longer required to begin running Shark. Instead, simply download the Shark binaries, and execute SHARK_HOME/bin/shark.

To include Shark as a dependency in your application:
For an sbt build file:

libraryDependencies ++= Seq(“edu.berkeley.cs.shark” %% “shark” % 0.9.1)

For Maven, in the dependencies section in pom.xml:

<dependency>
  <groupId>edu.berkeley.cs.shark</groupId>
  <artifactId>shark</artifactId>
  <version>0.9.1</version>
</dependency>

Query Execution and Performance Improvements

  • Delta encoding for int and long primitives stored in columnar format. To save memory. we only store differences between consecutive values in each int or long column.
  • Table scans over Hive-partitioned tables (i.e., tables created using PARTITIONED BY clause) now broadcast a single configuration for each table scan, as opposed to broadcasts linear in the number of partitions for that table.

Download Links

Shark with Hadoop 1
Shark with Hadoop 2 (cdh5)

Credits

Michael Armbrust - SharkServer bugfix, Scala 2.10 upgrade
Oleg Danilov - Hive 0.11 upgrade, bug fixes
Aaron Davidson - Tachyon API revamp, improved caching semantics
Harvey Feng - Hive 0.11, Spark 0.9 upgrade, release manager
Cheng Hao - Windowing functions, join refactor
Nandu Jayakumar - Delta encoding
Andy Konwinski - Build script fix
Steven Leung - Bug fix for partitioned table stats
ChengXiang Li - Yarn compatibility
Antonio Lupher - Hive 0.11 upgrade, lateral view improvements
Sundeep Narravula - Job cancellation using JDBC
Brian O’Neill - Build fix
Kay Ousterhout - Improved logging messages
Ahir Reddy - Python support
Sun Rui - Testing, analytic function support
Sergey Soldatov - Hive 0.11 upgrade, serialization bug fix
Henry Wang - SharkServer2 addition
Reynold Xin - SparkConf integration
Tian Yi - Combiner bug fix
Yury Yudin - Hive 0.11 support

Thanks to everyone who contributed!