Compatibility with Apache Hive

pwendell edited this page Oct 13, 2012 · 5 revisions

Deploying in Existing Hive Warehouses

Shark is designed to be "out of the box" compatible with existing Hive installations. When you deploy Shark in a cluster, you do not need to modify your existing Hive Metastore or change the data placement or partitioning of your tables.

Supported Hive Features

Shark supports the vast majority of Hive features, such as:

  • Hive query statements, including:
  • SELECT
  • GROUP_BY
  • ORDER_BY
  • CLUSTER_BY
  • SORT_BY
  • All Hive operators, including:
  • Relational operators (=, ⇔, ==, <>, <, >, >=, <=, etc)
  • Arthimatic operators (+, -, *, /, %, etc)
  • Logical operators (AND, &&, OR, ||, etc)
  • Complex type constructors
  • Mathemtatical functions (sign, ln, cos, etc)
  • String functions (instr, length, prinf, etc)
  • User defined functions (UDF)
  • User defined aggregation functions (UDAF)
  • User defined serialization formats (SerDe's)
  • Joins
  • JOIN
  • {LEFT|RIGHT|FULL} OUTER JOIN
  • LEFT SEMI JOIN
  • CROSS JOIN
  • Unions
  • Sub queries
  • SELECT col FROM ( SELECT a + b AS col from t1) t2
  • Sampling
  • Explain
  • Partitioned tables
  • All Hive DDL Functions, including:
  • CREATE TABLE
  • CREATE TABLE AS SELECT
  • ALTER TABLE
  • All Hive Data types, including:
  • TINYINT
  • SMALLINT
  • INT
  • BIGINT
  • BOOLEAN
  • FLOAT
  • DOUBLE
  • STRING
  • BINARY
  • TIMESTAMP
  • ARRAY<>
  • MAP<>
  • STRUCT<>
  • UNIONTYPE<>

Unsupported Hive Functionality

Below is a list of Hive features that we don't support yet. The vast majority of these features are rarely used in Hive deployments.

Major Hive Features

  • Tables with buckets: bucket is the hash partitioning within a Hive table partition. Shark doesn't support buckets.

Esoteric Hive Features

  • Tables with partitions using different input formats: In Shark, all table partitions need to have the same input format.
  • Non-equi outer join: For the uncommon use case of using outer joins with non-equi join conditions (e.g. condition "key < 10"), Shark will output wrong result for the NULL tuple.
  • Unique join
  • Single query multi insert
  • Column statistics collecting: Shark does not piggyback scans to collect column statistics at the moment.

Hive Input/Output Formats

  • File format for CLI: For results showing back to the CLI, Shark only supports TextOutputFormat.
  • Hadoop archive

Hive Optimizations A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are not necessary due to Shark's in-memory computational model. Others are slotted for future releases of Spark.

  • Block level bitmap indexes and virtual columns (used to build indexes)
  • Automatically convert a join to map join: For joining a large table with multiple small tables, Hive automatically converts the join into a map join. We are adding this auto conversion in the next release.
  • Automatically determine the number of reducers for joins and groupbys: Currently in Shark, you need to control the degree of parallelism post-shuffle using "set mapred.reduce.tasks=[num_tasks];". We are going to add auto-setting of parallelism in the next release.
  • Meta-data only query: For queries that can be answered by using only meta data, Shark still launches tasks to compute the result.
  • Skew data flag: Shark does not follow the skew data flags in Hive.
  • STREAMTABLE hint in join: Shark does not follow the STREAMTABLE hint.
  • Merge multiple small files for query results: If the result output contains multiple small files, Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS metadata. Shark does not support that.