[SPARK-13080] [SQL] Implement new Catalog API using Hive #11189

andrewor14 · 2016-02-12T22:26:00Z

This is a step towards merging SQLContext and HiveContext. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using HiveClient, an existing interface to Hive. It also extends HiveClient with additional calls to Hive that are needed to complete the catalog implementation.

Where should I start reviewing? The new catalog introduced is HiveCatalog. This class is relatively simple because it just calls HiveClientImpl, where most of the new logic is. I would not start with HiveClient, HiveQl, or HiveMetastoreCatalog, which are modified mainly because of a refactor.

Why is this patch so big? I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor CatalogTable convert directly to and from HiveTable (etc.). Otherwise we would have to first convert CatalogTable to the intermediate representation and then convert that to HiveTable, which is messy.

The new class hierarchy is as follows:

org.apache.spark.sql.catalyst.catalog.Catalog
  - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog
  - org.apache.spark.sql.hive.HiveCatalog

Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release.

WIP pending tests and potential cleanups.

This required converting o.a.s.sql.catalyst.catalog.Table to its counterpart in o.a.s.sql.hive.client.HiveTable. This required making o.a.s.sql.hive.client.TableType an enum because we need to create one of these from name.

Currently there's the catalog table, the Spark table used in the hive module, and the Hive table. To avoid converting to and from between these table representations, we kill the intermediate one, which is the one currently used throughout HiveClient and friends.

Instead, this commit introduces CatalogTableType that serves the same purpose. This adds some type-safety and keeps the code clean.

The operation doesn't support renaming anyway, so it doesn't make sense to pass in a name AND a CatalogDatabase that always has the same name.

andrewor14 · 2016-02-12T22:38:04Z

@yhuai @rxin

SparkQA · 2016-02-12T22:58:28Z

Test build #51208 has finished for PR 11189 at commit 07332ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-13T00:52:22Z

Test build #51216 has finished for PR 11189 at commit 8e82f1d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-13T01:18:50Z

Test build #51217 has finished for PR 11189 at commit 2b72025.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-02-13T06:52:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala

-      existingParts.remove(spec)
+      specs: Seq[TablePartitionSpec],
+      newSpecs: Seq[TablePartitionSpec]): Unit = synchronized {
+    assert(specs.size == newSpecs.size, "number of old and new partition specs differ")


assert -> require?

and maybe assertDatabaseExists should be called requireDatabaseExists too

andrewor14 · 2016-02-18T03:09:40Z

Alright, as of the latest commit this patch is no longer WIP. I added a test suite for the new HiveCatalog and all but one test in that suite are now passing. The one that I had to ignore was alter partitions, where Hive fails with a very unhelpful error message. I will continue to investigate ways to enable that test, but that should not block the merging of this patch.

andrewor14 · 2016-02-18T05:03:40Z

retest this please

SparkQA · 2016-02-18T07:19:15Z

Test build #51469 has finished for PR 11189 at commit 428c3c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

It turns out that you need to run "USE my_database" before "ALTER TABLE my_table PARTITION ..." (HIVE-2742). Geez.

SparkQA · 2016-02-19T01:35:23Z

Test build #51507 has finished for PR 11189 at commit 2ba1990.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

This was caused by cb288da, an attempt to clean up some duplicate code. It turns out that HiveClient and HiveClientImpl cannot both refer to Hive classes due to some classloader issues. Surprise... This commit reverts part of the changes introduced in cb288da.

SparkQA · 2016-02-19T04:00:31Z

Test build #51509 has finished for PR 11189 at commit d9a7723.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-02-19T04:04:02Z

@hvanhovell would you have time to review some of this?

rxin · 2016-02-19T04:06:02Z

I took a quick look and this looks reasonable. The changes were straightforward.

@andrewor14 are there still minor things you need to do here?

andrewor14 · 2016-02-19T17:51:10Z

No, just waiting for review at this point.

hvanhovell · 2016-02-19T19:32:04Z

I'll have a look this weekend.

davies · 2016-02-19T19:36:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala

+ * Thrown by a catalog when an item cannot be found. The analyzer will rethrow the exception
+ * as an [[org.apache.spark.sql.AnalysisException]] with the correct position information.
+ */
+abstract class NoSuchItemException extends Exception { override def getMessage: String }


We usually did not inline a method

hvanhovell · 2016-02-21T20:30:12Z

@andrewor14 This is pretty solid, couldn't find anything except for some trivial stuff.

LGTM pending an update to the latest master and a succesfull test run.

[SPARK-13080] [SQL] Implement new Catalog API using Hive

rxin · 2016-02-21T20:57:56Z

FYI I took most of Herman and Davies' comments and created a rebased pr here: #11293

andrewor14 · 2016-02-21T22:17:43Z

Closing in favor of #11293.

wangyum · 2020-01-20T08:17:36Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveCatalog.scala

+   * Run some code involving `client` in a [[synchronized]] block and wrap certain
+   * exceptions thrown in the process in [[AnalysisException]].
+   */
+  private def withClient[T](body: => T): T = synchronized {


@andrewor14 What is the reason we need a lock here?

Andrew Or added 17 commits February 10, 2016 13:16

Add skeleton for HiveCatalog

3b66605

Implement createDatabase

f3e094a

Fix style

4b09a7d

Implement dropDatabase

526f278

Implement alterDatabase

4aa6e66

Implement getDatabase, listDatabases and databaseExists

433d180

Implement createTable

ff5c5be

This required converting o.a.s.sql.catalyst.catalog.Table to its counterpart in o.a.s.sql.hive.client.HiveTable. This required making o.a.s.sql.hive.client.TableType an enum because we need to create one of these from name.

Explicitly mark methods with override in HiveCatalog

ff49f0c

Implement dropTable

ca98c00

Implement renameTable, alterTable

71f9964

Remove TableType enum

af5ffc0

Instead, this commit introduces CatalogTableType that serves the same purpose. This adds some type-safety and keeps the code clean.

Re-implement all table operations after the refactor

d7b18e6

Implement all partition operations

a915d01

Implement all function operations

3ceb88d

Simplify alterDatabase

07332ad

The operation doesn't support renaming anyway, so it doesn't make sense to pass in a name AND a CatalogDatabase that always has the same name.

Clean up HiveClientImpl a little

cdf1f70

Merge branch 'master' of github.com:apache/spark into hive-catalog

bbb8170

andrewor14 force-pushed the hive-catalog branch 2 times, most recently from 141356a to 8e82f1d Compare February 12, 2016 23:19

Fix tests?

2b72025

andrewor14 force-pushed the hive-catalog branch from 8e82f1d to 2b72025 Compare February 12, 2016 23:27

Miscellaneous cleanup

5e2cd3a

rxin reviewed Feb 13, 2016
View reviewed changes

Merge branch 'master' of github.com:apache/spark into hive-catalog

428c3c5

andrewor14 changed the title ~~[SPARK-13080] [SQL] [WIP] Implement new Catalog API using Hive~~ [SPARK-13080] [SQL] Implement new Catalog API using Hive Feb 18, 2016

Andrew Or added 3 commits February 18, 2016 13:06

Merge branch 'master' of github.com:apache/spark into hive-catalog

ed9c6fa

Miscellaneous clean ups

cb288da

Un-ignore alter partitions test

2ba1990

It turns out that you need to run "USE my_database" before "ALTER TABLE my_table PARTITION ..." (HIVE-2742). Geez.

andrewor14 force-pushed the hive-catalog branch from e3c15c6 to d9a7723 Compare February 19, 2016 01:55

davies reviewed Feb 19, 2016
View reviewed changes

rxin added a commit to rxin/spark that referenced this pull request Feb 21, 2016

Merge pull request apache#11189 from andrewor14/hive-catalog

df0ad86

[SPARK-13080] [SQL] Implement new Catalog API using Hive

rxin mentioned this pull request Feb 21, 2016

[SPARK-13080] [SQL] Implement new Catalog API using Hive #11293

Closed

andrewor14 closed this Feb 21, 2016

wangyum reviewed Jan 20, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13080] [SQL] Implement new Catalog API using Hive #11189

[SPARK-13080] [SQL] Implement new Catalog API using Hive #11189

andrewor14 commented Feb 12, 2016

andrewor14 commented Feb 12, 2016

SparkQA commented Feb 12, 2016

SparkQA commented Feb 13, 2016

SparkQA commented Feb 13, 2016

rxin Feb 13, 2016

rxin Feb 13, 2016

andrewor14 commented Feb 18, 2016

andrewor14 commented Feb 18, 2016

SparkQA commented Feb 18, 2016

SparkQA commented Feb 19, 2016

SparkQA commented Feb 19, 2016

rxin commented Feb 19, 2016

rxin commented Feb 19, 2016

andrewor14 commented Feb 19, 2016

hvanhovell commented Feb 19, 2016

davies Feb 19, 2016

hvanhovell commented Feb 21, 2016

rxin commented Feb 21, 2016

andrewor14 commented Feb 21, 2016

wangyum Jan 20, 2020

[SPARK-13080] [SQL] Implement new Catalog API using Hive #11189

[SPARK-13080] [SQL] Implement new Catalog API using Hive #11189

Conversation

andrewor14 commented Feb 12, 2016

andrewor14 commented Feb 12, 2016

SparkQA commented Feb 12, 2016

SparkQA commented Feb 13, 2016

SparkQA commented Feb 13, 2016

rxin Feb 13, 2016

Choose a reason for hiding this comment

rxin Feb 13, 2016

Choose a reason for hiding this comment

andrewor14 commented Feb 18, 2016

andrewor14 commented Feb 18, 2016

SparkQA commented Feb 18, 2016

SparkQA commented Feb 19, 2016

SparkQA commented Feb 19, 2016

rxin commented Feb 19, 2016

rxin commented Feb 19, 2016

andrewor14 commented Feb 19, 2016

hvanhovell commented Feb 19, 2016

davies Feb 19, 2016

Choose a reason for hiding this comment

hvanhovell commented Feb 21, 2016

rxin commented Feb 21, 2016

andrewor14 commented Feb 21, 2016

wangyum Jan 20, 2020

Choose a reason for hiding this comment