Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13080] [SQL] Implement new Catalog API using Hive #11189

Closed
wants to merge 34 commits into from

Conversation

andrewor14
Copy link
Contributor

This is a step towards merging SQLContext and HiveContext. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using HiveClient, an existing interface to Hive. It also extends HiveClient with additional calls to Hive that are needed to complete the catalog implementation.

Where should I start reviewing? The new catalog introduced is HiveCatalog. This class is relatively simple because it just calls HiveClientImpl, where most of the new logic is. I would not start with HiveClient, HiveQl, or HiveMetastoreCatalog, which are modified mainly because of a refactor.

Why is this patch so big? I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor CatalogTable convert directly to and from HiveTable (etc.). Otherwise we would have to first convert CatalogTable to the intermediate representation and then convert that to HiveTable, which is messy.

The new class hierarchy is as follows:

org.apache.spark.sql.catalyst.catalog.Catalog
  - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog
  - org.apache.spark.sql.hive.HiveCatalog

Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release.

WIP pending tests and potential cleanups.

Andrew Or added 17 commits February 10, 2016 13:16
This required converting o.a.s.sql.catalyst.catalog.Table to its
counterpart in o.a.s.sql.hive.client.HiveTable. This required
making o.a.s.sql.hive.client.TableType an enum because we need
to create one of these from name.
Currently there's the catalog table, the Spark table used in the
hive module, and the Hive table. To avoid converting to and from
between these table representations, we kill the intermediate one,
which is the one currently used throughout HiveClient and friends.
Instead, this commit introduces CatalogTableType that serves
the same purpose. This adds some type-safety and keeps the code
clean.
The operation doesn't support renaming anyway, so it doesn't
make sense to pass in a name AND a CatalogDatabase that always
has the same name.
@andrewor14
Copy link
Contributor Author

@yhuai @rxin

@SparkQA
Copy link

SparkQA commented Feb 12, 2016

Test build #51208 has finished for PR 11189 at commit 07332ad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14 andrewor14 force-pushed the hive-catalog branch 2 times, most recently from 141356a to 8e82f1d Compare February 12, 2016 23:19
@SparkQA
Copy link

SparkQA commented Feb 13, 2016

Test build #51216 has finished for PR 11189 at commit 8e82f1d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 13, 2016

Test build #51217 has finished for PR 11189 at commit 2b72025.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

existingParts.remove(spec)
specs: Seq[TablePartitionSpec],
newSpecs: Seq[TablePartitionSpec]): Unit = synchronized {
assert(specs.size == newSpecs.size, "number of old and new partition specs differ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert -> require?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe assertDatabaseExists should be called requireDatabaseExists too

@andrewor14 andrewor14 changed the title [SPARK-13080] [SQL] [WIP] Implement new Catalog API using Hive [SPARK-13080] [SQL] Implement new Catalog API using Hive Feb 18, 2016
@andrewor14
Copy link
Contributor Author

Alright, as of the latest commit this patch is no longer WIP. I added a test suite for the new HiveCatalog and all but one test in that suite are now passing. The one that I had to ignore was alter partitions, where Hive fails with a very unhelpful error message. I will continue to investigate ways to enable that test, but that should not block the merging of this patch.

@andrewor14
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 18, 2016

Test build #51469 has finished for PR 11189 at commit 428c3c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Andrew Or added 3 commits February 18, 2016 13:06
It turns out that you need to run "USE my_database" before
"ALTER TABLE my_table PARTITION ..." (HIVE-2742). Geez.
@SparkQA
Copy link

SparkQA commented Feb 19, 2016

Test build #51507 has finished for PR 11189 at commit 2ba1990.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

This was caused by cb288da, an attempt to clean up some duplicate
code. It turns out that HiveClient and HiveClientImpl cannot both
refer to Hive classes due to some classloader issues. Surprise...

This commit reverts part of the changes introduced in cb288da.
@SparkQA
Copy link

SparkQA commented Feb 19, 2016

Test build #51509 has finished for PR 11189 at commit d9a7723.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Feb 19, 2016

@hvanhovell would you have time to review some of this?

@rxin
Copy link
Contributor

rxin commented Feb 19, 2016

I took a quick look and this looks reasonable. The changes were straightforward.

@andrewor14 are there still minor things you need to do here?

@andrewor14
Copy link
Contributor Author

No, just waiting for review at this point.

@hvanhovell
Copy link
Contributor

I'll have a look this weekend.

* Thrown by a catalog when an item cannot be found. The analyzer will rethrow the exception
* as an [[org.apache.spark.sql.AnalysisException]] with the correct position information.
*/
abstract class NoSuchItemException extends Exception { override def getMessage: String }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually did not inline a method

@hvanhovell
Copy link
Contributor

@andrewor14 This is pretty solid, couldn't find anything except for some trivial stuff.

LGTM pending an update to the latest master and a succesfull test run.

rxin added a commit to rxin/spark that referenced this pull request Feb 21, 2016
[SPARK-13080] [SQL] Implement new Catalog API using Hive
@rxin
Copy link
Contributor

rxin commented Feb 21, 2016

FYI I took most of Herman and Davies' comments and created a rebased pr here: #11293

@andrewor14
Copy link
Contributor Author

Closing in favor of #11293.

@andrewor14 andrewor14 closed this Feb 21, 2016
* Run some code involving `client` in a [[synchronized]] block and wrap certain
* exceptions thrown in the process in [[AnalysisException]].
*/
private def withClient[T](body: => T): T = synchronized {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewor14 What is the reason we need a lock here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants