-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13080] [SQL] Implement new Catalog API using Hive #11189
Conversation
This required converting o.a.s.sql.catalyst.catalog.Table to its counterpart in o.a.s.sql.hive.client.HiveTable. This required making o.a.s.sql.hive.client.TableType an enum because we need to create one of these from name.
Currently there's the catalog table, the Spark table used in the hive module, and the Hive table. To avoid converting to and from between these table representations, we kill the intermediate one, which is the one currently used throughout HiveClient and friends.
Instead, this commit introduces CatalogTableType that serves the same purpose. This adds some type-safety and keeps the code clean.
The operation doesn't support renaming anyway, so it doesn't make sense to pass in a name AND a CatalogDatabase that always has the same name.
Test build #51208 has finished for PR 11189 at commit
|
141356a
to
8e82f1d
Compare
8e82f1d
to
2b72025
Compare
Test build #51216 has finished for PR 11189 at commit
|
Test build #51217 has finished for PR 11189 at commit
|
existingParts.remove(spec) | ||
specs: Seq[TablePartitionSpec], | ||
newSpecs: Seq[TablePartitionSpec]): Unit = synchronized { | ||
assert(specs.size == newSpecs.size, "number of old and new partition specs differ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert -> require?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and maybe assertDatabaseExists should be called requireDatabaseExists too
Alright, as of the latest commit this patch is no longer WIP. I added a test suite for the new |
retest this please |
Test build #51469 has finished for PR 11189 at commit
|
It turns out that you need to run "USE my_database" before "ALTER TABLE my_table PARTITION ..." (HIVE-2742). Geez.
Test build #51507 has finished for PR 11189 at commit
|
e3c15c6
to
d9a7723
Compare
Test build #51509 has finished for PR 11189 at commit
|
@hvanhovell would you have time to review some of this? |
I took a quick look and this looks reasonable. The changes were straightforward. @andrewor14 are there still minor things you need to do here? |
No, just waiting for review at this point. |
I'll have a look this weekend. |
* Thrown by a catalog when an item cannot be found. The analyzer will rethrow the exception | ||
* as an [[org.apache.spark.sql.AnalysisException]] with the correct position information. | ||
*/ | ||
abstract class NoSuchItemException extends Exception { override def getMessage: String } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually did not inline a method
@andrewor14 This is pretty solid, couldn't find anything except for some trivial stuff. LGTM pending an update to the latest master and a succesfull test run. |
[SPARK-13080] [SQL] Implement new Catalog API using Hive
FYI I took most of Herman and Davies' comments and created a rebased pr here: #11293 |
Closing in favor of #11293. |
* Run some code involving `client` in a [[synchronized]] block and wrap certain | ||
* exceptions thrown in the process in [[AnalysisException]]. | ||
*/ | ||
private def withClient[T](body: => T): T = synchronized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewor14 What is the reason we need a lock here?
This is a step towards merging
SQLContext
andHiveContext
. A new internalCatalog
API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API usingHiveClient
, an existing interface to Hive. It also extendsHiveClient
with additional calls to Hive that are needed to complete the catalog implementation.Where should I start reviewing? The new catalog introduced is
HiveCatalog
. This class is relatively simple because it just callsHiveClientImpl
, where most of the new logic is. I would not start withHiveClient
,HiveQl
, orHiveMetastoreCatalog
, which are modified mainly because of a refactor.Why is this patch so big? I had to refactor
HiveClient
to remove an intermediate representation of databases, tables, partitions etc. After this refactorCatalogTable
convert directly to and fromHiveTable
(etc.). Otherwise we would have to first convertCatalogTable
to the intermediate representation and then convert that toHiveTable
, which is messy.The new class hierarchy is as follows:
Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release.
WIP pending tests and potential cleanups.