-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13078][SQL] API and test cases for internal catalog #10982
Conversation
|
||
// TODO: need more functions for partitioning. | ||
|
||
def alterPartition(db: String, table: String, part: TablePartition): Unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partition handling is the main one that is incomplete ....
cc @hvanhovell This is what I discussed with you the other day. Once this is in, we can create the two implementations. |
Test build #50374 has finished for PR 10982 at commit
|
val schema: Seq[Column], | ||
val partitionColumns: Seq[Column], | ||
val storage: StorageFormat, | ||
val numBuckets: Int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bucketColumns
and sortColumns
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok - although hive doesn't actually support that.
OK I pushed a new version -- this one should have the basic pieces ready and we can parallelize the work after this. |
// Databases | ||
// -------------------------------------------------------------------------- | ||
|
||
def createDatabase(dbDefinition: Database, ifNotExists: Boolean): Unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to define when we should throw exceptions in api contract
Test build #50442 has finished for PR 10982 at commit
|
Test build #50443 has finished for PR 10982 at commit
|
case class TablePartition( | ||
values: Seq[String], | ||
storage: StorageFormat | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive allows us to store the partition in a different location, e.g.:
ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1';
Do you want to support this as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's already supported. StorageFormat defines locationUri
Talked to @hvanhovell offline. I'm going to merge this first and parallelize the work. |
(actually i will only merge it when my latest commit passes test) |
Test build #50482 has finished for PR 10982 at commit
|
Test build #2485 has finished for PR 10982 at commit
|
Thanks - merging this in master. |
* @param name name of the function | ||
* @param className fully qualified class name, e.g. "org.apache.spark.util.MyFunc" | ||
*/ | ||
case class Function( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be case classes?
This is a step towards consolidating `SQLContext` and `HiveContext`. This patch extends the existing Catalog API added in #10982 to include methods for handling table partitions. In particular, a partition is identified by `PartitionSpec`, which is just a `Map[String, String]`. The Catalog is still not used by anything yet, but its API is now more or less complete and an implementation is fully tested. About 200 lines are test code. Author: Andrew Or <andrew@databricks.com> Closes #11069 from andrewor14/catalog.
## What changes were proposed in this pull request? This is a step towards merging `SQLContext` and `HiveContext`. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using `HiveClient`, an existing interface to Hive. It also extends `HiveClient` with additional calls to Hive that are needed to complete the catalog implementation. *Where should I start reviewing?* The new catalog introduced is `HiveCatalog`. This class is relatively simple because it just calls `HiveClientImpl`, where most of the new logic is. I would not start with `HiveClient`, `HiveQl`, or `HiveMetastoreCatalog`, which are modified mainly because of a refactor. *Why is this patch so big?* I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor `CatalogTable` convert directly to and from `HiveTable` (etc.). Otherwise we would have to first convert `CatalogTable` to the intermediate representation and then convert that to HiveTable, which is messy. The new class hierarchy is as follows: ``` org.apache.spark.sql.catalyst.catalog.Catalog - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog - org.apache.spark.sql.hive.HiveCatalog ``` Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release. ## How was the this patch tested? All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases. Author: Andrew Or <andrew@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #11293 from rxin/hive-catalog.
This pull request creates an internal catalog API. The creation of this API is the first step towards consolidating SQLContext and HiveContext. I envision we will have two different implementations in Spark 2.0: (1) a simple in-memory implementation, and (2) an implementation based on the current HiveClient (ClientWrapper).
I took a look at what Hive's internal metastore interface/implementation, and then created this API based on it. I believe this is the minimal set needed in order to achieve all the needed functionality.