Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13078][SQL] API and test cases for internal catalog #10982

Closed
wants to merge 7 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Jan 29, 2016

This pull request creates an internal catalog API. The creation of this API is the first step towards consolidating SQLContext and HiveContext. I envision we will have two different implementations in Spark 2.0: (1) a simple in-memory implementation, and (2) an implementation based on the current HiveClient (ClientWrapper).

I took a look at what Hive's internal metastore interface/implementation, and then created this API based on it. I believe this is the minimal set needed in order to achieve all the needed functionality.


// TODO: need more functions for partitioning.

def alterPartition(db: String, table: String, part: TablePartition): Unit
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partition handling is the main one that is incomplete ....

@rxin
Copy link
Contributor Author

rxin commented Jan 29, 2016

cc @hvanhovell

This is what I discussed with you the other day. Once this is in, we can create the two implementations.

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50374 has finished for PR 10982 at commit 9e96106.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val schema: Seq[Column],
val partitionColumns: Seq[Column],
val storage: StorageFormat,
val numBuckets: Int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bucketColumns and sortColumns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok - although hive doesn't actually support that.

@rxin rxin changed the title [SPARK-13078][SQL] Internal catalog API - WIP [SPARK-13078][SQL] Infrastructure for the internal catalog API Jan 30, 2016
@rxin
Copy link
Contributor Author

rxin commented Jan 30, 2016

OK I pushed a new version -- this one should have the basic pieces ready and we can parallelize the work after this.

// Databases
// --------------------------------------------------------------------------

def createDatabase(dbDefinition: Database, ifNotExists: Boolean): Unit
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to define when we should throw exceptions in api contract

@SparkQA
Copy link

SparkQA commented Jan 30, 2016

Test build #50442 has finished for PR 10982 at commit d1bb199.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2016

Test build #50443 has finished for PR 10982 at commit 964193d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class CatalogTestCases extends SparkFunSuite

case class TablePartition(
values: Seq[String],
storage: StorageFormat
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive allows us to store the partition in a different location, e.g.:

ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1';

Do you want to support this as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already supported. StorageFormat defines locationUri

@rxin rxin changed the title [SPARK-13078][SQL] Infrastructure for the internal catalog API [SPARK-13078][SQL] API and test cases for internal catalog Feb 1, 2016
@rxin
Copy link
Contributor Author

rxin commented Feb 1, 2016

Talked to @hvanhovell offline. I'm going to merge this first and parallelize the work.

@rxin
Copy link
Contributor Author

rxin commented Feb 1, 2016

(actually i will only merge it when my latest commit passes test)

@SparkQA
Copy link

SparkQA commented Feb 1, 2016

Test build #50482 has finished for PR 10982 at commit 01c5922.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 1, 2016

Test build #2485 has finished for PR 10982 at commit 01c5922.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor Author

rxin commented Feb 1, 2016

Thanks - merging this in master.

@asfgit asfgit closed this in be7a2fc Feb 1, 2016
* @param name name of the function
* @param className fully qualified class name, e.g. "org.apache.spark.util.MyFunc"
*/
case class Function(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be case classes?

asfgit pushed a commit that referenced this pull request Feb 4, 2016
This is a step towards consolidating `SQLContext` and `HiveContext`.

This patch extends the existing Catalog API added in #10982 to include methods for handling table partitions. In particular, a partition is identified by `PartitionSpec`, which is just a `Map[String, String]`. The Catalog is still not used by anything yet, but its API is now more or less complete and an implementation is fully tested.

About 200 lines are test code.

Author: Andrew Or <andrew@databricks.com>

Closes #11069 from andrewor14/catalog.
asfgit pushed a commit that referenced this pull request Feb 21, 2016
## What changes were proposed in this pull request?

This is a step towards merging `SQLContext` and `HiveContext`. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using `HiveClient`, an existing interface to Hive. It also extends `HiveClient` with additional calls to Hive that are needed to complete the catalog implementation.

*Where should I start reviewing?* The new catalog introduced is `HiveCatalog`. This class is relatively simple because it just calls `HiveClientImpl`, where most of the new logic is. I would not start with `HiveClient`, `HiveQl`, or `HiveMetastoreCatalog`, which are modified mainly because of a refactor.

*Why is this patch so big?* I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor `CatalogTable` convert directly to and from `HiveTable` (etc.). Otherwise we would have to first convert `CatalogTable` to the intermediate representation and then convert that to HiveTable, which is messy.

The new class hierarchy is as follows:
```
org.apache.spark.sql.catalyst.catalog.Catalog
  - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog
  - org.apache.spark.sql.hive.HiveCatalog
```

Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release.

## How was the this patch tested?
All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases.

Author: Andrew Or <andrew@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #11293 from rxin/hive-catalog.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants