[SPARK-13078][SQL] API and test cases for internal catalog #10982

rxin · 2016-01-29T10:05:24Z

This pull request creates an internal catalog API. The creation of this API is the first step towards consolidating SQLContext and HiveContext. I envision we will have two different implementations in Spark 2.0: (1) a simple in-memory implementation, and (2) an implementation based on the current HiveClient (ClientWrapper).

I took a look at what Hive's internal metastore interface/implementation, and then created this API based on it. I believe this is the minimal set needed in order to achieve all the needed functionality.

rxin · 2016-01-29T10:06:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+
+  // TODO: need more functions for partitioning.
+
+  def alterPartition(db: String, table: String, part: TablePartition): Unit


partition handling is the main one that is incomplete ....

rxin · 2016-01-29T10:06:57Z

cc @hvanhovell

This is what I discussed with you the other day. Once this is in, we can create the two implementations.

SparkQA · 2016-01-29T10:33:35Z

Test build #50374 has finished for PR 10982 at commit 9e96106.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-29T18:26:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+  val schema: Seq[Column],
+  val partitionColumns: Seq[Column],
+  val storage: StorageFormat,
+  val numBuckets: Int,


bucketColumns and sortColumns?

ah ok - although hive doesn't actually support that.

rxin · 2016-01-30T07:59:17Z

OK I pushed a new version -- this one should have the basic pieces ready and we can parallelize the work after this.

rxin · 2016-01-30T08:08:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+  // Databases
+  // --------------------------------------------------------------------------
+
+  def createDatabase(dbDefinition: Database, ifNotExists: Boolean): Unit


need to define when we should throw exceptions in api contract

SparkQA · 2016-01-30T08:32:05Z

Test build #50442 has finished for PR 10982 at commit d1bb199.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-30T10:14:42Z

Test build #50443 has finished for PR 10982 at commit 964193d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class CatalogTestCases extends SparkFunSuite

hvanhovell · 2016-01-31T14:48:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+case class TablePartition(
+  values: Seq[String],
+  storage: StorageFormat
+)


Hive allows us to store the partition in a different location, e.g.:

ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1';

Do you want to support this as well?

That's already supported. StorageFormat defines locationUri

rxin · 2016-02-01T08:08:30Z

Talked to @hvanhovell offline. I'm going to merge this first and parallelize the work.

rxin · 2016-02-01T08:09:11Z

(actually i will only merge it when my latest commit passes test)

SparkQA · 2016-02-01T08:46:14Z

Test build #50482 has finished for PR 10982 at commit 01c5922.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-01T10:14:12Z

Test build #2485 has finished for PR 10982 at commit 01c5922.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-02-01T22:11:44Z

Thanks - merging this in master.

marmbrus · 2016-02-01T22:15:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+ * @param name name of the function
+ * @param className fully qualified class name, e.g. "org.apache.spark.util.MyFunc"
+ */
+case class Function(


Should these be case classes?

This is a step towards consolidating `SQLContext` and `HiveContext`. This patch extends the existing Catalog API added in #10982 to include methods for handling table partitions. In particular, a partition is identified by `PartitionSpec`, which is just a `Map[String, String]`. The Catalog is still not used by anything yet, but its API is now more or less complete and an implementation is fully tested. About 200 lines are test code. Author: Andrew Or <andrew@databricks.com> Closes #11069 from andrewor14/catalog.

## What changes were proposed in this pull request? This is a step towards merging `SQLContext` and `HiveContext`. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using `HiveClient`, an existing interface to Hive. It also extends `HiveClient` with additional calls to Hive that are needed to complete the catalog implementation. *Where should I start reviewing?* The new catalog introduced is `HiveCatalog`. This class is relatively simple because it just calls `HiveClientImpl`, where most of the new logic is. I would not start with `HiveClient`, `HiveQl`, or `HiveMetastoreCatalog`, which are modified mainly because of a refactor. *Why is this patch so big?* I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor `CatalogTable` convert directly to and from `HiveTable` (etc.). Otherwise we would have to first convert `CatalogTable` to the intermediate representation and then convert that to HiveTable, which is messy. The new class hierarchy is as follows: ``` org.apache.spark.sql.catalyst.catalog.Catalog - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog - org.apache.spark.sql.hive.HiveCatalog ``` Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release. ## How was the this patch tested? All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases. Author: Andrew Or <andrew@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #11293 from rxin/hive-catalog.

rxin reviewed Jan 29, 2016
View reviewed changes

cloud-fan reviewed Jan 29, 2016
View reviewed changes

rxin changed the title ~~[SPARK-13078][SQL] Internal catalog API - WIP~~ [SPARK-13078][SQL] Infrastructure for the internal catalog API Jan 30, 2016

rxin force-pushed the SPARK-13078 branch from 9e96106 to 21cdc9a Compare January 30, 2016 07:58

rxin reviewed Jan 30, 2016
View reviewed changes

hvanhovell reviewed Jan 31, 2016
View reviewed changes

rxin added 6 commits January 31, 2016 22:47

[SPARK-13078][SQL] Internal catalog API

b423620

Remove the separate FunctionCatalog.

4576919

Add test cases.

f315a0c

Add sort columns.

a5a8462

remove throws

db43a21

Add test case

afcb957

rxin changed the title ~~[SPARK-13078][SQL] Infrastructure for the internal catalog API~~ [SPARK-13078][SQL] API and test cases for internal catalog Feb 1, 2016

Add description

01c5922

rxin force-pushed the SPARK-13078 branch from 964193d to 01c5922 Compare February 1, 2016 08:12

asfgit closed this in be7a2fc Feb 1, 2016

marmbrus reviewed Feb 1, 2016
View reviewed changes

andrewor14 mentioned this pull request Feb 4, 2016

[SPARK-13079] [SQL] Extend and implement InMemoryCatalog #11069

Closed

andrewor14 mentioned this pull request Feb 12, 2016

[SPARK-13080] [SQL] Implement new Catalog API using Hive #11189

Closed

rxin mentioned this pull request Feb 21, 2016

[SPARK-13080] [SQL] Implement new Catalog API using Hive #11293

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13078][SQL] API and test cases for internal catalog #10982

[SPARK-13078][SQL] API and test cases for internal catalog #10982

rxin commented Jan 29, 2016

rxin Jan 29, 2016

rxin commented Jan 29, 2016

SparkQA commented Jan 29, 2016

cloud-fan Jan 29, 2016

rxin Jan 29, 2016

rxin commented Jan 30, 2016

rxin Jan 30, 2016

SparkQA commented Jan 30, 2016

SparkQA commented Jan 30, 2016

hvanhovell Jan 31, 2016

rxin Feb 1, 2016

rxin commented Feb 1, 2016

rxin commented Feb 1, 2016

SparkQA commented Feb 1, 2016

SparkQA commented Feb 1, 2016

rxin commented Feb 1, 2016

marmbrus Feb 1, 2016


		// TODO: need more functions for partitioning.

		def alterPartition(db: String, table: String, part: TablePartition): Unit

[SPARK-13078][SQL] API and test cases for internal catalog #10982

[SPARK-13078][SQL] API and test cases for internal catalog #10982

Conversation

rxin commented Jan 29, 2016

rxin Jan 29, 2016

Choose a reason for hiding this comment

rxin commented Jan 29, 2016

SparkQA commented Jan 29, 2016

cloud-fan Jan 29, 2016

Choose a reason for hiding this comment

rxin Jan 29, 2016

Choose a reason for hiding this comment

rxin commented Jan 30, 2016

rxin Jan 30, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 30, 2016

SparkQA commented Jan 30, 2016

hvanhovell Jan 31, 2016

Choose a reason for hiding this comment

rxin Feb 1, 2016

Choose a reason for hiding this comment

rxin commented Feb 1, 2016

rxin commented Feb 1, 2016

SparkQA commented Feb 1, 2016

SparkQA commented Feb 1, 2016

rxin commented Feb 1, 2016

marmbrus Feb 1, 2016

Choose a reason for hiding this comment