Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24252][SQL] Add TableCatalog API #24246

Closed
wants to merge 14 commits into from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Mar 30, 2019

What changes were proposed in this pull request?

This adds the TableCatalog API proposed in the Table Metadata API SPIP.

For TableCatalog to use Table, it needed to be moved into the catalyst module where the v2 catalog API is located. This also required moving TableCapability. Most of the files touched by this PR are import changes needed by this move.

How was this patch tested?

This adds a test implementation and contract tests.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 30, 2019

@mccheah, @cloud-fan, here's the TableCatalog API.

@SparkQA
Copy link

SparkQA commented Mar 30, 2019

Test build #104101 has finished for PR 24246 at commit 1773230.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-24252-add-table-catalog-api branch from 1773230 to 1c2b2cd Compare April 1, 2019 23:56
@SparkQA
Copy link

SparkQA commented Apr 2, 2019

Test build #104175 has finished for PR 24246 at commit 1c2b2cd.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 2, 2019

@cloud-fan, please also start reviewing the last commit on this PR so that we can get it in fairly quickly after #24117 is in.

/**
* Refresh table metadata for {@link Identifier identifier}.
* <p>
* If the table is already loaded or cached, drop cached data and reload it. If the table is not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is in the scope of this PR, but I was wondering if we could make the metadata/schema for a table un cacheable by default. For example, in the current V1 code if I reference a relation in the Hive catalog the session will cache the metadata associated with the table. This means that I can't update a user's view of the in catalog table without a manual user refresh.

I was hoping we can signal via the catalog api that a table's metadata is dynamic and should not be cached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, we've implemented caching in our catalogs. Spark always calls load and uses the version that is returned instead of caching. Our catalog then uses a weak reference in the cache so that the version is periodically updated when no more queries are running against that version.

I'm not sure whether Spark should have some caching above the catalog or if the catalog should be responsible for this. Let's plan on talking about this at the next sync.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@mccheah
Copy link
Contributor

mccheah commented Apr 11, 2019

#24117 merged, this should be rebased

@rdblue rdblue force-pushed the SPARK-24252-add-table-catalog-api branch 2 times, most recently from e589ebb to 9aebe0f Compare April 11, 2019 18:53
@rdblue rdblue changed the title [SPARK-24252][SQL] Add TableCatalog API (WIP) [SPARK-24252][SQL] Add TableCatalog API Apr 11, 2019
@@ -47,23 +47,23 @@
* <p>
* Truncating a table removes all existing rows.
* <p>
* See {@link org.apache.spark.sql.sources.v2.writer.SupportsTruncate}.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider moving the v2 source API into the catalyst module as well. That would fix these links, but it a much larger issue. I'll plan on bringing it up at the next v2 sync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to move everything to catalyst

@@ -96,7 +96,7 @@ public Transform bucket(int numBuckets, String... columns) {
* @param column an input column
* @return a logical identity transform with name "identity"
*/
public Transform identity(String column) {
public static Transform identity(String column) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing these methods that should have been static is required here: https://github.com/apache/spark/pull/24246/files#diff-d921fee79b4e63ab684dec9d10afeb44R63.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 11, 2019

@mccheah and @cloud-fan, this is now rebased on master and is ready for a review. Thanks!

@SparkQA
Copy link

SparkQA commented Apr 11, 2019

Test build #104526 has finished for PR 24246 at commit 9aebe0f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2019

Test build #104532 has finished for PR 24246 at commit cd19a62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mccheah mccheah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More stuff incoming

Table createTable(Identifier ident,
StructType schema,
Transform[] partitions,
Map<String, String> properties) throws TableAlreadyExistsException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move properties before partitions, we can make partitions a Transform.... Might be something to consider, but I don't feel too strongly about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called from Spark, so I don't think it is valuable to change it to Transform... Either way, the implementation gets a Transform[].

* @param newDataType the new data type
* @return a TableChange for the update
*/
static TableChange updateColumn(String[] fieldNames, DataType newDataType, boolean isNullable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all of these methods, do we want to provide variants where a single field name can be updated without having to pass in an array? E.g. static TableChange updateColumn(String fieldName, DataType newDataType, boolean isNullable) {...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better for UX, but we will have too many overloads. How about we create a class ColumnBuilder?

case class Column(nameParts: Array[String], dt: DataType, nullable: Boolean, comment: Option[String])

class ColumnBuilder {
  def withName(name: String)
  def withNameParts(nameParts: String)
  def withNullable(nullable: Boolean)
  ...
}

And in TableChange we just need to declare Column in the methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mccheah, I don't see much value in adding overloads here. This is called from Spark to create changes to pass to an implementation, so I think it would be better to have more consistency than to have a slightly friendlier call in a few places.

@cloud-fan, these are factory methods for specific changes that are called by Spark. The classes themselves encapsulate the data for the change, so it would be redundant to add a Column class: UpdateColumn(Column(fieldNames, newDataType, isNullable)) instead of UpdateColumn(fieldNames, newDataType, isNullable).

Also, it isn't correct to pass extra information that wasn't in the ALTER TABLE statement (or other APIs) into the source. See https://github.com/apache/spark/pull/24246/files#r276865657

@@ -183,17 +149,10 @@ private[sql] final case class LiteralValue[T](value: T, dataType: DataType) exte
}

private[sql] final case class FieldReference(parts: Seq[String]) extends NamedReference {
import org.apache.spark.sql.catalog.v2.CatalogV2Implicits._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we only need CatalogV2Implicits for quoted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should import everything from CatalogV2Implicits then. Importing implicits can invoke unexpected calls to said implicit methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific about what the risk is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of Scala implicits is that the implicit conversions or methods are exactly that - implicit - so they can be called without specifically invoking the method names. So a common philosophy around implicits is to only import them if such conversions are definitely required. Otherwise we can create side effects of calling the implicit methods when we don't mean to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes all of the implicit conversions available, but does not apply them unless it is necessary. For these implicit classes, the only time it is necessary is when one of the methods, like quoted is called. (The compiler would also implicitly convert when an object is coerced to the implicit type as well, but there are no methods that require an IdentifierHelper.)

It is fine to import unused implicits. The cases where the compiler will use them are predictable, and they can be used to provide a richer Scala API for classes that are limited because they are part of the public API.

I'm also wary of using Scala's implicit, but the ability to add methods to existing classes is one of the cases where I find them reliable and useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that here we should only be importing the specific method that's used - maybe move quoted to a helper class. If we really need the implicits functionality we can import it later on. Can we do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can import just the required implicit class, but I don't think there is any risk to using implicits like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this.

@cloud-fan
Copy link
Contributor

shall we open a PR to move DS v2 classes from sql/core to sql/catalyst first? This PR is easier to review without those import changes.

@@ -47,23 +47,23 @@
* <p>
* Truncating a table removes all existing rows.
* <p>
* See {@link org.apache.spark.sql.sources.v2.writer.SupportsTruncate}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to move everything to catalyst

* @return the table's metadata
* @throws NoSuchTableException If the table doesn't exist.
*/
Table loadTable(Identifier ident) throws NoSuchTableException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ExternalCatalog, we have getTable and loadTable. It's a little weird to see loadTable here which is actually the same as ExternalCatalog.getTable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the word "load" implies that, the table will be loaded to somewhere. Will we cache the table metadata somewhere inside Spark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table is loaded so Spark can use it. This usage not unusual for the word "load".

The name getTable is no appropriate here because in the Java world, "get" signals that the operation is merely returning something that already exists, like an instance field. This is expected to make a remote call to fetch information from another store, so it is not an appropriate use of "get".

I think that the fact that this takes an identifier and returns a Table makes the operation clear. If you have a better verb for this (that isn't "get"), I'd consider it. I can't think of anything that works as well as loadTable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we support the Hive LOAD TABLE in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, LOAD TABLE is a SQL operation, not a catalog operation.

LOAD TABLE can be implemented as a write to some Table. If we wanted to support a path to load compatible data files directly into a table as an optimization, then we would need to build an API for that. Such an API would be part of a source API and not a catalog API.

* @param newDataType the new data type
* @return a TableChange for the update
*/
static TableChange updateColumn(String[] fieldNames, DataType newDataType, boolean isNullable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better for UX, but we will have too many overloads. How about we create a class ColumnBuilder?

case class Column(nameParts: Array[String], dt: DataType, nullable: Boolean, comment: Option[String])

class ColumnBuilder {
  def withName(name: String)
  def withNameParts(nameParts: String)
  def withNullable(nullable: Boolean)
  ...
}

And in TableChange we just need to declare Column in the methods.

* @param newComment the new comment
* @return a TableChange for the update
*/
static TableChange updateComment(String[] fieldNames, String newComment) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need a new entity for updating column comment? Can we do it via UpdateColumn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to pass data from the source back into the source. For example, if a user runs ALTER TABLE t ALTER COLUMN c COMMENT 'some doc', why would we pass the type of column c back to the source? If Spark passed the type, then the source would need to detect what changed (the comment) and what didn't (the type) even though Spark already had that information.

It is also not correct to pass the type in when the user only requested a change to the comment. If the table has changed since the source loaded it, passing the type could tell the source to revert another change. Regardless of whether this is likely to happen, it would be a correctness bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. Shall we do it to all the column fields? like changing column data type only, changing column nullable only. Then we don't need the general UpdateColumn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, no. The type of a field is both the DataType and nullability in SQL, which is why types are expressed as data_type [NOT NULL].

For example:

ALTER TABLE t ALTER COLUMN c COMMENT 'doc'
ALTER TABLE t ALTER COLUMN i TYPE bigint
ALTER TABLE t ALTER COLUMN f TYPE double NOT NULL
ALTER TABLE t ALTER COLUMN d TYPE decimal(18,2) NOT NULL COMMENT 'doc'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm I understand your statement, are you saying

  1. data type and nullable together mean the field type, although users can alter one of them via SQL command, the catalog should always update them together.
  2. comment is a separated property and can be updated separately.

If this is true, I think we should have 3 methods: updateType, updateComment and updateColumn.

Copy link
Contributor Author

@rdblue rdblue Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to keep it simple and avoid creating a class to pass every combination of changes. If a user makes multiple changes at the same time, they are all passed to alterTable. That avoids the need to create combinations of these changes, like updateColumn that combines updating the comment and updating the type.

* )
* </pre>
*/
public interface TableChange {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need APIs to manage namespaces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a PR for this, but as we discussed in the sync, it is a separate feature and a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a PR for this: #24560

* @param comment the new field's comment string
* @return a TableChange for the addition
*/
static TableChange addColumn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we decided to drop the overloads at the first version, shall we also remove the overloads for addColumn? Then it's consistent with the API design of createTable and we can change them together in the future if we agree to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are different cases. addColumn overloads are to ensure that consistent defaults are passed to sources and these are implemented in Spark, not in sources. The createTable overloads allowed implementations to accept those calls directly without using defaults, but added to the number of methods that implementations could possibly implement. That second part was the motivation for removing the createTable methods, which doesn't apply here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, makes sense.

* @param newDataType the new data type
* @return a TableChange for the update
*/
static TableChange updateColumn(String[] fieldNames, DataType newDataType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it creates a UpdateColumnType now, shall we rename this method to updateType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If others agree, I can do this. I don't see a compelling reason to do it so I'll wait for other people to add their opinions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine as is (since there are no other ways to "update" a column), if you were gonna change it I would go with updateColumnType anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since there are no other ways to "update" a column

There is a updateComment method below, which updates column comment. Shall we name these 2 methods coherently?

@cloud-fan
Copy link
Contributor

There is one thing worth discussing: TableCatalog should be case-sensitive or not during table lookup?

By default, Spark is case insensitive and case preserving. There is one config to change the behavior, but it's internal and we don't expect users to set it. I think it's simpler to ask TableCatalog implementations to be case insensitive and case preserving, instead of taking a case-sensitivity flag.

Different ideas are welcome!

*
* @param ident a table identifier
*/
default void invalidateTable(Identifier ident) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the use case of this method? I can only think of one use case: when end users run REFRESH TABLE. Is there any more use cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be called by REFRESH TABLE to invalidate cached data. This was previously named refreshTable, but you preferred a void return type. In that case, invalidateTable describes what the method does better than refresh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about invalidateTableCache? I think that would be more straightforward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a misleading name. This invalidates a single table's cached data, not the entire cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method accepts one parameter of Identifier type, which should not be misleading.
Just suggestion, overall I don't have a strong opinion on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick question. This sounds like only to invalidate cached table metadata, and not touch query cache. But Catalog.refreshTable that is used v1 REFRESH TABLE command also drops InMemoryRelation if the table is cached in Spark query cache.

Is the difference intentional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table cache is managed by Spark, so the API here can not affect table cache inside Spark.

Copy link
Member

@sunchao sunchao Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I think even though the table cache is managed by Spark, currently the behavior is different between v1 and v2 (which doesn't invalidate the table cache). When a cached temp view is created such as using CACHE TABLE ... AS SELECT from a v2 table, and later on REFRESH TABLE command is called on the v2 table, the cache which won't get invalidated. Is this something we should resolve?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should add this kind of missing actions into v2 commands as well. I think DROP TABLE has the same issue.

@rdblue
Copy link
Contributor Author

rdblue commented May 6, 2019

@cloud-fan, I don't think sources need to be case insensitive. Spark should assume that sources are case sensitive and apply its case sensitivity setting.

There are 2 main cases: table identifiers and column identifiers:

  • When identifying tables, Spark can only pass on the identifier provided by the user. It cannot control whether the underlying catalog is case sensitive or not. For example, if a catalog contains both a.b and A.B then it would be ambiguous for Spark to require case insensitive matching. Case sensitive catalogs would necessarily break Spark's assumption and return the one matching case. So there's no point when identifying tables.
  • When identifying columns, Spark knows the table schema that is reported and should apply its case sensitivity rules before calling alterTable. If Spark relied on the catalog to enforce case sensitivity settings (or table for scan projection), then implementations would inevitably get it wrong. Instead, Spark can resolve the name the user provided (e.g., aBc) against the table definition (e.g. abc) and pass an identifier that matches the table definition to ensure case sensitivity in the implementation doesn't matter.

@gatorsmile, any additional comments on case sensitivity?

@cloud-fan
Copy link
Contributor

Do you mean end-users can no longer loop up tables from the catalog in a case insensitive style? e.g. SELECT * FROM mY_tAblE fails if there is a my_table in the current catalog? That will be a breaking change, since Spark is case insensitive by default.

@rdblue
Copy link
Contributor Author

rdblue commented May 7, 2019

@cloud-fan, it depends on the catalog. The default catalog will be case insensitive and will not break compatibility. But you can't force a catalog to be case insensitive because they are external systems.

How would you make a case sensitive path-based table case insensitive?

@cloud-fan
Copy link
Contributor

you can't force a catalog to be case insensitive because they are external systems.

That's a good point. I took another look at the ExternalCatalog. Spark always lower-case the identifiers before query catalogs under case-insensitive mode, so that ExternalCatalog can be case sensitive.

Shall we add some doc to TableCatalog and say it should be case sensitive?

@rdblue
Copy link
Contributor Author

rdblue commented May 7, 2019

What is the value of requiring a catalog to be case sensitive?

If a user requests mY_TAblE and the catalog returns my_table that's fine. The catalog considers those the same object.

@cloud-fan
Copy link
Contributor

cloud-fan commented May 7, 2019

Do you mean the case sensitivity behavior should totally be decided by the catalog implementation itself? If that's the case I think at least the catalog implementation should inform Spark if it's case sensitive or not.

@rdblue
Copy link
Contributor Author

rdblue commented May 7, 2019

@cloud-fan, why? When identifying tables, what would Spark do differently depending on whether the catalog is case sensitive or not?

Like I said, Spark can only pass the identifier from the query and rely on the catalog to either find a table or not.

@rdblue
Copy link
Contributor Author

rdblue commented May 7, 2019

I followed up with @cloud-fan and we came to an agreement on case sensitivity:

  • Spark should pass table identifiers in the original case
  • Spark should normalize column identifiers to the case used by the table schema
  • Will to add javadoc to explain this behavior (done)
  • We will discuss a method to inspect catalog case sensitivity in the next sync (not a blocker for this PR)

@rdblue
Copy link
Contributor Author

rdblue commented May 7, 2019

@gatorsmile, any other comments on case sensitivity?

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if tests pass

/**
* Catalog methods for working with Tables.
* <p>
* TableCatalog implementations may be case sensitive or case insensitive. Spark will pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may worth to add a method to report this information. We can think about it later.

@gatorsmile
Copy link
Member

LGTM from my side but I would suggest to invite the other committers @rxin @marmbrus @brkyvz @ueshin @zsxwing to review this PR too even after this PR is merged. We still can revisit the decisions we made here to ensure all the committers are fine about the APIs we added here.

@rdblue
Copy link
Contributor Author

rdblue commented May 7, 2019

Thanks, Xiao! I agree, we can still revisit the decisions in this PR after further review.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one minor comment

@SparkQA
Copy link

SparkQA commented May 7, 2019

Test build #105224 has finished for PR 24246 at commit 12796fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 303ee3f May 8, 2019
@rdblue
Copy link
Contributor Author

rdblue commented May 8, 2019

Thanks @cloud-fan!

mccheah pushed a commit to palantir/spark that referenced this pull request May 15, 2019
## What changes were proposed in this pull request?

This adds the TableCatalog API proposed in the [Table Metadata API SPIP](https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d).

For `TableCatalog` to use `Table`, it needed to be moved into the catalyst module where the v2 catalog API is located. This also required moving `TableCapability`. Most of the files touched by this PR are import changes needed by this move.

## How was this patch tested?

This adds a test implementation and contract tests.

Closes apache#24246 from rdblue/SPARK-24252-add-table-catalog-api.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
mccheah pushed a commit to palantir/spark that referenced this pull request Jun 6, 2019
Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core.

I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go.

Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update `Analyzer` and `SessionCatalog` to support the new catalog plugin, which needs to be in the catalyst module.

This PR can solve the problem we have in apache#24246

existing tests

Closes apache#24416 from cloud-fan/move.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
emanuelebardelli pushed a commit to emanuelebardelli/spark that referenced this pull request Jun 15, 2019
## What changes were proposed in this pull request?

Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core.

I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go.

Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update `Analyzer` and `SessionCatalog` to support the new catalog plugin, which needs to be in the catalyst module.

This PR can solve the problem we have in apache#24246

## How was this patch tested?

existing tests

Closes apache#24416 from cloud-fan/move.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
* @param ident a table identifier
* @return true if a table was deleted, false if no table exists for the identifier
*/
boolean dropTable(Identifier ident);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rdblue dropTable() here does not support to specify purge or not. Would you please share your idea about why it is designed like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purge is an option specific to some sources, like Hive. If you drop a table in most JDBC databases, you don't have an option to keep the data around somewhere. If an option can't be satisfied by sources, then it isn't a good candidate to be in the API.

Even in Hive, purge is optional because managed tables will drop data automatically, but external tables will not. Using table configuration for sources that support "external" data is a better option, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now maybe we can fail the analysis if PURGE is specified but we are dropping a v2 table. @waterlx would you like to send a PR to do it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants