Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19148][SQL] do not expose the external table concept in Catalog #16528

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

In #16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path.

This PR renames Catalog.createExternalTable to createTable(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if path is specified in options.

How was this patch tested?

new tests in CatalogSuite

"""
}.mkString
})
foldFunctions = _.map { funCall =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor code cleanup, not related to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we revert this change?

case fs: HadoopFsRelation =>
if (table.tableType == CatalogTableType.EXTERNAL && fs.location.rootPaths.isEmpty) {
throw new AnalysisException(
"Cannot create a file-based external data source table without path")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will never hit this branch after this PR. There is no public API to set the table type and users can only set custom table path, so we will never create an external table without path

@cloud-fan
Copy link
Contributor Author

cc @yhuai @rxin @gatorsmile

@SparkQA
Copy link

SparkQA commented Jan 10, 2017

Test build #71123 has finished for PR 16528 at commit f06ed16.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 10, 2017

Test build #71125 has finished for PR 16528 at commit b0c252a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 10, 2017

Test build #71128 has finished for PR 16528 at commit b0c252a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please


/**
* :: Experimental ::
* Creates a table from the given path and returns the corresponding DataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can explain here to say the data files will not be dropped when dropping the table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already documented this in programming guide. If we wanna explain the custom path here, there are a lot of similar places we need to add comments too. So I'd like to leave it as it was.

@gatorsmile
Copy link
Member

createTableHeader ('(' colTypeList ')')? tableProvider
  (OPTIONS options=tablePropertyList)?
  (PARTITIONED BY partitionColumnNames=identifierList)?
  bucketSpec? locationSpec?
  (COMMENT comment=STRING)?
  (AS? query)?                                                   #createTable
  def createTable(
      tableName: String,
      source: String,
      schema: StructType,
      options: Map[String, String]): DataFrame

If we compare the above two interfaces, the Catalog's createTable API only can create a non-partitioned table. How about adding a new createTable API for users to specify the partitioning info and bucketing info?

@gatorsmile
Copy link
Member

We also need to update the Python interface. See the code

@SparkQA
Copy link

SparkQA commented Jan 10, 2017

Test build #71143 has finished for PR 16528 at commit b0c252a.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

adding a new createTable API for users to specify the partitioning info and bucketing info

I'd like to do it in follow-ups

I'll update the python api too

@SparkQA
Copy link

SparkQA commented Jan 11, 2017

Test build #71175 has finished for PR 16528 at commit f1f75ed.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2017

Test build #71191 has started for PR 16528 at commit 0d1baf1.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 11, 2017

Test build #71198 has finished for PR 16528 at commit 0d1baf1.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2017

Test build #71214 has finished for PR 16528 at commit 8f4e86e.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 12, 2017

Test build #71238 has finished for PR 16528 at commit 3b59221.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71292 has finished for PR 16528 at commit 2e1d378.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71316 has finished for PR 16528 at commit d5b3b4f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71325 has finished for PR 16528 at commit 55cf0c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

warnings.warn(
"createExternalTable is deprecated since Spark 2.2, please use createTable instead.",
DeprecationWarning)
return self.createTable(tableName, path, source, schema, **options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**options -> options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's python syntax, like what we do in scala: func(options: _*)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Got it. I also manually tried it in pyspark. It works fine.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jan 16, 2017

Test build #71408 has finished for PR 16528 at commit 55cf0c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

cc @yhuai for final sign-off

@yhuai
Copy link
Contributor

yhuai commented Jan 16, 2017

looks good to me. If possible, I'd like to get the code mentioned by https://github.com/apache/spark/pull/16528/files#r96314156 reverted.

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71469 has finished for PR 16528 at commit 318dc04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@asfgit asfgit closed this in 18ee55d Jan 17, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

In apache#16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path.

This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options.

## How was this patch tested?

new tests in `CatalogSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16528 from cloud-fan/create-table.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

In apache#16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path.

This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options.

## How was this patch tested?

new tests in `CatalogSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16528 from cloud-fan/create-table.
ghost pushed a commit to dbtsai/spark that referenced this pull request Apr 1, 2017
…t in Catalog

### What changes were proposed in this pull request?
After we renames `Catalog`.`createExternalTable` to `createTable` in the PR: apache#16528, we also need to deprecate the corresponding functions in `SQLContext`.

### How was this patch tested?
N/A

Author: Xiao Li <gatorsmile@gmail.com>

Closes apache#17502 from gatorsmile/deprecateCreateExternalTable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants