Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18949] [SQL] Add recoverPartitions API to Catalog #16356

Closed
wants to merge 6 commits into from

Conversation

gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Dec 20, 2016

What changes were proposed in this pull request?

Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. MSCK REPAIR TABLE or ALTER TABLE table RECOVER PARTITIONS. (Actually, very hard for me to remember MSCK and have no clue what it means)

After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.

Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by

spark.catalog.recoverPartitions("testTable")

How was this patch tested?

Modified the existing test cases.

@gatorsmile
Copy link
Member Author

cc @rxin @cloud-fan @ericl

@rxin
Copy link
Contributor

rxin commented Dec 20, 2016

What is the SQL equivalent command? MSCK? Should we match that?

@gatorsmile
Copy link
Member Author

We have two SQL equivalent commands:

  • ALTER TABLE table RECOVER PARTITIONS;
  • MSCK REPAIR TABLE table;

I am not good at naming. How about recoverPartitions?

@SparkQA
Copy link

SparkQA commented Dec 20, 2016

Test build #70416 has finished for PR 16356 at commit 1f71236.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Dec 20, 2016

Yea recoverPartitions sound a lot better.

@rxin
Copy link
Contributor

rxin commented Dec 20, 2016

We should also add the Python API.

@gatorsmile
Copy link
Member Author

Sure, will do. Thanks!

@gatorsmile gatorsmile changed the title [SPARK-18949] [SQL] Add repairTable API to Catalog [SPARK-18949] [SQL] Add recoverPartitions API to Catalog Dec 20, 2016
@rxin
Copy link
Contributor

rxin commented Dec 20, 2016

We can laso merge this in branch-2.1. So let's do 2.1.1 as since version.

ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.util.sketch.CountMinSketch.toByteArray")
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.util.sketch.CountMinSketch.toByteArray"),
// [SPARK-18949] [SQL] Add repairTable API to Catalog
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.sql.catalog.Catalog.recoverPartitions")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When backporting this PR to 2.1.1, we might need to move this to the next section 2.1.x.

@SparkQA
Copy link

SparkQA commented Dec 20, 2016

Test build #70422 has finished for PR 16356 at commit 494da5f.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Dec 20, 2016

LGTM pending tests.

@@ -258,6 +258,11 @@ def refreshTable(self, tableName):
"""Invalidate and refresh all the cached metadata of the given table."""
self._jcatalog.refreshTable(tableName)

@since(2.1.1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change it to 2.1. 2.1.1 causes the doc build break. If needed, I can do more investigation to see how to do it for 2.1.1

@@ -258,6 +258,11 @@ def refreshTable(self, tableName):
"""Invalidate and refresh all the cached metadata of the given table."""
self._jcatalog.refreshTable(tableName)

@since(2.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do "2.1.1" as a string

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

@SparkQA
Copy link

SparkQA commented Dec 20, 2016

Test build #70418 has finished for PR 16356 at commit fb26533.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70426 has finished for PR 16356 at commit 451ab05.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70423 has finished for PR 16356 at commit 7cb5e3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #3511 has finished for PR 16356 at commit 451ab05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Dec 21, 2016

Merging in master/branch-2.1.

@rxin
Copy link
Contributor

rxin commented Dec 21, 2016

Can you send a pr for branch-2.1?

@asfgit asfgit closed this in 24c0c94 Dec 21, 2016
@gatorsmile
Copy link
Member Author

Sure, let me do it now.

asfgit pushed a commit that referenced this pull request Dec 21, 2016
### What changes were proposed in this pull request?

This PR is to backport #16356 to Spark 2.1.1 branch.

----

Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)

After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.

Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
```Scala
spark.catalog.recoverPartitions("testTable")
```

### How was this patch tested?
Modified the existing test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16372 from gatorsmile/repairTable2.1.1.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
### What changes were proposed in this pull request?

Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)

After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.

Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
```Scala
spark.catalog.recoverPartitions("testTable")
```

### How was this patch tested?
Modified the existing test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#16356 from gatorsmile/repairTable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants