[SPARK-16905] SQL DDL: MSCK REPAIR TABLE by davies · Pull Request #14500 · apache/spark

davies · 2016-08-04T22:26:58Z

What changes were proposed in this pull request?

MSCK REPAIR TABLE could be used to recover the partitions in external catalog based on partitions in file system.

Another syntax is: ALTER TABLE table RECOVER PARTITIONS

The implementation in this PR will only list partitions (not the files with a partition) in driver (in parallel if needed).

How was this patch tested?

Added unit tests for it and Hive compatibility test suite.

davies · 2016-08-04T22:27:23Z

@yhuai Could you help to generate the golden result for this suite?

JoshRosen · 2016-08-04T22:28:13Z

Jenkins, retest this please.

yhuai · 2016-08-04T23:00:08Z

We do not generate golden files anymore. Let's port those tests. Thanks.

davies · 2016-08-04T23:06:35Z

@yhuai Just checked the repair.q, it's kind of useless, already covered by out unit test, we could just ignore it.

gatorsmile · 2016-08-04T23:13:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

+case class RepairTableCommand(tableName: TableIdentifier) extends RunnableCommand {
+  override def run(spark: SparkSession): Seq[Row] = {
+    val catalog = spark.sessionState.catalog
+    val table = catalog.getTableMetadata(tableName)


This is a dead code. The previous line already checks whether the table exists or not.

val table = catalog.getTableMetadata(tableName)

SparkQA · 2016-08-04T23:45:45Z

Test build #63242 has finished for PR 14500 at commit c5edbdf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AlterTableRecoverPartitionsCommand(
- case class RepairTableCommand(tableName: TableIdentifier) extends RunnableCommand

SparkQA · 2016-08-04T23:57:34Z

Test build #63243 has finished for PR 14500 at commit c5edbdf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AlterTableRecoverPartitionsCommand(
- case class RepairTableCommand(tableName: TableIdentifier) extends RunnableCommand

SparkQA · 2016-08-05T00:04:55Z

Test build #63244 has finished for PR 14500 at commit 89d22f4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-05T00:43:14Z

Test build #63246 has finished for PR 14500 at commit f338516.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-08-05T07:50:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+   * Create an [[AlterTableDiscoverPartitionsCommand]] command
+   *
+   * For example:
+   * {{{


Nit: Update the syntax and the comments here.

SparkQA · 2016-08-05T21:30:27Z

Test build #63279 has finished for PR 14500 at commit 7f4f38d.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-08-05T22:30:25Z

Test build #63283 has finished for PR 14500 at commit e478c3a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class ShuffleIndexInformation
- public class ShuffleIndexRecord
- case class Least(children: Seq[Expression]) extends Expression
- case class Greatest(children: Seq[Expression]) extends Expression
- case class CreateTable(tableDesc: CatalogTable, mode: SaveMode, query: Option[LogicalPlan])
- case class PreprocessDDL(conf: SQLConf) extends Rule[LogicalPlan]

rxin · 2016-08-06T00:17:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

+    val threshold = spark.conf.get("spark.rdd.parallelListingThreshold", "10").toInt
+    val statusPar: GenSeq[FileStatus] =
+      if (partitionNames.length > 1 && statuses.length > threshold || partitionNames.length > 2) {
+        val parArray = statuses.par


i didn't look carefully - but if you are using the default exec context, please create a new one. otherwise it'd block.

A new one is created here: https://github.com/davies/spark/blob/e478c3a50e0d1564fc78939c5c1fb47798c47a3e/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L490

cool. can we make it explicit, e.g. statuses.par(evalTaskSupport)?

This is copied from UnionRDD.

I did not figure out how it work, at least statuses.par(evalTaskSupport) does not work.

viirya · 2016-08-06T03:55:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

+      if (st.isDirectory && name.contains("=")) {
+        val ps = name.split("=", 2)
+        val columnName = PartitioningUtils.unescapePathName(ps(0)).toLowerCase
+        val value = PartitioningUtils.unescapePathName(ps(1))


Do we need to check if the value is valid? E.g., for a partition column "a" of IntegerType, "a=abc" is invalid.

We could have a TODO here.

SparkQA · 2016-08-08T22:29:02Z

Test build #63375 has finished for PR 14500 at commit e5906cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal · 2016-08-08T22:59:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

+        val ps = name.split("=", 2)
+        val columnName = PartitioningUtils.unescapePathName(ps(0)).toLowerCase
+        // TODO: Validate the value
+        val value = PartitioningUtils.unescapePathName(ps(1))


Can this escaping cause problems in (say) S3 for objects of the form "foo%20bar"?

If the partitions are generated by Spark, they could be unescape back correctly. For others, they could be compatibility issues. For example, Spark does not escape in Linux, the unescaping for%20 could be wrong (we could show an warning?). I think these are not in the scope of this PR.

yes, that makes sense.

sameeragarwal · 2016-08-08T23:35:16Z

LGTM

davies · 2016-08-09T17:04:18Z

Merging into master and 2.0 branch, thanks!

yhuai · 2016-08-09T18:47:51Z

@liancheng Can you do a post-hoc review?

MSCK REPAIR TABLE could be used to recover the partitions in external catalog based on partitions in file system. Another syntax is: ALTER TABLE table RECOVER PARTITIONS The implementation in this PR will only list partitions (not the files with a partition) in driver (in parallel if needed). Added unit tests for it and Hive compatibility test suite. Author: Davies Liu <davies@databricks.com> Closes #14500 from davies/repair_table.

hvanhovell · 2016-08-09T18:52:25Z

LGTM

yhuai · 2016-08-09T19:03:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

+      CatalogTablePartition(spec, table.storage.copy(locationUri = Some(location.toUri.toString)))
+    }
+    spark.sessionState.catalog.createPartitions(tableName,
+      parts.toArray[CatalogTablePartition], ignoreIfExists = true)


What will happen if we get thousands of new partitions of tens thousands of new partitions?

Good question, see the implementation in HiveShim:

// Follows exactly the same logic of DDLTask.createPartitions in Hive 0.12 override def createPartitions( hive: Hive, database: String, tableName: String, parts: Seq[CatalogTablePartition], ignoreIfExists: Boolean): Unit = { val table = hive.getTable(database, tableName) parts.foreach { s => val location = s.storage.locationUri.map(new Path(table.getPath, _)).orNull val spec = s.spec.asJava if (hive.getPartition(table, spec, false) != null && ignoreIfExists) { // Ignore this partition since it already exists and ignoreIfExists == true } else { if (location == null && table.isView()) { throw new HiveException("LOCATION clause illegal for view partition"); } createPartitionMethod.invoke( hive, table, spec, location, null, // partParams null, // inputFormat null, // outputFormat -1: JInteger, // numBuckets null, // cols null, // serializationLib null, // serdeParams null, // bucketCols null) // sortCols } } }

All these partitions will be insert into Hive in sequential way, so group them as batches will not help here.

no, this is true for Hive <=0.12, for Hive 0.13+, they are sent in single RPC. so we should verify that what's limit for a single RPC

It seems that the Hive Metastore can't handle a RPC with millions of partitions, I will send a patch to do it in batch.

support ddl: MSCK REPAIR TABLE

c5edbdf

disallow on datasource table

89d22f4

do not use repair.q

f338516

gatorsmile reviewed Aug 4, 2016
View reviewed changes

hvanhovell reviewed Aug 5, 2016
View reviewed changes

davies changed the title ~~[SPARK-] SQL DDL: MSCK REPAIR TABLE~~ [SPARK-16905] SQL DDL: MSCK REPAIR TABLE Aug 5, 2016

Davies Liu added 2 commits August 5, 2016 12:13

address comments

7f4f38d

Merge branch 'master' of github.com:apache/spark into repair_table

e478c3a

rxin reviewed Aug 6, 2016
View reviewed changes

viirya reviewed Aug 6, 2016
View reviewed changes

adress comments

e5906cf

sameeragarwal reviewed Aug 8, 2016
View reviewed changes

asfgit closed this in 92da228 Aug 9, 2016

yhuai reviewed Aug 9, 2016
View reviewed changes

Conversation

davies commented Aug 4, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

davies commented Aug 4, 2016

Uh oh!

JoshRosen commented Aug 4, 2016

Uh oh!

yhuai commented Aug 4, 2016

Uh oh!

davies commented Aug 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 4, 2016

Uh oh!

SparkQA commented Aug 4, 2016

Uh oh!

SparkQA commented Aug 5, 2016

Uh oh!

SparkQA commented Aug 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 5, 2016

Uh oh!

SparkQA commented Aug 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell Aug 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Aug 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sameeragarwal commented Aug 8, 2016

Uh oh!

davies commented Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhuai commented Aug 9, 2016

Uh oh!

hvanhovell commented Aug 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

hvanhovell Aug 7, 2016 •

edited

Loading

viirya Aug 6, 2016 •

edited

Loading

davies commented Aug 9, 2016 •

edited

Loading