[SPARK-16905] SQL DDL: MSCK REPAIR TABLE#14500
Conversation
|
@yhuai Could you help to generate the golden result for this suite? |
|
Jenkins, retest this please. |
|
We do not generate golden files anymore. Let's port those tests. Thanks. |
|
@yhuai Just checked the repair.q, it's kind of useless, already covered by out unit test, we could just ignore it. |
| case class RepairTableCommand(tableName: TableIdentifier) extends RunnableCommand { | ||
| override def run(spark: SparkSession): Seq[Row] = { | ||
| val catalog = spark.sessionState.catalog | ||
| val table = catalog.getTableMetadata(tableName) |
There was a problem hiding this comment.
This is a dead code. The previous line already checks whether the table exists or not.
val table = catalog.getTableMetadata(tableName)
|
Test build #63242 has finished for PR 14500 at commit
|
|
Test build #63243 has finished for PR 14500 at commit
|
|
Test build #63244 has finished for PR 14500 at commit
|
|
Test build #63246 has finished for PR 14500 at commit
|
| * Create an [[AlterTableDiscoverPartitionsCommand]] command | ||
| * | ||
| * For example: | ||
| * {{{ |
There was a problem hiding this comment.
Nit: Update the syntax and the comments here.
|
Test build #63279 has finished for PR 14500 at commit
|
|
Test build #63283 has finished for PR 14500 at commit
|
| val threshold = spark.conf.get("spark.rdd.parallelListingThreshold", "10").toInt | ||
| val statusPar: GenSeq[FileStatus] = | ||
| if (partitionNames.length > 1 && statuses.length > threshold || partitionNames.length > 2) { | ||
| val parArray = statuses.par |
There was a problem hiding this comment.
i didn't look carefully - but if you are using the default exec context, please create a new one. otherwise it'd block.
There was a problem hiding this comment.
There was a problem hiding this comment.
cool. can we make it explicit, e.g. statuses.par(evalTaskSupport)?
There was a problem hiding this comment.
This is copied from UnionRDD.
There was a problem hiding this comment.
I did not figure out how it work, at least statuses.par(evalTaskSupport) does not work.
| if (st.isDirectory && name.contains("=")) { | ||
| val ps = name.split("=", 2) | ||
| val columnName = PartitioningUtils.unescapePathName(ps(0)).toLowerCase | ||
| val value = PartitioningUtils.unescapePathName(ps(1)) |
There was a problem hiding this comment.
Do we need to check if the value is valid? E.g., for a partition column "a" of IntegerType, "a=abc" is invalid.
There was a problem hiding this comment.
We could have a TODO here.
|
Test build #63375 has finished for PR 14500 at commit
|
| val ps = name.split("=", 2) | ||
| val columnName = PartitioningUtils.unescapePathName(ps(0)).toLowerCase | ||
| // TODO: Validate the value | ||
| val value = PartitioningUtils.unescapePathName(ps(1)) |
There was a problem hiding this comment.
Can this escaping cause problems in (say) S3 for objects of the form "foo%20bar"?
There was a problem hiding this comment.
If the partitions are generated by Spark, they could be unescape back correctly. For others, they could be compatibility issues. For example, Spark does not escape in Linux, the unescaping for%20 could be wrong (we could show an warning?). I think these are not in the scope of this PR.
|
LGTM |
|
Merging into master and 2.0 branch, thanks! |
|
@liancheng Can you do a post-hoc review? |
MSCK REPAIR TABLE could be used to recover the partitions in external catalog based on partitions in file system. Another syntax is: ALTER TABLE table RECOVER PARTITIONS The implementation in this PR will only list partitions (not the files with a partition) in driver (in parallel if needed). Added unit tests for it and Hive compatibility test suite. Author: Davies Liu <davies@databricks.com> Closes #14500 from davies/repair_table.
|
LGTM |
| CatalogTablePartition(spec, table.storage.copy(locationUri = Some(location.toUri.toString))) | ||
| } | ||
| spark.sessionState.catalog.createPartitions(tableName, | ||
| parts.toArray[CatalogTablePartition], ignoreIfExists = true) |
There was a problem hiding this comment.
What will happen if we get thousands of new partitions of tens thousands of new partitions?
There was a problem hiding this comment.
Good question, see the implementation in HiveShim:
// Follows exactly the same logic of DDLTask.createPartitions in Hive 0.12
override def createPartitions(
hive: Hive,
database: String,
tableName: String,
parts: Seq[CatalogTablePartition],
ignoreIfExists: Boolean): Unit = {
val table = hive.getTable(database, tableName)
parts.foreach { s =>
val location = s.storage.locationUri.map(new Path(table.getPath, _)).orNull
val spec = s.spec.asJava
if (hive.getPartition(table, spec, false) != null && ignoreIfExists) {
// Ignore this partition since it already exists and ignoreIfExists == true
} else {
if (location == null && table.isView()) {
throw new HiveException("LOCATION clause illegal for view partition");
}
createPartitionMethod.invoke(
hive,
table,
spec,
location,
null, // partParams
null, // inputFormat
null, // outputFormat
-1: JInteger, // numBuckets
null, // cols
null, // serializationLib
null, // serdeParams
null, // bucketCols
null) // sortCols
}
}
}
All these partitions will be insert into Hive in sequential way, so group them as batches will not help here.
There was a problem hiding this comment.
no, this is true for Hive <=0.12, for Hive 0.13+, they are sent in single RPC. so we should verify that what's limit for a single RPC
There was a problem hiding this comment.
It seems that the Hive Metastore can't handle a RPC with millions of partitions, I will send a patch to do it in batch.
What changes were proposed in this pull request?
MSCK REPAIR TABLE could be used to recover the partitions in external catalog based on partitions in file system.
Another syntax is: ALTER TABLE table RECOVER PARTITIONS
The implementation in this PR will only list partitions (not the files with a partition) in driver (in parallel if needed).
How was this patch tested?
Added unit tests for it and Hive compatibility test suite.