Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32838][SQL] Static partition overwrite could use staging dir insert #35608

Closed
wants to merge 5 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

What changes were proposed in this pull request?

Currently, we verify path in DataSourceAnalysis

// For dynamic partition overwrite, we do not delete partition directories ahead.
// We write to staging directories and move to final partition directories after writing
// job is done. So it is ok to have outputPath try to overwrite inputpath.
 if (overwrite && !insertCommand.dynamicPartitionOverwrite) {
      DDLUtils.verifyNotReadPath(actualQuery, outputPath)
}

/**
   * Throws exception if outputPath tries to overwrite inputpath.
   */
  def verifyNotReadPath(query: LogicalPlan, outputPath: Path) : Unit = {
    val inputPaths = query.collect {
      case LogicalRelation(r: HadoopFsRelation, _, _, _) =>
        r.location.rootPaths
    }.flatten

    if (inputPaths.contains(outputPath)) {
      throw new AnalysisException(
        "Cannot overwrite a path that is also being read from.")
    }
  }

For static partition insert and read data form same table, it's really a normal case. This bug troubles user a lot.
In this pr, for static partition insert, we can use same logical like dynamic partition overwrite to avoid this issue.

Why are the changes needed?

Support more ETL case

Does this PR introduce any user-facing change?

After this patch, user can:

  1. Insert overwrite static partition from data read from same table's partition
  2. Insert overwrite static partition from data read from same table's same partition

How was this patch tested?

Added UT

@AngersZhuuuu
Copy link
Contributor Author

cc @CHENXCHEN

@AngersZhuuuu
Copy link
Contributor Author

gentle ping @cloud-fan @HyukjinKwon @viirya @dongjoon-hyun This is a long term issue. and current code is an easy and reasonable way to resolve this problem. Hope for your reviews. Many spark user encounter this issue. ccc @TongWei1105

@AngersZhuuuu
Copy link
Contributor Author

ping @cloud-fan

@SparksFyz
Copy link

We also encounter this issue for partitioned table(maybe converted from HiveTableRelation). Here change InsertIntoHadoopFsRelationCommand.dynamicPartitionOverwrite to true is another way to solve this problem? DynamicPartitionWrite will delete data in commitJob before rename.

@AngersZhuuuu
Copy link
Contributor Author

We also encounter this issue for partitioned table(maybe converted from HiveTableRelation). Here change InsertIntoHadoopFsRelationCommand.dynamicPartitionOverwrite to true is another way to solve this problem?

Why In hive we won't meet such issue is because we use staging dir.

DynamicPartitionWrite will delete data in commitJob before rename.

What are you trying to express here?

@SparksFyz
Copy link

SparksFyz commented Mar 9, 2022

if (overwrite && !insertCommand.dynamicPartitionOverwrite) {
      DDLUtils.verifyNotReadPath(actualQuery, outputPath)
}

DynamicPartitionOverwrite does not delete data before job begin, so it will not encountered verify problem.

Since we use directly outputCommitter to write files. For the case Insert overwrite static partition from data read from same table's partition, we just tune insertCommand.dynamicPartitionOverwrite to true to avoid delete partition data before read.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Sep 27, 2022
@github-actions github-actions bot closed this Sep 28, 2022
@shrprasa
Copy link
Contributor

shrprasa commented Apr 6, 2023

@AngersZhuuuu @cloud-fan We are facing this issue with Spark 3.2. Can we get this fix merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants