Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31675][CORE] Fix rename and delete files with different filesystem #36070

Closed
wants to merge 2 commits into from

Conversation

CHENXCHEN
Copy link

@CHENXCHEN CHENXCHEN commented Apr 5, 2022

What changes were proposed in this pull request?

When we use a partition table, if the filesystem of partition location is different from the filesystem of the table location,
we will get an exception like that:java.lang.IllegalArgumentException: Wrong FS: s3a://path/to/spark3_snap/dt=2020-09-10, expected: hdfs://cluster,
because HadoopMapReduceCommitProtocol will use the filesystem of the table location to operate the file.
For example, the following SQL will cause the above exception:

CREATE TABLE `spark3_snap`( `id` string) PARTITIONED BY (`dt` string)
STORED AS ORC LOCATION 'hdfs://path/to/spark3_snap';

-- The file system of the partition location is different from the filesystem of the table location,
-- one is S3A, the other is HDFS
alter table tmp.spark3_snap add partition (dt='2020-09-10') 
LOCATION 's3a://path/to/spark3_snap/dt=2020-09-10';

-- This will get an exception: "java.lang.IllegalArgumentException: Wrong FS: s3a://path/to/spark3_snap/dt=2020-09-10, expected: hdfs://cluster"
insert overwrite table tmp.spark3_snap partition(dt)
select '10' id, '2020-09-09' dt
union
select '20' id, '2020-09-10' dt
;

See details in the JIRA. SPARK-31675

Why are the changes needed?

We cannot operate on partitions with different from filesystem of table partition location

Does this PR introduce any user-facing change?

Yes, before this PR, an exception will be reported when the user operates a filesystem of partition location different from the filesystem of table location. After this PR, it will be processed as needed.

How was this patch tested?

Manual testing, not sure how to use unit tests in Spark to verify this patch.

@github-actions github-actions bot added the CORE label Apr 5, 2022
@CHENXCHEN CHENXCHEN changed the title [SPARK-32838][CORE] Fix rename and delete files with different filesystem [SPARK-31675][CORE] Fix rename and delete files with different filesystem Apr 5, 2022
@CHENXCHEN
Copy link
Author

cc @cloud-fan could you help take a look when you have time? Thanks.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@cloud-fan
Copy link
Contributor

Cross-file-system table writing sounds like a big feature to me. Currently, Spark fails to write so this feature is not supported yet, but I'm wondering what's the best way to do it, e.g. shall we put the staging dir in the same file system of the target path?

cc @AngersZhuuuu @yaooqinn

@CHENXCHEN
Copy link
Author

CHENXCHEN commented Apr 7, 2022

Staging Dir is generated based on the table location
If we specify that the generated partition location must be placed under the table location, we need:

  1. Find the partition locations that are different from the table locations, change their locations, and then delete the old locations at InsertIntoHadoopFsRelationCommand.scala
  2. Partition locations that are different from the table location need to be passed into the commiter as new task file at FileFormatDataWriter.scala

Hive's approach is to keep the path to the partition location and move files across file systems.Hive.java
Would it be better if our behavior was consistent with that of Hive?

@AngersZhuuuu
Copy link
Contributor

Current DS insert only support write staging dir for dynamic partition overwrite, this pr's case seems is to use hive serde(since hive serde support config staging dir use different file system). And spark's commit protocol not support different file system for staging dir.

Add support for different file system have to consider a lot, you can check the point mentioned in #33828

Also I am writing a new build in commit protocol #36056, it's behavior like hive, and it make all overwrite use a staging dir in the same file system of the target path.

@CHENXCHEN
Copy link
Author

Do we need a wrapper file system to handle all the files in spark, including cross file system operations?
It sounds like a big change...

@CHENXCHEN
Copy link
Author

Current DS insert only support write staging dir for dynamic partition overwrite, this pr's case seems is to use hive serde(since hive serde support config staging dir use different file system). And spark's commit protocol not support different file system for staging dir.

Add support for different file system have to consider a lot, you can check the point mentioned in #33828

Also I am writing a new build in commit protocol #36056, it's behavior like hive, and it make all overwrite use a staging dir in the same file system of the target path.

Yes, this pr's case is to use hive serde(hive serde support config staging dir use different file system)

#33828 and #36056 If we have multiple filesystems that are different from the staging dir filesystem, we still have exceptions.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really should not be using the classic FileOutputCommitter against S3; as well as performance being awful it lacks correctness and resilience against failure of task commit. Problems with file rename here are essentially second order.

Which committer are you using and what filesystems?

the code does look good for cross EZ copies in hdfs.

val dstFs = dstPath.getFileSystem(hadoopConf)
// Copying files across different file systems
if (needCopy(srcPath, dstPath, srcFs, dstFs)) {
if (!FileUtil.copy(srcFs, srcFs.listStatus(srcPath).map(_.getPath), dstFs, dstPath,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may want to think about parallelizing the copy, as for each file it is now going to take time proportional to data.length/(download_bandwidth+upload_bandwidth)

shame copy returns false sometimes; looks like it is only if mkdirs() on the dest or delete(src) fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can i highlight something i've noticed here, that copy() command stos on src read() returning -1, without doing any checks to validate file length, not great.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Nov 20, 2022
@github-actions github-actions bot closed this Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants