Skip to content

Comments

[SPARK_42744] delete uploaded file when job finish for k8s#40363

Closed
thousandhu wants to merge 2 commits intoapache:masterfrom
thousandhu:SPARK-42744
Closed

[SPARK_42744] delete uploaded file when job finish for k8s#40363
thousandhu wants to merge 2 commits intoapache:masterfrom
thousandhu:SPARK-42744

Conversation

@thousandhu
Copy link

What changes were proposed in this pull request?

Let driver delete uploaded file when job finish.

Why are the changes needed?

Now there is no deletion for files uploaded by client, which causes file leaks on remote file system.

Does this PR introduce any user-facing change?

Yes. This PR add a new configuration spark.kubernetes.uploaded.file.delete.on.termination. By default, this configuration is false and the behavior is the same with current version. When the configuration is set to true, driver will try to delete uploaded files when job finish.

How was this patch tested?

ConfigBuilder("spark.kubernetes.uploaded.files")
.internal()
.doc("Remember all uploaded uri by spark client, used to delete uris when app finished.")
.version("3.1.2-internal")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, but Apache Spark codebase needs a valid Spark version info, @thousandhu .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed version to 3.5.0.
BTW, this is a internal config, we don't want user to set it. So the config is set as .internal().

val KUBERNETES_UPLOAD_FILE_DELETE_ON_TERMINATION =
ConfigBuilder("spark.kubernetes.uploaded.file.delete.on.termination")
.doc("Deleting uploaded file when app finished")
.version("3.1.2")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, for the new feature or improvement, this should be 3.5.0.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change version to 3.5.0

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, you can prevent the leak very easily by using TTL like S3/MinIO lifecycle rules.

Now there is no deletion for files uploaded by client, which causes file leaks on remote file system.

@thousandhu
Copy link
Author

thousandhu commented Mar 14, 2023

BTW, you can prevent the leak very easily by using TTL like S3/MinIO lifecycle rules.

We are using HDFS as the storage.
And the ttl is not enough to all apps, such as streaming and AI training job which runs for several days.
If TTL is too short, it will affect the failover for spark apps. If TTL is very long, it cases storage waste.

@shrprasa
Copy link
Contributor

@thousandhu @dongjoon-hyun @holdenk
The approach in this PR only handles the cleanup on driver side. It won't clean up the files if files were uploaded during job submission but then submission fails due to some reason. As in this driver won't be launched and clean up will not happen.

This Jira SPARK-42744 is duplicate of SPARK-42466. I had already created PR for this issue which handles cleanup on both driver as well client side in case of app submission failure.
Other than this it also optimises the file upload by creating just one upload sub directory rather than created several sub - directories for each file getting uploaded.

#40128

@github-actions
Copy link

github-actions bot commented Jul 9, 2023

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jul 9, 2023
@github-actions github-actions bot closed this Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants