[SPARK_42744] delete uploaded file when job finish for k8s#40363
[SPARK_42744] delete uploaded file when job finish for k8s#40363thousandhu wants to merge 2 commits intoapache:masterfrom
Conversation
| ConfigBuilder("spark.kubernetes.uploaded.files") | ||
| .internal() | ||
| .doc("Remember all uploaded uri by spark client, used to delete uris when app finished.") | ||
| .version("3.1.2-internal") |
There was a problem hiding this comment.
Thank you for making a PR, but Apache Spark codebase needs a valid Spark version info, @thousandhu .
There was a problem hiding this comment.
I've changed version to 3.5.0.
BTW, this is a internal config, we don't want user to set it. So the config is set as .internal().
| val KUBERNETES_UPLOAD_FILE_DELETE_ON_TERMINATION = | ||
| ConfigBuilder("spark.kubernetes.uploaded.file.delete.on.termination") | ||
| .doc("Deleting uploaded file when app finished") | ||
| .version("3.1.2") |
There was a problem hiding this comment.
And, for the new feature or improvement, this should be 3.5.0.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
BTW, you can prevent the leak very easily by using TTL like S3/MinIO lifecycle rules.
Now there is no deletion for files uploaded by client, which causes file leaks on remote file system.
We are using HDFS as the storage. |
|
@thousandhu @dongjoon-hyun @holdenk This Jira SPARK-42744 is duplicate of SPARK-42466. I had already created PR for this issue which handles cleanup on both driver as well client side in case of app submission failure. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Let driver delete uploaded file when job finish.
Why are the changes needed?
Now there is no deletion for files uploaded by client, which causes file leaks on remote file system.
Does this PR introduce any user-facing change?
Yes. This PR add a new configuration spark.kubernetes.uploaded.file.delete.on.termination. By default, this configuration is false and the behavior is the same with current version. When the configuration is set to true, driver will try to delete uploaded files when job finish.
How was this patch tested?