Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Nov 15, 2025

What changes were proposed in this pull request?

This PR aims to add free_disk_space step to K8s integration test GitHub Action job.

Why are the changes needed?

The K8s integration test CI is flaky due to No space left on device error.

[info]   25/11/14 21:27:02 ERROR TaskSchedulerImpl: Lost executor 4 on 10.244.0.67: Unable to create executor due to /var/data/spark-163899dd-08da-4b76-b71d-c428207a3bdf/spark-1e9d976f-69b3-4274-af03-300cfc4d6fd5/-14621403551763155568738_cache
-> ./software.amazon.awssdk_bundle-2.29.52.jar:
No space left on device

Like the other four GitHub Action jobs, free_disk_space_container will mitigate this situation in this job.

BEFORE

$ git grep 'free_disk_space$'
.github/workflows/build_and_test.yml:          ./dev/free_disk_space
.github/workflows/release.yml:            ./dev/free_disk_space

AFTER

$ git grep 'free_disk_space$'
.github/workflows/build_and_test.yml:          ./dev/free_disk_space
.github/workflows/build_and_test.yml:            ./dev/free_disk_space
.github/workflows/release.yml:            ./dev/free_disk_space

Does this PR introduce any user-facing change?

No, this is a test infra change.

How was this patch tested?

Pass the CIs and check the log. The following is the log result of this PR.

BEFORE CLEANUP

+ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        72G   54G   18G  76% /
tmpfs           7.9G   84K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda16      881M   62M  758M   8% /boot
/dev/sda15      105M  6.2M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001

AFTER CLEANUP

+ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        72G   21G   52G  29% /
tmpfs           7.9G   84K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda16      881M   62M  758M   8% /boot
/dev/sda15      105M  6.2M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the INFRA label Nov 15, 2025
@dongjoon-hyun dongjoon-hyun marked this pull request as draft November 15, 2025 06:35
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-54366][INFRA] Add free_disk_space_container step to K8s integration test GitHub Action job [SPARK-54366][INFRA] Add free_disk_space step to K8s integration test GitHub Action job Nov 15, 2025
@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review November 15, 2025 21:23
@dongjoon-hyun
Copy link
Member Author

Could you review this INFRA PR too, please, @sarutak ?

Copy link
Member

@sarutak sarutak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I confirmed the way to free up is the same as:

- name: Free up disk space
run: |
if [ -f ./dev/free_disk_space ]; then
./dev/free_disk_space
fi

@dongjoon-hyun
Copy link
Member Author

Thank you again!

dongjoon-hyun added a commit that referenced this pull request Nov 15, 2025
…st GitHub Action job

### What changes were proposed in this pull request?

This PR aims to add `free_disk_space` step to K8s integration test GitHub Action job.

### Why are the changes needed?

The K8s integration test CI is flaky due to `No space left on device` error.
- https://github.com/apache/spark/actions/runs/19354883389/job/55448531341

```
[info]   25/11/14 21:27:02 ERROR TaskSchedulerImpl: Lost executor 4 on 10.244.0.67: Unable to create executor due to /var/data/spark-163899dd-08da-4b76-b71d-c428207a3bdf/spark-1e9d976f-69b3-4274-af03-300cfc4d6fd5/-14621403551763155568738_cache
-> ./software.amazon.awssdk_bundle-2.29.52.jar:
No space left on device
```

Like the other four GitHub Action jobs, `free_disk_space_container` will mitigate this situation in this job.

**BEFORE**
```
$ git grep 'free_disk_space$'
.github/workflows/build_and_test.yml:          ./dev/free_disk_space
.github/workflows/release.yml:            ./dev/free_disk_space
```

**AFTER**
```
$ git grep 'free_disk_space$'
.github/workflows/build_and_test.yml:          ./dev/free_disk_space
.github/workflows/build_and_test.yml:            ./dev/free_disk_space
.github/workflows/release.yml:            ./dev/free_disk_space
```

### Does this PR introduce _any_ user-facing change?

No, this is a test infra change.

### How was this patch tested?

Pass the CIs and check the log. The following is the log result of this PR.
- https://github.com/dongjoon-hyun/spark/actions/runs/19395758483/job/55495933312

**BEFORE CLEANUP**
```
+ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        72G   54G   18G  76% /
tmpfs           7.9G   84K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda16      881M   62M  758M   8% /boot
/dev/sda15      105M  6.2M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001
```

**AFTER CLEANUP**
```
+ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        72G   21G   52G  29% /
tmpfs           7.9G   84K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda16      881M   62M  758M   8% /boot
/dev/sda15      105M  6.2M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #53077 from dongjoon-hyun/SPARK-54366.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 0311f44)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun added a commit that referenced this pull request Nov 15, 2025
…st GitHub Action job

### What changes were proposed in this pull request?

This PR aims to add `free_disk_space` step to K8s integration test GitHub Action job.

### Why are the changes needed?

The K8s integration test CI is flaky due to `No space left on device` error.
- https://github.com/apache/spark/actions/runs/19354883389/job/55448531341

```
[info]   25/11/14 21:27:02 ERROR TaskSchedulerImpl: Lost executor 4 on 10.244.0.67: Unable to create executor due to /var/data/spark-163899dd-08da-4b76-b71d-c428207a3bdf/spark-1e9d976f-69b3-4274-af03-300cfc4d6fd5/-14621403551763155568738_cache
-> ./software.amazon.awssdk_bundle-2.29.52.jar:
No space left on device
```

Like the other four GitHub Action jobs, `free_disk_space_container` will mitigate this situation in this job.

**BEFORE**
```
$ git grep 'free_disk_space$'
.github/workflows/build_and_test.yml:          ./dev/free_disk_space
.github/workflows/release.yml:            ./dev/free_disk_space
```

**AFTER**
```
$ git grep 'free_disk_space$'
.github/workflows/build_and_test.yml:          ./dev/free_disk_space
.github/workflows/build_and_test.yml:            ./dev/free_disk_space
.github/workflows/release.yml:            ./dev/free_disk_space
```

### Does this PR introduce _any_ user-facing change?

No, this is a test infra change.

### How was this patch tested?

Pass the CIs and check the log. The following is the log result of this PR.
- https://github.com/dongjoon-hyun/spark/actions/runs/19395758483/job/55495933312

**BEFORE CLEANUP**
```
+ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        72G   54G   18G  76% /
tmpfs           7.9G   84K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda16      881M   62M  758M   8% /boot
/dev/sda15      105M  6.2M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001
```

**AFTER CLEANUP**
```
+ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        72G   21G   52G  29% /
tmpfs           7.9G   84K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda16      881M   62M  758M   8% /boot
/dev/sda15      105M  6.2M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #53077 from dongjoon-hyun/SPARK-54366.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 0311f44)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun added a commit that referenced this pull request Nov 15, 2025
…st GitHub Action job

### What changes were proposed in this pull request?

This PR aims to add `free_disk_space` step to K8s integration test GitHub Action job.

### Why are the changes needed?

The K8s integration test CI is flaky due to `No space left on device` error.
- https://github.com/apache/spark/actions/runs/19354883389/job/55448531341

```
[info]   25/11/14 21:27:02 ERROR TaskSchedulerImpl: Lost executor 4 on 10.244.0.67: Unable to create executor due to /var/data/spark-163899dd-08da-4b76-b71d-c428207a3bdf/spark-1e9d976f-69b3-4274-af03-300cfc4d6fd5/-14621403551763155568738_cache
-> ./software.amazon.awssdk_bundle-2.29.52.jar:
No space left on device
```

Like the other four GitHub Action jobs, `free_disk_space_container` will mitigate this situation in this job.

**BEFORE**
```
$ git grep 'free_disk_space$'
.github/workflows/build_and_test.yml:          ./dev/free_disk_space
.github/workflows/release.yml:            ./dev/free_disk_space
```

**AFTER**
```
$ git grep 'free_disk_space$'
.github/workflows/build_and_test.yml:          ./dev/free_disk_space
.github/workflows/build_and_test.yml:            ./dev/free_disk_space
.github/workflows/release.yml:            ./dev/free_disk_space
```

### Does this PR introduce _any_ user-facing change?

No, this is a test infra change.

### How was this patch tested?

Pass the CIs and check the log. The following is the log result of this PR.
- https://github.com/dongjoon-hyun/spark/actions/runs/19395758483/job/55495933312

**BEFORE CLEANUP**
```
+ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        72G   54G   18G  76% /
tmpfs           7.9G   84K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda16      881M   62M  758M   8% /boot
/dev/sda15      105M  6.2M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001
```

**AFTER CLEANUP**
```
+ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        72G   21G   52G  29% /
tmpfs           7.9G   84K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda16      881M   62M  758M   8% /boot
/dev/sda15      105M  6.2M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #53077 from dongjoon-hyun/SPARK-54366.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 0311f44)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member Author

Merged to master/4.1/4.0/3.5.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-54366 branch November 15, 2025 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants