Skip to content

[SPARK-45651][BUILD][FOLLOWUP] Reduce mvn -Xmx option to 2g in publish_snapshot workflow#43538

Closed
EnricoMi wants to merge 1 commit intoapache:masterfrom
G-Research:publish-snapshot-mvn-xmx
Closed

[SPARK-45651][BUILD][FOLLOWUP] Reduce mvn -Xmx option to 2g in publish_snapshot workflow#43538
EnricoMi wants to merge 1 commit intoapache:masterfrom
G-Research:publish-snapshot-mvn-xmx

Conversation

@EnricoMi
Copy link
Contributor

What changes were proposed in this pull request?

Limit max memory for mvn clean deploy to 2g when run in publish_snapshot Github workflow.

Why are the changes needed?

The host that runs the workflow has only 7G of memory, while the release-build.sh script sets the limit to 12g, causing the process to be killed (for branch master).

Does this PR introduce any user-facing change?

No

How was this patch tested?

Not tested

Was this patch authored or co-authored using generative AI tooling?

No

@EnricoMi EnricoMi changed the title [SPARK-45651][Build][Follow-up] Reduce mvn -Xmx option to 2g in publish_snapshot workflow [SPARK-45651][BUILD][FOLLOWUP] Reduce mvn -Xmx option to 2g in publish_snapshot workflow Oct 26, 2023
@EnricoMi
Copy link
Contributor Author

CC @LuciferYang @HyukjinKwon

cd ..

export MAVEN_OPTS="-Xss128m -Xmx12g -XX:ReservedCodeCacheSize=1g"
export MAVEN_OPTS="-Xss128m -Xmx${MAVEN_MXM_OPT:-12g} -XX:ReservedCodeCacheSize=1g"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I previously changed the -Xmx option to 3g in build/mvn script, not here, so the previous fix probably didn't take effect...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So perhaps the -Xmx here could be larger.

Copy link
Contributor Author

@EnricoMi EnricoMi Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the previous fix probably didn't take effect...

I suspect so. The memory logging that is currently in place will tell us for sure if any attempt has any or just too little effect.

Copy link
Contributor Author

@EnricoMi EnricoMi Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So perhaps the -Xmx here could be larger.

Why so? Do you refer to the 12g in this line or the 2g in the workflow? Since the build workflow uses 2g already, we should stick to those 2g for consistency or bump the 2g in build.yml to 3g as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is perhaps MAVEN_MXM_OPT could be 4g, because I've tested before and java-other-versions can also run successfully on GA with 4g.

But 2g is also fine to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, either of 2g, 3g, and 4g is fine with me, too.

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if test pass

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

Seems not working (https://github.com/apache/spark/actions/runs/6661131541/job/18103495351). I reverted this for now.

@HyukjinKwon
Copy link
Member

For now, we use a different Docker image between the actual test and snapshot build. We cache the image (see image_urlat .github/workflows/build_and_test.yml, and https://github.com/apache/spark/blob/master/dev/infra/Dockerfile). Can we match the same image?

@EnricoMi
Copy link
Contributor Author

Seems not working (https://github.com/apache/spark/actions/runs/6661131541/job/18103495351). I reverted this for now.

The problem was unrelated:

408 Request Timeout

@EnricoMi
Copy link
Contributor Author

That workflow run did not pick up the changes of this PR:

https://github.com/apache/spark/actions/runs/6661131541/workflow

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Oct 27, 2023

In fact, you manually triggered the publish, and it worked perfectly fine for master:
https://github.com/apache/spark/actions/runs/6661389655/job/18104176828
https://github.com/apache/spark/actions/runs/6661389655/workflow

Branch branch-3.5 failed due to an unrelated 408 Request Timeout. So no need to revert.

@EnricoMi EnricoMi deleted the publish-snapshot-mvn-xmx branch October 27, 2023 09:09
@LuciferYang
Copy link
Contributor

Shall we revive this pr and give it another try ...

@EnricoMi
Copy link
Contributor Author

Yes, please!

@EnricoMi
Copy link
Contributor Author

The mem statistics with 2g are

MiB Mem :   6922.0 total,    579.9 free,   5634.8 used,    707.2 buff/cache
MiB Swap:   4096.0 total,   2069.7 free,   2026.2 used.    976.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2494 runner    20   0   16.4g   5.1g  23040 S 118.8  75.5 103:16.37 java

So even with 2g, memory usage is still quite high but significantly better than with 12g, there is headroom of 2g free swap and 500m free mem.

@HyukjinKwon
Copy link
Member

I manually triggered once more at https://github.com/apache/spark/actions/runs/6661389655 and it failed too. Is that also unrelated?

@EnricoMi
Copy link
Contributor Author

Revived in #43555.

@HyukjinKwon
Copy link
Member

If that's the case we can get this in again and see if it works.

@HyukjinKwon
Copy link
Member

(sorry it's my phone now so can't properly check the logs on my own)

@EnricoMi
Copy link
Contributor Author

Same HTTP timeout:

2023-10-27T00:46:23.1689370Z mem: top - 00:46:23 up 7 min,  0 users,  load average: 0.86, 0.96, 0.50
2023-10-27T00:46:23.1690374Z mem: Tasks: 129 total,   1 running, 128 sleeping,   0 stopped,   0 zombie
2023-10-27T00:46:23.1691430Z mem: %Cpu(s):  0.0 us,  3.2 sy,  0.0 ni, 96.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
2023-10-27T00:46:23.1692524Z mem: MiB Mem :   6922.0 total,    596.6 free,   2637.2 used,   3688.3 buff/cache
2023-10-27T00:46:31.1391267Z mem: MiB Swap:   4096.0 total,   4095.2 free,      0.8 used.   39
2023-10-27T00:46:31.1396047Z org.apache.maven.wagon.TransferFailedException: transfer failed for https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-network-shuffle_2.12/3.5.1-SNAPSHOT/spark-network-shuffle_2.12-3.5.1-20231027.004212-88-sources.jar.md5, status: 408 Request Timeout

@EnricoMi
Copy link
Contributor Author

Looks like a known unrelated issue, workaround here: https://github.com/kiegroup/kie-wb-common/pull/3416/files

LuciferYang pushed a commit that referenced this pull request Oct 27, 2023
…h_snapshot workflow

### What changes were proposed in this pull request?
This re-does #43538, which has [falsely been reverted](#43538 (comment)).

Limit max memory for `mvn clean deploy` to `2g` when run in `publish_snapshot` Github workflow.

### Why are the changes needed?
The host that runs the workflow has only 7G of memory, while the `release-build.sh` script sets the limit to 12g, causing the process to be killed (for branch `master`).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not tested

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43555 from EnricoMi/publish-snapshot-mvn-xmx-2.

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
@EnricoMi
Copy link
Contributor Author

EnricoMi commented Oct 27, 2023

@HyukjinKwon
Copy link
Member

Let's retry few times and see if it actually works first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants