-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
馃惓 automated docker image build fails for e-mission-server
#926
Comments
for (1) I know that the job fails if docker commands fail in general (e.g. https://github.com/e-mission/e-mission-server/actions/runs/5370984366/jobs/9743442408) but it is not failing if this particular command fails. In the case where it correctly failed, we see
which is probably called from So the fix for this is probably to use |
for the (2) which is preventing the process from being killed in the first place, conda/conda#8051 seems to indicate that the long term fix is to switch to |
for testing (1), you already have a PR. It is not building the In terms of seeing "If one of the steps of the image build did not work, then why did the image build not fail?" you can also run the docker command locally and build an image manually. You can then introduce a failure in the scripts run from docker and try and see if the local docker build fails. This can happen in parallel with the testing using Github Actions. |
assuming that it is a memory issue (which seems likely) you can also try to force the error by reducing the resources for docker on your laptop. Get it down to 4GB or something and it will probably fail the same way. And if it deosn't fail the same way, it is probably not a memory issue. which is also interesting although harder. |
The image build is successful on my local machine. I was able to recreate the error in my fork of e-mission-server using GitHub actions. The error is occurring while running setup_config.sh, specifically during the command |
Adding "set -e" before the command did not cause the setup to fail. Instead, the program completely skipped this section that we saw before:
But continued with the next steps and said that the image was successfully built. Adding Since I was able to find the exact command that is erroring, I added |
Tried to test how scripts error out:
Test 1
Test 2:
|
Now, add this bash script to
|
So what if we try
|
Let's verify whether
|
so it seems like this should just work.
We have shown these in the last 4 comments. One caveat may be that conda itself is a wrapper that does not have
Look forward to seeing logs soon |
After researching and running tests, it is apparent that the issue is due to conda not having enough memory to complete the install. The error occurs during setup_config.sh, but does not necessarily have to occur while running a specific command. See this test run, where the error occurs despite commenting out the (modified version of the) conda install command:
This was unexpected, and meant that editing the Additionally, I was able to build the docker image successfully by setting my local docker resources to 12GB, rather than the default 8GB. This essentially verified that it was a memory issue, and that migrating to mamba might be the best long-term solution. Let's compare the outputs of running the docker build with different modifications. For this first local run, I have my docker resources set to 3.8GB, so there is no way it will succeed: This is a run reflective of the current state of the Dockerfile and setup_config.sh
It does not fail despite the killed error, and continues later on to say that it has finished setup_config.sh:
The computer does not recognize the error:
First try
Resulted in a run that appears successful, with no error messages.
It appears that I then tried
Where the error occurred the same as if I hadn't changed anything:
A result similar to my first attempt occurred from a push 2 days ago when un-commenting out set -e at the top of setup_config.sh and running a modified version of the conda install command.
When run in GH actions, the run succeeded and the Killed error never showed:
From these tests, it would appear that
Which did error out:
However, it was erroring out regardless of memory allocation. After meeting with @shankari, I found out that I had just tried to change too many things at once.
|
Yay! Yes, one step at a time is the way to go. So I should be seeing PR soon? Don't forget to put it into "To Review" in the project when it is ready |
Options are:
Quick back of the envelope for how much we would need to replace in the server:
Decision: Try |
I have tested libmamba both on GH actions and locally, by adding to the setup_conda.sh script:
Making libmamba the solver allowed the script to run successfully on GH actions, and was about twice as fast as previous runs. This increase in speed was also observed in my local tests. However, when running locally, there was no change in the need to increase docker resources up to 12GB. Based on my tests, I think that |
@nataliejschultz we do not install conda on our server. The conda install happens as part of the image build, so as part of the GH actions. Then we just run the image on our server. For additional context, you can see the PR related to the docker image build action (image_build_push.yml), e-mission/e-mission-server#875, fixing #752. You can find this history yourself in the future by using the file history. If this works, please go ahead and submit a PR. I can merge it before I rebuild the images for the release, and you can have two changes included in the release 馃槃 |
Confirmed that after this merge, the docker build fails as expected.
|
Test with docker also failed
|
Example run:
https://github.com/e-mission/e-mission-server/actions/runs/5371438586/jobs/9744656517
the build looks like it succeeded, and the image was pushed successfully to the server.
But in reality, the image is corrupted because it does not have the conda environment installed
See error around line 382
After that, only a few packages are targeted for download
If run manually, the image builds successfully.
Issues:
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
-m
work to increase memory and allow the build to complete?The text was updated successfully, but these errors were encountered: