Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost Builds #3

Closed
cmoesel opened this issue Mar 17, 2018 · 15 comments
Closed

Lost Builds #3

cmoesel opened this issue Mar 17, 2018 · 15 comments

Comments

@cmoesel
Copy link
Member

cmoesel commented Mar 17, 2018

I've setup the auto-build as described and even got some failed builds -- which proved to me that the auto-build webhook was working. I fixed the issue that caused the initial failures, but now I never get a build notification for my IG (whether triggering via a push or manually) -- and the build.log hasn't changed (it is still that last failed build).

I think my build is getting hung or timing out. Is there any way to get more insight into the build process as it is happening? As it is now, it's very difficult to determine what's going on -- as the build succeeds locally (albeit with reported errors in qa.html), but seems to get lost in the auto-build.

This is for the IG at https://github.com/HL7/us-breastcancer.

@cmoesel
Copy link
Member Author

cmoesel commented Mar 20, 2018

BLUF

I think the container that does the build is running out of memory and being killed before it finished.

I think this can be fixed by increasing memory available to the container and/or using the --oom-kill-disable options when running the container (but only recommended in conjunction with a -m flag to limit total memory).

Details

I've been trying to determine what's going on by attempting to reproduce using Docker containers (since that is how the auto-build works). I tried to do a variation of the ig-build image in this repo, but I could not get it to build. In short, the container insisted on installing ruby 2.1, but jekyll-asciidoc required a later version of ruby. I tried downgrading jekyll-asciidoc, but that didn't work either.

In the end, I ended up making an even-more-modified container for testing (see Dockerfile). When I ran the container, I mapped my IG source to /usr/src/input:

docker run -it --rm -v /path/to/my/guide:/usr/src/input --name running-ig-publisher ig-publisher

This ran for a while but ultimately was killed. Here's the last few lines of the output:

Generating Outputs in /usr/src/ig/output/guide/                                  (00:02:39.0195sec)
Generating Summary Outputs                                                       (00:05:48.0216sec)
Jekyll: Source: /usr/src/ig/temp                                                 (00:05:49.0611sec)
Jekyll: Generating...                                                            (00:05:49.0612sec)
Jekyll: done in 8.509 seconds.                                                   (00:05:58.0123sec)
Checking Output HTML                                                             (00:06:02.0670sec)
found 2946 files
Killed
$

After some research, I learned that

By default, kernel kills processes in a container if an out-of-memory (OOM) error occurs.

Based on that documentation, I changed my command to this:

docker run -it -m 3G --oom-kill-disable --rm -v /path/to/my/guide:/usr/src/input --name running-ig-publisher ig-publisher

This time, the build succeeded (although it took 21 minutes to complete).

@jmandel or @grahamegrieve -- do you think that it's likely that this is the problem? And if so, might it be possible to update the auto-builder to allocate more memory and/or set the --oom-kill-disable flag (as long as you are also limiting memory via a -m flag)?

@jmandel
Copy link
Collaborator

jmandel commented Mar 20, 2018

Thanks for the detailed report! Just to get started, do you know how much memory the job is expected to take? And when it has been killed is it doing the java build process or during a Jekyll invocation (I was assuming the former)? Also is the 21 minute build time consistent with what you see when you run locally using a normal process?

Things we'll want to investigate:

  1. Is our default to container currently not buildable from scratch?
  2. Can increase the memory limits in our kubernetes deployment to fix the errors you seeing right now
  3. Can we ensure that proper errors are generated when the container runs out of memory

@cmoesel
Copy link
Member Author

cmoesel commented Mar 20, 2018

I'm not sure how much memory it needs. As noted above, I got it working with -m 3G and --oom-kill-disable -- but as it's not great to disable the oom killer, I'll experiment to see if there is a value for -m that allows is to complete.

A few odd things to note:

  • The full IG build takes about 5 minutes on my Mac OS X if I don't use a Docker container (vs 21 minutes when I got it to work in the container). This is with -Xms2g -Xmx2g allocating 2G to Java.
  • Today I got the docker build to work in 15m when I didn't specify a memory limit -m when running the container (which is odd since yesterday that failed).

According to the logs, it looks like it's dying in the Java process, but since I think I've established that 2GB is enough for Java (on my own Mac), I wonder if it's some other process eating up memory and/or not releasing it?

@cmoesel
Copy link
Member Author

cmoesel commented Mar 20, 2018

Based on experimentation, I can keep the Java memory allocation to 2GB, but I need to increase the container allocation to 8GB. In other words, this seems to work:

docker run -it -m 8GB --rm -v /path/to/my/guide:/usr/src/input --name running-ig-publisher ig-publisher

Is it possible to increase the allocation on the container by that much in the hosted services?

@jmandel
Copy link
Collaborator

jmandel commented Mar 20, 2018

We can certainly make this change. I would love to understand more about what's happening with memory at the same time. Do you think your Java process is using only two gigabytes? In which case is Jekyll bursting higher?

@cmoesel
Copy link
Member Author

cmoesel commented Mar 20, 2018

I guess I don't know for sure. I'm assuming that since I'm passing in -Xms2g -Xmx2g that Java is staying within those bounds -- otherwise I think we'd see Java throw an OOM exception. That said, it seems suspicious that some other process would be using almost 6GB on top of Java.

I'll try running it again and using docker stats to see what that reports for memory usage and by what processes. (This is probably something I should have done in the first place).

@cmoesel
Copy link
Member Author

cmoesel commented Mar 20, 2018

Here is a very weird wrinkle in this story. As I was trying to profile my container to see memory usage, I noticed that no matter what I pass in via -m, my container was never getting more than 2GB memory. I also couldn't reproduce the success I saw this morning with -m 8G.

It turns out that Docker for Mac defaults to an overall 2GB memory max unless you change the configuration in the Docker preferences menu. So all along I have never got more than 2GB memory in a container. This means that my success this morning was due to some stroke of luck in which I didn't exceed 2GB! It also means that my actual memory usage is probably much lower than 8GB (phew).

I've just bumped up my Docker for Mac memory preferences, so I'll try this all again and report back.

@cmoesel
Copy link
Member Author

cmoesel commented Mar 20, 2018

Wow. Actually increasing my container memory made a big difference. Here's how I ran it:

  • Java command (inside container) was still given -Xms2g -Xmx2g arguments
  • docker run command was given 4GB allocation (via -m 4G)

The result is:

  • Published successfully in ~5 minutes (same as my host without docker)
  • Jekyll went by quick (in top) but it looked like it used about 166 MB memory at its peak
  • Java process went up to 2.90 GB memory when checking output HTML

I was reminded that Xmx only affects max heap size, but the Java process will use additional memory on top of that (see here).

To see how close we are getting to the actual java heap size max (2 GB), I tried running it with -Xmx1500m. It ran a very long time and then finally crashed with java.lang.OutOfMemoryError: Java heap space. This means we may not have much headroom with a 2GB heap.

I did one more run, setting -Xmx2500m and it completed in 4m33s.

@jmandel -- here are my recommendations based on these findings:

  • Within the container, we should run the publisher.jar with -Xmx2500m. This gives us a little headroom and allows for faster process times.
  • When launching the container, we should set the memory limit to 4 GB (-m 4G). Since we're using 2.9 GB in the java process, I think 3 GB would be cutting it too close.

Let me know what you think -- and when you think we might be able to have these changes deployed.

@jmandel
Copy link
Collaborator

jmandel commented Mar 21, 2018

Thanks! Really appreciate the level of detail and suggestions here -- it's a huge help.

Currently we're assigning 2g to the Java process and 3200Mi as a memory limit for the Kubernetes pod.

I'll bump these up today per your suggestions and note here when I've done so.

@jmandel
Copy link
Collaborator

jmandel commented Mar 21, 2018

With 4Gi on the pod and -Xms3g -Xmx3g on the java process, I'm still seeing the java process with exist status 137 on HL7/us-breastcancer (shortly after jekyll runs). Now bumping Java memory up to -Xms3750m -Xmx3750m...

@jmandel
Copy link
Collaborator

jmandel commented Mar 21, 2018

This should be working now!

An underlying issue was actually brittleness in looking for the qa.html file -- we expected to find it as the root of the output from the build tool, but here there's a guide directory. Is this an intentional choice? I think we'll need some way to determine how to find the right qa file after a build, if the convention isn't "grab the root-level file".

@cmoesel
Copy link
Member Author

cmoesel commented Mar 21, 2018

Hi @jmandel. Great! I can't wait to try it out! Regarding the guide directory, I'll take a look to see if I can remember how/why we did it that way. If it's not standard practice, I'd like to conform to whatever is.

@cmoesel
Copy link
Member Author

cmoesel commented Mar 21, 2018

Would you be able to find the qa.html by looking at the ig.json file's paths.output? In our ig.json, we have it set to output/guide/, so that seems to indicate where the qa.html file would be.

That said, I just noticed now that in the IG Publisher doc, it documents this field w/ a special disclaimer for auto-build:

"output" : "[output]", // optional; default = "output" (and must be output for auto-build)

Oops.

I can probably change ours to "output". IIRC, I think I put that second-level folder in there because it's a real pain in the neck to find the index.html file in the output folder containing 2000+ files -- so I put the 1000s of files in a sub-folder and have a simple redirecting index.html along-side it. Perhaps I can find another way to make it easy for people running the build locally to quickly open the index.

@cmoesel
Copy link
Member Author

cmoesel commented Mar 21, 2018

@jmandel -- I just pushed to my repo and got my first successfully triggered auto-build! Thank you so very much for trudging through this with me and spending the time to get it working. It is very much appreciated!

@cmoesel cmoesel closed this as completed Mar 21, 2018
@jmandel
Copy link
Collaborator

jmandel commented Mar 21, 2018

Sure thing! Really appreciate your help investigating along the way. Looking ahead, there's no reason that handling of the configuration Json file needs to be so brittle, and it would be good too tackle that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants