Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to unpack GHC (sometimes) #4888

Closed
magthe opened this issue Jun 19, 2019 · 13 comments

Comments

Projects
None yet
7 participants
@magthe
Copy link

commented Jun 19, 2019

General info

The last couple of days I'm running into an issue where untaring of GHC fails:

Preparing to download ghc-8.6.5 ...
ghc-8.6.5: download has begun
ghc-8.6.5:   17.49 MiB / 175.83 MiB (  9.95%) downloaded...
ghc-8.6.5:   47.24 MiB / 175.83 MiB ( 26.87%) downloaded...
ghc-8.6.5:   76.67 MiB / 175.83 MiB ( 43.61%) downloaded...
ghc-8.6.5:  107.02 MiB / 175.83 MiB ( 60.86%) downloaded...
ghc-8.6.5:  136.92 MiB / 175.83 MiB ( 77.87%) downloaded...
ghc-8.6.5:  167.14 MiB / 175.83 MiB ( 95.06%) downloaded...
ghc-8.6.5:  175.83 MiB / 175.83 MiB (100.00%) downloaded...
Downloaded ghc-8.6.5.
Unpacking GHC into /home/vsts_azpcontainer/.stack/programs/x86_64-linux/ghc-8.6.5.temp/ ...
Received ExitFailure (-15) when running
Raw command: /bin/tar Jxf /home/vsts_azpcontainer/.stack/programs/x86_64-linux/ghc-8.6.5.tar.xz
Run from: /home/vsts_azpcontainer/.stack/programs/x86_64-linux/ghc-8.6.5.temp/


Error: Error encountered while unpacking GHC with
         tar Jxf /home/vsts_azpcontainer/.stack/programs/x86_64-linux/ghc-8.6.5.tar.xz
         run in /home/vsts_azpcontainer/.stack/programs/x86_64-linux/ghc-8.6.5.temp/

       The following directories may now contain files, but won't be used by stack:
         - /home/vsts_azpcontainer/.stack/programs/x86_64-linux/ghc-8.6.5.temp/
         - /home/vsts_azpcontainer/.stack/programs/x86_64-linux/ghc-8.6.5/

       For more information consider rerunning with --verbose flag

Steps to reproduce

I don't have exact steps, but the code and CI builds are all open and available.

The code is available at: https://github.com/magthe/ci-test-hs/ (the branch Add Azure Pipelines)

Examples of CI builds at:

Building image locally (docker build -t foo:0 .) first failed, then I followed the suggestion and added --verbose, then it succeeded. Howerver, the CI builds keep failing sporadically.

Expected

I'm used to stack setup working like a charm.

Actual

Well, see above.

Stack version

The version I'm using on VMs is the pre-built 2.1.1 downloaded from GitHub, e.g. https://github.com/magthe/ci-test-hs/blob/153ca80eaca23eae6444abdbf32e0e3b91240d76/.travis.yml#L15

The version used in container, including when building images, is the one that's found in fpco/stack-build:lts-13 (I believe that's been fpco/stack-build:lts-13.25 and thus stack 2.1.1)

Method of installation

See above.

@dmp1ce

This comment has been minimized.

Copy link

commented Jun 20, 2019

I'm also seeing this issue with Snapcraft which uses Multipass to build snap packages. It happens every time on a fresh build. https://forum.snapcraft.io/t/haskell-stack-snaps-help/11909

@snoyberg

This comment has been minimized.

Copy link
Contributor

commented Jun 23, 2019

I'm really confused by this one, and would love to hear some thoughts from others. I can't see any reason why a SIGTERM would be sent to Stack, what process would be sending it, or what changes in the Stack.Setup codepath could generate this difference.

I am able to reproduce.

@snoyberg snoyberg added this to the P0: Blocking release milestone Jun 23, 2019

@snoyberg snoyberg added the type: bug label Jun 23, 2019

@snoyberg

This comment has been minimized.

Copy link
Contributor

commented Jun 23, 2019

@jamesdbrock provided a Dockerfile for reproing this in #4889, but it's not a reliable repro. I'm not familiar with Snapcraft @dmp1ce. Do you think you'd be able to put together a reliable Docker-based repro for easier testing?

@dmp1ce

This comment has been minimized.

Copy link

commented Jun 23, 2019

It isn't easy to get snapd installed in a docker container. snapd is needed to install snapcraft which in turn uses multipass to create a virtual machine for building snap images. Probably the easiest thing to do is run snapcraft on a spare computer running Ubuntu. Multipass might work in LXD but I'm not sure because multipass requires a KVM device.

On Ubuntu the steps would be:

  • Ensure snapd is running (sudo apt install snapd)
  • Install snapcraft (sudo snap install snapcraft --classic)
  • Get basic snap of a stack project (git clone https://github.com/dmp1ce/snapcraft-stack-example.git)
  • Try to build project with snapcraft (snapcraft)
@jamesdbrock

This comment has been minimized.

Copy link

commented Jun 24, 2019

I tried to look into stack setup with strace and I got this glimpse.

strace -e %signal stack setup --verbose
2019-06-24 00:20:51.863074: [debug] Unpacking /root/.stack/programs/x86_64-linux/ghc-8.6.5.tar.xz
2019-06-24 00:20:51.864394: [debug] Run process within /root/.stack/programs/x86_64-linux/ghc-8.6.5.temp/: /bin/tar Jxf /root/.stack/programs/x86_64-linux/ghc-8.6.5.tar.xz
rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
tgkill(14, 16, SIGPIPE)                 = 0
kill(21, SIGTERM)                       = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=21, si_uid=0, si_status=SIGTERM, si_utime=21, si_stime=179} ---
2019-06-24 00:21:05.061188: [error] Received ExitFailure (-15) when running
Raw command: /bin/tar Jxf /root/.stack/programs/x86_64-linux/ghc-8.6.5.tar.xz
Run from: /root/.stack/programs/x86_64-linux/ghc-8.6.5.temp/

I'd like to capture a log of

strace -f -e %signal,%process stack setup

but I cannot reproduce the error again.

@snoyberg

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

That strace output is just what we needed! @psibi and @nh2 provided the missing piece of insight: the SIGTERM is coming from Stack itself. This reminded me of a test suite bug I fixed recently:

snoyberg/conduit@20fd6e2

Which ultimately led to this PR: #4902

I'd appreciate if those affected by this bug would be able to test this out and confirm that it fixes the problem for them.

snoyberg added a commit that referenced this issue Jun 25, 2019

Merge pull request #4902 from commercialhaskell/4888-fix-unpack-ghc
Wait for children to exit correctly (fixes #4888)
@magthe

This comment has been minimized.

Copy link
Author

commented Jun 30, 2019

I'm not that familiar with the whole stack/stackage setup, so I'll have to ask. Are there builds with this change included available somewhere, e.g. using some specific tag on DockerHub or in an artefact store on the CI system you use?

@snoyberg

This comment has been minimized.

Copy link
Contributor

commented Jun 30, 2019

We have nothing automated (though I wish we did). I generated a Linux executable and uploaded it to S3, and started using it for typed-process. If you'd like to use it too, it's available at

https://s3.amazonaws.com/www.snoyman.com/stack-1ed71cae36a64365ead72da1427e1685ccec8246.bz2

Relevant commit: fpco/typed-process@af31b7b#diff-354f30a63fb0907d4ad57269548329e3

@magthe

This comment has been minimized.

Copy link
Author

commented Jun 30, 2019

Yes, that'll make it a little easier to try it out on CI services, since that's where I've observed the issue most frequently.

@SkyWriter

This comment has been minimized.

Copy link
Member

commented Jul 1, 2019

Running stack build with LTS 12.26 in Docker fpco/stack-build-small (3523caf4fba2) always fails with the mysterious -15 error message, even though execution of the corresponding tar command succeeds when done manually.

Stack build provided by @snoyberg heals my woes, and my app builds without issues.

@magthe

This comment has been minimized.

Copy link
Author

commented Jul 1, 2019

I'm completely failing to reproduce without the fix for the last few days... don't ask me what's different on the various CI services I'm experimenting with.

I can't even reproduce using @SkyWriter 's recipe above 😕

@neongreen

This comment has been minimized.

Copy link
Collaborator

commented Jul 11, 2019

I generated a Linux executable and uploaded it to S3, and started using it for typed-process.

A new official release would be great – due to this issue, our CI builds fail more often than not nowadays.

@snoyberg is there anything I can do (regarding maintenance tasks) to help make the new release happen faster? My email is in my Github profile.

@borsboom

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2019

@neongreen The only holdup right now is that 4938-non-ascii-module-names is failing for macOS on CI (see #4939 (comment)). If there's anything you can do to help with that, it would push things along.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.