Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-553] Restructure dockcross dockerfiles to fix caching #11302

Merged
merged 2 commits into from
Jun 19, 2018

Conversation

KellenSunderland
Copy link
Contributor

@KellenSunderland KellenSunderland commented Jun 15, 2018

Description

This PR restructures the dockcross Dockerfiles to remove their multi-head components. These multi-head components made it difficult to properly provide a remote cache for the Dockerfiles, and was slowing down builds. A simple work-around for this was to generalize the ubuntu_ccache.sh file such that it works with ubuntu, debian, and in cross-compilation containers. We can now use this script the same way in a variety of containers, which eliminates the need for a special multi-head docker stage.

Should address #11257.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Ran test_docker_cache.py, ensured both existing tests pass.
  • Code is well-documented:
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@KellenSunderland
Copy link
Contributor Author

Would appreciate a review from @marcoabreu and @lebeg.

@lebeg
Copy link
Contributor

lebeg commented Jun 15, 2018

Excellent work @KellenSunderland! Do you think there is any need to make the centos installations also robust to compiler changes? Maybe mentioning #11257 would be good as well.

Copy link
Contributor

@larroy larroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Is this a bug in docker? Shall we report it?

@KellenSunderland
Copy link
Contributor Author

@lebeg I think calling the compiler with $CC should be sufficient for CentOS, at least for the time being. We can always update in the future if needed.

@larroy I think there's a feature request that will make this easier in the future. Hopefully v19 will include the feature req. It is certainly a breaking change from versions prior to 17, but it's made as a response to some valid security concerns. There's also a work around, but I think implementing this would add quite a lot of complexity.

@marcoabreu
Copy link
Contributor

Great job, Kellen! Thanks a lot!

I think the ccache is now not being applied to all builds. Examples:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11302/2/pipeline/59
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11302/2/pipeline/54

On some others, it works:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11302/2/pipeline/52
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11302/2/pipeline/51

Have a look at the execution durations to see whether ccache was actually used or not.

Things that still don't work and haven't worked before are GPU builds. Those are a separate issue.

@larroy
Copy link
Contributor

larroy commented Jun 16, 2018

LGTM I can try what I asked myself as I'm not sure my question was well understood.

@KellenSunderland
Copy link
Contributor Author

@larroy Sounds good, we can also chat offline about if it in more detail if I've misunderstood.

@marcoabreu Not sure I fully understand the examples you've given. Do they relate to the issue that this PR addresses (docker cache misses?).

./configure
# Manually specify x86 gcc versions so that this script remains compatible with dockcross (which uses an ARM based gcc
# by default).
CC=/usr/bin/gcc CXX=/usr/bin/g++ ./configure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we set this environment variable in the docker file and then unset it? Since it's specific to dockcross.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to ensure we're maintaining the CC and CXX env vars from the 'dockcross/linux-armv7' image, so to do this we'd have to store those and then restore them. Seems more complicated than just overriding them for a single command to me.

What would be the advantage, readability?

@marcoabreu
Copy link
Contributor

Yes, I got the feeling that your changes are causing some builds to not use ccache anymore. It'd be good if you could verify that

@KellenSunderland
Copy link
Contributor Author

@marcoabreu Gotcha. I think what would be useful would be to include some more information about what ccache is doing into the logs. I'll see if I can create a (separate) PR to do that, and then rebase.

@marcoabreu
Copy link
Contributor

That's an excellent idea! Maybe just printing the cache hit/miss statistics after a build (not globally using ccache -s as it will include everything).

Copy link
Contributor

@larroy larroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't really work for me:

pllarroy@186590d670bd:0:~/devel/mxnet/mxnet$ ci/build.py -p android_arm64
build.py: 2018-06-16 23:12:21,380 Building container tagged 'mxnetci/build.android_arm64' with docker
build.py: 2018-06-16 23:12:21,380 Running command: 'docker build -f docker/Dockerfile.build.android_arm64 --build-arg USER_ID=1177751787 --cache-from mxnetci/build.android_arm64 -t mxnetci/build.android_arm64 docker'
Sending build context to Docker daemon  103.4kB
Step 1/27 : FROM dockcross/base:latest
 ---> a4a05890b715
Step 2/27 : MAINTAINER Pedro Larroy "pllarroy@amazon.com"
 ---> Using cache
 ---> 7592fc7b9c5c
Step 3/27 : COPY --from=ccachebuilder /usr/local/bin/ccache /usr/local/bin/ccache
invalid from flag value ccachebuilder: pull access denied for ccachebuilder, repository does not exist or may require 'docker login'
Traceback (most recent call last):
  File "ci/build.py", line 357, in <module>
    sys.exit(main())
  File "ci/build.py", line 283, in main
    build_docker(platform, docker_binary, registry=args.docker_registry)
  File "ci/build.py", line 84, in build_docker
    check_call(cmd)
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['docker', 'build', '-f', 'docker/Dockerfile.build.android_arm64', '--build-arg', 'USER_ID=1177751787', '--cache-from', 'mxnetci/build.android_arm64', '-t', 'mxnetci/build.android_arm64', 'docker']' returned non-zero exit status 1.

@marcoabreu
Copy link
Contributor

Could you elaborate Pedro? Kellen removes the multi head in his PR while your log seems to be referencing a dockerfile that still has it. Could it be possible you checked out the wrong commit?

@KellenSunderland KellenSunderland force-pushed the dockcross_cache branch 2 times, most recently from 2df969f to 3e8ab97 Compare June 17, 2018 12:00
@KellenSunderland
Copy link
Contributor Author

KellenSunderland commented Jun 17, 2018

@marcoabreu @larroy: No actually the error Pedro pointed out was a valid error. I've fixed it and am investigating why the build succeeded earlier.

Edit: looks like the earlier run didn't tests Android/64 so that would explain it. Thanks for the testing Pedro.

@marcoabreu
Copy link
Contributor

Ah your PR did not change Android Arm64 because we just merged it and that caused the race condition in the git merge. Thanks for elaborating

@KellenSunderland
Copy link
Contributor Author

Any other data you want to see here @marcoabreu ? No rush on the merge but lmk if you need anything else.

@marcoabreu
Copy link
Contributor

I compared the two following runs:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11302/6/pipeline/58
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1012/pipeline/56/

I noticed the following slow downs - partially quite significantely:
CPU clang 3.9, CPU clang 3.9 MKLDNN, CPU clang 5.0, CPU clang 5.0 MKLDNN, CPU MKLDNN, CPU: Openblas, GPU CMake, GPU CMake MKLDNN, GPU CUDA 9.1, GPU MKLDNN, NVidia Jetson ARMv8

The other ones have not been slowed down or improved.

I think we should investigate this before we merge. What I noticed is that the CentOS ones are still exactly the same while almost all other jobs got slower by up to factors like 10x

@KellenSunderland
Copy link
Contributor Author

KellenSunderland commented Jun 19, 2018 via email

@marcoabreu
Copy link
Contributor

Oops, I forgot to take that time into account. Thank you and sorry for the inconvenience.

@marcoabreu marcoabreu merged commit 47e2b89 into apache:master Jun 19, 2018
@KellenSunderland
Copy link
Contributor Author

Thanks man, appreciate it. I'll make sure and monitor master for a couple days and verify that the cache is being utilized.

zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
…11302)

* Add ccache reporting to CI

* Restructure dockcross dockerfiles to fix caching
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
…11302)

* Add ccache reporting to CI

* Restructure dockcross dockerfiles to fix caching
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants