feature: Add TensorFlow 2.1 dockerfiles by saimidu · Pull Request #24 · aws/deep-learning-containers

saimidu · 2020-02-25T02:33:41Z

Description of changes:
Migrate dockerfiles for TF 2.1 from sagemaker-tensorflow-container repository

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

arjkesh · 2020-02-26T19:00:50Z

if you push an empty commit, the sanity tests will go away.

This reverts commit 3b340ca.

--- X-AI-Tool: Human X-AI-Prompt: saw #21 0.419 No cached wheel — building flash-attn from source #24 0.570 No cached wheel — building transformer-engine from source Signed-off-by: Junpu Fan <junpu@amazon.com>

--- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 50 X-AI-Prompt: #24 6.697 Using MAX_JOBS=32 as the number of jobs. #24 6.699 Using NVCC_THREADS=16 as the number of nvcc threads. #24 6.940 -- The CXX compiler identification is GNU 11.5.0 #24 6.951 -- Detecting CXX compiler ABI info #24 7.024 -- Detecting CXX compiler ABI info - failed #24 7.024 -- Check for working CXX compiler: /usr/bin/c++ #24 7.089 -- Check for working CXX compiler: /usr/bin/c++ - broken #24 7.089 CMake Error at /opt/venv/lib/python3.12/site-packages/cmake/data/share/cmake-4.3/Modules/CMakeTestCXXCompiler.cmake:73 (message): #24 7.089 The C++ compiler #24 7.089 #24 7.089 "/usr/bin/c++" #24 7.089 #24 7.089 is not able to compile a simple test program. #24 7.089 #24 7.089 It fails with the following output: #24 7.089 #24 7.089 Change Dir: '/workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ' #24 7.089 #24 7.089 Run Build Command(s): /opt/venv/bin/ninja -v cmTC_ff516 #24 7.089 [1/2] sccache /usr/bin/c++ -o CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o -c /workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ/testCXXCompiler.cxx #24 7.089 FAILED: [code=2] CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o #24 7.089 sccache /usr/bin/c++ -o CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o -c /workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ/testCXXCompiler.cxx #24 7.089 sccache: error: Server startup failed: cache storage failed to read: Unexpected (permanent) at read => S3Error { code: "AuthorizationHeaderMalformed", message: "The authorization header is malformed; a non-empty Access Key (AKID) must be provided in the credential.", resource: "", request_id: "9JNZ99SMVCR9235F" } #24 7.089 #24 7.089 Context: #24 7.089 uri: https://s3.us-west-2.amazonaws.com/dlc-cicd-models/sccache/vllm/.sccache_check #24 7.089 response: Parts { status: 400, version: HTTP/1.1, headers: {"x-amz-request-id": "9JNZ99SMVCR9235F", "x-amz-id-2": "xP77wFtCDnopxg4jLe8wBmqfAYAk3v+fP16A7xtV1fsZueOgmrd/cCc7CZRjMMLMk+FfKnUhh5c=", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Tue, 24 Mar 2026 05:57:07 GMT", "connection": "close", "server": "AmazonS3"} } #24 7.089 service: s3 #24 7.089 path: .sccache_check #24 7.089 range: 0- #24 7.089 #24 7.089 Backtrace: #24 7.089 0: <unknown> #24 7.089 1: <unknown> #24 7.089 2: <unknown> #24 7.089 3: <unknown> #24 7.089 4: <unknown> #24 7.089 5: <unknown> #24 7.089 6: <unknown> #24 7.089 7: <unknown> #24 7.089 8: <unknown> #24 7.089 9: <unknown> #24 7.089 10: <unknown> #24 7.089 11: <unknown> #24 7.089 12: <unknown> #24 7.089 #24 7.089 #24 7.089 Run with SCCACHE_LOG=debug SCCACHE_NO_DAEMON=1 to get more information #24 7.089 ninja: build stopped: subcommand failed. #24 7.089 #24 7.089 #24 7.089 #24 7.089 #24 7.089 #24 7.089 CMake will not be able to correctly generate this project. #24 7.089 Call Stack (most recent call first): #24 7.089 CMakeLists.txt:14 (project) #24 7.089 #24 7.089 #24 7.090 -- Configuring incomplete, errors occurred! #24 7.093 Traceback (most recent call last): #24 7.093 File "/workspace/vllm/setup.py", line 1044, in <module> #24 7.093 setup( #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/__init__.py", line 117, in setup #24 7.093 return distutils.core.setup(**attrs) # type: ignore[return-value] #24 7.093 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/core.py", line 186, in setup #24 7.093 return run_commands(dist) #24 7.093 ^^^^^^^^^^^^^^^^^^ #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/core.py", line 202, in run_commands #24 7.093 dist.run_commands() #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1002, in run_commands #24 7.093 self.run_command(cmd) #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.094 super().run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.094 cmd_obj.run() #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/command/bdist_wheel.py", line 370, in run #24 7.094 self.run_command("build") #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command #24 7.094 self.distribution.run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.094 super().run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.094 cmd_obj.run() #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/command/build.py", line 135, in run #24 7.094 self.run_command(cmd_name) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command #24 7.094 self.distribution.run_command(command) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.095 super().run_command(command) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.095 cmd_obj.run() #24 7.095 File "/workspace/vllm/setup.py", line 360, in run #24 7.095 super().run() #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/command/build_ext.py", line 97, in run #24 7.095 _build_ext.run(self) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 368, in run #24 7.095 self.build_extensions() #24 7.095 File "/workspace/vllm/setup.py", line 317, in build_extensions #24 7.095 self.configure(ext) #24 7.095 File "/workspace/vllm/setup.py", line 294, in configure #24 7.095 subprocess.check_call( #24 7.095 File "/usr/lib64/python3.12/subprocess.py", line 413, in check_call #24 7.095 raise CalledProcessError(retcode, cmd) #24 7.095 subprocess.CalledProcessError: Command '['cmake', '/workspace/vllm', '-G', 'Ninja', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DVLLM_TARGET_DEVICE=cuda', '-DCMAKE_C_COMPILER_LAUNCHER=sccache', '-DCMAKE_CXX_COMPILER_LAUNCHER=sccache', '-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache', '-DCMAKE_HIP_COMPILER_LAUNCHER=sccache', '-DVLLM_PYTHON_EXECUTABLE=/opt/venv/bin/python3', '-DVLLM_PYTHON_PATH=/workspace/vllm:/usr/lib64/python312.zip:/usr/lib64/python3.12:/usr/lib64/python3.12/lib-dynload:/opt/venv/lib64/python3.12/site-packages:/opt/venv/lib64/python3.12/site-packages/nvidia_cutlass_dsl/python_packages:/opt/venv/lib/python3.12/site-packages:/opt/venv/lib/python3.12/site-packages/nvidia_cutlass_dsl/python_packages:/opt/venv/lib64/python3.12/site-packages/setuptools/_vendor:/opt/venv/lib/python3.12/site-packages/grpc_tools/_proto', '-DFETCHCONTENT_BASE_DIR=/workspace/vllm/.deps', '-DNVCC_THREADS=16', '-DCMAKE_JOB_POOL_COMPILE:STRING=compile', '-DCMAKE_JOB_POOLS:STRING=compile=2', '-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc']' returned non-zero exit status 1. #24 ERROR: process "/bin/sh -c python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 && if [ -n \"${SCCACHE_BUCKET}\" ]; then sccache --show-stats; fi" did not complete successfully: exit code: 1 Signed-off-by: Yadan Wei <yadanwei@amazon.com>

* Human changes made during kiro-cli session after prompt completion. --- X-AI-Tool: Human X-AI-Prompt: can you summerize this PR #5763 so I can add discription in the pr Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 74 X-AI-Prompt: can you look at this dockerfile sample https://github.com/aws/deep-learning-containers/pull/5808/changes#diff-aff16f8c535417fcf020bc2184ab09935e6c66cf46842f6ccee6d2022f4077ff to modify my dockerfile for oss setup /Volumes/workplace/kiro-workplace/AsimovBuilderCoreContext/src/AsimovBuilderCoreContext/workspace/2week/deep-learning-containers/docker/vllm/Dockerfile.amzn2023 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * Human changes made during kiro-cli session after prompt completion. --- X-AI-Tool: Human X-AI-Prompt: can you look at this dockerfile sample https://github.com/aws/deep-learning-containers/pull/5808/changes#diff-aff16f8c535417fcf020bc2184ab09935e6c66cf46842f6ccee6d2022f4077ff to modify my dockerfile for oss setup /Volumes/workplace/kiro-workplace/AsimovBuilderCoreContext/src/AsimovBuilderCoreContext/workspace/2week/deep-learning-containers/docker/vllm/Dockerfile.amzn2023 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 183 X-AI-Prompt: for my build vllm container,how can I add benchmark test with popular models Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 142 X-AI-Prompt: okay could you implement for me and could you find which s3 bucket sample pr is using, we can use the same one Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 28 X-AI-Prompt: how the cache will be saved bucket/hash/**.o? Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 75 X-AI-Prompt: how my sample PR access s3 bucket, I think we do not need to do above things Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 50 X-AI-Prompt: #24 6.697 Using MAX_JOBS=32 as the number of jobs. #24 6.699 Using NVCC_THREADS=16 as the number of nvcc threads. #24 6.940 -- The CXX compiler identification is GNU 11.5.0 #24 6.951 -- Detecting CXX compiler ABI info #24 7.024 -- Detecting CXX compiler ABI info - failed #24 7.024 -- Check for working CXX compiler: /usr/bin/c++ #24 7.089 -- Check for working CXX compiler: /usr/bin/c++ - broken #24 7.089 CMake Error at /opt/venv/lib/python3.12/site-packages/cmake/data/share/cmake-4.3/Modules/CMakeTestCXXCompiler.cmake:73 (message): #24 7.089 The C++ compiler #24 7.089 #24 7.089 "/usr/bin/c++" #24 7.089 #24 7.089 is not able to compile a simple test program. #24 7.089 #24 7.089 It fails with the following output: #24 7.089 #24 7.089 Change Dir: '/workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ' #24 7.089 #24 7.089 Run Build Command(s): /opt/venv/bin/ninja -v cmTC_ff516 #24 7.089 [1/2] sccache /usr/bin/c++ -o CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o -c /workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ/testCXXCompiler.cxx #24 7.089 FAILED: [code=2] CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o #24 7.089 sccache /usr/bin/c++ -o CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o -c /workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ/testCXXCompiler.cxx #24 7.089 sccache: error: Server startup failed: cache storage failed to read: Unexpected (permanent) at read => S3Error { code: "AuthorizationHeaderMalformed", message: "The authorization header is malformed; a non-empty Access Key (AKID) must be provided in the credential.", resource: "", request_id: "9JNZ99SMVCR9235F" } #24 7.089 #24 7.089 Context: #24 7.089 uri: https://s3.us-west-2.amazonaws.com/dlc-cicd-models/sccache/vllm/.sccache_check #24 7.089 response: Parts { status: 400, version: HTTP/1.1, headers: {"x-amz-request-id": "9JNZ99SMVCR9235F", "x-amz-id-2": "xP77wFtCDnopxg4jLe8wBmqfAYAk3v+fP16A7xtV1fsZueOgmrd/cCc7CZRjMMLMk+FfKnUhh5c=", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Tue, 24 Mar 2026 05:57:07 GMT", "connection": "close", "server": "AmazonS3"} } #24 7.089 service: s3 #24 7.089 path: .sccache_check #24 7.089 range: 0- #24 7.089 #24 7.089 Backtrace: #24 7.089 0: <unknown> #24 7.089 1: <unknown> #24 7.089 2: <unknown> #24 7.089 3: <unknown> #24 7.089 4: <unknown> #24 7.089 5: <unknown> #24 7.089 6: <unknown> #24 7.089 7: <unknown> #24 7.089 8: <unknown> #24 7.089 9: <unknown> #24 7.089 10: <unknown> #24 7.089 11: <unknown> #24 7.089 12: <unknown> #24 7.089 #24 7.089 #24 7.089 Run with SCCACHE_LOG=debug SCCACHE_NO_DAEMON=1 to get more information #24 7.089 ninja: build stopped: subcommand failed. #24 7.089 #24 7.089 #24 7.089 #24 7.089 #24 7.089 #24 7.089 CMake will not be able to correctly generate this project. #24 7.089 Call Stack (most recent call first): #24 7.089 CMakeLists.txt:14 (project) #24 7.089 #24 7.089 #24 7.090 -- Configuring incomplete, errors occurred! #24 7.093 Traceback (most recent call last): #24 7.093 File "/workspace/vllm/setup.py", line 1044, in <module> #24 7.093 setup( #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/__init__.py", line 117, in setup #24 7.093 return distutils.core.setup(**attrs) # type: ignore[return-value] #24 7.093 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/core.py", line 186, in setup #24 7.093 return run_commands(dist) #24 7.093 ^^^^^^^^^^^^^^^^^^ #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/core.py", line 202, in run_commands #24 7.093 dist.run_commands() #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1002, in run_commands #24 7.093 self.run_command(cmd) #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.094 super().run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.094 cmd_obj.run() #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/command/bdist_wheel.py", line 370, in run #24 7.094 self.run_command("build") #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command #24 7.094 self.distribution.run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.094 super().run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.094 cmd_obj.run() #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/command/build.py", line 135, in run #24 7.094 self.run_command(cmd_name) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command #24 7.094 self.distribution.run_command(command) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.095 super().run_command(command) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.095 cmd_obj.run() #24 7.095 File "/workspace/vllm/setup.py", line 360, in run #24 7.095 super().run() #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/command/build_ext.py", line 97, in run #24 7.095 _build_ext.run(self) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 368, in run #24 7.095 self.build_extensions() #24 7.095 File "/workspace/vllm/setup.py", line 317, in build_extensions #24 7.095 self.configure(ext) #24 7.095 File "/workspace/vllm/setup.py", line 294, in configure #24 7.095 subprocess.check_call( #24 7.095 File "/usr/lib64/python3.12/subprocess.py", line 413, in check_call #24 7.095 raise CalledProcessError(retcode, cmd) #24 7.095 subprocess.CalledProcessError: Command '['cmake', '/workspace/vllm', '-G', 'Ninja', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DVLLM_TARGET_DEVICE=cuda', '-DCMAKE_C_COMPILER_LAUNCHER=sccache', '-DCMAKE_CXX_COMPILER_LAUNCHER=sccache', '-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache', '-DCMAKE_HIP_COMPILER_LAUNCHER=sccache', '-DVLLM_PYTHON_EXECUTABLE=/opt/venv/bin/python3', '-DVLLM_PYTHON_PATH=/workspace/vllm:/usr/lib64/python312.zip:/usr/lib64/python3.12:/usr/lib64/python3.12/lib-dynload:/opt/venv/lib64/python3.12/site-packages:/opt/venv/lib64/python3.12/site-packages/nvidia_cutlass_dsl/python_packages:/opt/venv/lib/python3.12/site-packages:/opt/venv/lib/python3.12/site-packages/nvidia_cutlass_dsl/python_packages:/opt/venv/lib64/python3.12/site-packages/setuptools/_vendor:/opt/venv/lib/python3.12/site-packages/grpc_tools/_proto', '-DFETCHCONTENT_BASE_DIR=/workspace/vllm/.deps', '-DNVCC_THREADS=16', '-DCMAKE_JOB_POOL_COMPILE:STRING=compile', '-DCMAKE_JOB_POOLS:STRING=compile=2', '-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc']' returned non-zero exit status 1. #24 ERROR: process "/bin/sh -c python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 && if [ -n \"${SCCACHE_BUCKET}\" ]; then sccache --show-stats; fi" did not complete successfully: exit code: 1 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 55 X-AI-Prompt: but only ec2 workflow pass sccache bucket Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 38 X-AI-Prompt: if need to install this to docker where should i place it # install kv_connectors if requested ARG INSTALL_KV_CONNECTORS=false ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list} RUN --mount=type=cache,target=/root/.cache/uv \ --mount=type=bind,source=requirements/kv_connectors.txt,target=/tmp/kv_connectors.txt,ro \ CUDA_MAJOR="${CUDA_VERSION%%.*}"; \ CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-'); \ CUDA_HOME=/usr/local/cuda; \ # lmcache requires explicit specifying CUDA_HOME BUILD_PKGS="libcusparse-dev-${CUDA_VERSION_DASH} \ libcublas-dev-${CUDA_VERSION_DASH} \ libcusolver-dev-${CUDA_VERSION_DASH}"; \ if [ "$INSTALL_KV_CONNECTORS" = "true" ]; then \ if [ "$CUDA_MAJOR" -ge 13 ]; then \ uv pip install --system nixl-cu13; \ fi; \ uv pip install --system -r /tmp/kv_connectors.txt --no-build || ( \ # if the above fails, install from source apt-get update -y && \ apt-get install -y --no-install-recommends ${BUILD_PKGS} && \ uv pip install --system -r /tmp/kv_connectors.txt --no-build-isolation && \ apt-get purge -y ${BUILD_PKGS} && \ # clean up -dev packages, keep runtime libraries rm -rf /var/lib/apt/lists/* \ ); \ fi Signed-off-by: Yadan Wei <yadanwei@amazon.com> * Human changes made during kiro-cli session after prompt completion. --- X-AI-Tool: Human X-AI-Prompt: if need to install this to docker where should i place it # install kv_connectors if requested ARG INSTALL_KV_CONNECTORS=false ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list} RUN --mount=type=cache,target=/root/.cache/uv \ --mount=type=bind,source=requirements/kv_connectors.txt,target=/tmp/kv_connectors.txt,ro \ CUDA_MAJOR="${CUDA_VERSION%%.*}"; \ CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-'); \ CUDA_HOME=/usr/local/cuda; \ # lmcache requires explicit specifying CUDA_HOME BUILD_PKGS="libcusparse-dev-${CUDA_VERSION_DASH} \ libcublas-dev-${CUDA_VERSION_DASH} \ libcusolver-dev-${CUDA_VERSION_DASH}"; \ if [ "$INSTALL_KV_CONNECTORS" = "true" ]; then \ if [ "$CUDA_MAJOR" -ge 13 ]; then \ uv pip install --system nixl-cu13; \ fi; \ uv pip install --system -r /tmp/kv_connectors.txt --no-build || ( \ # if the above fails, install from source apt-get update -y && \ apt-get install -y --no-install-recommends ${BUILD_PKGS} && \ uv pip install --system -r /tmp/kv_connectors.txt --no-build-isolation && \ apt-get purge -y ${BUILD_PKGS} && \ # clean up -dev packages, keep runtime libraries rm -rf /var/lib/apt/lists/* \ ); \ fi Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 23 X-AI-Prompt: i use m6a.8xlarge EC2, additional 5,000 GB EBS volume size to build Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 81 X-AI-Prompt: you can reas this for information /Volumes/workplace/kiro-workplace/AsimovBuilderCoreContext/src/AsimovBuilderCoreContext/workspace/2week/deep-learning-containers/docker/vllm/Dockerfile.amzn2023 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * feat: upgrade vLLM to 0.18.0, enable kv_connectors, simplify sccache credentials - Bump vLLM from 0.17.1 to 0.18.0 - Enable INSTALL_KV_CONNECTORS by default (lmcache, nixl) - Copy kv_connectors.txt from source stage instead of build stage - Remove static credential ARGs for sccache, use container credential endpoint only Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove dead sccache credential fallbacks from build script Match simplified Dockerfile — only use AWS_CONTAINER_CREDENTIALS_RELATIVE_URI for sccache S3 access via CodeBuild IAM role. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: add sccache smoke test, exclude model/benchmark test paths from SageMaker workflow - Add sccache S3 connectivity smoke test before EC2 build - Exclude vllm_model_smoke_test.sh, vllm_benchmark_test.sh, benchmark_report.py from triggering the SageMaker workflow (not used by SageMaker) Signed-off-by: Yadan Wei <yadanwei@amazon.com> * chore: update vllm config for 0.18.0 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * remove some files Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove explicit credential ENV from Dockerfile, let sccache use default credential chain Setting AWS_CONTAINER_CREDENTIALS_RELATIVE_URI to empty string breaks the AWS SDK default credential chain, preventing instance profile auth from working. With --network=host, sccache can reach the instance metadata service directly without any credential ENV vars. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: forward AWS_CONTAINER_CREDENTIALS_FULL_URI for CodeBuild sccache auth CodeBuild uses FULL_URI (http://127.0.0.1:port/...) not RELATIVE_URI. Forward it as build-arg so sccache inside Docker can authenticate to S3. --network=host makes the local endpoint reachable. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: smoke test use --no-cache and pass FULL_URI credentials Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove credential ARGs, use IMDSv2 via --network=host, add SCCACHE_IDLE_TIMEOUT=0 - Remove all AWS credential ARGs/ENVs from Dockerfile — sccache uses IMDSv2 on EC2 fleet runners via --network=host - Add SCCACHE_IDLE_TIMEOUT=0 to prevent daemon shutdown during long builds (likely cause of only partial cache being written) Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: smoke test use IMDSv2 only, no credential args Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: use static credentials for sccache S3 access IMDSv2 not reachable from inside Docker on CodeBuild fleet runners. Snapshot credentials via aws configure export-credentials instead. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * revert sccache change and change nvcc_thread * remove unused filed Signed-off-by: Yadan Wei <weiyadan@amazon.com> * update flashinfer version Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix test directory Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix directory Signed-off-by: Yadan Wei <yadanwei@amazon.com> * Delete .gem-config/config --------- Signed-off-by: Yadan Wei <yadanwei@amazon.com> Signed-off-by: Yadan Wei <weiyadan@amazon.com> Co-authored-by: Yadan Wei <yadanwei@amazon.com> Co-authored-by: Yadan Wei <weiyadan@amazon.com>

feature: Add TensorFlow 2.1 dockerfiles

3ba78ed

arjkesh approved these changes Feb 25, 2020

View reviewed changes

arjkesh added 2 commits February 26, 2020 15:00

Merge branch 'master' into tf_2.1_training

34263d9

Merge branch 'master' into tf_2.1_training

a9ed61c

saimidu merged commit 183d728 into aws:master Feb 27, 2020

saimidu deleted the tf_2.1_training branch May 19, 2020 19:27

tejaschumbalkar referenced this pull request in tejaschumbalkar/deep-learning-containers Aug 3, 2021

Revert "inital commit (#23)" (#24)

c388161

This reverts commit 3b340ca.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: Add TensorFlow 2.1 dockerfiles#24

feature: Add TensorFlow 2.1 dockerfiles#24
saimidu merged 3 commits intoaws:masterfrom
saimidu:tf_2.1_training

saimidu commented Feb 25, 2020

Uh oh!

arjkesh commented Feb 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saimidu commented Feb 25, 2020

Uh oh!

arjkesh commented Feb 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants