feature: Add TensorFlow 2.1 dockerfiles#24
Merged
saimidu merged 3 commits intoaws:masterfrom Feb 27, 2020
saimidu:tf_2.1_training
Merged
feature: Add TensorFlow 2.1 dockerfiles#24saimidu merged 3 commits intoaws:masterfrom saimidu:tf_2.1_training
saimidu merged 3 commits intoaws:masterfrom
saimidu:tf_2.1_training
Conversation
arjkesh
approved these changes
Feb 25, 2020
|
if you push an empty commit, the sanity tests will go away. |
tejaschumbalkar
referenced
this pull request
in tejaschumbalkar/deep-learning-containers
Aug 3, 2021
Yadan-Wei
pushed a commit
that referenced
this pull request
Mar 24, 2026
--- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 50 X-AI-Prompt: #24 6.697 Using MAX_JOBS=32 as the number of jobs. #24 6.699 Using NVCC_THREADS=16 as the number of nvcc threads. #24 6.940 -- The CXX compiler identification is GNU 11.5.0 #24 6.951 -- Detecting CXX compiler ABI info #24 7.024 -- Detecting CXX compiler ABI info - failed #24 7.024 -- Check for working CXX compiler: /usr/bin/c++ #24 7.089 -- Check for working CXX compiler: /usr/bin/c++ - broken #24 7.089 CMake Error at /opt/venv/lib/python3.12/site-packages/cmake/data/share/cmake-4.3/Modules/CMakeTestCXXCompiler.cmake:73 (message): #24 7.089 The C++ compiler #24 7.089 #24 7.089 "/usr/bin/c++" #24 7.089 #24 7.089 is not able to compile a simple test program. #24 7.089 #24 7.089 It fails with the following output: #24 7.089 #24 7.089 Change Dir: '/workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ' #24 7.089 #24 7.089 Run Build Command(s): /opt/venv/bin/ninja -v cmTC_ff516 #24 7.089 [1/2] sccache /usr/bin/c++ -o CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o -c /workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ/testCXXCompiler.cxx #24 7.089 FAILED: [code=2] CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o #24 7.089 sccache /usr/bin/c++ -o CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o -c /workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ/testCXXCompiler.cxx #24 7.089 sccache: error: Server startup failed: cache storage failed to read: Unexpected (permanent) at read => S3Error { code: "AuthorizationHeaderMalformed", message: "The authorization header is malformed; a non-empty Access Key (AKID) must be provided in the credential.", resource: "", request_id: "9JNZ99SMVCR9235F" } #24 7.089 #24 7.089 Context: #24 7.089 uri: https://s3.us-west-2.amazonaws.com/dlc-cicd-models/sccache/vllm/.sccache_check #24 7.089 response: Parts { status: 400, version: HTTP/1.1, headers: {"x-amz-request-id": "9JNZ99SMVCR9235F", "x-amz-id-2": "xP77wFtCDnopxg4jLe8wBmqfAYAk3v+fP16A7xtV1fsZueOgmrd/cCc7CZRjMMLMk+FfKnUhh5c=", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Tue, 24 Mar 2026 05:57:07 GMT", "connection": "close", "server": "AmazonS3"} } #24 7.089 service: s3 #24 7.089 path: .sccache_check #24 7.089 range: 0- #24 7.089 #24 7.089 Backtrace: #24 7.089 0: <unknown> #24 7.089 1: <unknown> #24 7.089 2: <unknown> #24 7.089 3: <unknown> #24 7.089 4: <unknown> #24 7.089 5: <unknown> #24 7.089 6: <unknown> #24 7.089 7: <unknown> #24 7.089 8: <unknown> #24 7.089 9: <unknown> #24 7.089 10: <unknown> #24 7.089 11: <unknown> #24 7.089 12: <unknown> #24 7.089 #24 7.089 #24 7.089 Run with SCCACHE_LOG=debug SCCACHE_NO_DAEMON=1 to get more information #24 7.089 ninja: build stopped: subcommand failed. #24 7.089 #24 7.089 #24 7.089 #24 7.089 #24 7.089 #24 7.089 CMake will not be able to correctly generate this project. #24 7.089 Call Stack (most recent call first): #24 7.089 CMakeLists.txt:14 (project) #24 7.089 #24 7.089 #24 7.090 -- Configuring incomplete, errors occurred! #24 7.093 Traceback (most recent call last): #24 7.093 File "/workspace/vllm/setup.py", line 1044, in <module> #24 7.093 setup( #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/__init__.py", line 117, in setup #24 7.093 return distutils.core.setup(**attrs) # type: ignore[return-value] #24 7.093 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/core.py", line 186, in setup #24 7.093 return run_commands(dist) #24 7.093 ^^^^^^^^^^^^^^^^^^ #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/core.py", line 202, in run_commands #24 7.093 dist.run_commands() #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1002, in run_commands #24 7.093 self.run_command(cmd) #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.094 super().run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.094 cmd_obj.run() #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/command/bdist_wheel.py", line 370, in run #24 7.094 self.run_command("build") #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command #24 7.094 self.distribution.run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.094 super().run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.094 cmd_obj.run() #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/command/build.py", line 135, in run #24 7.094 self.run_command(cmd_name) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command #24 7.094 self.distribution.run_command(command) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.095 super().run_command(command) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.095 cmd_obj.run() #24 7.095 File "/workspace/vllm/setup.py", line 360, in run #24 7.095 super().run() #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/command/build_ext.py", line 97, in run #24 7.095 _build_ext.run(self) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 368, in run #24 7.095 self.build_extensions() #24 7.095 File "/workspace/vllm/setup.py", line 317, in build_extensions #24 7.095 self.configure(ext) #24 7.095 File "/workspace/vllm/setup.py", line 294, in configure #24 7.095 subprocess.check_call( #24 7.095 File "/usr/lib64/python3.12/subprocess.py", line 413, in check_call #24 7.095 raise CalledProcessError(retcode, cmd) #24 7.095 subprocess.CalledProcessError: Command '['cmake', '/workspace/vllm', '-G', 'Ninja', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DVLLM_TARGET_DEVICE=cuda', '-DCMAKE_C_COMPILER_LAUNCHER=sccache', '-DCMAKE_CXX_COMPILER_LAUNCHER=sccache', '-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache', '-DCMAKE_HIP_COMPILER_LAUNCHER=sccache', '-DVLLM_PYTHON_EXECUTABLE=/opt/venv/bin/python3', '-DVLLM_PYTHON_PATH=/workspace/vllm:/usr/lib64/python312.zip:/usr/lib64/python3.12:/usr/lib64/python3.12/lib-dynload:/opt/venv/lib64/python3.12/site-packages:/opt/venv/lib64/python3.12/site-packages/nvidia_cutlass_dsl/python_packages:/opt/venv/lib/python3.12/site-packages:/opt/venv/lib/python3.12/site-packages/nvidia_cutlass_dsl/python_packages:/opt/venv/lib64/python3.12/site-packages/setuptools/_vendor:/opt/venv/lib/python3.12/site-packages/grpc_tools/_proto', '-DFETCHCONTENT_BASE_DIR=/workspace/vllm/.deps', '-DNVCC_THREADS=16', '-DCMAKE_JOB_POOL_COMPILE:STRING=compile', '-DCMAKE_JOB_POOLS:STRING=compile=2', '-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc']' returned non-zero exit status 1. #24 ERROR: process "/bin/sh -c python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 && if [ -n \"${SCCACHE_BUCKET}\" ]; then sccache --show-stats; fi" did not complete successfully: exit code: 1 Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Yadan-Wei
added a commit
that referenced
this pull request
Mar 27, 2026
* Human changes made during kiro-cli session after prompt completion. --- X-AI-Tool: Human X-AI-Prompt: can you summerize this PR #5763 so I can add discription in the pr Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 74 X-AI-Prompt: can you look at this dockerfile sample https://github.com/aws/deep-learning-containers/pull/5808/changes#diff-aff16f8c535417fcf020bc2184ab09935e6c66cf46842f6ccee6d2022f4077ff to modify my dockerfile for oss setup /Volumes/workplace/kiro-workplace/AsimovBuilderCoreContext/src/AsimovBuilderCoreContext/workspace/2week/deep-learning-containers/docker/vllm/Dockerfile.amzn2023 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * Human changes made during kiro-cli session after prompt completion. --- X-AI-Tool: Human X-AI-Prompt: can you look at this dockerfile sample https://github.com/aws/deep-learning-containers/pull/5808/changes#diff-aff16f8c535417fcf020bc2184ab09935e6c66cf46842f6ccee6d2022f4077ff to modify my dockerfile for oss setup /Volumes/workplace/kiro-workplace/AsimovBuilderCoreContext/src/AsimovBuilderCoreContext/workspace/2week/deep-learning-containers/docker/vllm/Dockerfile.amzn2023 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 183 X-AI-Prompt: for my build vllm container,how can I add benchmark test with popular models Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 142 X-AI-Prompt: okay could you implement for me and could you find which s3 bucket sample pr is using, we can use the same one Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 28 X-AI-Prompt: how the cache will be saved bucket/hash/**.o? Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 75 X-AI-Prompt: how my sample PR access s3 bucket, I think we do not need to do above things Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 50 X-AI-Prompt: #24 6.697 Using MAX_JOBS=32 as the number of jobs. #24 6.699 Using NVCC_THREADS=16 as the number of nvcc threads. #24 6.940 -- The CXX compiler identification is GNU 11.5.0 #24 6.951 -- Detecting CXX compiler ABI info #24 7.024 -- Detecting CXX compiler ABI info - failed #24 7.024 -- Check for working CXX compiler: /usr/bin/c++ #24 7.089 -- Check for working CXX compiler: /usr/bin/c++ - broken #24 7.089 CMake Error at /opt/venv/lib/python3.12/site-packages/cmake/data/share/cmake-4.3/Modules/CMakeTestCXXCompiler.cmake:73 (message): #24 7.089 The C++ compiler #24 7.089 #24 7.089 "/usr/bin/c++" #24 7.089 #24 7.089 is not able to compile a simple test program. #24 7.089 #24 7.089 It fails with the following output: #24 7.089 #24 7.089 Change Dir: '/workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ' #24 7.089 #24 7.089 Run Build Command(s): /opt/venv/bin/ninja -v cmTC_ff516 #24 7.089 [1/2] sccache /usr/bin/c++ -o CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o -c /workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ/testCXXCompiler.cxx #24 7.089 FAILED: [code=2] CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o #24 7.089 sccache /usr/bin/c++ -o CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o -c /workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ/testCXXCompiler.cxx #24 7.089 sccache: error: Server startup failed: cache storage failed to read: Unexpected (permanent) at read => S3Error { code: "AuthorizationHeaderMalformed", message: "The authorization header is malformed; a non-empty Access Key (AKID) must be provided in the credential.", resource: "", request_id: "9JNZ99SMVCR9235F" } #24 7.089 #24 7.089 Context: #24 7.089 uri: https://s3.us-west-2.amazonaws.com/dlc-cicd-models/sccache/vllm/.sccache_check #24 7.089 response: Parts { status: 400, version: HTTP/1.1, headers: {"x-amz-request-id": "9JNZ99SMVCR9235F", "x-amz-id-2": "xP77wFtCDnopxg4jLe8wBmqfAYAk3v+fP16A7xtV1fsZueOgmrd/cCc7CZRjMMLMk+FfKnUhh5c=", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Tue, 24 Mar 2026 05:57:07 GMT", "connection": "close", "server": "AmazonS3"} } #24 7.089 service: s3 #24 7.089 path: .sccache_check #24 7.089 range: 0- #24 7.089 #24 7.089 Backtrace: #24 7.089 0: <unknown> #24 7.089 1: <unknown> #24 7.089 2: <unknown> #24 7.089 3: <unknown> #24 7.089 4: <unknown> #24 7.089 5: <unknown> #24 7.089 6: <unknown> #24 7.089 7: <unknown> #24 7.089 8: <unknown> #24 7.089 9: <unknown> #24 7.089 10: <unknown> #24 7.089 11: <unknown> #24 7.089 12: <unknown> #24 7.089 #24 7.089 #24 7.089 Run with SCCACHE_LOG=debug SCCACHE_NO_DAEMON=1 to get more information #24 7.089 ninja: build stopped: subcommand failed. #24 7.089 #24 7.089 #24 7.089 #24 7.089 #24 7.089 #24 7.089 CMake will not be able to correctly generate this project. #24 7.089 Call Stack (most recent call first): #24 7.089 CMakeLists.txt:14 (project) #24 7.089 #24 7.089 #24 7.090 -- Configuring incomplete, errors occurred! #24 7.093 Traceback (most recent call last): #24 7.093 File "/workspace/vllm/setup.py", line 1044, in <module> #24 7.093 setup( #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/__init__.py", line 117, in setup #24 7.093 return distutils.core.setup(**attrs) # type: ignore[return-value] #24 7.093 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/core.py", line 186, in setup #24 7.093 return run_commands(dist) #24 7.093 ^^^^^^^^^^^^^^^^^^ #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/core.py", line 202, in run_commands #24 7.093 dist.run_commands() #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1002, in run_commands #24 7.093 self.run_command(cmd) #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.094 super().run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.094 cmd_obj.run() #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/command/bdist_wheel.py", line 370, in run #24 7.094 self.run_command("build") #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command #24 7.094 self.distribution.run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.094 super().run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.094 cmd_obj.run() #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/command/build.py", line 135, in run #24 7.094 self.run_command(cmd_name) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command #24 7.094 self.distribution.run_command(command) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.095 super().run_command(command) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.095 cmd_obj.run() #24 7.095 File "/workspace/vllm/setup.py", line 360, in run #24 7.095 super().run() #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/command/build_ext.py", line 97, in run #24 7.095 _build_ext.run(self) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 368, in run #24 7.095 self.build_extensions() #24 7.095 File "/workspace/vllm/setup.py", line 317, in build_extensions #24 7.095 self.configure(ext) #24 7.095 File "/workspace/vllm/setup.py", line 294, in configure #24 7.095 subprocess.check_call( #24 7.095 File "/usr/lib64/python3.12/subprocess.py", line 413, in check_call #24 7.095 raise CalledProcessError(retcode, cmd) #24 7.095 subprocess.CalledProcessError: Command '['cmake', '/workspace/vllm', '-G', 'Ninja', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DVLLM_TARGET_DEVICE=cuda', '-DCMAKE_C_COMPILER_LAUNCHER=sccache', '-DCMAKE_CXX_COMPILER_LAUNCHER=sccache', '-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache', '-DCMAKE_HIP_COMPILER_LAUNCHER=sccache', '-DVLLM_PYTHON_EXECUTABLE=/opt/venv/bin/python3', '-DVLLM_PYTHON_PATH=/workspace/vllm:/usr/lib64/python312.zip:/usr/lib64/python3.12:/usr/lib64/python3.12/lib-dynload:/opt/venv/lib64/python3.12/site-packages:/opt/venv/lib64/python3.12/site-packages/nvidia_cutlass_dsl/python_packages:/opt/venv/lib/python3.12/site-packages:/opt/venv/lib/python3.12/site-packages/nvidia_cutlass_dsl/python_packages:/opt/venv/lib64/python3.12/site-packages/setuptools/_vendor:/opt/venv/lib/python3.12/site-packages/grpc_tools/_proto', '-DFETCHCONTENT_BASE_DIR=/workspace/vllm/.deps', '-DNVCC_THREADS=16', '-DCMAKE_JOB_POOL_COMPILE:STRING=compile', '-DCMAKE_JOB_POOLS:STRING=compile=2', '-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc']' returned non-zero exit status 1. #24 ERROR: process "/bin/sh -c python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 && if [ -n \"${SCCACHE_BUCKET}\" ]; then sccache --show-stats; fi" did not complete successfully: exit code: 1 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 55 X-AI-Prompt: but only ec2 workflow pass sccache bucket Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 38 X-AI-Prompt: if need to install this to docker where should i place it # install kv_connectors if requested ARG INSTALL_KV_CONNECTORS=false ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list} RUN --mount=type=cache,target=/root/.cache/uv \ --mount=type=bind,source=requirements/kv_connectors.txt,target=/tmp/kv_connectors.txt,ro \ CUDA_MAJOR="${CUDA_VERSION%%.*}"; \ CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-'); \ CUDA_HOME=/usr/local/cuda; \ # lmcache requires explicit specifying CUDA_HOME BUILD_PKGS="libcusparse-dev-${CUDA_VERSION_DASH} \ libcublas-dev-${CUDA_VERSION_DASH} \ libcusolver-dev-${CUDA_VERSION_DASH}"; \ if [ "$INSTALL_KV_CONNECTORS" = "true" ]; then \ if [ "$CUDA_MAJOR" -ge 13 ]; then \ uv pip install --system nixl-cu13; \ fi; \ uv pip install --system -r /tmp/kv_connectors.txt --no-build || ( \ # if the above fails, install from source apt-get update -y && \ apt-get install -y --no-install-recommends ${BUILD_PKGS} && \ uv pip install --system -r /tmp/kv_connectors.txt --no-build-isolation && \ apt-get purge -y ${BUILD_PKGS} && \ # clean up -dev packages, keep runtime libraries rm -rf /var/lib/apt/lists/* \ ); \ fi Signed-off-by: Yadan Wei <yadanwei@amazon.com> * Human changes made during kiro-cli session after prompt completion. --- X-AI-Tool: Human X-AI-Prompt: if need to install this to docker where should i place it # install kv_connectors if requested ARG INSTALL_KV_CONNECTORS=false ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list} RUN --mount=type=cache,target=/root/.cache/uv \ --mount=type=bind,source=requirements/kv_connectors.txt,target=/tmp/kv_connectors.txt,ro \ CUDA_MAJOR="${CUDA_VERSION%%.*}"; \ CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-'); \ CUDA_HOME=/usr/local/cuda; \ # lmcache requires explicit specifying CUDA_HOME BUILD_PKGS="libcusparse-dev-${CUDA_VERSION_DASH} \ libcublas-dev-${CUDA_VERSION_DASH} \ libcusolver-dev-${CUDA_VERSION_DASH}"; \ if [ "$INSTALL_KV_CONNECTORS" = "true" ]; then \ if [ "$CUDA_MAJOR" -ge 13 ]; then \ uv pip install --system nixl-cu13; \ fi; \ uv pip install --system -r /tmp/kv_connectors.txt --no-build || ( \ # if the above fails, install from source apt-get update -y && \ apt-get install -y --no-install-recommends ${BUILD_PKGS} && \ uv pip install --system -r /tmp/kv_connectors.txt --no-build-isolation && \ apt-get purge -y ${BUILD_PKGS} && \ # clean up -dev packages, keep runtime libraries rm -rf /var/lib/apt/lists/* \ ); \ fi Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 23 X-AI-Prompt: i use m6a.8xlarge EC2, additional 5,000 GB EBS volume size to build Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 81 X-AI-Prompt: you can reas this for information /Volumes/workplace/kiro-workplace/AsimovBuilderCoreContext/src/AsimovBuilderCoreContext/workspace/2week/deep-learning-containers/docker/vllm/Dockerfile.amzn2023 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * feat: upgrade vLLM to 0.18.0, enable kv_connectors, simplify sccache credentials - Bump vLLM from 0.17.1 to 0.18.0 - Enable INSTALL_KV_CONNECTORS by default (lmcache, nixl) - Copy kv_connectors.txt from source stage instead of build stage - Remove static credential ARGs for sccache, use container credential endpoint only Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove dead sccache credential fallbacks from build script Match simplified Dockerfile — only use AWS_CONTAINER_CREDENTIALS_RELATIVE_URI for sccache S3 access via CodeBuild IAM role. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: add sccache smoke test, exclude model/benchmark test paths from SageMaker workflow - Add sccache S3 connectivity smoke test before EC2 build - Exclude vllm_model_smoke_test.sh, vllm_benchmark_test.sh, benchmark_report.py from triggering the SageMaker workflow (not used by SageMaker) Signed-off-by: Yadan Wei <yadanwei@amazon.com> * chore: update vllm config for 0.18.0 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * remove some files Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove explicit credential ENV from Dockerfile, let sccache use default credential chain Setting AWS_CONTAINER_CREDENTIALS_RELATIVE_URI to empty string breaks the AWS SDK default credential chain, preventing instance profile auth from working. With --network=host, sccache can reach the instance metadata service directly without any credential ENV vars. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: forward AWS_CONTAINER_CREDENTIALS_FULL_URI for CodeBuild sccache auth CodeBuild uses FULL_URI (http://127.0.0.1:port/...) not RELATIVE_URI. Forward it as build-arg so sccache inside Docker can authenticate to S3. --network=host makes the local endpoint reachable. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: smoke test use --no-cache and pass FULL_URI credentials Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove credential ARGs, use IMDSv2 via --network=host, add SCCACHE_IDLE_TIMEOUT=0 - Remove all AWS credential ARGs/ENVs from Dockerfile — sccache uses IMDSv2 on EC2 fleet runners via --network=host - Add SCCACHE_IDLE_TIMEOUT=0 to prevent daemon shutdown during long builds (likely cause of only partial cache being written) Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: smoke test use IMDSv2 only, no credential args Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: use static credentials for sccache S3 access IMDSv2 not reachable from inside Docker on CodeBuild fleet runners. Snapshot credentials via aws configure export-credentials instead. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * revert sccache change and change nvcc_thread * remove unused filed Signed-off-by: Yadan Wei <weiyadan@amazon.com> * update flashinfer version Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix test directory Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix directory Signed-off-by: Yadan Wei <yadanwei@amazon.com> * Delete .gem-config/config --------- Signed-off-by: Yadan Wei <yadanwei@amazon.com> Signed-off-by: Yadan Wei <weiyadan@amazon.com> Co-authored-by: Yadan Wei <yadanwei@amazon.com> Co-authored-by: Yadan Wei <weiyadan@amazon.com>
Jyothirmaikottu
pushed a commit
that referenced
this pull request
Mar 30, 2026
* Human changes made during kiro-cli session after prompt completion. --- X-AI-Tool: Human X-AI-Prompt: can you summerize this PR #5763 so I can add discription in the pr Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 74 X-AI-Prompt: can you look at this dockerfile sample https://github.com/aws/deep-learning-containers/pull/5808/changes#diff-aff16f8c535417fcf020bc2184ab09935e6c66cf46842f6ccee6d2022f4077ff to modify my dockerfile for oss setup /Volumes/workplace/kiro-workplace/AsimovBuilderCoreContext/src/AsimovBuilderCoreContext/workspace/2week/deep-learning-containers/docker/vllm/Dockerfile.amzn2023 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * Human changes made during kiro-cli session after prompt completion. --- X-AI-Tool: Human X-AI-Prompt: can you look at this dockerfile sample https://github.com/aws/deep-learning-containers/pull/5808/changes#diff-aff16f8c535417fcf020bc2184ab09935e6c66cf46842f6ccee6d2022f4077ff to modify my dockerfile for oss setup /Volumes/workplace/kiro-workplace/AsimovBuilderCoreContext/src/AsimovBuilderCoreContext/workspace/2week/deep-learning-containers/docker/vllm/Dockerfile.amzn2023 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 183 X-AI-Prompt: for my build vllm container,how can I add benchmark test with popular models Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 142 X-AI-Prompt: okay could you implement for me and could you find which s3 bucket sample pr is using, we can use the same one Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 28 X-AI-Prompt: how the cache will be saved bucket/hash/**.o? Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 75 X-AI-Prompt: how my sample PR access s3 bucket, I think we do not need to do above things Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 50 X-AI-Prompt: #24 6.697 Using MAX_JOBS=32 as the number of jobs. #24 6.699 Using NVCC_THREADS=16 as the number of nvcc threads. #24 6.940 -- The CXX compiler identification is GNU 11.5.0 #24 6.951 -- Detecting CXX compiler ABI info #24 7.024 -- Detecting CXX compiler ABI info - failed #24 7.024 -- Check for working CXX compiler: /usr/bin/c++ #24 7.089 -- Check for working CXX compiler: /usr/bin/c++ - broken #24 7.089 CMake Error at /opt/venv/lib/python3.12/site-packages/cmake/data/share/cmake-4.3/Modules/CMakeTestCXXCompiler.cmake:73 (message): #24 7.089 The C++ compiler #24 7.089 #24 7.089 "/usr/bin/c++" #24 7.089 #24 7.089 is not able to compile a simple test program. #24 7.089 #24 7.089 It fails with the following output: #24 7.089 #24 7.089 Change Dir: '/workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ' #24 7.089 #24 7.089 Run Build Command(s): /opt/venv/bin/ninja -v cmTC_ff516 #24 7.089 [1/2] sccache /usr/bin/c++ -o CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o -c /workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ/testCXXCompiler.cxx #24 7.089 FAILED: [code=2] CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o #24 7.089 sccache /usr/bin/c++ -o CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o -c /workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ/testCXXCompiler.cxx #24 7.089 sccache: error: Server startup failed: cache storage failed to read: Unexpected (permanent) at read => S3Error { code: "AuthorizationHeaderMalformed", message: "The authorization header is malformed; a non-empty Access Key (AKID) must be provided in the credential.", resource: "", request_id: "9JNZ99SMVCR9235F" } #24 7.089 #24 7.089 Context: #24 7.089 uri: https://s3.us-west-2.amazonaws.com/dlc-cicd-models/sccache/vllm/.sccache_check #24 7.089 response: Parts { status: 400, version: HTTP/1.1, headers: {"x-amz-request-id": "9JNZ99SMVCR9235F", "x-amz-id-2": "xP77wFtCDnopxg4jLe8wBmqfAYAk3v+fP16A7xtV1fsZueOgmrd/cCc7CZRjMMLMk+FfKnUhh5c=", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Tue, 24 Mar 2026 05:57:07 GMT", "connection": "close", "server": "AmazonS3"} } #24 7.089 service: s3 #24 7.089 path: .sccache_check #24 7.089 range: 0- #24 7.089 #24 7.089 Backtrace: #24 7.089 0: <unknown> #24 7.089 1: <unknown> #24 7.089 2: <unknown> #24 7.089 3: <unknown> #24 7.089 4: <unknown> #24 7.089 5: <unknown> #24 7.089 6: <unknown> #24 7.089 7: <unknown> #24 7.089 8: <unknown> #24 7.089 9: <unknown> #24 7.089 10: <unknown> #24 7.089 11: <unknown> #24 7.089 12: <unknown> #24 7.089 #24 7.089 #24 7.089 Run with SCCACHE_LOG=debug SCCACHE_NO_DAEMON=1 to get more information #24 7.089 ninja: build stopped: subcommand failed. #24 7.089 #24 7.089 #24 7.089 #24 7.089 #24 7.089 #24 7.089 CMake will not be able to correctly generate this project. #24 7.089 Call Stack (most recent call first): #24 7.089 CMakeLists.txt:14 (project) #24 7.089 #24 7.089 #24 7.090 -- Configuring incomplete, errors occurred! #24 7.093 Traceback (most recent call last): #24 7.093 File "/workspace/vllm/setup.py", line 1044, in <module> #24 7.093 setup( #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/__init__.py", line 117, in setup #24 7.093 return distutils.core.setup(**attrs) # type: ignore[return-value] #24 7.093 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/core.py", line 186, in setup #24 7.093 return run_commands(dist) #24 7.093 ^^^^^^^^^^^^^^^^^^ #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/core.py", line 202, in run_commands #24 7.093 dist.run_commands() #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1002, in run_commands #24 7.093 self.run_command(cmd) #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.094 super().run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.094 cmd_obj.run() #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/command/bdist_wheel.py", line 370, in run #24 7.094 self.run_command("build") #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command #24 7.094 self.distribution.run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.094 super().run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.094 cmd_obj.run() #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/command/build.py", line 135, in run #24 7.094 self.run_command(cmd_name) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command #24 7.094 self.distribution.run_command(command) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.095 super().run_command(command) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.095 cmd_obj.run() #24 7.095 File "/workspace/vllm/setup.py", line 360, in run #24 7.095 super().run() #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/command/build_ext.py", line 97, in run #24 7.095 _build_ext.run(self) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 368, in run #24 7.095 self.build_extensions() #24 7.095 File "/workspace/vllm/setup.py", line 317, in build_extensions #24 7.095 self.configure(ext) #24 7.095 File "/workspace/vllm/setup.py", line 294, in configure #24 7.095 subprocess.check_call( #24 7.095 File "/usr/lib64/python3.12/subprocess.py", line 413, in check_call #24 7.095 raise CalledProcessError(retcode, cmd) #24 7.095 subprocess.CalledProcessError: Command '['cmake', '/workspace/vllm', '-G', 'Ninja', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DVLLM_TARGET_DEVICE=cuda', '-DCMAKE_C_COMPILER_LAUNCHER=sccache', '-DCMAKE_CXX_COMPILER_LAUNCHER=sccache', '-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache', '-DCMAKE_HIP_COMPILER_LAUNCHER=sccache', '-DVLLM_PYTHON_EXECUTABLE=/opt/venv/bin/python3', '-DVLLM_PYTHON_PATH=/workspace/vllm:/usr/lib64/python312.zip:/usr/lib64/python3.12:/usr/lib64/python3.12/lib-dynload:/opt/venv/lib64/python3.12/site-packages:/opt/venv/lib64/python3.12/site-packages/nvidia_cutlass_dsl/python_packages:/opt/venv/lib/python3.12/site-packages:/opt/venv/lib/python3.12/site-packages/nvidia_cutlass_dsl/python_packages:/opt/venv/lib64/python3.12/site-packages/setuptools/_vendor:/opt/venv/lib/python3.12/site-packages/grpc_tools/_proto', '-DFETCHCONTENT_BASE_DIR=/workspace/vllm/.deps', '-DNVCC_THREADS=16', '-DCMAKE_JOB_POOL_COMPILE:STRING=compile', '-DCMAKE_JOB_POOLS:STRING=compile=2', '-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc']' returned non-zero exit status 1. #24 ERROR: process "/bin/sh -c python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 && if [ -n \"${SCCACHE_BUCKET}\" ]; then sccache --show-stats; fi" did not complete successfully: exit code: 1 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 55 X-AI-Prompt: but only ec2 workflow pass sccache bucket Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 38 X-AI-Prompt: if need to install this to docker where should i place it # install kv_connectors if requested ARG INSTALL_KV_CONNECTORS=false ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list} RUN --mount=type=cache,target=/root/.cache/uv \ --mount=type=bind,source=requirements/kv_connectors.txt,target=/tmp/kv_connectors.txt,ro \ CUDA_MAJOR="${CUDA_VERSION%%.*}"; \ CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-'); \ CUDA_HOME=/usr/local/cuda; \ # lmcache requires explicit specifying CUDA_HOME BUILD_PKGS="libcusparse-dev-${CUDA_VERSION_DASH} \ libcublas-dev-${CUDA_VERSION_DASH} \ libcusolver-dev-${CUDA_VERSION_DASH}"; \ if [ "$INSTALL_KV_CONNECTORS" = "true" ]; then \ if [ "$CUDA_MAJOR" -ge 13 ]; then \ uv pip install --system nixl-cu13; \ fi; \ uv pip install --system -r /tmp/kv_connectors.txt --no-build || ( \ # if the above fails, install from source apt-get update -y && \ apt-get install -y --no-install-recommends ${BUILD_PKGS} && \ uv pip install --system -r /tmp/kv_connectors.txt --no-build-isolation && \ apt-get purge -y ${BUILD_PKGS} && \ # clean up -dev packages, keep runtime libraries rm -rf /var/lib/apt/lists/* \ ); \ fi Signed-off-by: Yadan Wei <yadanwei@amazon.com> * Human changes made during kiro-cli session after prompt completion. --- X-AI-Tool: Human X-AI-Prompt: if need to install this to docker where should i place it # install kv_connectors if requested ARG INSTALL_KV_CONNECTORS=false ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list} RUN --mount=type=cache,target=/root/.cache/uv \ --mount=type=bind,source=requirements/kv_connectors.txt,target=/tmp/kv_connectors.txt,ro \ CUDA_MAJOR="${CUDA_VERSION%%.*}"; \ CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-'); \ CUDA_HOME=/usr/local/cuda; \ # lmcache requires explicit specifying CUDA_HOME BUILD_PKGS="libcusparse-dev-${CUDA_VERSION_DASH} \ libcublas-dev-${CUDA_VERSION_DASH} \ libcusolver-dev-${CUDA_VERSION_DASH}"; \ if [ "$INSTALL_KV_CONNECTORS" = "true" ]; then \ if [ "$CUDA_MAJOR" -ge 13 ]; then \ uv pip install --system nixl-cu13; \ fi; \ uv pip install --system -r /tmp/kv_connectors.txt --no-build || ( \ # if the above fails, install from source apt-get update -y && \ apt-get install -y --no-install-recommends ${BUILD_PKGS} && \ uv pip install --system -r /tmp/kv_connectors.txt --no-build-isolation && \ apt-get purge -y ${BUILD_PKGS} && \ # clean up -dev packages, keep runtime libraries rm -rf /var/lib/apt/lists/* \ ); \ fi Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 23 X-AI-Prompt: i use m6a.8xlarge EC2, additional 5,000 GB EBS volume size to build Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 81 X-AI-Prompt: you can reas this for information /Volumes/workplace/kiro-workplace/AsimovBuilderCoreContext/src/AsimovBuilderCoreContext/workspace/2week/deep-learning-containers/docker/vllm/Dockerfile.amzn2023 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * feat: upgrade vLLM to 0.18.0, enable kv_connectors, simplify sccache credentials - Bump vLLM from 0.17.1 to 0.18.0 - Enable INSTALL_KV_CONNECTORS by default (lmcache, nixl) - Copy kv_connectors.txt from source stage instead of build stage - Remove static credential ARGs for sccache, use container credential endpoint only Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove dead sccache credential fallbacks from build script Match simplified Dockerfile — only use AWS_CONTAINER_CREDENTIALS_RELATIVE_URI for sccache S3 access via CodeBuild IAM role. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: add sccache smoke test, exclude model/benchmark test paths from SageMaker workflow - Add sccache S3 connectivity smoke test before EC2 build - Exclude vllm_model_smoke_test.sh, vllm_benchmark_test.sh, benchmark_report.py from triggering the SageMaker workflow (not used by SageMaker) Signed-off-by: Yadan Wei <yadanwei@amazon.com> * chore: update vllm config for 0.18.0 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * remove some files Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove explicit credential ENV from Dockerfile, let sccache use default credential chain Setting AWS_CONTAINER_CREDENTIALS_RELATIVE_URI to empty string breaks the AWS SDK default credential chain, preventing instance profile auth from working. With --network=host, sccache can reach the instance metadata service directly without any credential ENV vars. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: forward AWS_CONTAINER_CREDENTIALS_FULL_URI for CodeBuild sccache auth CodeBuild uses FULL_URI (http://127.0.0.1:port/...) not RELATIVE_URI. Forward it as build-arg so sccache inside Docker can authenticate to S3. --network=host makes the local endpoint reachable. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: smoke test use --no-cache and pass FULL_URI credentials Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove credential ARGs, use IMDSv2 via --network=host, add SCCACHE_IDLE_TIMEOUT=0 - Remove all AWS credential ARGs/ENVs from Dockerfile — sccache uses IMDSv2 on EC2 fleet runners via --network=host - Add SCCACHE_IDLE_TIMEOUT=0 to prevent daemon shutdown during long builds (likely cause of only partial cache being written) Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: smoke test use IMDSv2 only, no credential args Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: use static credentials for sccache S3 access IMDSv2 not reachable from inside Docker on CodeBuild fleet runners. Snapshot credentials via aws configure export-credentials instead. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * revert sccache change and change nvcc_thread * remove unused filed Signed-off-by: Yadan Wei <weiyadan@amazon.com> * update flashinfer version Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix test directory Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix directory Signed-off-by: Yadan Wei <yadanwei@amazon.com> * Delete .gem-config/config --------- Signed-off-by: Yadan Wei <yadanwei@amazon.com> Signed-off-by: Yadan Wei <weiyadan@amazon.com> Co-authored-by: Yadan Wei <yadanwei@amazon.com> Co-authored-by: Yadan Wei <weiyadan@amazon.com>
bhanutejagk
pushed a commit
that referenced
this pull request
Mar 31, 2026
* Human changes made during kiro-cli session after prompt completion. --- X-AI-Tool: Human X-AI-Prompt: can you summerize this PR #5763 so I can add discription in the pr Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 74 X-AI-Prompt: can you look at this dockerfile sample https://github.com/aws/deep-learning-containers/pull/5808/changes#diff-aff16f8c535417fcf020bc2184ab09935e6c66cf46842f6ccee6d2022f4077ff to modify my dockerfile for oss setup /Volumes/workplace/kiro-workplace/AsimovBuilderCoreContext/src/AsimovBuilderCoreContext/workspace/2week/deep-learning-containers/docker/vllm/Dockerfile.amzn2023 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * Human changes made during kiro-cli session after prompt completion. --- X-AI-Tool: Human X-AI-Prompt: can you look at this dockerfile sample https://github.com/aws/deep-learning-containers/pull/5808/changes#diff-aff16f8c535417fcf020bc2184ab09935e6c66cf46842f6ccee6d2022f4077ff to modify my dockerfile for oss setup /Volumes/workplace/kiro-workplace/AsimovBuilderCoreContext/src/AsimovBuilderCoreContext/workspace/2week/deep-learning-containers/docker/vllm/Dockerfile.amzn2023 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 183 X-AI-Prompt: for my build vllm container,how can I add benchmark test with popular models Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 142 X-AI-Prompt: okay could you implement for me and could you find which s3 bucket sample pr is using, we can use the same one Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 28 X-AI-Prompt: how the cache will be saved bucket/hash/**.o? Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 75 X-AI-Prompt: how my sample PR access s3 bucket, I think we do not need to do above things Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 50 X-AI-Prompt: #24 6.697 Using MAX_JOBS=32 as the number of jobs. #24 6.699 Using NVCC_THREADS=16 as the number of nvcc threads. #24 6.940 -- The CXX compiler identification is GNU 11.5.0 #24 6.951 -- Detecting CXX compiler ABI info #24 7.024 -- Detecting CXX compiler ABI info - failed #24 7.024 -- Check for working CXX compiler: /usr/bin/c++ #24 7.089 -- Check for working CXX compiler: /usr/bin/c++ - broken #24 7.089 CMake Error at /opt/venv/lib/python3.12/site-packages/cmake/data/share/cmake-4.3/Modules/CMakeTestCXXCompiler.cmake:73 (message): #24 7.089 The C++ compiler #24 7.089 #24 7.089 "/usr/bin/c++" #24 7.089 #24 7.089 is not able to compile a simple test program. #24 7.089 #24 7.089 It fails with the following output: #24 7.089 #24 7.089 Change Dir: '/workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ' #24 7.089 #24 7.089 Run Build Command(s): /opt/venv/bin/ninja -v cmTC_ff516 #24 7.089 [1/2] sccache /usr/bin/c++ -o CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o -c /workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ/testCXXCompiler.cxx #24 7.089 FAILED: [code=2] CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o #24 7.089 sccache /usr/bin/c++ -o CMakeFiles/cmTC_ff516.dir/testCXXCompiler.cxx.o -c /workspace/vllm/build/temp.linux-x86_64-cpython-312/CMakeFiles/CMakeScratch/TryCompile-Y7AumQ/testCXXCompiler.cxx #24 7.089 sccache: error: Server startup failed: cache storage failed to read: Unexpected (permanent) at read => S3Error { code: "AuthorizationHeaderMalformed", message: "The authorization header is malformed; a non-empty Access Key (AKID) must be provided in the credential.", resource: "", request_id: "9JNZ99SMVCR9235F" } #24 7.089 #24 7.089 Context: #24 7.089 uri: https://s3.us-west-2.amazonaws.com/dlc-cicd-models/sccache/vllm/.sccache_check #24 7.089 response: Parts { status: 400, version: HTTP/1.1, headers: {"x-amz-request-id": "9JNZ99SMVCR9235F", "x-amz-id-2": "xP77wFtCDnopxg4jLe8wBmqfAYAk3v+fP16A7xtV1fsZueOgmrd/cCc7CZRjMMLMk+FfKnUhh5c=", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Tue, 24 Mar 2026 05:57:07 GMT", "connection": "close", "server": "AmazonS3"} } #24 7.089 service: s3 #24 7.089 path: .sccache_check #24 7.089 range: 0- #24 7.089 #24 7.089 Backtrace: #24 7.089 0: <unknown> #24 7.089 1: <unknown> #24 7.089 2: <unknown> #24 7.089 3: <unknown> #24 7.089 4: <unknown> #24 7.089 5: <unknown> #24 7.089 6: <unknown> #24 7.089 7: <unknown> #24 7.089 8: <unknown> #24 7.089 9: <unknown> #24 7.089 10: <unknown> #24 7.089 11: <unknown> #24 7.089 12: <unknown> #24 7.089 #24 7.089 #24 7.089 Run with SCCACHE_LOG=debug SCCACHE_NO_DAEMON=1 to get more information #24 7.089 ninja: build stopped: subcommand failed. #24 7.089 #24 7.089 #24 7.089 #24 7.089 #24 7.089 #24 7.089 CMake will not be able to correctly generate this project. #24 7.089 Call Stack (most recent call first): #24 7.089 CMakeLists.txt:14 (project) #24 7.089 #24 7.089 #24 7.090 -- Configuring incomplete, errors occurred! #24 7.093 Traceback (most recent call last): #24 7.093 File "/workspace/vllm/setup.py", line 1044, in <module> #24 7.093 setup( #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/__init__.py", line 117, in setup #24 7.093 return distutils.core.setup(**attrs) # type: ignore[return-value] #24 7.093 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/core.py", line 186, in setup #24 7.093 return run_commands(dist) #24 7.093 ^^^^^^^^^^^^^^^^^^ #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/core.py", line 202, in run_commands #24 7.093 dist.run_commands() #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1002, in run_commands #24 7.093 self.run_command(cmd) #24 7.093 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.094 super().run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.094 cmd_obj.run() #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/command/bdist_wheel.py", line 370, in run #24 7.094 self.run_command("build") #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command #24 7.094 self.distribution.run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.094 super().run_command(command) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.094 cmd_obj.run() #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/command/build.py", line 135, in run #24 7.094 self.run_command(cmd_name) #24 7.094 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command #24 7.094 self.distribution.run_command(command) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/dist.py", line 1107, in run_command #24 7.095 super().run_command(command) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command #24 7.095 cmd_obj.run() #24 7.095 File "/workspace/vllm/setup.py", line 360, in run #24 7.095 super().run() #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/command/build_ext.py", line 97, in run #24 7.095 _build_ext.run(self) #24 7.095 File "/opt/venv/lib64/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 368, in run #24 7.095 self.build_extensions() #24 7.095 File "/workspace/vllm/setup.py", line 317, in build_extensions #24 7.095 self.configure(ext) #24 7.095 File "/workspace/vllm/setup.py", line 294, in configure #24 7.095 subprocess.check_call( #24 7.095 File "/usr/lib64/python3.12/subprocess.py", line 413, in check_call #24 7.095 raise CalledProcessError(retcode, cmd) #24 7.095 subprocess.CalledProcessError: Command '['cmake', '/workspace/vllm', '-G', 'Ninja', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DVLLM_TARGET_DEVICE=cuda', '-DCMAKE_C_COMPILER_LAUNCHER=sccache', '-DCMAKE_CXX_COMPILER_LAUNCHER=sccache', '-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache', '-DCMAKE_HIP_COMPILER_LAUNCHER=sccache', '-DVLLM_PYTHON_EXECUTABLE=/opt/venv/bin/python3', '-DVLLM_PYTHON_PATH=/workspace/vllm:/usr/lib64/python312.zip:/usr/lib64/python3.12:/usr/lib64/python3.12/lib-dynload:/opt/venv/lib64/python3.12/site-packages:/opt/venv/lib64/python3.12/site-packages/nvidia_cutlass_dsl/python_packages:/opt/venv/lib/python3.12/site-packages:/opt/venv/lib/python3.12/site-packages/nvidia_cutlass_dsl/python_packages:/opt/venv/lib64/python3.12/site-packages/setuptools/_vendor:/opt/venv/lib/python3.12/site-packages/grpc_tools/_proto', '-DFETCHCONTENT_BASE_DIR=/workspace/vllm/.deps', '-DNVCC_THREADS=16', '-DCMAKE_JOB_POOL_COMPILE:STRING=compile', '-DCMAKE_JOB_POOLS:STRING=compile=2', '-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc']' returned non-zero exit status 1. #24 ERROR: process "/bin/sh -c python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 && if [ -n \"${SCCACHE_BUCKET}\" ]; then sccache --show-stats; fi" did not complete successfully: exit code: 1 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 55 X-AI-Prompt: but only ec2 workflow pass sccache bucket Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 38 X-AI-Prompt: if need to install this to docker where should i place it # install kv_connectors if requested ARG INSTALL_KV_CONNECTORS=false ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list} RUN --mount=type=cache,target=/root/.cache/uv \ --mount=type=bind,source=requirements/kv_connectors.txt,target=/tmp/kv_connectors.txt,ro \ CUDA_MAJOR="${CUDA_VERSION%%.*}"; \ CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-'); \ CUDA_HOME=/usr/local/cuda; \ # lmcache requires explicit specifying CUDA_HOME BUILD_PKGS="libcusparse-dev-${CUDA_VERSION_DASH} \ libcublas-dev-${CUDA_VERSION_DASH} \ libcusolver-dev-${CUDA_VERSION_DASH}"; \ if [ "$INSTALL_KV_CONNECTORS" = "true" ]; then \ if [ "$CUDA_MAJOR" -ge 13 ]; then \ uv pip install --system nixl-cu13; \ fi; \ uv pip install --system -r /tmp/kv_connectors.txt --no-build || ( \ # if the above fails, install from source apt-get update -y && \ apt-get install -y --no-install-recommends ${BUILD_PKGS} && \ uv pip install --system -r /tmp/kv_connectors.txt --no-build-isolation && \ apt-get purge -y ${BUILD_PKGS} && \ # clean up -dev packages, keep runtime libraries rm -rf /var/lib/apt/lists/* \ ); \ fi Signed-off-by: Yadan Wei <yadanwei@amazon.com> * Human changes made during kiro-cli session after prompt completion. --- X-AI-Tool: Human X-AI-Prompt: if need to install this to docker where should i place it # install kv_connectors if requested ARG INSTALL_KV_CONNECTORS=false ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list} RUN --mount=type=cache,target=/root/.cache/uv \ --mount=type=bind,source=requirements/kv_connectors.txt,target=/tmp/kv_connectors.txt,ro \ CUDA_MAJOR="${CUDA_VERSION%%.*}"; \ CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-'); \ CUDA_HOME=/usr/local/cuda; \ # lmcache requires explicit specifying CUDA_HOME BUILD_PKGS="libcusparse-dev-${CUDA_VERSION_DASH} \ libcublas-dev-${CUDA_VERSION_DASH} \ libcusolver-dev-${CUDA_VERSION_DASH}"; \ if [ "$INSTALL_KV_CONNECTORS" = "true" ]; then \ if [ "$CUDA_MAJOR" -ge 13 ]; then \ uv pip install --system nixl-cu13; \ fi; \ uv pip install --system -r /tmp/kv_connectors.txt --no-build || ( \ # if the above fails, install from source apt-get update -y && \ apt-get install -y --no-install-recommends ${BUILD_PKGS} && \ uv pip install --system -r /tmp/kv_connectors.txt --no-build-isolation && \ apt-get purge -y ${BUILD_PKGS} && \ # clean up -dev packages, keep runtime libraries rm -rf /var/lib/apt/lists/* \ ); \ fi Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 23 X-AI-Prompt: i use m6a.8xlarge EC2, additional 5,000 GB EBS volume size to build Signed-off-by: Yadan Wei <yadanwei@amazon.com> * AI changes made during Kiro-cli session --- X-AI-Tool: Kiro-cli X-AI-Handle-Time-Seconds: 81 X-AI-Prompt: you can reas this for information /Volumes/workplace/kiro-workplace/AsimovBuilderCoreContext/src/AsimovBuilderCoreContext/workspace/2week/deep-learning-containers/docker/vllm/Dockerfile.amzn2023 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * feat: upgrade vLLM to 0.18.0, enable kv_connectors, simplify sccache credentials - Bump vLLM from 0.17.1 to 0.18.0 - Enable INSTALL_KV_CONNECTORS by default (lmcache, nixl) - Copy kv_connectors.txt from source stage instead of build stage - Remove static credential ARGs for sccache, use container credential endpoint only Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove dead sccache credential fallbacks from build script Match simplified Dockerfile — only use AWS_CONTAINER_CREDENTIALS_RELATIVE_URI for sccache S3 access via CodeBuild IAM role. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: add sccache smoke test, exclude model/benchmark test paths from SageMaker workflow - Add sccache S3 connectivity smoke test before EC2 build - Exclude vllm_model_smoke_test.sh, vllm_benchmark_test.sh, benchmark_report.py from triggering the SageMaker workflow (not used by SageMaker) Signed-off-by: Yadan Wei <yadanwei@amazon.com> * chore: update vllm config for 0.18.0 Signed-off-by: Yadan Wei <yadanwei@amazon.com> * remove some files Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove explicit credential ENV from Dockerfile, let sccache use default credential chain Setting AWS_CONTAINER_CREDENTIALS_RELATIVE_URI to empty string breaks the AWS SDK default credential chain, preventing instance profile auth from working. With --network=host, sccache can reach the instance metadata service directly without any credential ENV vars. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: forward AWS_CONTAINER_CREDENTIALS_FULL_URI for CodeBuild sccache auth CodeBuild uses FULL_URI (http://127.0.0.1:port/...) not RELATIVE_URI. Forward it as build-arg so sccache inside Docker can authenticate to S3. --network=host makes the local endpoint reachable. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: smoke test use --no-cache and pass FULL_URI credentials Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: remove credential ARGs, use IMDSv2 via --network=host, add SCCACHE_IDLE_TIMEOUT=0 - Remove all AWS credential ARGs/ENVs from Dockerfile — sccache uses IMDSv2 on EC2 fleet runners via --network=host - Add SCCACHE_IDLE_TIMEOUT=0 to prevent daemon shutdown during long builds (likely cause of only partial cache being written) Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: smoke test use IMDSv2 only, no credential args Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix: use static credentials for sccache S3 access IMDSv2 not reachable from inside Docker on CodeBuild fleet runners. Snapshot credentials via aws configure export-credentials instead. Signed-off-by: Yadan Wei <yadanwei@amazon.com> * revert sccache change and change nvcc_thread * remove unused filed Signed-off-by: Yadan Wei <weiyadan@amazon.com> * update flashinfer version Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix test directory Signed-off-by: Yadan Wei <yadanwei@amazon.com> * fix directory Signed-off-by: Yadan Wei <yadanwei@amazon.com> * Delete .gem-config/config --------- Signed-off-by: Yadan Wei <yadanwei@amazon.com> Signed-off-by: Yadan Wei <weiyadan@amazon.com> Co-authored-by: Yadan Wei <yadanwei@amazon.com> Co-authored-by: Yadan Wei <weiyadan@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of changes:
Migrate dockerfiles for TF 2.1 from sagemaker-tensorflow-container repository
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.