New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building PyTorch-1.8.1-fosscuda-2020b.eb stops with a Failed to find LLVM FileCheck error #13245
Comments
The Can you run (the
This should list a logfile for the testing phase ( |
With the -T flag the progress is printed and shows where the test phase stops: == preparing...
That was 3½ days ago! No progress has been made since. The log file /tmp/eb-c69Vg2/easybuild-DEuFFC.log ends with: == 2021-06-24 16:56:43,318 run.py:623 INFO parse_log_for_error msg: Command used: export PYTHONPATH=/tmp/eb-c69Vg2/tmpjZ2XI7/lib/python3.8/site-packages:$PYTHONPATH && USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=80 BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/lib64 CUDNN_INCLUDE_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/home/modules/software/NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1/include USE_METAL=0 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python setup.py install --prefix=/tmp/eb-c69Vg2/tmpjZ2XI7 |
I rebooted the machine now and repeated the build. The testing output again stops again after 2 hours 20 minutes as shown below: == building...
The test output file /tmp/eb-jIAp6q/easybuild-run_cmd-QCZeVN.log ends with: Running test_tensorexpr ... [2021-06-28 11:04:48.747289] |
Could the system default limits be stopping the test processes prematurely? Are there especially large system limits that are required for running the PyTorch tests successfully? Our GPU server has these limits: $ ulimit -a |
I find that the testing phase of
$ eb --from-pr=12814 PyTorch-1.8.1-fosscuda-2020b.eb -r
stops after about 1 hour with no further progress. The end of the EB log file shows some LLVM error:
== 2021-06-21 12:40:23,429 run.py:623 INFO parse_log_for_error msg: Command used: export PYTHONPATH=/tmp/eb-DKuIuR/tmp2EWzsa/lib/python3.8/site-packages:$PYTHONPATH && USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=80 BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/lib64 CUDNN_INCLUDE_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/home/modules/software/NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1/include USE_METAL=0 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python setup.py install --prefix=/tmp/eb-DKuIuR/tmp2EWzsa
== 2021-06-21 12:40:23,429 run.py:625 INFO parse_log_for_error (some may be harmless) regExp (?<![(,-]|\w)(?:error|segmentation fault|failed)(?![(,-]|.?\w) found:
-- Failed to find LLVM FileCheck
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile
== 2021-06-21 12:40:23,429 run.py:582 WARNING Found 2 errors in command output (output: -- Failed to find LLVM FileCheck
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile)
== 2021-06-21 12:40:23,429 run.py:226 INFO running cmd: export PYTHONPATH=/tmp/eb-DKuIuR/tmp2EWzsa/lib/python3.8/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --continue-through-error --verbose -x distributed/rpc/test_process_group_agent test_quantization
Any ideas how to fix this?
The text was updated successfully, but these errors were encountered: