Building PyTorch-1.8.1-fosscuda-2020b.eb stops with a Failed to find LLVM FileCheck error #13245

OleHolmNielsen · 2021-06-23T13:24:33Z

I find that the testing phase of
$ eb --from-pr=12814 PyTorch-1.8.1-fosscuda-2020b.eb -r
stops after about 1 hour with no further progress. The end of the EB log file shows some LLVM error:

== 2021-06-21 12:40:23,429 run.py:623 INFO parse_log_for_error msg: Command used: export PYTHONPATH=/tmp/eb-DKuIuR/tmp2EWzsa/lib/python3.8/site-packages:$PYTHONPATH && USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=80 BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/lib64 CUDNN_INCLUDE_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/home/modules/software/NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1/include USE_METAL=0 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python setup.py install --prefix=/tmp/eb-DKuIuR/tmp2EWzsa
== 2021-06-21 12:40:23,429 run.py:625 INFO parse_log_for_error (some may be harmless) regExp (?<![(,-]|\w)(?:error|segmentation fault|failed)(?![(,-]|.?\w) found:
-- Failed to find LLVM FileCheck
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile
== 2021-06-21 12:40:23,429 run.py:582 WARNING Found 2 errors in command output (output: -- Failed to find LLVM FileCheck
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile)
== 2021-06-21 12:40:23,429 run.py:226 INFO running cmd: export PYTHONPATH=/tmp/eb-DKuIuR/tmp2EWzsa/lib/python3.8/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --continue-through-error --verbose -x distributed/rpc/test_process_group_agent test_quantization

Any ideas how to fix this?

branfosj · 2021-06-24T12:58:56Z

The Failed to find LLVM FileCheck is CMake output from the build phase and it is unlikely that this is the problem. I see the same line in my successful builds. Also, the output shows that the build stage has completed and you haved moved onto the tests, which can take several hours.

Can you run (the -T adds trace):

eb --from-pr=12814 PyTorch-1.8.1-fosscuda-2020b.eb -Tr

This should list a logfile for the testing phase ([output logged in /dev/shm/tmp-branfosj-admin-up/eb-t7tehcxc/easybuild-run_cmd-wyagv7wf.log]), which you can look at to follow progress while it is running and see any failed tests. There should also be a log file for the whole build which you can look at once the build is complete (/failed).

OleHolmNielsen · 2021-06-28T05:57:41Z

With the -T flag the progress is printed and shows where the test phase stops:

== preparing...

loading toolchain module: fosscuda/2020b
loading modules for build dependencies:

CMake/3.18.4-GCCcore-10.2.0

hypothesis/5.41.5-GCCcore-10.2.0
loading modules for (runtime) dependencies:

Ninja/1.10.1-GCCcore-10.2.0

Python/3.8.6-GCCcore-10.2.0

protobuf/3.14.0-GCCcore-10.2.0

protobuf-python/3.14.0-GCCcore-10.2.0

pybind11/2.6.0-GCCcore-10.2.0

SciPy-bundle/2020.11-fosscuda-2020b

typing-extensions/3.7.4.3-GCCcore-10.2.0

PyYAML/5.3.1-GCCcore-10.2.0

MPFR/4.1.0-GCCcore-10.2.0

GMP/6.2.0-GCCcore-10.2.0

numactl/2.0.13-GCCcore-10.2.0

FFmpeg/4.3.1-GCCcore-10.2.0

Pillow/8.0.1-GCCcore-10.2.0

cuDNN/8.0.4.30-CUDA-11.1.1

magma/2.5.4-fosscuda-2020b

NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1
defining build environment for fosscuda/2020b toolchain
== configuring...
== building...
running command:
[started at: 2021-06-24 16:13:23]
[working dir: /dev/shm/PyTorch/1.8.1/fosscuda-2020b/pytorch]
[output logged in /tmp/eb-c69Vg2/easybuild-run_cmd-lTcwYV.log]
USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=80 BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/lib64 CUDNN_INCLUDE_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/home/modules/software/NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1/include USE_METAL=0 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python setup.py build
command completed: exit 0, ran in 00h42m37s
== testing...
running command:
[started at: 2021-06-24 16:56:02]
[working dir: /dev/shm/PyTorch/1.8.1/fosscuda-2020b/pytorch]
[output logged in /tmp/eb-c69Vg2/easybuild-run_cmd-nR1YWD.log]
export PYTHONPATH=/tmp/eb-c69Vg2/tmpjZ2XI7/lib/python3.8/site-packages:$PYTHONPATH && USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=80 BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/lib64 CUDNN_INCLUDE_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/home/modules/software/NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1/include USE_METAL=0 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python setup.py install --prefix=/tmp/eb-c69Vg2/tmpjZ2XI7
command completed: exit 0, ran in 00h00m40s
running command:
[started at: 2021-06-24 16:56:43]
[working dir: /dev/shm/PyTorch/1.8.1/fosscuda-2020b/pytorch]
[output logged in /tmp/eb-c69Vg2/easybuild-run_cmd-_3VBsc.log]
export PYTHONPATH=/tmp/eb-c69Vg2/tmpjZ2XI7/lib/python3.8/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --continue-through-error --verbose -x distributed/rpc/test_process_group_agent test_quantization

That was 3½ days ago! No progress has been made since.

The log file /tmp/eb-c69Vg2/easybuild-DEuFFC.log ends with:

== 2021-06-24 16:56:43,318 run.py:623 INFO parse_log_for_error msg: Command used: export PYTHONPATH=/tmp/eb-c69Vg2/tmpjZ2XI7/lib/python3.8/site-packages:$PYTHONPATH && USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=80 BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/lib64 CUDNN_INCLUDE_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/home/modules/software/NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1/include USE_METAL=0 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python setup.py install --prefix=/tmp/eb-c69Vg2/tmpjZ2XI7
== 2021-06-24 16:56:43,318 run.py:625 INFO parse_log_for_error (some may be harmless) regExp (?<![(,-]|\w)(?:error|segmentation fault|failed)(?![(,-]|.?\w) found:
-- Failed to find LLVM FileCheck
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile
== 2021-06-24 16:56:43,318 run.py:582 WARNING Found 2 errors in command output (output: -- Failed to find LLVM FileCheck
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile)
== 2021-06-24 16:56:43,319 run.py:226 INFO running cmd: export PYTHONPATH=/tmp/eb-c69Vg2/tmpjZ2XI7/lib/python3.8/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --continue-through-error --verbose -x distributed/rpc/test_process_group_agent test_quantization

OleHolmNielsen · 2021-06-28T09:20:44Z

I rebooted the machine now and repeated the build. The testing output again stops again after 2 hours 20 minutes as shown below:

== building...

running command:
[started at: 2021-06-28 08:03:35]
[working dir: /dev/shm/PyTorch/1.8.1/fosscuda-2020b/pytorch]
[output logged in /tmp/eb-jIAp6q/easybuild-run_cmd-LU1yqw.log]
USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=80 BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/lib64 CUDNN_INCLUDE_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/home/modules/software/NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1/include USE_METAL=0 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python setup.py build
command completed: exit 0, ran in 00h43m00s
== testing...
running command:
[started at: 2021-06-28 08:46:37]
[working dir: /dev/shm/PyTorch/1.8.1/fosscuda-2020b/pytorch]
[output logged in /tmp/eb-jIAp6q/easybuild-run_cmd-JU6lOR.log]
export PYTHONPATH=/tmp/eb-jIAp6q/tmph8qQ3D/lib/python3.8/site-packages:$PYTHONPATH && USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=80 BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/lib64 CUDNN_INCLUDE_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/home/modules/software/NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1/include USE_METAL=0 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python setup.py install --prefix=/tmp/eb-jIAp6q/tmph8qQ3D
command completed: exit 0, ran in 00h00m42s
running command:
[started at: 2021-06-28 08:47:19]
[working dir: /dev/shm/PyTorch/1.8.1/fosscuda-2020b/pytorch]
[output logged in /tmp/eb-jIAp6q/easybuild-run_cmd-QCZeVN.log]
export PYTHONPATH=/tmp/eb-jIAp6q/tmph8qQ3D/lib/python3.8/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --continue-through-error --verbose -x distributed/rpc/test_process_group_agent test_quantization

The test output file /tmp/eb-jIAp6q/easybuild-run_cmd-QCZeVN.log ends with:

Running test_tensorexpr ... [2021-06-28 11:04:48.747289]
Executing ['/home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python', 'test_tensorexpr.py', '-v'] ... [2021-06-28 11:04:48.747446]
test_add_const_rhs (main.TestTensorExprFuser) ... ok
test_add_sub (main.TestTensorExprFuser) ... ok
test_alias_analysis_input_and_module (main.TestTensorExprFuser) ... ok
test_alias_analysis_inputs (main.TestTensorExprFuser) ... ok
test_alias_analysis_module (main.TestTensorExprFuser) ... ok
test_all_combos (main.TestTensorExprFuser) ... ok
test_alpha (main.TestTensorExprFuser) ... ok
test_binary_ops (main.TestTensorExprFuser) ... ok
test_bitwise_ops (main.TestTensorExprFuser) ... ok
test_broadcast (main.TestTensorExprFuser) ... ok
test_broadcast_2 (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_broadcast_big2 (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_broadcast_cuda (main.TestTensorExprFuser) ... ok
test_cat_cpu (main.TestTensorExprFuser) ... ok
test_cat_cuda (main.TestTensorExprFuser) ... ok
test_cat_empty_tensors (main.TestTensorExprFuser) ... ok
test_cat_negative_dim_cpu (main.TestTensorExprFuser) ... ok
test_cat_negative_dim_cuda (main.TestTensorExprFuser) ... ok
test_cat_only_cpu (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_cat_only_cuda (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_cat_promote_inputs (main.TestTensorExprFuser) ... ok
test_char (main.TestTensorExprFuser) ... ok
test_chunk (main.TestTensorExprFuser) ... ok
test_clamp (main.TestTensorExprFuser) ... ok
test_constant (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_double (main.TestTensorExprFuser) ... ok
test_double_intrinsics (main.TestTensorExprFuser) ... ok
test_dynamic_shape (main.TestTensorExprFuser) ... skipped 'dynamic shapes are not quite there yet'
test_easy (main.TestTensorExprFuser) ... ok
test_eq (main.TestTensorExprFuser) ... ok
test_exp_pow (main.TestTensorExprFuser) ... ok
test_four_arg (main.TestTensorExprFuser) ... ok
test_ge (main.TestTensorExprFuser) ... ok
test_gt (main.TestTensorExprFuser) ... ok
test_guard_fails (main.TestTensorExprFuser) ... ok
test_half_gelu (main.TestTensorExprFuser) ... ok
test_int64_promotion (main.TestTensorExprFuser) ... ok
test_int_output (main.TestTensorExprFuser) ... ok
test_le (main.TestTensorExprFuser) ... ok
test_loop (main.TestTensorExprFuser) ... ok
test_lt (main.TestTensorExprFuser) ... ok
test_mask (main.TestTensorExprFuser) ... ok
test_min_max (main.TestTensorExprFuser) ... ok
test_min_max_reduction (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_min_max_reduction2 (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_min_max_reduction_dim1 (main.TestTensorExprFuser) ... ok
test_min_max_reduction_dim1_2 (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_multi_rand (main.TestTensorExprFuser) ... ok
test_multioutput (main.TestTens

OleHolmNielsen · 2021-07-09T08:54:57Z

Could the system default limits be stopping the test processes prematurely? Are there especially large system limits that are required for running the PyTorch tests successfully? Our GPU server has these limits:

$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 766950
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building PyTorch-1.8.1-fosscuda-2020b.eb stops with a Failed to find LLVM FileCheck error #13245

Building PyTorch-1.8.1-fosscuda-2020b.eb stops with a Failed to find LLVM FileCheck error #13245

OleHolmNielsen commented Jun 23, 2021

branfosj commented Jun 24, 2021

OleHolmNielsen commented Jun 28, 2021

OleHolmNielsen commented Jun 28, 2021

OleHolmNielsen commented Jul 9, 2021

Building PyTorch-1.8.1-fosscuda-2020b.eb stops with a Failed to find LLVM FileCheck error #13245

Building PyTorch-1.8.1-fosscuda-2020b.eb stops with a Failed to find LLVM FileCheck error #13245

Comments

OleHolmNielsen commented Jun 23, 2021

branfosj commented Jun 24, 2021

OleHolmNielsen commented Jun 28, 2021

OleHolmNielsen commented Jun 28, 2021

OleHolmNielsen commented Jul 9, 2021