Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building PyTorch-1.8.1-fosscuda-2020b.eb stops with a Failed to find LLVM FileCheck error #13245

Open
OleHolmNielsen opened this issue Jun 23, 2021 · 4 comments

Comments

@OleHolmNielsen
Copy link
Contributor

I find that the testing phase of
$ eb --from-pr=12814 PyTorch-1.8.1-fosscuda-2020b.eb -r
stops after about 1 hour with no further progress. The end of the EB log file shows some LLVM error:

== 2021-06-21 12:40:23,429 run.py:623 INFO parse_log_for_error msg: Command used: export PYTHONPATH=/tmp/eb-DKuIuR/tmp2EWzsa/lib/python3.8/site-packages:$PYTHONPATH && USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=80 BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/lib64 CUDNN_INCLUDE_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/home/modules/software/NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1/include USE_METAL=0 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python setup.py install --prefix=/tmp/eb-DKuIuR/tmp2EWzsa
== 2021-06-21 12:40:23,429 run.py:625 INFO parse_log_for_error (some may be harmless) regExp (?<![(,-]|\w)(?:error|segmentation fault|failed)(?![(,-]|.?\w) found:
-- Failed to find LLVM FileCheck
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile
== 2021-06-21 12:40:23,429 run.py:582 WARNING Found 2 errors in command output (output: -- Failed to find LLVM FileCheck
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile)
== 2021-06-21 12:40:23,429 run.py:226 INFO running cmd: export PYTHONPATH=/tmp/eb-DKuIuR/tmp2EWzsa/lib/python3.8/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --continue-through-error --verbose -x distributed/rpc/test_process_group_agent test_quantization

Any ideas how to fix this?

@branfosj
Copy link
Member

The Failed to find LLVM FileCheck is CMake output from the build phase and it is unlikely that this is the problem. I see the same line in my successful builds. Also, the output shows that the build stage has completed and you haved moved onto the tests, which can take several hours.

Can you run (the -T adds trace):

eb --from-pr=12814 PyTorch-1.8.1-fosscuda-2020b.eb -Tr

This should list a logfile for the testing phase ([output logged in /dev/shm/tmp-branfosj-admin-up/eb-t7tehcxc/easybuild-run_cmd-wyagv7wf.log]), which you can look at to follow progress while it is running and see any failed tests. There should also be a log file for the whole build which you can look at once the build is complete (/failed).

@OleHolmNielsen
Copy link
Contributor Author

With the -T flag the progress is printed and shows where the test phase stops:

== preparing...

loading toolchain module: fosscuda/2020b
loading modules for build dependencies:

  • CMake/3.18.4-GCCcore-10.2.0
  • hypothesis/5.41.5-GCCcore-10.2.0
    loading modules for (runtime) dependencies:
  • Ninja/1.10.1-GCCcore-10.2.0
  • Python/3.8.6-GCCcore-10.2.0
  • protobuf/3.14.0-GCCcore-10.2.0
  • protobuf-python/3.14.0-GCCcore-10.2.0
  • pybind11/2.6.0-GCCcore-10.2.0
  • SciPy-bundle/2020.11-fosscuda-2020b
  • typing-extensions/3.7.4.3-GCCcore-10.2.0
  • PyYAML/5.3.1-GCCcore-10.2.0
  • MPFR/4.1.0-GCCcore-10.2.0
  • GMP/6.2.0-GCCcore-10.2.0
  • numactl/2.0.13-GCCcore-10.2.0
  • FFmpeg/4.3.1-GCCcore-10.2.0
  • Pillow/8.0.1-GCCcore-10.2.0
  • cuDNN/8.0.4.30-CUDA-11.1.1
  • magma/2.5.4-fosscuda-2020b
  • NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1
    defining build environment for fosscuda/2020b toolchain
    == configuring...
    == building...
    running command:
    [started at: 2021-06-24 16:13:23]
    [working dir: /dev/shm/PyTorch/1.8.1/fosscuda-2020b/pytorch]
    [output logged in /tmp/eb-c69Vg2/easybuild-run_cmd-lTcwYV.log]
    USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=80 BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/lib64 CUDNN_INCLUDE_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/home/modules/software/NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1/include USE_METAL=0 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python setup.py build
    command completed: exit 0, ran in 00h42m37s
    == testing...
    running command:
    [started at: 2021-06-24 16:56:02]
    [working dir: /dev/shm/PyTorch/1.8.1/fosscuda-2020b/pytorch]
    [output logged in /tmp/eb-c69Vg2/easybuild-run_cmd-nR1YWD.log]
    export PYTHONPATH=/tmp/eb-c69Vg2/tmpjZ2XI7/lib/python3.8/site-packages:$PYTHONPATH && USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=80 BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/lib64 CUDNN_INCLUDE_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/home/modules/software/NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1/include USE_METAL=0 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python setup.py install --prefix=/tmp/eb-c69Vg2/tmpjZ2XI7
    command completed: exit 0, ran in 00h00m40s
    running command:
    [started at: 2021-06-24 16:56:43]
    [working dir: /dev/shm/PyTorch/1.8.1/fosscuda-2020b/pytorch]
    [output logged in /tmp/eb-c69Vg2/easybuild-run_cmd-_3VBsc.log]
    export PYTHONPATH=/tmp/eb-c69Vg2/tmpjZ2XI7/lib/python3.8/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --continue-through-error --verbose -x distributed/rpc/test_process_group_agent test_quantization

That was 3½ days ago! No progress has been made since.

The log file /tmp/eb-c69Vg2/easybuild-DEuFFC.log ends with:

== 2021-06-24 16:56:43,318 run.py:623 INFO parse_log_for_error msg: Command used: export PYTHONPATH=/tmp/eb-c69Vg2/tmpjZ2XI7/lib/python3.8/site-packages:$PYTHONPATH && USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=80 BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/lib64 CUDNN_INCLUDE_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/home/modules/software/NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1/include USE_METAL=0 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python setup.py install --prefix=/tmp/eb-c69Vg2/tmpjZ2XI7
== 2021-06-24 16:56:43,318 run.py:625 INFO parse_log_for_error (some may be harmless) regExp (?<![(,-]|\w)(?:error|segmentation fault|failed)(?![(,-]|.?\w) found:
-- Failed to find LLVM FileCheck
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile
== 2021-06-24 16:56:43,318 run.py:582 WARNING Found 2 errors in command output (output: -- Failed to find LLVM FileCheck
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile)
== 2021-06-24 16:56:43,319 run.py:226 INFO running cmd: export PYTHONPATH=/tmp/eb-c69Vg2/tmpjZ2XI7/lib/python3.8/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --continue-through-error --verbose -x distributed/rpc/test_process_group_agent test_quantization

@OleHolmNielsen
Copy link
Contributor Author

I rebooted the machine now and repeated the build. The testing output again stops again after 2 hours 20 minutes as shown below:

== building...

running command:
[started at: 2021-06-28 08:03:35]
[working dir: /dev/shm/PyTorch/1.8.1/fosscuda-2020b/pytorch]
[output logged in /tmp/eb-jIAp6q/easybuild-run_cmd-LU1yqw.log]
USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=80 BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/lib64 CUDNN_INCLUDE_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/home/modules/software/NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1/include USE_METAL=0 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python setup.py build
command completed: exit 0, ran in 00h43m00s
== testing...
running command:
[started at: 2021-06-28 08:46:37]
[working dir: /dev/shm/PyTorch/1.8.1/fosscuda-2020b/pytorch]
[output logged in /tmp/eb-jIAp6q/easybuild-run_cmd-JU6lOR.log]
export PYTHONPATH=/tmp/eb-jIAp6q/tmph8qQ3D/lib/python3.8/site-packages:$PYTHONPATH && USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=80 BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/lib64 CUDNN_INCLUDE_DIR=/home/modules/software/cuDNN/8.0.4.30-CUDA-11.1.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/home/modules/software/NCCL/2.8.3-GCCcore-10.2.0-CUDA-11.1.1/include USE_METAL=0 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python setup.py install --prefix=/tmp/eb-jIAp6q/tmph8qQ3D
command completed: exit 0, ran in 00h00m42s
running command:
[started at: 2021-06-28 08:47:19]
[working dir: /dev/shm/PyTorch/1.8.1/fosscuda-2020b/pytorch]
[output logged in /tmp/eb-jIAp6q/easybuild-run_cmd-QCZeVN.log]
export PYTHONPATH=/tmp/eb-jIAp6q/tmph8qQ3D/lib/python3.8/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python run_test.py --continue-through-error --verbose -x distributed/rpc/test_process_group_agent test_quantization

The test output file /tmp/eb-jIAp6q/easybuild-run_cmd-QCZeVN.log ends with:

Running test_tensorexpr ... [2021-06-28 11:04:48.747289]
Executing ['/home/modules/software/Python/3.8.6-GCCcore-10.2.0/bin/python', 'test_tensorexpr.py', '-v'] ... [2021-06-28 11:04:48.747446]
test_add_const_rhs (main.TestTensorExprFuser) ... ok
test_add_sub (main.TestTensorExprFuser) ... ok
test_alias_analysis_input_and_module (main.TestTensorExprFuser) ... ok
test_alias_analysis_inputs (main.TestTensorExprFuser) ... ok
test_alias_analysis_module (main.TestTensorExprFuser) ... ok
test_all_combos (main.TestTensorExprFuser) ... ok
test_alpha (main.TestTensorExprFuser) ... ok
test_binary_ops (main.TestTensorExprFuser) ... ok
test_bitwise_ops (main.TestTensorExprFuser) ... ok
test_broadcast (main.TestTensorExprFuser) ... ok
test_broadcast_2 (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_broadcast_big2 (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_broadcast_cuda (main.TestTensorExprFuser) ... ok
test_cat_cpu (main.TestTensorExprFuser) ... ok
test_cat_cuda (main.TestTensorExprFuser) ... ok
test_cat_empty_tensors (main.TestTensorExprFuser) ... ok
test_cat_negative_dim_cpu (main.TestTensorExprFuser) ... ok
test_cat_negative_dim_cuda (main.TestTensorExprFuser) ... ok
test_cat_only_cpu (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_cat_only_cuda (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_cat_promote_inputs (main.TestTensorExprFuser) ... ok
test_char (main.TestTensorExprFuser) ... ok
test_chunk (main.TestTensorExprFuser) ... ok
test_clamp (main.TestTensorExprFuser) ... ok
test_constant (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_double (main.TestTensorExprFuser) ... ok
test_double_intrinsics (main.TestTensorExprFuser) ... ok
test_dynamic_shape (main.TestTensorExprFuser) ... skipped 'dynamic shapes are not quite there yet'
test_easy (main.TestTensorExprFuser) ... ok
test_eq (main.TestTensorExprFuser) ... ok
test_exp_pow (main.TestTensorExprFuser) ... ok
test_four_arg (main.TestTensorExprFuser) ... ok
test_ge (main.TestTensorExprFuser) ... ok
test_gt (main.TestTensorExprFuser) ... ok
test_guard_fails (main.TestTensorExprFuser) ... ok
test_half_gelu (main.TestTensorExprFuser) ... ok
test_int64_promotion (main.TestTensorExprFuser) ... ok
test_int_output (main.TestTensorExprFuser) ... ok
test_le (main.TestTensorExprFuser) ... ok
test_loop (main.TestTensorExprFuser) ... ok
test_lt (main.TestTensorExprFuser) ... ok
test_mask (main.TestTensorExprFuser) ... ok
test_min_max (main.TestTensorExprFuser) ... ok
test_min_max_reduction (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_min_max_reduction2 (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_min_max_reduction_dim1 (main.TestTensorExprFuser) ... ok
test_min_max_reduction_dim1_2 (main.TestTensorExprFuser) ... skipped 'temporarily disable'
test_multi_rand (main.TestTensorExprFuser) ... ok
test_multioutput (main.TestTens

@OleHolmNielsen
Copy link
Contributor Author

Could the system default limits be stopping the test processes prematurely? Are there especially large system limits that are required for running the PyTorch tests successfully? Our GPU server has these limits:

$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 766950
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants