{ai}[foss/2022a] PyTorch v2.0.1 #18269

branfosj · 2023-07-05T20:04:50Z

(created using eb --new-pr)

I started working on this but have not had the time to complete it and will not have the time to complete it.

I also tried a CUDA version of 2.0.0, but I stopped working on that when I had many test failures of the form:

torch.distributed.DistBackendError: NCCL error in: /dev/shm/branfosj/build-up-EL8/PyTorch/2.0.0/foss-2022a-CUDA-11.7.0/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, invalid usage, NCCL version 2.12.12
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.

…2.0.1_skip-test.patch

branfosj · 2023-07-05T20:09:51Z

PR closed, as I'll not be working on it.

branfosj · 2023-07-06T01:17:56Z

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0104u05b.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/4709f77fc3f581f128f6edd3457e5196 for a full test report.

branfosj · 2023-07-06T07:36:10Z

`test_torchinductor_opinfo`


====================================================================================== FAILURES =======================================================================================
___________________________________________________________ TestInductorOpInfoCPU.test_comprehensive_index_add_cpu_float16 ____________________________________________________________
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/inductor/test_torchinductor_opinfo.py", line 606, in test_comprehensive
    raise RuntimeError(
RuntimeError: unexpected success index_add, torch.float16, cpu
_______________________________________________________ TestInductorOpInfoCPU.test_comprehensive_scatter_reduce_sum_cpu_float16 _______________________________________________________
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/inductor/test_torchinductor_opinfo.py", line 606, in test_comprehensive
    raise RuntimeError(
RuntimeError: unexpected success scatter_reduce.sum, torch.float16, cpu

`test_dataloader`

Failes in test_batch_sampler, test_bulk_loading_nobatch, and test_chain_iterable_style_dataset:

RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code

`test_ops_gradients`

====================================================================================== FAILURES =======================================================================================
__________________________________________________________ TestBwdGradientsCPU.test_fn_grad_linalg_det_singular_cpu_float64 ___________________________________________________________
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/test_ops_gradients.py", line 26, in test_fn_grad
    self._grad_test_helper(device, dtype, op, op.get_op())
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4303, in _grad_test_helper
    return self._check_helper(device, dtype, op, variant, 'gradcheck', check_forward_ad=check_forward_ad,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4272, in _check_helper
    self.assertTrue(gradcheck(fn, gradcheck_args,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3827, in gradcheck
    return torch.autograd.gradcheck(fn, inputs, **kwargs)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1476, in gradcheck
    return _gradcheck_helper(**args)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1490, in _gradcheck_helper
    _gradcheck_real_imag(gradcheck_fn, func, func_out, tupled_inputs, outputs, eps,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1113, in _gradcheck_real_imag
    gradcheck_fn(func, func_out, tupled_inputs, outputs, eps,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1363, in _fast_gradcheck
    _check_analytical_numerical_equal(analytical_vJu, numerical_vJu, complex_indices,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1335, in _check_analytical_numerical_equal
    raise GradcheckError(_get_notallclose_msg(a, n, j, i, complex_indices, test_imag, is_forward_ad) + jacobians_str)
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(0.1083, dtype=torch.float64)
analytical:tensor(0.0094, dtype=torch.float64)

The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:

[snip large tensor]

`test_jit`

======================================================================
ERROR: test_always_alive_values (jit.test_freezing.TestMKLDNNReinplacing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_freezing.py", line 3053, in test_always_alive_values
    self.checkResults(mod_eager, mod)
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_freezing.py", line 3010, in checkResults
    self.assertEqual(mod1(inp), mod2(inp))
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 483, in prof_meth_call
    return prof_callable(meth_call, *args, **kwargs)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 477, in prof_callable
    return callable(*args, **kwargs)
RuntimeError: Couldn't find method: '_conv_forward' on class: '__torch__.torch.nn.modules.conv.___torch_mangle_2194.Conv2d (of Python compilation unit at: 0x7329e20)'

======================================================================
ERROR: test_merge_liveness (jit.test_freezing.TestMKLDNNReinplacing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_freezing.py", line 3035, in test_merge_liveness
    FileCheck().check("aten::mul_").check_not("aten::add_").run(mod.graph)
RuntimeError: Expected to find "aten::mul_" but did not find it
##[endgroup]
Searched string:
graph(%self : __torch__.torch.nn.modules.container.___torch_mangle_2196.Sequential,
~~~~~~~~~~ <--- HERE
      %input.1 : Tensor):
  %9 : int = prim::Constant[value=1]()
From CHECK: aten::mul_


======================================================================
ERROR: test_successful (jit.test_freezing.TestMKLDNNReinplacing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_freezing.py", line 3017, in test_successful
    FileCheck().check("mkldnn_convolution").check_next("prim::MKLDNNHardSwish_").check_next("aten::relu_").run(mod.graph)
RuntimeError: Expected to find "mkldnn_convolution" but did not find it
Searched string:
graph(%self : __torch__.torch.nn.modules.container.___torch_mangle_2199.Sequential,
~~~~~~~~~~~~~~~~~~ <--- HERE
      %input.1 : Tensor):
  %14 : Function = prim::Constant[name="relu"]()
From CHECK: mkldnn_convolution

======================================================================
ERROR: test_switch_inputs_to_inplace (jit.test_freezing.TestMKLDNNReinplacing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_freezing.py", line 3086, in test_switch_inputs_to_inplace
    FileCheck().check("aten::add_").run(mod.graph)
RuntimeError: Expected to find "aten::add_" but did not find it
Searched string:
graph(%self : __torch__.torch.nn.modules.container.___torch_mangle_2203.Sequential,
~~~~~~~~~~ <--- HERE
      %input.1 : Tensor):
  %9 : int = prim::Constant[value=1]()
From CHECK: aten::add_


======================================================================
ERROR: test_conv_dim_folding (jit.test_peephole.TestPeephole)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_peephole.py", line 207, in test_conv_dim_folding
    FileCheck().check_not("conv").check_not("dim").run(conv_dim.graph)
RuntimeError: Expected to not find "conv" but found it
graph(%self : __torch__.jit.test_peephole.ConvDim,
      %x.1 : Tensor):
  %conv : __torch__.torch.nn.modules.conv.Conv1d = prim::GetAttr[name="conv"](%self)
   ~~~~ <--- HERE
  %weight : Tensor = prim::GetAttr[name="weight"](%conv)
  %bias : Tensor? = prim::GetAttr[name="bias"](%conv)
From CHECK-NOT: conv


======================================================================
ERROR: test_partial_eval_stitching (jit.test_symbolic_shape_analysis.TestSymbolicShapeAnalysis)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_symbolic_shape_analysis.py", line 440, in test_partial_eval_stitching
    self.checkSymShapeCompute(shape_compute_graph, nodes, output_shapes, ([1, 3, 224, 224],))
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_symbolic_shape_analysis.py", line 405, in checkSymShapeCompute
    g = shape_compute_graph.partial_eval_shape_graph()
AttributeError: 'NoneType' object has no attribute 'partial_eval_shape_graph'

======================================================================
ERROR: test_refinement_through_graph_stitching (jit.test_symbolic_shape_analysis.TestSymbolicShapeAnalysis)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_symbolic_shape_analysis.py", line 461, in test_refinement_through_graph_stitching
    self.assertTrue(out1[2] != out2[2])
TypeError: 'NoneType' object is not subscriptable

======================================================================
FAIL: test_profiler (__main__.TestJit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/test_jit.py", line 2958, in test_profiler
    self.assertTrue(e.thread not in mul_events)
AssertionError: False is not true

======================================================================
FAIL: test_canonicalize_tensor_iterator (jit.test_tracer.TestTracer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_tracer.py", line 243, in test_canonicalize_tensor_iterator
    self.assertTrue(str(traced.graph_for(x)).count(': int = prim::Constant') == 5)
AssertionError: False is not true

======================================================================
FAIL: test_inplace_check (jit.test_tracer.TestTracer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_tracer.py", line 340, in test_inplace_check
    with self.assertRaisesRegex(RuntimeError, 'inplace MyInplaceFn'):
AssertionError: RuntimeError not raised

----------------------------------------------------------------------

adding easyconfigs: PyTorch-2.0.1-foss-2022a.eb and patches: PyTorch-…

b6f54d6

…2.0.1_skip-test.patch

branfosj added the update label Jul 5, 2023

branfosj marked this pull request as draft July 5, 2023 20:05

branfosj closed this Jul 5, 2023

update source

0e1650c

branfosj reopened this Jul 5, 2023

branfosj closed this Jul 5, 2023

boegel added this to the 4.7.3 milestone Jul 6, 2023

branfosj deleted the 20230705210442_new_pr_PyTorch201 branch October 7, 2023 14:51

boegel modified the milestones: 4.8.0, 4.x Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{ai}[foss/2022a] PyTorch v2.0.1 #18269

{ai}[foss/2022a] PyTorch v2.0.1 #18269

branfosj commented Jul 5, 2023 •

edited

branfosj commented Jul 5, 2023

branfosj commented Jul 6, 2023

branfosj commented Jul 6, 2023

{ai}[foss/2022a] PyTorch v2.0.1 #18269

{ai}[foss/2022a] PyTorch v2.0.1 #18269

Conversation

branfosj commented Jul 5, 2023 • edited

branfosj commented Jul 5, 2023

branfosj commented Jul 6, 2023

branfosj commented Jul 6, 2023

test_torchinductor_opinfo

test_dataloader

test_ops_gradients

test_jit

branfosj commented Jul 5, 2023 •

edited

`test_torchinductor_opinfo`

`test_dataloader`

`test_ops_gradients`

`test_jit`