Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ai}[foss/2022a] PyTorch v2.0.1 #18269

Closed

Conversation

branfosj
Copy link
Member

@branfosj branfosj commented Jul 5, 2023

(created using eb --new-pr)

I started working on this but have not had the time to complete it and will not have the time to complete it.

I also tried a CUDA version of 2.0.0, but I stopped working on that when I had many test failures of the form:

torch.distributed.DistBackendError: NCCL error in: /dev/shm/branfosj/build-up-EL8/PyTorch/2.0.0/foss-2022a-CUDA-11.7.0/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, invalid usage, NCCL version 2.12.12
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.

@branfosj branfosj added the update label Jul 5, 2023
@branfosj branfosj marked this pull request as draft July 5, 2023 20:05
@branfosj
Copy link
Member Author

branfosj commented Jul 5, 2023

PR closed, as I'll not be working on it.

@branfosj branfosj closed this Jul 5, 2023
@branfosj branfosj reopened this Jul 5, 2023
@branfosj branfosj closed this Jul 5, 2023
@branfosj
Copy link
Member Author

branfosj commented Jul 6, 2023

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0104u05b.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/4709f77fc3f581f128f6edd3457e5196 for a full test report.

@branfosj
Copy link
Member Author

branfosj commented Jul 6, 2023

test_torchinductor_opinfo


====================================================================================== FAILURES =======================================================================================
___________________________________________________________ TestInductorOpInfoCPU.test_comprehensive_index_add_cpu_float16 ____________________________________________________________
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/inductor/test_torchinductor_opinfo.py", line 606, in test_comprehensive
    raise RuntimeError(
RuntimeError: unexpected success index_add, torch.float16, cpu
_______________________________________________________ TestInductorOpInfoCPU.test_comprehensive_scatter_reduce_sum_cpu_float16 _______________________________________________________
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/inductor/test_torchinductor_opinfo.py", line 606, in test_comprehensive
    raise RuntimeError(
RuntimeError: unexpected success scatter_reduce.sum, torch.float16, cpu

test_dataloader

Failes in test_batch_sampler, test_bulk_loading_nobatch, and test_chain_iterable_style_dataset:

RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code

test_ops_gradients

====================================================================================== FAILURES =======================================================================================
__________________________________________________________ TestBwdGradientsCPU.test_fn_grad_linalg_det_singular_cpu_float64 ___________________________________________________________
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/test_ops_gradients.py", line 26, in test_fn_grad
    self._grad_test_helper(device, dtype, op, op.get_op())
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4303, in _grad_test_helper
    return self._check_helper(device, dtype, op, variant, 'gradcheck', check_forward_ad=check_forward_ad,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4272, in _check_helper
    self.assertTrue(gradcheck(fn, gradcheck_args,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3827, in gradcheck
    return torch.autograd.gradcheck(fn, inputs, **kwargs)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1476, in gradcheck
    return _gradcheck_helper(**args)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1490, in _gradcheck_helper
    _gradcheck_real_imag(gradcheck_fn, func, func_out, tupled_inputs, outputs, eps,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1113, in _gradcheck_real_imag
    gradcheck_fn(func, func_out, tupled_inputs, outputs, eps,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1363, in _fast_gradcheck
    _check_analytical_numerical_equal(analytical_vJu, numerical_vJu, complex_indices,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1335, in _check_analytical_numerical_equal
    raise GradcheckError(_get_notallclose_msg(a, n, j, i, complex_indices, test_imag, is_forward_ad) + jacobians_str)
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(0.1083, dtype=torch.float64)
analytical:tensor(0.0094, dtype=torch.float64)

The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:

[snip large tensor]

test_jit

======================================================================
ERROR: test_always_alive_values (jit.test_freezing.TestMKLDNNReinplacing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_freezing.py", line 3053, in test_always_alive_values
    self.checkResults(mod_eager, mod)
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_freezing.py", line 3010, in checkResults
    self.assertEqual(mod1(inp), mod2(inp))
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 483, in prof_meth_call
    return prof_callable(meth_call, *args, **kwargs)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-jr1zlioq/tmpwifu7v85/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 477, in prof_callable
    return callable(*args, **kwargs)
RuntimeError: Couldn't find method: '_conv_forward' on class: '__torch__.torch.nn.modules.conv.___torch_mangle_2194.Conv2d (of Python compilation unit at: 0x7329e20)'

======================================================================
ERROR: test_merge_liveness (jit.test_freezing.TestMKLDNNReinplacing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_freezing.py", line 3035, in test_merge_liveness
    FileCheck().check("aten::mul_").check_not("aten::add_").run(mod.graph)
RuntimeError: Expected to find "aten::mul_" but did not find it
##[endgroup]
Searched string:
graph(%self : __torch__.torch.nn.modules.container.___torch_mangle_2196.Sequential,
~~~~~~~~~~ <--- HERE
      %input.1 : Tensor):
  %9 : int = prim::Constant[value=1]()
From CHECK: aten::mul_


======================================================================
ERROR: test_successful (jit.test_freezing.TestMKLDNNReinplacing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_freezing.py", line 3017, in test_successful
    FileCheck().check("mkldnn_convolution").check_next("prim::MKLDNNHardSwish_").check_next("aten::relu_").run(mod.graph)
RuntimeError: Expected to find "mkldnn_convolution" but did not find it
Searched string:
graph(%self : __torch__.torch.nn.modules.container.___torch_mangle_2199.Sequential,
~~~~~~~~~~~~~~~~~~ <--- HERE
      %input.1 : Tensor):
  %14 : Function = prim::Constant[name="relu"]()
From CHECK: mkldnn_convolution

======================================================================
ERROR: test_switch_inputs_to_inplace (jit.test_freezing.TestMKLDNNReinplacing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_freezing.py", line 3086, in test_switch_inputs_to_inplace
    FileCheck().check("aten::add_").run(mod.graph)
RuntimeError: Expected to find "aten::add_" but did not find it
Searched string:
graph(%self : __torch__.torch.nn.modules.container.___torch_mangle_2203.Sequential,
~~~~~~~~~~ <--- HERE
      %input.1 : Tensor):
  %9 : int = prim::Constant[value=1]()
From CHECK: aten::add_


======================================================================
ERROR: test_conv_dim_folding (jit.test_peephole.TestPeephole)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_peephole.py", line 207, in test_conv_dim_folding
    FileCheck().check_not("conv").check_not("dim").run(conv_dim.graph)
RuntimeError: Expected to not find "conv" but found it
graph(%self : __torch__.jit.test_peephole.ConvDim,
      %x.1 : Tensor):
  %conv : __torch__.torch.nn.modules.conv.Conv1d = prim::GetAttr[name="conv"](%self)
   ~~~~ <--- HERE
  %weight : Tensor = prim::GetAttr[name="weight"](%conv)
  %bias : Tensor? = prim::GetAttr[name="bias"](%conv)
From CHECK-NOT: conv


======================================================================
ERROR: test_partial_eval_stitching (jit.test_symbolic_shape_analysis.TestSymbolicShapeAnalysis)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_symbolic_shape_analysis.py", line 440, in test_partial_eval_stitching
    self.checkSymShapeCompute(shape_compute_graph, nodes, output_shapes, ([1, 3, 224, 224],))
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_symbolic_shape_analysis.py", line 405, in checkSymShapeCompute
    g = shape_compute_graph.partial_eval_shape_graph()
AttributeError: 'NoneType' object has no attribute 'partial_eval_shape_graph'

======================================================================
ERROR: test_refinement_through_graph_stitching (jit.test_symbolic_shape_analysis.TestSymbolicShapeAnalysis)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_symbolic_shape_analysis.py", line 461, in test_refinement_through_graph_stitching
    self.assertTrue(out1[2] != out2[2])
TypeError: 'NoneType' object is not subscriptable

======================================================================
FAIL: test_profiler (__main__.TestJit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/test_jit.py", line 2958, in test_profiler
    self.assertTrue(e.thread not in mul_events)
AssertionError: False is not true

======================================================================
FAIL: test_canonicalize_tensor_iterator (jit.test_tracer.TestTracer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_tracer.py", line 243, in test_canonicalize_tensor_iterator
    self.assertTrue(str(traced.graph_for(x)).count(': int = prim::Constant') == 5)
AssertionError: False is not true

======================================================================
FAIL: test_inplace_check (jit.test_tracer.TestTracer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/jit/test_tracer.py", line 340, in test_inplace_check
    with self.assertRaisesRegex(RuntimeError, 'inplace MyInplaceFn'):
AssertionError: RuntimeError not raised

----------------------------------------------------------------------

@boegel boegel added this to the 4.7.3 milestone Jul 6, 2023
@branfosj branfosj deleted the 20230705210442_new_pr_PyTorch201 branch October 7, 2023 14:51
@boegel boegel modified the milestones: 4.8.0, 4.x Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants