Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

test_subgraph_exe1 fails on windows #19915

Open
leezu opened this issue Feb 18, 2021 · 9 comments
Open

test_subgraph_exe1 fails on windows #19915

leezu opened this issue Feb 18, 2021 · 9 comments
Labels

Comments

@leezu
Copy link
Contributor

leezu commented Feb 18, 2021

https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-cpu/detail/PR-19908/2/pipeline

@leezu leezu added the Bug label Feb 18, 2021
@leezu
Copy link
Contributor Author

leezu commented Feb 18, 2021

The first time I see a related error on master branch windows-cpu is https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-cpu/detail/master/2455/pipeline of e164cee

[2021-02-16T21:06:05.273Z] _______________ test_subgraph_exe4[sym14-op_names14-default_v2] _______________
[2021-02-16T21:06:05.273Z] [gw0] win32 -- Python 3.7.0 C:\Python37\python.exe
[2021-02-16T21:06:05.273Z] 
[2021-02-16T21:06:05.273Z] sym = <Symbol convolution38>, subgraph_backend = 'default_v2'
[2021-02-16T21:06:05.273Z] op_names = ['sin', 'Convolution']
[2021-02-16T21:06:05.273Z] 
[2021-02-16T21:06:05.273Z]     @pytest.mark.parametrize('subgraph_backend', ['default', 'default_v2'])
[2021-02-16T21:06:05.273Z]     @pytest.mark.parametrize('sym,op_names', get_graphs())
[2021-02-16T21:06:05.273Z]     def test_subgraph_exe4(sym, subgraph_backend, op_names):
[2021-02-16T21:06:05.273Z]         """Use env var MXNET_SUBGRAPH_BACKEND=default to trigger graph partitioning in bind
[2021-02-16T21:06:05.273Z]         and compare results of the partitioned sym and the original sym."""
[2021-02-16T21:06:05.273Z]         def get_executor(sym, subgraph_backend=None, op_names=None, original_exec=None):
[2021-02-16T21:06:05.273Z]             arg_shapes, _, aux_shapes = sym.infer_shape()
[2021-02-16T21:06:05.273Z]             if subgraph_backend is None:
[2021-02-16T21:06:05.273Z]                 arg_array = [mx.nd.random.uniform(shape=shape) for shape in arg_shapes]
[2021-02-16T21:06:05.273Z]                 aux_array = [mx.nd.random.uniform(shape=shape) for shape in aux_shapes]
[2021-02-16T21:06:05.273Z]             else:
[2021-02-16T21:06:05.273Z]                 arg_array = None
[2021-02-16T21:06:05.273Z]                 aux_array = None
[2021-02-16T21:06:05.273Z]             exe = sym._bind(ctx=mx.current_context(),
[2021-02-16T21:06:05.273Z]                            args=arg_array if subgraph_backend is None else original_exec.arg_arrays,
[2021-02-16T21:06:05.273Z]                            aux_states=aux_array if subgraph_backend is None else original_exec.aux_arrays,
[2021-02-16T21:06:05.273Z]                            grad_req='null')
[2021-02-16T21:06:05.273Z]             exe.forward()
[2021-02-16T21:06:05.273Z]             return exe
[2021-02-16T21:06:05.273Z]     
[2021-02-16T21:06:05.273Z]         sym, _, _ = sym
[2021-02-16T21:06:05.273Z] >       original_exec = get_executor(sym)
[2021-02-16T21:06:05.273Z] 
[2021-02-16T21:06:05.273Z] tests\python\unittest\test_subgraph_op.py:237: 
[2021-02-16T21:06:05.273Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2021-02-16T21:06:05.273Z] tests\python\unittest\test_subgraph_op.py:222: in get_executor
[2021-02-16T21:06:05.273Z]     arg_shapes, _, aux_shapes = sym.infer_shape()
[2021-02-16T21:06:05.273Z] windows_package\python\mxnet\symbol\symbol.py:1132: in infer_shape
[2021-02-16T21:06:05.273Z]     res = self._infer_shape_impl(False, *args, **kwargs)
[2021-02-16T21:06:05.273Z] windows_package\python\mxnet\symbol\symbol.py:1267: in _infer_shape_impl
[2021-02-16T21:06:05.273Z]     ctypes.byref(complete)))
[2021-02-16T21:06:05.273Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2021-02-16T21:06:05.273Z] 
[2021-02-16T21:06:05.273Z] ret = -1
[2021-02-16T21:06:05.273Z] 
[2021-02-16T21:06:05.273Z]     def check_call(ret):
[2021-02-16T21:06:05.273Z]         """Check the return value of C API call.
[2021-02-16T21:06:05.273Z]     
[2021-02-16T21:06:05.273Z]         This function will raise an exception when an error occurs.
[2021-02-16T21:06:05.273Z]         Wrap every API call with this function.
[2021-02-16T21:06:05.273Z]     
[2021-02-16T21:06:05.273Z]         Parameters
[2021-02-16T21:06:05.273Z]         ----------
[2021-02-16T21:06:05.273Z]         ret : int
[2021-02-16T21:06:05.273Z]             return value from API calls.
[2021-02-16T21:06:05.273Z]         """
[2021-02-16T21:06:05.273Z]         if ret != 0:
[2021-02-16T21:06:05.273Z] >           raise get_last_ffi_error()
[2021-02-16T21:06:05.273Z] E           mxnet.base.MXNetError: MXNetError: Error in operator convolution38: Shape inconsistent, Provided = [1,0,2,2], inferred shape=(1,3,2,2)

@mseth10
Copy link
Contributor

mseth10 commented Feb 21, 2021

The error occurs for the network

    data1 = mx.sym.Variable('data1', shape=(3, 3, 10, 10), dtype=np.float32)
    data2 = mx.sym.Variable('data2', shape=(1, 0, 2, 2))
    data3 = mx.sym.sin(data2)
    conv = mx.sym.Convolution(data=data1, weight=data3, kernel=(2, 2), num_filter=1)
    return (conv, ['data1'], [(3, 3, 10, 10)])

with simple_bind during infer_shape and is flaky.

@samskalicky Do you think we can change the shape of data2 from (1,0,2,2) to (1,3,2,2)? Or is it intended to be inferred during shape inference?

@samskalicky
Copy link
Contributor

samskalicky commented Feb 22, 2021

no idea, if its flaky then its working (sometimes) and we should figure out why it fails. Just changing the inputs is not a good way to "fix" this, but might be a good place to debug if that makes the problem go away consistently. But that shouldnt be the final resolution, that just hides the problem

@leezu
Copy link
Contributor Author

leezu commented Feb 25, 2021

This essentially blocks the master CI. I marked more subgraph tests for disabling on windows in #19908

@samskalicky
Copy link
Contributor

So these tests pass on linux but are flaky on windows? is that the current state of things?

@leezu
Copy link
Contributor Author

leezu commented Feb 25, 2021

Yes. Maybe there was a change to the Windows CI infrastructure that triggered this. I'm not sure.

@mseth10
Copy link
Contributor

mseth10 commented Apr 27, 2021

Are we still seeing this error? @leezu

@leezu
Copy link
Contributor Author

leezu commented Apr 27, 2021

The test is currently disabled on Windows:

https://github.com/apache/incubator-mxnet/blob/5722f8b38af58c5a296e46ca695bfaf7cff85040/tests/python/unittest/test_subgraph_op.py#L126-L127

If you think it has been fixed, let's re-enable it :)

@DickJC123
Copy link
Contributor

I recently set up master with an internal build/CI system, and see the reported failure on linux, but so far only on the CI machines when running the full test suite. The test_subgraph_exe* tests pass when run individually on a non-CI machine. The failure I'm seeing matches the reported one:

Shape inconsistent, Provided = [1,0,2,2], inferred shape=(1,3,2,2)

This error text comes from the macro SHAPE_ASSIGN_CHECK, which calls shape_assign():
https://github.com/apache/incubator-mxnet/blob/master/src/operator/operator_common.h#L157-L181

My confusion is in the interpretation of the shape [1,0,2,2]. It seems the test author wanted the C-dimension of this input weight tensor shape to be inferred. However, shape_assign() seems to be applying the 'np_shape' view of the shape, where a 0 represents a known 0-size, generally reserved for a scalar (so incompatible with [1,3,2,2]. I wonder if a 'use_np_shape' mode is being non-deterministically applied somehow to this test. Thoughts anyone?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants