Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dist] dtype mismatch when copy graph into shared memory and get it back #4222

Closed
Rhett-Ying opened this issue Jul 7, 2022 · 1 comment
Closed
Assignees

Comments

@Rhett-Ying
Copy link
Collaborator

Rhett-Ying commented Jul 7, 2022

🐛 Bug

In distDGL, graph is mapped to shared memory when server get started. Then clients re-create graph from the previously shared memory. The problem here is dtype in original graph could be different from what we specify when re-create graph from shared memory.

FIELD_DICT = {'inner_node': F.int32, # A flag indicates whether the node is inside a partition.
'inner_edge': F.int32, # A flag indicates whether the edge is inside a partition.
NID: F.int64,
EID: F.int64,
NTYPE: F.int32,
ETYPE: F.int32}

Though re-creation does not fail, accessing those data from re-created graph results in core dump or unexpected data.

To Reproduce

   hg = create_random_hetero()
        num_parts = 1
        num_hops = 1
        graph_name = 'test_shared_mem_graph'
        partition_graph(hg, graph_name, num_parts, tmpdirname,
                        num_hops=num_hops, part_method='metis', reshuffle=True)

        g = dgl.load_graphs(os.path.join(tmpdirname, 'part0/graph.dgl'))[0][0]
        graph_format= ['csc','coo','csr']
        g = g.formats(graph_format)
        g.create_formats_()
        sg = dgl.distributed.dist_graph._copy_graph_to_shared_mem(g, graph_name, graph_format)
        for k in sg.ndata:
            assert F.array_equal(g.ndata[k], sg.ndata[k])
        for k in sg.edata:
            assert F.array_equal(g.edata[k], sg.edata[k])
        print(sg)

        cg = dgl.distributed.dist_graph._get_graph_from_shared_mem(graph_name)
        print(cg)
        #print(cg.ndata['inner_node'])   ------- bus error
Graph(num_nodes=30030, num_edges=300600,
      ndata_schemes={'inner_node': Scheme(shape=(), dtype=torch.int8), '_ID': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'inner_edge': Scheme(shape=(), dtype=torch.int8), '_ID': Scheme(shape=(), dtype=torch.int64), '_TYPE': Scheme(shape=(), dtype=torch.int32)})
Graph(num_nodes=30030, num_edges=300600,
      ndata_schemes={'inner_node': Scheme(shape=(), dtype=torch.int32), '_ID': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'inner_edge': Scheme(shape=(), dtype=torch.int32), '_ID': Scheme(shape=(), dtype=torch.int64), '_TYPE': Scheme(shape=(), dtype=torch.int32)})

Expected behavior

Environment

  • DGL Version (e.g., 1.0): master
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
  • OS (e.g., Linux):
  • How you installed DGL (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version (if applicable):
  • GPU models and configuration (e.g. V100):
  • Any other relevant information:

Additional context

@Rhett-Ying
Copy link
Collaborator Author

fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant