[Dist] dtype mismatch when copy graph into shared memory and get it back #4222

Rhett-Ying · 2022-07-07T07:15:56Z

🐛 Bug

In distDGL, graph is mapped to shared memory when server get started. Then clients re-create graph from the previously shared memory. The problem here is dtype in original graph could be different from what we specify when re-create graph from shared memory.

dgl/python/dgl/distributed/dist_graph.py

Lines 87 to 92 in 28b0904

    
           FIELD_DICT = {'inner_node': F.int32,    # A flag indicates whether the node is inside a partition. 
        
                         'inner_edge': F.int32,    # A flag indicates whether the edge is inside a partition. 
        
                         NID: F.int64, 
        
                         EID: F.int64, 
        
                         NTYPE: F.int32, 
        
                         ETYPE: F.int32}

Though re-creation does not fail, accessing those data from re-created graph results in core dump or unexpected data.

To Reproduce

   hg = create_random_hetero()
        num_parts = 1
        num_hops = 1
        graph_name = 'test_shared_mem_graph'
        partition_graph(hg, graph_name, num_parts, tmpdirname,
                        num_hops=num_hops, part_method='metis', reshuffle=True)

        g = dgl.load_graphs(os.path.join(tmpdirname, 'part0/graph.dgl'))[0][0]
        graph_format= ['csc','coo','csr']
        g = g.formats(graph_format)
        g.create_formats_()
        sg = dgl.distributed.dist_graph._copy_graph_to_shared_mem(g, graph_name, graph_format)
        for k in sg.ndata:
            assert F.array_equal(g.ndata[k], sg.ndata[k])
        for k in sg.edata:
            assert F.array_equal(g.edata[k], sg.edata[k])
        print(sg)

        cg = dgl.distributed.dist_graph._get_graph_from_shared_mem(graph_name)
        print(cg)
        #print(cg.ndata['inner_node'])   ------- bus error

Graph(num_nodes=30030, num_edges=300600,
      ndata_schemes={'inner_node': Scheme(shape=(), dtype=torch.int8), '_ID': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'inner_edge': Scheme(shape=(), dtype=torch.int8), '_ID': Scheme(shape=(), dtype=torch.int64), '_TYPE': Scheme(shape=(), dtype=torch.int32)})
Graph(num_nodes=30030, num_edges=300600,
      ndata_schemes={'inner_node': Scheme(shape=(), dtype=torch.int32), '_ID': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'inner_edge': Scheme(shape=(), dtype=torch.int32), '_ID': Scheme(shape=(), dtype=torch.int64), '_TYPE': Scheme(shape=(), dtype=torch.int32)})

Expected behavior

Environment

DGL Version (e.g., 1.0): master
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
OS (e.g., Linux):
How you installed DGL (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version (if applicable):
GPU models and configuration (e.g. V100):
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

Rhett-Ying · 2022-07-11T06:22:19Z

fixed.

Rhett-Ying self-assigned this Jul 7, 2022

Rhett-Ying mentioned this issue Jul 8, 2022

[Dist] format dtypes when loading graph in server #4228

Merged

7 tasks

Rhett-Ying closed this as completed Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dist] dtype mismatch when copy graph into shared memory and get it back #4222

[Dist] dtype mismatch when copy graph into shared memory and get it back #4222

Rhett-Ying commented Jul 7, 2022 •

edited

Loading

Rhett-Ying commented Jul 11, 2022

[Dist] dtype mismatch when copy graph into shared memory and get it back #4222

[Dist] dtype mismatch when copy graph into shared memory and get it back #4222

Comments

Rhett-Ying commented Jul 7, 2022 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Rhett-Ying commented Jul 11, 2022

Rhett-Ying commented Jul 7, 2022 •

edited

Loading