Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dist] sample_etype_neighbors samples unexpected num_edges #4150

Closed
Rhett-Ying opened this issue Jun 21, 2022 · 3 comments
Closed

[Dist] sample_etype_neighbors samples unexpected num_edges #4150

Rhett-Ying opened this issue Jun 21, 2022 · 3 comments
Assignees
Labels
bug:confirmed Something isn't working topic: Distributed DGL Issues related to distributed training/inference

Comments

@Rhett-Ying
Copy link
Collaborator

🐛 Bug

When sampling from DistGraph where dgl.ETYPE is not sorted for the neighbors of sorted dst, far more than expected edges will be sampled. This is caused by below flag: etype_sorted=True. False should be the default value. And such argument should be exposed to user for better control.

sampled_graph = local_sample_etype_neighbors(
local_g, local_ids, etype_field, fan_out, edge_dir, prob, replace,
etype_sorted=True, _dist_training=True)

To Reproduce

Expected behavior

Environment

  • DGL Version (e.g., 1.0): master
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
  • OS (e.g., Linux):
  • How you installed DGL (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version (if applicable):
  • GPU models and configuration (e.g. V100):
  • Any other relevant information:

Additional context

@Rhett-Ying Rhett-Ying added the topic: Distributed DGL Issues related to distributed training/inference label Jun 21, 2022
@Rhett-Ying Rhett-Ying self-assigned this Jun 21, 2022
@Rhett-Ying
Copy link
Collaborator Author

@jermainewang @zheng-da pls help make comments on this.

@jermainewang jermainewang added the bug:confirmed Something isn't working label Jun 21, 2022
@jermainewang
Copy link
Member

jermainewang commented Jun 21, 2022

We found this bug when investigating a performance issue of distributed training. The details are as following:

  1. The etype_sorted flag requires the neighbor edges of each node to be sorted. This is IMO non-trivial for end users who are preparing the graph for (1) it is more complex than just sorting all the edges by their types and (2) DGL internally needs to convert graph to CSC format for neighbor sampling which may break the user-prepared edge order.
  2. By whatever reasons, the data we are using is not sorted by edge types. It may be because the convert_partition.py script does not promise this property:
    # It's not guaranteed that the edges are sorted based on edge type.
    # Let's sort edges and all attributes on the edges.
    sort_idx = np.argsort(etype_ids)
    src_id, dst_id, orig_src_id, orig_dst_id, orig_edge_id, etype_ids = src_id[sort_idx], dst_id[sort_idx], \
    orig_src_id[sort_idx], orig_dst_id[sort_idx], orig_edge_id[sort_idx], etype_ids[sort_idx]
    assert np.all(np.diff(etype_ids) >= 0)
  3. The sample_etype_neighbors function hardcoded etype_sorted to be True:
    # DistGraph's edges are sorted by default according to
    # graph partition mechanism.
    sampled_graph = local_sample_etype_neighbors(
    local_g, local_ids, etype_field, fan_out, edge_dir, prob, replace,
    etype_sorted=True, _dist_training=True)
  4. Calling sample_etype_neighbors(etype_sorted=True) on unsorted data does not trigger any validity check. In fact, it will still scan the neighbor list to find consecutive etype segments and perform sampling, which will cause many extra neighbors to be sampled. For example, consider a node with neighbor edges of type [0, 0, 1, 1, 0, 0, 1, 1]. With fanout=3, it should return 6 neighbors (3 samples per type), but the algorithm will treat each consecutive segment as one new type which results in 8 neighbors being sampled.
    for (int64_t j = 0; j < len; ++j) {
    if ((j+1 == len) || cur_et != et[et_idx[j+1]]) {
    // 1 end of the current etype
    // 2 end of the row
    // random pick for current etype
    if (et_len <= num_picks[cur_et] && !replace) {
    // fast path, select all
    for (int64_t k = 0; k < et_len; ++k) {
    rows.push_back(rid);
    cols.push_back(indices[off+et_idx[et_offset+k]]);
    if (data)
    idx.push_back(data[off+et_idx[et_offset+k]]);
    else
    idx.push_back(off+et_idx[et_offset+k]);
    }
    } else {

My suggestions:

  1. The sample_etype_neighbors API was introduced in [Distributed] Edge-type-specific fanouts for heterogeneous graphs #3558. I checked the doc page. It was currently not exposed to end users. We should review the APIs, expose it and write more thorough tests.
  2. When using in distributed training, etype_sorted shall not be hardcoded to be True. We should expose the option to upper-level applications.
  3. In the operator implementation, we should check whether the neighbor lists are sorted by etype during sampling. It won't add much overhead, i.e., just sweeping the neighbor list and check whether the etype ID is in ascending order.
  4. In the future, we should provide an API to adjust the neighbor list order of the internal CSC/CSR.

2 and 3 are immediate patch to take while 1 and 4 may need more time.

@Rhett-Ying
Copy link
Collaborator Author

all required changes have been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:confirmed Something isn't working topic: Distributed DGL Issues related to distributed training/inference
Projects
None yet
Development

No branches or pull requests

2 participants