Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Friendster dataset in Benchmark #3967

Closed
initzhang opened this issue May 1, 2022 · 5 comments
Closed

Friendster dataset in Benchmark #3967

initzhang opened this issue May 1, 2022 · 5 comments

Comments

@initzhang
Copy link
Contributor

🐛 Bug

There is a potential problem about the friendster dataset used in the benchmark scripts:

def get_friendster():
# Same as https://snap.stanford.edu/data/bigdata/communities/com-friendster.ungraph.txt.gz
_download('https://dgl-asv-data.s3-us-west-2.amazonaws.com/dataset/friendster/com-friendster.ungraph.txt.gz',
'/tmp/dataset/friendster', 'com-friendster.ungraph.txt.gz')
df = pandas.read_csv('/tmp/dataset/friendster/com-friendster.ungraph.txt.gz', sep='\t', skiprows=4, header=None,
names=['src', 'dst'], compression='gzip')
src = df['src'].values
dst = df['dst'].values
print('construct the graph')
return dgl.graph((src, dst))

the node ID in the com-friendster.ungraph.txt does not increase from 0 and continuously, so the returned graph is also wrong (in terms of number of nodes and edges). Maybe it will affect the benchmark results?

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Environment

  • DGL Version (e.g., 1.0):
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
  • OS (e.g., Linux):
  • How you installed DGL (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version (if applicable):
  • GPU models and configuration (e.g. V100):
  • Any other relevant information:

Additional context

@jermainewang
Copy link
Member

Hi, thanks for reporting. I'd like to understand how severe is the problem. Do you know how many nodes in the constructed graph are isolated (no in- and out- edges)?

@initzhang
Copy link
Contributor Author

initzhang commented May 8, 2022

According to the description in the friendster website (http://snap.stanford.edu/data/com-Friendster.html), the dataset contains 65,608,366 connected nodes with 1,806,067,135 edges.

While the current benchmark script generates a graph with 124,836,180 nodes and 1,806,067,135 edges, where the number of nonzero elements of g.in_degrees() + g.out_degrees() is 65,608,366.

So approximately 50% isolated nodes.

@jermainewang
Copy link
Member

I see. The inflation of node ID space may harm locality for reading node features because they will be more scattered. It won't influence the computation amount which depends on the number of edges. I think it is worth improving. Would you like to help? The compact_graphs API should be useful.

@initzhang
Copy link
Contributor Author

Sure! I will follow up and modify related scripts soon.

@initzhang
Copy link
Contributor Author

Hi, I have launched a PR at #4009

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants