Friendster dataset in Benchmark #3967

initzhang · 2022-05-01T12:29:09Z

🐛 Bug

There is a potential problem about the friendster dataset used in the benchmark scripts:

Lines 131 to 140 in ae7e3db

    
           def get_friendster(): 
        
               # Same as https://snap.stanford.edu/data/bigdata/communities/com-friendster.ungraph.txt.gz 
        
               _download('https://dgl-asv-data.s3-us-west-2.amazonaws.com/dataset/friendster/com-friendster.ungraph.txt.gz', 
        
                         '/tmp/dataset/friendster', 'com-friendster.ungraph.txt.gz') 
        
               df = pandas.read_csv('/tmp/dataset/friendster/com-friendster.ungraph.txt.gz', sep='\t', skiprows=4, header=None, 
        
                                    names=['src', 'dst'], compression='gzip') 
        
               src = df['src'].values 
        
               dst = df['dst'].values 
        
               print('construct the graph') 
        
               return dgl.graph((src, dst))

the node ID in the com-friendster.ungraph.txt does not increase from 0 and continuously, so the returned graph is also wrong (in terms of number of nodes and edges). Maybe it will affect the benchmark results?

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Environment

DGL Version (e.g., 1.0):
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
OS (e.g., Linux):
How you installed DGL (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version (if applicable):
GPU models and configuration (e.g. V100):
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

jermainewang · 2022-05-08T09:39:38Z

Hi, thanks for reporting. I'd like to understand how severe is the problem. Do you know how many nodes in the constructed graph are isolated (no in- and out- edges)?

initzhang · 2022-05-08T15:29:22Z

According to the description in the friendster website (http://snap.stanford.edu/data/com-Friendster.html), the dataset contains 65,608,366 connected nodes with 1,806,067,135 edges.

While the current benchmark script generates a graph with 124,836,180 nodes and 1,806,067,135 edges, where the number of nonzero elements of g.in_degrees() + g.out_degrees() is 65,608,366.

So approximately 50% isolated nodes.

jermainewang · 2022-05-09T09:11:55Z

I see. The inflation of node ID space may harm locality for reading node features because they will be more scattered. It won't influence the computation amount which depends on the number of edges. I think it is worth improving. Would you like to help? The compact_graphs API should be useful.

initzhang · 2022-05-09T10:45:25Z

Sure! I will follow up and modify related scripts soon.

initzhang · 2022-05-14T03:05:33Z

Hi, I have launched a PR at #4009

jermainewang added the Enhancement label May 9, 2022

initzhang closed this as completed May 17, 2022

initzhang mentioned this issue Aug 23, 2024

Mismatch in Number of Nodes for Friendster Dataset initzhang/DUCATI_SIGMOD#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Friendster dataset in Benchmark #3967

Friendster dataset in Benchmark #3967

initzhang commented May 1, 2022

jermainewang commented May 8, 2022

initzhang commented May 8, 2022 •

edited

Loading

jermainewang commented May 9, 2022

initzhang commented May 9, 2022

initzhang commented May 14, 2022

Friendster dataset in Benchmark #3967

Friendster dataset in Benchmark #3967

Comments

initzhang commented May 1, 2022

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

jermainewang commented May 8, 2022

initzhang commented May 8, 2022 • edited Loading

jermainewang commented May 9, 2022

initzhang commented May 9, 2022

initzhang commented May 14, 2022

initzhang commented May 8, 2022 •

edited

Loading