Improving dataset loader and preprocess script #102

JoelMathewC · 2024-01-20T10:02:18Z

There are some severe speed issues with the preprocess and data loader script and this oftens makes benchmarking a rather tedious process. I'll work on clearing up the technical debt here (mostly mine 😅).

JoelMathewC · 2024-01-20T10:53:38Z

Starting with the preprocess script the current time taken is as follows (noting that I've already done some optimization here for loading the file). This is from the sx-stackoverflow set to cutoff at 20M edges.

Namespace(dataset='sx-stackoverflow', base=10000000, percent_change=2.0, cutoff_time=20000000)
[CHECKPOINT]::FILE_PARSING_COMPLETED in 31.958895921707153s
[CHECKPOINT]::PREPROCESS_GRAPH in 661.6636250019073s
[CHECKPOINT]::JSON_DUMP in 95.33747601509094s

I tried tweaking around with this a bit but I think this is acceptable and hence moving on.

JoelMathewC · 2024-01-21T07:32:03Z

Okay so I have identified a few repeated operations in the data loading pipeline. Right now the flow is

Dataset (edge list) ----(Preprocessor)--> JsonData (add/delete list) ----(Data loader)--> GraphObj (edge list) ---(GraphObjConstructor)--> GPMA format (add/delete list)

I think we can drop the last two conversions and just pipe in the original edge list and add/delete list. @nithinmanoj10 any opinions against this approach?

I'll try making the changes to see if there is some dependency I missed.

nithinmanoj10 · 2024-01-21T12:25:34Z

Starting with the preprocess script the current time taken is as follows (noting that I've already done some optimization here for loading the file). This is from the sx-stackoverflow set to cutoff at 20M edges.
Namespace(dataset='sx-stackoverflow', base=10000000, percent_change=2.0, cutoff_time=20000000)
[CHECKPOINT]::FILE_PARSING_COMPLETED in 31.958895921707153s
[CHECKPOINT]::PREPROCESS_GRAPH in 661.6636250019073s
[CHECKPOINT]::JSON_DUMP in 95.33747601509094s
I tried tweaking around with this a bit but I think this is acceptable and hence moving on.

In which file did you benchmark the preprocessing steps? @JoelMathewC

JoelMathewC · 2024-01-22T04:01:00Z

I'm running tests on benchmarking/dataset/preprocessing/preprocess_temporal_data.py. I'll push the changes I made soon. I'm still running a few tests myself.

JoelMathewC self-assigned this Jan 20, 2024

JoelMathewC added the cleanup label Jan 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving dataset loader and preprocess script #102

Improving dataset loader and preprocess script #102

JoelMathewC commented Jan 20, 2024

JoelMathewC commented Jan 20, 2024

JoelMathewC commented Jan 21, 2024

nithinmanoj10 commented Jan 21, 2024

JoelMathewC commented Jan 22, 2024

Improving dataset loader and preprocess script #102

Improving dataset loader and preprocess script #102

Comments

JoelMathewC commented Jan 20, 2024

JoelMathewC commented Jan 20, 2024

JoelMathewC commented Jan 21, 2024

nithinmanoj10 commented Jan 21, 2024

JoelMathewC commented Jan 22, 2024