Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving dataset loader and preprocess script #102

Open
JoelMathewC opened this issue Jan 20, 2024 · 4 comments
Open

Improving dataset loader and preprocess script #102

JoelMathewC opened this issue Jan 20, 2024 · 4 comments
Assignees
Labels

Comments

@JoelMathewC
Copy link
Contributor

There are some severe speed issues with the preprocess and data loader script and this oftens makes benchmarking a rather tedious process. I'll work on clearing up the technical debt here (mostly mine 馃槄).

@JoelMathewC JoelMathewC self-assigned this Jan 20, 2024
@JoelMathewC
Copy link
Contributor Author

Starting with the preprocess script the current time taken is as follows (noting that I've already done some optimization here for loading the file). This is from the sx-stackoverflow set to cutoff at 20M edges.

Namespace(dataset='sx-stackoverflow', base=10000000, percent_change=2.0, cutoff_time=20000000)
[CHECKPOINT]::FILE_PARSING_COMPLETED in 31.958895921707153s
[CHECKPOINT]::PREPROCESS_GRAPH in 661.6636250019073s
[CHECKPOINT]::JSON_DUMP in 95.33747601509094s

I tried tweaking around with this a bit but I think this is acceptable and hence moving on.

@JoelMathewC
Copy link
Contributor Author

Okay so I have identified a few repeated operations in the data loading pipeline. Right now the flow is

Dataset (edge list) ----(Preprocessor)--> JsonData (add/delete list) ----(Data loader)--> GraphObj (edge list) ---(GraphObjConstructor)--> GPMA format (add/delete list)

I think we can drop the last two conversions and just pipe in the original edge list and add/delete list. @nithinmanoj10 any opinions against this approach?

I'll try making the changes to see if there is some dependency I missed.

@nithinmanoj10
Copy link
Contributor

Starting with the preprocess script the current time taken is as follows (noting that I've already done some optimization here for loading the file). This is from the sx-stackoverflow set to cutoff at 20M edges.

Namespace(dataset='sx-stackoverflow', base=10000000, percent_change=2.0, cutoff_time=20000000)
[CHECKPOINT]::FILE_PARSING_COMPLETED in 31.958895921707153s
[CHECKPOINT]::PREPROCESS_GRAPH in 661.6636250019073s
[CHECKPOINT]::JSON_DUMP in 95.33747601509094s

I tried tweaking around with this a bit but I think this is acceptable and hence moving on.

In which file did you benchmark the preprocessing steps? @JoelMathewC

@JoelMathewC
Copy link
Contributor Author

I'm running tests on benchmarking/dataset/preprocessing/preprocess_temporal_data.py. I'll push the changes I made soon. I'm still running a few tests myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants