New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use jwalk instead of walkdir: parallel walking for performance boost #40
Conversation
Hey thanks for this, I have been considering using Rayon to parallelize this too. Unfortunately it seems there is a bug merging the data on large directories.
compared to du (original dust agrees with du):
I am going to guess that the cumulative sum of file size is not being summed up correctly. |
I'll investigate further, thanks for your time. |
The problem was that JWalk does not take hidden directories in account by default. On a side note, do you want me to add a CLI flag to ignore hidden directories? That could be a nice feature. |
f6f14c5
to
62ac9b6
Compare
You were correct about the hidden directories. Nice. I have been running your branch vs the existing branch locally and I see very little difference in execution speed. I certainly can't get anywhere near the speed of du so I'd like to know how you benchmarked that. Update: I have been able to get slightly superior performance to the existing dust by limiting the number of threads created to approximately match my cpu cores. I think the thing to do might be to incorporate something like: https://crates.io/crates/num_cpus to ensure we don't spawn too many threads. I don't think we need a flag to ignore hidden directories, (I personally cant see the usecase, but if several people ask I'll add it). However, I would like an optional flag to control the aforementioned number of CPU cores we use (Personally I don't like it when a tool always grabs all your cpu cores). I do like the commit: I'll wait a week to see if you want to add anything and if not I'll merge and extend your work myself. Cheers, |
The performance surely depends on your hardware, I have a SSD with a 8 core intel. Maybe HDD and/or a lower number of cores is the reason you see the difference in speed. For the benchmark used, I did
jwalk uses rayon, which states: I added the |
thanks for your work. |
Use a parallel walkdir implementation that is able to fetch more resources at the same time. This should greatly improve directory trees with lots of branching, at the cost of threading. Also make sure to allocate enough space as much as possible ahead of time to avoid reallocation. Lastly,
remove
on a vec needs to make an allocation each time, so prefer the retain method for repeated removal.Some benchmark on my computer (galago pro 3):
Folder 1 (medium, shallow):
dust: 0.055011367s
du -sh: 0.245766881s
Folder 2 (small, shallow):
du -sh: 0.007830246s
dust: 0.025789946s
Folder 3 (large, deep):
dust: 3.421158887s
du -sh: 21.322613990s