Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use jwalk instead of walkdir: parallel walking for performance boost #40

Merged
merged 6 commits into from Dec 4, 2019

Conversation

AdminXVII
Copy link
Contributor

Use a parallel walkdir implementation that is able to fetch more resources at the same time. This should greatly improve directory trees with lots of branching, at the cost of threading. Also make sure to allocate enough space as much as possible ahead of time to avoid reallocation. Lastly, remove on a vec needs to make an allocation each time, so prefer the retain method for repeated removal.

Some benchmark on my computer (galago pro 3):

Folder 1 (medium, shallow):
dust: 0.055011367s
du -sh: 0.245766881s

Folder 2 (small, shallow):
du -sh: 0.007830246s
dust: 0.025789946s

Folder 3 (large, deep):
dust: 3.421158887s
du -sh: 21.322613990s

@bootandy
Copy link
Owner

bootandy commented Nov 25, 2019

Hey thanks for this,

I have been considering using Rayon to parallelize this too.

Unfortunately it seems there is a bug merging the data on large directories.
Here is it running on one of my folders:

$ time target/release/dust /home/andy/dev/python/problems                     [23:54:01]
 137M ─┬ problems
  85M  ├─┬ hackerrank

compared to du (original dust agrees with du):

$ du -d 1 -h  /home/andy/dev/python/problems                                  [23:53:55]
15G	/home/andy/dev/python/problems

I am going to guess that the cumulative sum of file size is not being summed up correctly.

@AdminXVII
Copy link
Contributor Author

I'll investigate further, thanks for your time.

@AdminXVII
Copy link
Contributor Author

The problem was that JWalk does not take hidden directories in account by default.

On a side note, do you want me to add a CLI flag to ignore hidden directories? That could be a nice feature.

@bootandy
Copy link
Owner

bootandy commented Dec 3, 2019

You were correct about the hidden directories. Nice.

I have been running your branch vs the existing branch locally and I see very little difference in execution speed. I certainly can't get anywhere near the speed of du so I'd like to know how you benchmarked that.

Update: I have been able to get slightly superior performance to the existing dust by limiting the number of threads created to approximately match my cpu cores. I think the thing to do might be to incorporate something like: https://crates.io/crates/num_cpus to ensure we don't spawn too many threads.

I don't think we need a flag to ignore hidden directories, (I personally cant see the usecase, but if several people ask I'll add it). However, I would like an optional flag to control the aforementioned number of CPU cores we use (Personally I don't like it when a tool always grabs all your cpu cores).

I do like the commit: Use more rusty patterns and preallocate enough space I think that improves things.

I'll wait a week to see if you want to add anything and if not I'll merge and extend your work myself.

Cheers,

@AdminXVII
Copy link
Contributor Author

AdminXVII commented Dec 3, 2019

The performance surely depends on your hardware, I have a SSD with a 8 core intel. Maybe HDD and/or a lower number of cores is the reason you see the difference in speed.

For the benchmark used, I did

[parallel-walk][~/dev/dust]$ cargo run --release # dont benchmark cargo
[parallel-walk][~/dev/dust]$ sync; echo 3 > /proc/sys/vm/drop_caches
[parallel-walk][~/dev/dust]$ time ./target/release/dust ..
  57G ─┬ ..
  15G  ├─┬ redox
 5.7G  │ ├─┬ cookbook
 4.9G  │ │ └── recipes
 3.0G  │ └─┬ .git
 3.0G  │   └─┬ modules
 2.9G  │     └── rust
 6.1G  ├─┬ shellac-server
 5.4G  │ └─┬ target
 3.0G  │   └── debug
 6.1G  ├─┬ robotique
 6.1G  │ └─┬ target
 4.2G  │   └─┬ debug
 3.2G  │     └── deps
 5.8G  ├─┬ iced
 5.4G  │ └─┬ target
 4.7G  │   └─┬ debug
 2.7G  │     └── deps
 3.3G  ├─┬ planet-gen
 3.3G  │ └── target
 2.6G  └── ion
real    5.535719165s
[parallel-walk][~/dev/dust]$ sync; echo 3 > /proc/sys/vm/drop_caches
[parallel-walk][~/dev/dust]$ time du -sh ..
58G	..
real    21.578478589s

jwalk uses rayon, which states:
"By default, Rayon uses the same number of threads as the number of CPUs available. Note that on systems with hyperthreading enabled this equals the number of logical cores and not the physical ones."
So theorically, the number of threads should already match the number of cores.

I added the threads option to the CLI to specify the number of threads to use. Default to the number of logical cores.

@bootandy bootandy merged commit 79416fd into bootandy:master Dec 4, 2019
@bootandy
Copy link
Owner

bootandy commented Dec 4, 2019

thanks for your work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants