Skip to content

Avoid opening all files for reading on windows#74

Merged
bootandy merged 3 commits into
bootandy:masterfrom
rasmushalland:win-perf
Mar 1, 2020
Merged

Avoid opening all files for reading on windows#74
bootandy merged 3 commits into
bootandy:masterfrom
rasmushalland:win-perf

Conversation

@rasmushalland
Copy link
Copy Markdown
Contributor

It can be very expensive to do that, especially when it causes windows defender to read the files and scan them.

Savings in orders of magnitude in terms of time, io and cpu have been observed on hdd, windows 10, some 100Ks files taking up some hundreds of GBs:
Consistently opening the file: 30 minutes.
With this optimization: 8 sec.

Hard links: Unresolved. We don't get inode/file index, so hard links count once for each link. Hopefully they are not too commonly in use on windows.

@rasmushalland rasmushalland force-pushed the win-perf branch 3 times, most recently from af94f43 to b6a5cc7 Compare February 24, 2020 03:45
@bootandy
Copy link
Copy Markdown
Owner

Thanks

That looks interesting, I'll play with this branch and get back to you.

Comment thread src/utils/platform.rs Outdated
Comment thread src/utils/platform.rs Outdated
@bootandy bootandy mentioned this pull request Feb 27, 2020
@bootandy
Copy link
Copy Markdown
Owner

@rasmushalland
I added some commits to yours, let me know what you think of this:
#76

@rasmushalland
Copy link
Copy Markdown
Contributor Author

I've been looking into opening files again, coming up with a few observations:

First, it is possible to open the files to get their info in a much cheaper way, seemingly without waking up windows defender simply by not requesting read access to the file when opening them, instead just as for FILE_READ_ATTRIBUTES access instead of the default GENERIC_READ access. Opening files that way makes dust check the same HDD in 25 seconds instead of 8 seconds. So it is slower, but nowhere near as slow the previous way of opening files. The updated PR does that for the non-common case.

By the way, the 8 seconds that I mentioned apply when dust has run recently and windows presumably has cached stuff from the disk. Otherwise it takes something like a minute. That difference would probably be smaller for an SSD, since it would be able to fetch the NTFS data more quickly.

It is also worth noting that getting the file info by opening the file can fail in cases where the fast approach of getting the size works. At least that is what happens for the swap file, pagefile.sys: It disappeared from the output when I went back to opening the files with FILE_READ_ATTRIBUTES or GENERIC_READ, probably due to lack of permission to open the file. Both techniques therefore need to fall back on the other.

In choosing the main technique it seems that we are looking at a trade-off between correctness in the face of hard links vs some seconds of saved time.

I can think of a few compromises if we decide that the seconds that can be saved are worth some amount of correctness in some (rare?) cases:

  1. Ignore hard links, assuming that they are so rare that it wont really bother anyone.
  2. Introduce a switch, maybe called "exact" or "fast", that changes the behavior from approximate or exact-but-not-so-fast.
  3. Make some assumption about the use of hard links: To the extent that they are used on windows, maybe it only really makes a difference in terms of space consumption for large-ish files. With that assumption, we could define some file size threshold below which we use the fast technique, and above which we attempt opening the file. For it to make a difference, the threshold would have to be high enough that a low percentage of files are larger than it. Some moderate number of megabytes, maybe.
  4. Some combination of the above, such as a switch that allows one to specify the threshold size.

@bootandy
Copy link
Copy Markdown
Owner

bootandy commented Feb 29, 2020

Firstly thanks,

I want to point out that I don't have windows so I am kind of trusting you that these optimizations work in windows.

One thing we could consider doing is borrowing the '-s' flag from linux and re purposing it for windows. So by default linux shows file size based on blocks (chunks of disk) but if the -s flag is added it will show actual file size. I wonder if we should change the code so the '-s' flag on windows makes it cover the 'exact' path.

getting the file info by opening the file can fail in cases where the fast approach of getting the size works. At least that is what happens for the swap file, pagefile.sys

Are you sure that happened and not that the swap file wasn't deleted in the meantime? - I'm not saying you are wrong I'd just like to check.

Anyway as soon as I fix the build I'd be happy to merge this in and we'll get it shipped in version 0.5.1

@bootandy
Copy link
Copy Markdown
Owner

can you rebase - that should fix the build - thanks

rasmushalland and others added 3 commits March 1, 2020 14:41
It can be very expensive to do that, especially when it causes windows defender to read the files and scan them.
Instead of generating random values for the drive and inode counter on
windows we return None instead
We avoid passing FILE_READ_DATA to CreateFile.
@bootandy bootandy merged commit 6bc44de into bootandy:master Mar 1, 2020
@rasmushalland rasmushalland deleted the win-perf branch July 4, 2020 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants