Avoid opening all files for reading on windows#74
Conversation
af94f43 to
b6a5cc7
Compare
|
Thanks That looks interesting, I'll play with this branch and get back to you. |
|
@rasmushalland |
b6a5cc7 to
9ea98a6
Compare
|
I've been looking into opening files again, coming up with a few observations: First, it is possible to open the files to get their info in a much cheaper way, seemingly without waking up windows defender simply by not requesting read access to the file when opening them, instead just as for FILE_READ_ATTRIBUTES access instead of the default GENERIC_READ access. Opening files that way makes dust check the same HDD in 25 seconds instead of 8 seconds. So it is slower, but nowhere near as slow the previous way of opening files. The updated PR does that for the non-common case. By the way, the 8 seconds that I mentioned apply when dust has run recently and windows presumably has cached stuff from the disk. Otherwise it takes something like a minute. That difference would probably be smaller for an SSD, since it would be able to fetch the NTFS data more quickly. It is also worth noting that getting the file info by opening the file can fail in cases where the fast approach of getting the size works. At least that is what happens for the swap file, pagefile.sys: It disappeared from the output when I went back to opening the files with FILE_READ_ATTRIBUTES or GENERIC_READ, probably due to lack of permission to open the file. Both techniques therefore need to fall back on the other. In choosing the main technique it seems that we are looking at a trade-off between correctness in the face of hard links vs some seconds of saved time. I can think of a few compromises if we decide that the seconds that can be saved are worth some amount of correctness in some (rare?) cases:
|
|
Firstly thanks, I want to point out that I don't have windows so I am kind of trusting you that these optimizations work in windows. One thing we could consider doing is borrowing the '-s' flag from linux and re purposing it for windows. So by default linux shows file size based on blocks (chunks of disk) but if the -s flag is added it will show actual file size. I wonder if we should change the code so the '-s' flag on windows makes it cover the 'exact' path.
Are you sure that happened and not that the swap file wasn't deleted in the meantime? - I'm not saying you are wrong I'd just like to check. Anyway as soon as I fix the build I'd be happy to merge this in and we'll get it shipped in version 0.5.1 |
|
can you rebase - that should fix the build - thanks |
It can be very expensive to do that, especially when it causes windows defender to read the files and scan them.
Instead of generating random values for the drive and inode counter on windows we return None instead
We avoid passing FILE_READ_DATA to CreateFile.
It can be very expensive to do that, especially when it causes windows defender to read the files and scan them.
Savings in orders of magnitude in terms of time, io and cpu have been observed on hdd, windows 10, some 100Ks files taking up some hundreds of GBs:
Consistently opening the file: 30 minutes.
With this optimization: 8 sec.
Hard links: Unresolved. We don't get inode/file index, so hard links count once for each link. Hopefully they are not too commonly in use on windows.