Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directory and file exists methods could be more efficient #22219

Closed
JeremyKuhne opened this issue Jun 9, 2017 · 25 comments
Closed

Directory and file exists methods could be more efficient #22219

JeremyKuhne opened this issue Jun 9, 2017 · 25 comments
Labels
area-System.IO enhancement Product code improvement that does NOT require public API changes/additions Hackathon Issues picked for Hackathon help wanted [up-for-grabs] Good issue for external contributors tenet-performance Performance related issue
Milestone

Comments

@JeremyKuhne
Copy link
Member

Related to #21678

Exists methods have a try-catch for error cases and they normalize, which is not necessary (at least on Windows). As all error cases return false, we can remove most of this overhead.

https://github.com/dotnet/corefx/blob/master/src/System.IO.FileSystem/src/System/IO/File.cs#L181-L210

The only pre-emptive check we might want to keep from GetFullPath is to return false if the passed in path contains an embedded null.

Maybe have a FileSystem.FileExistsFast? This is the current Windows one:

https://github.com/dotnet/corefx/blob/master/src/System.IO.FileSystem/src/System/IO/File.cs#L181-L210

Unix is already in much better state. I assume that crappy and partial (relative) paths will work.

Tests for bad and relative paths are critical, need to validate existing coverage and expand if necessary.

@kellypleahy
Copy link
Contributor

OK... Now I'm massively confused. The reason I got started on this was #21678 which started from another issue dotnet/corefx#19716 (comment) which started because checking that file exists || directory exists is inefficient. However, looking at this code, it seems to return true for File.Exists when that "file" is a directory.

FileSystem.FileExists does not appear to work that way in the current codebase, nor does FileSystem.DirectoryExists. So, what gives?

@kellypleahy
Copy link
Contributor

@JeremyKuhne
Copy link
Member Author

OK... Now I'm massively confused

Sorry :)

it seems to return true for File.Exists when that "file" is a directory

Shouldn't.

Hmm... maybe that comment is just wrong here:

Looks like it. :)

@kellypleahy
Copy link
Contributor

OK... walking back from the ledge now. Let's see... so.

Are you saying you recommend File.Exists and Directory.Exists directly use FileSystem.Current.FileExists and FileSystem.Current.DirectoryExists as those generally do the right thing and seem like they don't throw exceptions? I forgot, they do seem to require a full path though, right, or is that not generally true? I was wondering about the removing of the trailing \ in DirectoryExists and whether that should really be done or not in FileSystem. Have you seen that bit? I'd be afraid to change that at this point, obviously, as it likely affects a LOT of stuff upstream.

@JeremyKuhne
Copy link
Member Author

This is related to the other (#19717) in that checking for existence period isn't efficient. You have to check twice (#19717) when you don't care if the existing path is a file or directory, which is a pretty common scenario.

Underlying that scenario is this one- if you do care if it is a file or directory, the existing methods do way more work than they need to, notably in Windows. We pre-normalize the path, but the Windows API does that anyway when you get the attributes (which I presume is also the case for Unix). The current FileSystem impl throws, then we catch and turn it to false- which is also inefficient- allocating strings and exceptions then throwing them away. :(

The Unix impl may actually benefit from pre-normalization as it calls multiple APIs. We'd have to validate.

We don't want to change the existing FileSystem.Current.FileExists until we've vetted that nobody cares about the exception state. Simpler to start with a new method and transition a bit at a time I think.

Trimming the trailing separator is still something we'd want to keep.

If you don't want to do this, that's fine, but I'm more than happy to help you through it if you do. I've got enough experience here that I can pretty confidently review your changes. :)

@kellypleahy
Copy link
Contributor

kellypleahy commented Jun 9, 2017

I'd be happy to tackle this one. questions...

It seems like the "common" function I want is something like a (not yet existing) FileSystemEntryType FileSystemEntry.Lookup(string path) with

enum FileSystemEntryType { DoesNotExist, File, Directory, ...? }

that doesn't ever throw on anything but rather just returns DoesNotExist if you don't have privileges to see it, for instance. It also seems like we'd want to use this in CoreCLR for the (not yet existing) Path.Exists. Should I just start with building something like this internal on FileSystem and then figure we'll move it later to CoreCLR as the time permits and api review is completed for Path.Exists and #21678?

Also, it seems like we need to follow symlinks in this one, right? Just want to confirm.

@JeremyKuhne
Copy link
Member Author

DoesNotExist if you don't have privileges to see it, for instance.

I've always been conflicted about how to represent the concept in a way that is easy to understand. CanNotDetermine or perhaps both NotFound and AccessDenied? In my old team I actually needed to know whether it was a rights issue so we could request the information from a higher-privilege process- it might be nice to have that available for a future API (perhaps the Path.Exists, even!).

Should I just start with building something like this internal on FileSystem

That is what I would do- much easier to iterate.

Also, it seems like we need to follow symlinks in this one, right?

Yeah, as that is the default historical (Windows) behavior.

kellypleahy referenced this issue in kellypleahy/corefx Jun 10, 2017
@kellypleahy
Copy link
Contributor

kellypleahy commented Jun 10, 2017

@JeremyKuhne I've taken the first steps in addressing this one. Can you have a look over the commit referenced above (kellypleahy/corefx@1a3842c) and see if this is along the lines you were thinking? My general philosophy was to pass through all errors that were easily captured from the underlying system (for free) to the caller as enum values. The enum values that are consistent between the two OSes are reused - some can only come from one or the other OS (for example, Socket can only happen on *nix while InvalidDrive can only happen on Win32).

I haven't done anything to use this new function yet. I assume we'd use it in File.Exists and Directory.Exists as well as the places where the other ticket (#19717) recommends eliminating the double-check for File.Exists(...) || Directory.Exists(...).

Next steps - I don't know how best to test some of these cases. Some of the tests should be straightforward, but for some of them I need to create a path that has a folder in it that is not accessible to the user running the LookupEntry function. I'm not sure if that's possible with the current test structure we have in corefx. Do you know if that's possible to change user during the test and verify things using that? Other cases require a valid and invalid symlink, as well as a cycle of symlinks.

cc @karelz re: testing above ^

@kellypleahy
Copy link
Contributor

also,... might be good to assign this one to me so nobody else grabs it ;)

@karelz
Copy link
Member

karelz commented Jun 10, 2017

Your wish was granted 😆, it's now assigned to you. (BTW: Thanks for your help!)

Re: Testing - I don't think we have abilities to switch users in test runs. It is very corner-case scenario and our infra was focused on mainline scenarios so far.
I let @JeremyKuhne advice how best to go from here. Maybe having manually-run tests with special runner / prep step (which would create dirs on another account)? Maybe just one-off manual testing?

@JeremyKuhne
Copy link
Member Author

is along the lines you were thinking?

It is a good start. Some more outstanding stuff (to make sure they're in your todo list):

  1. FillAttributeInfo: Remove tryagain. We should only ever make a single attempt at Interop.Kernel32.GetFileAttributesEx, no reason to use FindFirstFile.
  2. Remove the try/catch in File/Directory exists as there should never be an exception after (1)
  3. Remove the normalization (Path.GetFullPath()) and replace with an embedded null check like Path.GetFullPath does.

My general philosophy was to pass through all errors that were easily captured from the underlying system

I think that is fine, as long as we don't surface them in our public API, as they don't align nicely cross-plat. Passing them back internally like this may make it easier to modify/add new API that depends on this helper. I would also make sure that .Directory and .File are set to match what Exists returns- that means either merging the other FileTypes or turning this into a Flags enum. Don't want people accidentally changing the semantics of what a "file" is. I'm currently thinking merging is better than Flags.

Do you know if that's possible to change user during the test and verify things using that?

Afaik we haven't written anything to do this yet. A simpler answer might be to remove file rights (but not ownership), then restore them? We might also be able to find OS files that we can't hit? (Don't know any off the top of my head.)

From this point I'd move towards an official PR for discussion after you make more changes- you can mark it as [WIP] if we're still hashing things out.

@kellypleahy
Copy link
Contributor

From this point I'd move towards an official PR for discussion after you make more changes- you can mark it as [WIP] if we're still hashing things out.

OK cool. How do I "mark it as WIP"? Do I just put [WIP] in the PR title?

@JeremyKuhne
Copy link
Member Author

Do I just put [WIP] in the PR title?

yep

@JeremyKuhne
Copy link
Member Author

@kellypleahy Just a heads up- I've got to make some changes around this area for UAP work. dotnet/corefx#21149 changes the core FillAttributes- I'll drop mentions of any other changes that cross paths with this issue.

@kellypleahy
Copy link
Contributor

kellypleahy commented Jun 16, 2017 via email

@kellypleahy
Copy link
Contributor

I have been very busy lately and haven't been able to do anything additional on this. I'm hoping to get back into it in a week or so. If you deem it important to get started earlier, please feel free to reassign - I don't want to delay anyone who is waiting on this.

@karelz
Copy link
Member

karelz commented Jun 25, 2017

No worries. If someone wants to grab it, they can speak up. It is in Future milestone, which means we won't look at it ourselves anytime soon.

@danmoseley
Copy link
Member

Happy to reassign if you free up...

@danmoseley
Copy link
Member

Costed at 3-4 days if @JeremyKuhne does it. Depends on dotnet/corefx#25539 to add Span overloads first.

@niemyjski
Copy link

I'd really love to know the perf numbers of the File.Create, File.Exists and File.Delete operations cross platform. In our app (Exceptionless), we've noticed better performance on on windows and we are trying to track it down. Currently we are looking to replace all unneeded operations in our storage provider and just catch exceptions. https://github.com/FoundatioFx/Foundatio/blob/master/src/Foundatio/Storage/FolderFileStorage.cs

@JeremyKuhne
Copy link
Member Author

@niemyjski Exists isn't horrible- you'd have to call it an awful lot to get a noticeable hit.

Directory enumeration is the most expensive API we have- a good place to focus your investigation. I recently did a big update for Windows with enumeration perf and I'm looking to provide high-performance extension points cross plat. dotnet/corefx#25873

If you get some measurements of where your app is spending time in System.IO that you think can be improved feel free to open issues and loop me in.

@danmoseley
Copy link
Member

@Anipik this is one you could look at when you have finished other tasks and have free time later. @JeremyKuhne can give more details.

@niemyjski
Copy link

I forgot to post this on here when I was doing profiling work some time ago.

screen shot 2017-12-06 at 8 14 33 am

@JeremyKuhne
Copy link
Member Author

@niemyjski I made significant improvements to FillAttributeInfo. The marshaling involved now uses blittable structs and it only falls back to FindFirstFile in the needed error case. Some of the marshaling changes are in 4.7.2.

https://github.com/dotnet/corefx/blob/master/src/System.IO.FileSystem/src/System/IO/FileSystem.Windows.cs#L191

The key place the existing code is still bad is that it try/catches for non-existence. That is the first thing we should try to remove here. Additionally we can look at removing the normalization (e.g. Path.GetFullPath). The OS doesn't need us to do that work and we've removed almost all of the pre-checks from that code. If we do remove the GetFullPath() we would need to allow for the fact that the p/invokes assume normalization has occurred. That entails trimming trailing periods, spaces, and separators if the path isn't a device path. If it is a device path just the trailing separator has to be removed.

@Anipik Anipik removed their assignment Mar 15, 2018
@JeremyKuhne
Copy link
Member Author

Not as big of an issue as it used to be. Closing.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.IO enhancement Product code improvement that does NOT require public API changes/additions Hackathon Issues picked for Hackathon help wanted [up-for-grabs] Good issue for external contributors tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants