Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cataloging root dir takes a very long time #119

Closed
wagoodman opened this issue Aug 4, 2020 · 3 comments
Closed

Cataloging root dir takes a very long time #119

wagoodman opened this issue Aug 4, 2020 · 3 comments
Labels
bug Something isn't working

Comments

@wagoodman
Copy link
Contributor

It seems that go run main.go dir:/// hangs (even as root). If this operation is not possible we should have better messaging here (though I don't see why this shouldn't work).

@wagoodman wagoodman added the bug Something isn't working label Aug 4, 2020
@wagoodman wagoodman modified the milestone: v0.1.0 Aug 4, 2020
@alfredodeza
Copy link
Contributor

I debugged this, and it isn't that it hangs... it does work. The problem is that it is trying to catalog a whole device, going file by file in the system. On OSX I ran it with dtrace and saw it was going through all files (as expected):

90589/0x130b338:  lstat64("/Applications/Pages.app/Contents/Resources/FunctionHelp.bundle/Contents/Resources/nl.lproj/ffa5c004fb.html\0", 0xC01A5A5218, 0x0)		 = 0 0

It doesn't seem to me this is a problem

@luhring
Copy link
Contributor

luhring commented Oct 19, 2020

I've been looking into this more closely. @alfredodeza is right — the issue is that we're performing operations that take a very long time, and this gets exacerbated with additional nested directories, which means that scanning / is the worst case scenario.

The problem

Right now, our DirectoryResolver uses the doublestar.Glob function (from https://github.com/bmatcuk/doublestar). This function is inefficient, particularly for how we're using it here. Starting with the basedir, it recursively searches all files in all directories, one directory at a time, until it's found the complete set of matches. It also keeps any directory files open until it has finished scanning everything within the given directory. (I've opened a PR in doublestar for this latter issue: bmatcuk/doublestar#47.)

To make matters worse, we call doublestar.Glob on a per-cataloger basis. So if the average run time of Glob is 1 second, and we have 8 catalogers, the total run time of cataloging is (at least) 8 seconds.

A possible solution

Particularly because we need to search for files multiple times within the cataloging process, I'd suggest we somewhat mimic our approach to how we catalog container images, where we start by building an in-memory file tree, and then we use the tree for searches, rather than accessing the disk N times within a single search.

For this approach, we could use stereoscope's tree implementation, as well as its FilesByGlob implementation, and we'd end up not using doublestar any more. We'd pay the performance cost of the initial tree buildout, which is at worst the same as the cost of a single doublestar.Glob call, and then we'd reap the benefit of super cheap memory accesses as we perform our searches.

@luhring luhring changed the title Cannot catalog root dir Cataloging root dir takes a very long time Oct 19, 2020
@wagoodman
Copy link
Contributor Author

This has been fixed in #442

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants