-
Notifications
You must be signed in to change notification settings - Fork 107
Make contents API scale
#3609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make contents API scale
#3609
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
webbnh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; however, I have the usual collection of nits, pointed questions, and suggestions.
lib/pbench/server/cache_manager.py
Outdated
| if target.is_dir(): | ||
| uri = f"{origin}/contents/{link}" | ||
| type = CacheType.DIRECTORY | ||
| append_to = dir_list | ||
| elif target.is_file(): | ||
| uri = f"{origin}/inventory/{link}" | ||
| type = CacheType.FILE | ||
| else: | ||
| uri = f"{origin}/inventory/{relative}" | ||
| type = CacheType.OTHER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused: why is the uri sometimes built from link and sometimes built from relative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For OTHER and BROKEN the URL points to the symlink since we can't (or won't) resolve it. For others, we return the resolved target so they get the real directory or file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur on the BROKEN case...but why is that appropriate for the OTHER case? (In the OTHER case, we successfully resolved the symlink to a directory entry accessible within the tarball via a relative path, so won't f"{origin}/inventory/{link}" and f"{origin}/inventory/{relative}" address the same file, with the latter just taking a more circuitous route, if the route is different at all?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually a comment for inventory rather than contents. We don't want to expose the real target of an OTHER file (e.g., a fifo or whatever), which is why we use OTHER. Yeah, it's possible that inventory will transparently resolve the link and return the fifo, which I suspect we really don't want. I'm not going to mess with that here.
dbutenhof
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some of the changes, including some new test links.
I also got an idea this afternoon (which is why I'm working way late today!) about improving the report generator performance by extending my find_dataset changes to an alternate "full discovery" by pulling all the dataset resource IDs and tarball-path values from SQL instead of iterdir-ing through the ARCHIVEs. Turns out that cuts cache discovery from nearly 2 hours to about 25 minutes.
lib/pbench/server/cache_manager.py
Outdated
| if target.is_dir(): | ||
| uri = f"{origin}/contents/{link}" | ||
| type = CacheType.DIRECTORY | ||
| append_to = dir_list | ||
| elif target.is_file(): | ||
| uri = f"{origin}/inventory/{link}" | ||
| type = CacheType.FILE | ||
| else: | ||
| uri = f"{origin}/inventory/{relative}" | ||
| type = CacheType.OTHER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For OTHER and BROKEN the URL points to the symlink since we can't (or won't) resolve it. For others, we return the resolved target so they get the real directory or file.
webbnh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updates look generally good. I've got follow-ups on two of my comments from the first pass (the rest are resolved -- thanks!), and I've got a few thoughts on your revisions. The foremost is that making CacheManager.full_discovery() into a wrapper seems like it causes more problems than it really solves. (The result is good, but the approach has some flaws.) The others are smaller bits, nits, and suggestions.
lib/pbench/server/cache_manager.py
Outdated
| if target.is_dir(): | ||
| uri = f"{origin}/contents/{link}" | ||
| type = CacheType.DIRECTORY | ||
| append_to = dir_list | ||
| elif target.is_file(): | ||
| uri = f"{origin}/inventory/{link}" | ||
| type = CacheType.FILE | ||
| else: | ||
| uri = f"{origin}/inventory/{relative}" | ||
| type = CacheType.OTHER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur on the BROKEN case...but why is that appropriate for the OTHER case? (In the OTHER case, we successfully resolved the symlink to a directory entry accessible within the tarball via a relative path, so won't f"{origin}/inventory/{link}" and f"{origin}/inventory/{relative}" address the same file, with the latter just taking a more circuitous route, if the route is different at all?)
PBENCH-1321
The `/datasets/{id}/contents` API includes into several unexpectedly expensive
steps:
1. Finding the tarball (by MD5 value) within the `ARCHIVE` tree using a `glob`
2. Fully discovering all tarballs within the controller directory
3. Unpacking the tarball into a cache directory using `tar`
4. Building a "map" of the contents of the unpacked tarball subtree
This PR includes mitigations for all but the `tar` unpack step:
1. Use the `server.tarball-path` metadata instead of searching the disk
2. Only discover the target tarball rather than the entire controller
3. Skip the "map" and evaluate the actual target path within the cache
Finding a tarball within our 30Tb `ARCHIVE` tree can take many minutes, while
identifying the controller directory from the tarball path takes a fraction of
a second.
Depending on the number of tarballs within a controller (some have many), full
controller discovery has been observed to take half a minute; while populating
only the target tarball takes a fraction of a second.
Building the map for a large tarball tree can take minutes, whereas discovery
of the actual relative file path within the cache runs at native (Python) file
system speeds.
webbnh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dbutenhof
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently I have several responses written earlier this afternoon that didn't get posted. Huh. You can enjoy them at your leisure.
lib/pbench/server/cache_manager.py
Outdated
| if target.is_dir(): | ||
| uri = f"{origin}/contents/{link}" | ||
| type = CacheType.DIRECTORY | ||
| append_to = dir_list | ||
| elif target.is_file(): | ||
| uri = f"{origin}/inventory/{link}" | ||
| type = CacheType.FILE | ||
| else: | ||
| uri = f"{origin}/inventory/{relative}" | ||
| type = CacheType.OTHER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually a comment for inventory rather than contents. We don't want to expose the real target of an OTHER file (e.g., a fifo or whatever), which is why we use OTHER. Yeah, it's possible that inventory will transparently resolve the link and return the fifo, which I suspect we really don't want. I'm not going to mess with that here.
PBENCH-1321
The
/datasets/{id}/contentsAPI includes into several unexpectedly expensive steps:ARCHIVEtree using aglobtarThis PR includes mitigations for all but the
tarunpack step:server.tarball-pathmetadata instead of searching the diskFinding a tarball within our 30Tb
ARCHIVEtree can take many minutes, while identifying the controller directory from the tarball path takes a fraction of a second.Depending on the number of tarballs within a controller (some have many), full controller discovery has been observed to take half a minute; while populating only the target tarball takes a fraction of a second.
Building the map for a large tarball tree can take minutes, whereas discovery of the actual relative file path within the cache runs at native (Python) file system speeds.