Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel fetching can cause cache conflicts #43

Closed
oconnor663 opened this issue Aug 20, 2014 · 4 comments
Closed

parallel fetching can cause cache conflicts #43

oconnor663 opened this issue Aug 20, 2014 · 4 comments

Comments

@oconnor663
Copy link
Member

Our parallelism uses module object locking to avoid fetching the same module twice. But there's nothing preventing two different modules from using the same URL. Those two modules could get fetched in parallel, and then you have two instances of the git plugin (or whatever) trying to write to the same directory.

We definitely don't want to shove any locking responsibilities down to the plugins. What we should do is create more granular plugin cache directories (instead of one big global one) and use a lock in peru itself to prevent two fetches from touching one cache at the same time. I'm tempted to use the full hash of a module's fields to name this directory, but we don't want to invalidate a git clone when the user changes rev for example. We could use the name+type of a module (because a module should definitely get a clean plugin cache if it changes type), but that could still get confused if one module swaps names with another. Maybe the solution is to name/lock the cache dir with a hash of all plugin fields, but also allow plugin.yaml to restrict the list of fields that get hashed. So the git plugin for example could say, "Only use my url field for the purposes of plugin caching." Is that too complicated? It might even make sense to make this configuration semi-mandatory, so that plugins that don't specify their cacheable fields get /dev/null as their PERU_PLUGIN_CACHE. Random upside to all this: we can get rid of the urlencoding that the plugins are doing now.

Related: You could have two modules with exactly the same fields. Ideally the second one should be a cache hit. But if they're fetched in parallel, they might both be cache misses, and then they would duplicate work. The solution to this would be to take module locks by cache key, rather than just by module object instance. (This should've been obvious from the beginning, since the read-write that we're protecting is done on that key.) Unlike the plugin issue above, this distinction is just a duplicated-work issue in a weird corner case, rather than a serious correctness issue. But since we already have to do module-level locking (to cover the case where both A and B depend on C), we might as well do it right.

All together, here's what that locking is going to look like. All of this lives in RemoteModule.get_tree, though RemoteModule.reup will probably want to do it too, so hopefully we can share it cleanly.

  1. Take a module lock keyed off of the module's cache key (the hash of all module fields). Think of this as the "don't fetch the same module twice" lock, though it will also handle identical modules with different names.
  2. Check the module cache and exit early if it's a hit.
  3. Take a plugin lock keyed off the relevant-to-plugin-caching fields specified in plugin.yaml. Think of this as the "only one job at a time using a given plugin cache directory" lock. If the plugin hasn't configured these fields, there's no lock here, and we don't provide a cache dir at all.
  4. Take the max-parallel-fetches semaphore. Think of this as the "even though we could run infinity jobs in parallel, let's be sensible and only run 10" semaphore.
  5. Actually shell out to the plugin.
@oconnor663 oconnor663 added the bug label Aug 20, 2014
@oconnor663
Copy link
Member Author

The plugin cache lock and the parallelism semaphore could probably move down into plugins.py. Then RemoteModule.reup wouldn't need any additional locks, and RemoteModule.get_tree could just concern itself with the module lock.

@oconnor663
Copy link
Member Author

One issue with this scheme: the git plugin needs to be able to fetch subrepos. If everything lives in keyed dir created by peru, where is the plugin supposed to put the subrepo?

@oconnor663
Copy link
Member Author

Note that the previous way of doing this (where the git plugin could predict where caches for other git modules were supposed to live, and write to them) had the benefit of sharing caches on disk, but it still didn't allow fetching submodules in parallel, which is the really important thing. And of course it was unsafe once parallelism was introduced.

@oconnor663
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant