You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our parallelism uses module object locking to avoid fetching the same module twice. But there's nothing preventing two different modules from using the same URL. Those two modules could get fetched in parallel, and then you have two instances of the git plugin (or whatever) trying to write to the same directory.
We definitely don't want to shove any locking responsibilities down to the plugins. What we should do is create more granular plugin cache directories (instead of one big global one) and use a lock in peru itself to prevent two fetches from touching one cache at the same time. I'm tempted to use the full hash of a module's fields to name this directory, but we don't want to invalidate a git clone when the user changes rev for example. We could use the name+type of a module (because a module should definitely get a clean plugin cache if it changes type), but that could still get confused if one module swaps names with another. Maybe the solution is to name/lock the cache dir with a hash of all plugin fields, but also allow plugin.yaml to restrict the list of fields that get hashed. So the git plugin for example could say, "Only use my url field for the purposes of plugin caching." Is that too complicated? It might even make sense to make this configuration semi-mandatory, so that plugins that don't specify their cacheable fields get /dev/null as their PERU_PLUGIN_CACHE. Random upside to all this: we can get rid of the urlencoding that the plugins are doing now.
Related: You could have two modules with exactly the same fields. Ideally the second one should be a cache hit. But if they're fetched in parallel, they might both be cache misses, and then they would duplicate work. The solution to this would be to take module locks by cache key, rather than just by module object instance. (This should've been obvious from the beginning, since the read-write that we're protecting is done on that key.) Unlike the plugin issue above, this distinction is just a duplicated-work issue in a weird corner case, rather than a serious correctness issue. But since we already have to do module-level locking (to cover the case where both A and B depend on C), we might as well do it right.
All together, here's what that locking is going to look like. All of this lives in RemoteModule.get_tree, though RemoteModule.reup will probably want to do it too, so hopefully we can share it cleanly.
Take a module lock keyed off of the module's cache key (the hash of all module fields). Think of this as the "don't fetch the same module twice" lock, though it will also handle identical modules with different names.
Check the module cache and exit early if it's a hit.
Take a plugin lock keyed off the relevant-to-plugin-caching fields specified in plugin.yaml. Think of this as the "only one job at a time using a given plugin cache directory" lock. If the plugin hasn't configured these fields, there's no lock here, and we don't provide a cache dir at all.
Take the max-parallel-fetches semaphore. Think of this as the "even though we could run infinity jobs in parallel, let's be sensible and only run 10" semaphore.
Actually shell out to the plugin.
The text was updated successfully, but these errors were encountered:
The plugin cache lock and the parallelism semaphore could probably move down into plugins.py. Then RemoteModule.reup wouldn't need any additional locks, and RemoteModule.get_tree could just concern itself with the module lock.
One issue with this scheme: the git plugin needs to be able to fetch subrepos. If everything lives in keyed dir created by peru, where is the plugin supposed to put the subrepo?
Note that the previous way of doing this (where the git plugin could predict where caches for other git modules were supposed to live, and write to them) had the benefit of sharing caches on disk, but it still didn't allow fetching submodules in parallel, which is the really important thing. And of course it was unsafe once parallelism was introduced.
Our parallelism uses module object locking to avoid fetching the same module twice. But there's nothing preventing two different modules from using the same URL. Those two modules could get fetched in parallel, and then you have two instances of the git plugin (or whatever) trying to write to the same directory.
We definitely don't want to shove any locking responsibilities down to the plugins. What we should do is create more granular plugin cache directories (instead of one big global one) and use a lock in peru itself to prevent two fetches from touching one cache at the same time. I'm tempted to use the full hash of a module's fields to name this directory, but we don't want to invalidate a git clone when the user changes
rev
for example. We could use the name+type of a module (because a module should definitely get a clean plugin cache if it changes type), but that could still get confused if one module swaps names with another. Maybe the solution is to name/lock the cache dir with a hash of all plugin fields, but also allowplugin.yaml
to restrict the list of fields that get hashed. So the git plugin for example could say, "Only use myurl
field for the purposes of plugin caching." Is that too complicated? It might even make sense to make this configuration semi-mandatory, so that plugins that don't specify their cacheable fields get/dev/null
as theirPERU_PLUGIN_CACHE
. Random upside to all this: we can get rid of the urlencoding that the plugins are doing now.Related: You could have two modules with exactly the same fields. Ideally the second one should be a cache hit. But if they're fetched in parallel, they might both be cache misses, and then they would duplicate work. The solution to this would be to take module locks by cache key, rather than just by module object instance. (This should've been obvious from the beginning, since the read-write that we're protecting is done on that key.) Unlike the plugin issue above, this distinction is just a duplicated-work issue in a weird corner case, rather than a serious correctness issue. But since we already have to do module-level locking (to cover the case where both A and B depend on C), we might as well do it right.
All together, here's what that locking is going to look like. All of this lives in
RemoteModule.get_tree
, thoughRemoteModule.reup
will probably want to do it too, so hopefully we can share it cleanly.plugin.yaml
. Think of this as the "only one job at a time using a given plugin cache directory" lock. If the plugin hasn't configured these fields, there's no lock here, and we don't provide a cache dir at all.The text was updated successfully, but these errors were encountered: