New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #153: buildout should cache working set environments #402
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rafaelbco, thank you for this contribution!
Besides the other comments, I would appreciate if you could provide timing comparisons (after doing the changes for providing copies of working sets in the cache) with a simple and a more complex buildout
.
Having a sense of how drastic a difference in timing it makes would help justify changing a piece of code that is working just fine as it is, keeping in mind that features are assets, but each extra line of code is a liability.
|
||
def install(self): | ||
reqs, ws = self.working_set() | ||
return () | ||
|
||
update = install | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These whitespace changes, were they done because of a pylint suggestion? That's fine, but it's better if they're in their own commit separate from the commits with functional changes.
Could you please place all whitespace changes like this in a single commit before all the other ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And since you'll be rewriting history for this, would you mind squashing the two current commits together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the suggestions. But how can I do that? My git-fu is weak. Perhaps close this PR and do a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the suggestions. But how can I do that? My git-fu is weak. Perhaps close this PR and do a new one?
It's not necessary to close this PR. As long as this PR is open, whatever is on rafaelbco:master
will be what this PR represents, but you'll need to do git rebase -i upstream/master
to reorder your history and squash commits, then git push -f
to force push the rewritten branch on top of the one in your github repository.
This assumes you have a git "remote" called upstream
pointing to the original buildout repo.
Check some tutorials on git interactive rebase
or git rewrite history
.
In any case it's better to do the history rewriting as the last step, after replying to all change requests with pure additional commits.
But in the end, if after making all changes you're still uncomfortable doing git history rewriting, just give me permission to your repo and I'll rewrite the history for you. The easiest way to do that is by allowing changes to a pull request.
@@ -107,7 +101,6 @@ def __init__(self, buildout, name, options): | |||
if self.extra_paths: | |||
options['extra-paths'] = '\n'.join(self.extra_paths) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about commits for whitespace here as above
allow_hosts=allow_hosts) | ||
_working_set_cache[cache_key] = ws | ||
|
||
return orig_distributions, _working_set_cache[cache_key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function looks fine as far as it goes. I only see one issue with it but it's a big one:
Instances of pkg_resources.WorkingSet()
are mutable (see for ex. the methods .add()
and .add_entry()
).
So, conceivably, recipes inheriting from zc.recipe.egg:Scripts
(and there are a lot of them), could take one of these working sets from the cache and further call .add()
or .add_entry()
on them, polluting the cache and creating the kind of hair pulling bug that manifests very randomly, depending on:
- the order of execution of parts within a recipe
- the sequence of calls to buildout (e.g. it fails the first time but works the second, or the other way around, without any changes to buildout.cfg in between).
Or it might not cause a failure in the buildout
run at all, but might generate scripts with slightly mismatching eggs in the sys.path
, again depending on the order of the parts or the sequence of execution of repeated buildout
runs.
The issue of the mutability of WorkingSet
was mentioned by @djay on the original issue
So, instead of straight up returning WorkingSet
instances from the cache, please return a copy.deepcopy()
of them (WorkingSet
instances define .__getstate__()
/ .__setstate__()
pair of methods, so deepcopy()
should be enough). Something like:
return orig_distributions, copy.deepcopy(_working_set_cache[cache_key])
In any case, you should at least add a test for this situation, namely:
- Call
working_set()
with a given set of parameters - Modify the resulting working set (the second element of the returned tuple) by calling
.add()
or.add_entry()
on it. - Call
working_set()
with the same set of parameters as above. - Make sure the resulting
WorkingSet
does not contain the modifications caused by.add()
or.add_entry()
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, I'll do that. And thanks for the feedback!
""" | ||
orig_distributions = list(distributions) | ||
cache_key = ( | ||
tuple(distributions), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you should sort this tuple of distributions to increase the possibility of a cache hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the order of distributions affect the generated working set? If not then definitely the list should be sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the order of distributions affect the generated working set?
That is a good question. In theory, there should be no difference as long as packages don't try to place modules in the same namespace as other packages and a smart-enough dependency resolver should try to find a set of dependencies for the requested packages no matter the order the packages are requested.
In practice I think the order matters because the resolver in setuptools
is not smart enough and the specific ordering can mean the difference between a dependency resolution that works and one that doesn't, so leave the distributions
unsorted for now.
@leorochael I think I addressed all the issues in the code. Please review again. The timing comparisons are still pending though. |
"Allow edits from maintainers" is enabled so feel free to change whatever you want. |
Hi Rafael, I've just added a commit (to be squashed later, in case you agree), fixing some formatting and wording. But I just had another idea: The So, what if instead of taking The documentation you added would need to be adjusted not to say it does the same thing as the What do you think? |
I think you're right. This way the module level function would make an even better addition to the API (besides being used for making the caching easier). I'll work on it. |
@leorochael I made the changes requested to the |
Looks good to me! @jimfulton, any comments? |
I wonder if instead of the module-global If I understand what is being cached, then in practice (running buildout on the command line) I wouldn't expect that to make any difference, but for third-party tests (or anything that may invoke buildout APIs programmatically more than once in a single process) there's no (or less) concern about cache pollution/corruption. (E.g., for this reason, if a zopefoundation project has to use a module-global cache (because there is no scoped argument to hang it on) it tends to include cleanup code like so: _cache = {}
try:
from zope.testing import cleanup
except ImportError:
pass
else:
cleanup.addCleanUp(_cache.clear()) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if instead of the module-global
_working_set_cache
, the cache could be on the buildout instance object itself? That way we have isolation of the cache to a particular buildout run.
@jamadden, good catch! It's always better to avoid global state, but I hadn't figured out a good way to propose to have a working set cache that would span different parts AND recipe instances.
Hanging the cache off of either the buildout instance itself (something like self.buildout._working_set_cache
) or off the buildout options (e.g. self.buildout['buildout']['_working_set_cache']
) would work.
I think the second one is less "messing with internal attributes of other instances" than the first, but I'm open to alternatives.
This would imply that either the function that caches working sets goes back to being a method of the Egg recipe class or that it remains module global, but accepts an extra argument: the cache dictionary itself.
@rafaelbco, I hope the repeated request for changes doesn't discourage you.
I think your proposed improvement is nice, and each additional opinion has made it better.
Results are cached in the `_working_set_cache` dict at module level. | ||
Results are cached in the mapping provided in the `cache_storage` argument. | ||
This defaults to a global dict at this module. If `None` is passed then | ||
caching is disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to even have the option of a module global dict cache? That seems like a foot-gun to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is I think the working_set()
function at module level is useful on its own. And if people choose to use it then it's nice to have caching enabled by default, since building working sets is slow.
I think it's better to do this than putting the burden of setting up the cache storage in the callers.
For example, check out this code: https://github.com/plone/plone.recipe.zope2instance/blob/master/src/plone/recipe/zope2instance/__init__.py#L620
I think it would be nice to use the working_set()
function here, instead of instantiating the Egg
class.
Remember caching can be disable if needed, passing cache_storage=None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is I think the working_set() function at module level is useful on its own.
if people choose to use it then it's nice to have caching enabled by default, since building working sets is slow.
I think it's better to do this than putting the burden of setting up the cache storage in the callers.
Personally, I happen to disagree, I think module level caches are dangerous by default. Moreover...:
I think it would be nice to use the working_set() function here, instead of instantiating the Egg class.
That seems to disprove the point. That recipe already has an Egg instance, which has the buildout, etc, and so access to the scoped cache. That's been the only way to do it so far, and I don't see a reason to introduce another way to do it ("There should be one-- and preferably only one --obvious way to do it."), especially if that way is potentially dangerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I see two different and independent issues brought by you here.
First is:
module level caches are dangerous by default
It's fixable by not providing a default value for the cache_storage
argument. I don't agree with this change because I think the benefits are greater than the danger.
Second is:
I don't see a reason to introduce another way to do it.
It's fixable by removing the working_set()
module level function from the API. Probably making it a non-public method named _working_set()
or _build_working_set()
.
I don't like this change because I'd rather use working_set()
as a function than instantiate the Egg
class just to call the method. I think the module level function is a better API.
So, if I'm right, and the new proposed API is better, then we have to solve the problem you mentioned: now we have two ways of doing the same thing.
The obvious fix is to remove the Egg.working_set()
method from the API. However it seems unwise to break backwards compatibility in a PR about performance. So I propose we violate "There should be one...".
Since:
- I want the PR approved because of the performance gains. The API improvement is a small bonus for me.
- I don't agree with the two changes above. I'd like the code to stay unchanged.
- The changes proposed don't impact the performance gains.
Then I'll ask kindly for @leorochael and @jamadden decide together what changes needs to be done and I'll made them (provided they are in the scope I delimited above).
It would be even better if you do the changes yourselves, after agreeing on what needs to be changed.
I made the proposed changes. I chose to store the cache as an attribute of the buildout object. Storing it as an option seemed kind of strange to me. I used
I like having
Thanks! |
@rafaelbco, I've done a lot of thinking this last weekend on the problem we're trying to solve and the different options and opinions, and how to solve them not through a compromise that'll leave some or all of us only partially dissatisfied, but through a solution that is better overall. The functionality we're trying to provide is comprised of an entry point (currently a module level function), and some state that exists outside of the scope of a single run of this entry point (the cache). The needed lifetime of this state is greater than the lifetime of the run of a single buildout part, so multiple parts can make use of the cache, but it doesn't need to be longer than the lifetime of the buildout run itself, so it doesn't require module-level state. At this point the remaining reason for a module level state is simplification of the API, but as @jamadden, pointed out, by having (Incidentally, Now the usually Pythonic way of having "some state that exists outside of the scope of a single function run" and "an entry point for a functionality" (preferably one that is simple to use) is to create a class (say, And since the needed lifetime of the cache is contained in the lifetime of the buildout run, it seems to me the better place to put the cache is as an instance of the buildout object, instantiated by the buildout instance itself, so that it's simply used by recipes, instead of being instantiated by recipes on demand (it could perhaps even be used by buildout itself to cache its own working sets, but this is a massive scope creep). I'm still weighing if the added complexity of the above paragraph is worth the effort (it might be if we could make the classes in But if the additional complexity is not worth it, then I believe the simpler solution here is to get rid of the module level function and turn it into an internal I will get around to one of these solutions sometime this week, but if any one is feeling impatient. I'd be willing to merge the Regards, |
@leorochael and @jamadden It's my first contribution to That said, I changed my mind about what I think needs to be done. Here's my new suggestion:
This won't change the API at all and achieve to goal of the PR, which is performance improvement. So you merge the PR and do a new What do you think? |
@rafaelbco, sounds good to me. |
@leorochael Ready to be reviewed again! |
@rafaelbco, looks good to me! @jamadden, I expect your concerns have been met, right? @jimfulton, are you ok, with merging this? |
Absolutely! Thanks @rafaelbco! |
I don't have time to look at this now. If @leorochael and @jamadden are good I'm good. :) |
Thanks @rafaelbco! |
Thank you guys, it was a pleasure work with you. |
Any plans on releasing on pypi? |
I tried today, but failed to use I'm afraid to try without |
A new version of @jimfulton, I don't think I have permission for adding Could you please either create a new release or give me the permission to do so? |
@jimfulton, thanks for the PyPI credentials. @rafaelbco, after much wrangling with It seems we're keeping the pace of releasing one every two years ;-) |
Thanks @leorochael ! |
Thanks @leorochael <https://github.com/leorochael> !
Rafael Oliveira
…On Thu, Aug 17, 2017 at 12:23 PM, Jim Fulton ***@***.***> wrote:
Thanks @leorochael <https://github.com/leorochael> !
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#402 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABCVOlofNGEBQQAxyBdI7bvUfulyyJCWks5sZFrZgaJpZM4OkgaC>
.
|
This PR is meant to address the problem discussed here: https://community.plone.org/t/multiple-zope-instances-slow-buildout
Review is much appreciated because I'm new to buildout development (I'm a buildout heavy user, though).