Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #153: buildout should cache working set environments #402

Merged
merged 2 commits into from Aug 8, 2017
Merged

Fix #153: buildout should cache working set environments #402

merged 2 commits into from Aug 8, 2017

Conversation

rafaelbco
Copy link
Contributor

This PR is meant to address the problem discussed here: https://community.plone.org/t/multiple-zope-instances-slow-buildout

Review is much appreciated because I'm new to buildout development (I'm a buildout heavy user, though).

Copy link
Contributor

@leorochael leorochael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rafaelbco, thank you for this contribution!

Besides the other comments, I would appreciate if you could provide timing comparisons (after doing the changes for providing copies of working sets in the cache) with a simple and a more complex buildout.

Having a sense of how drastic a difference in timing it makes would help justify changing a piece of code that is working just fine as it is, keeping in mind that features are assets, but each extra line of code is a liability.


def install(self):
reqs, ws = self.working_set()
return ()

update = install


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These whitespace changes, were they done because of a pylint suggestion? That's fine, but it's better if they're in their own commit separate from the commits with functional changes.

Could you please place all whitespace changes like this in a single commit before all the other ones?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And since you'll be rewriting history for this, would you mind squashing the two current commits together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the suggestions. But how can I do that? My git-fu is weak. Perhaps close this PR and do a new one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the suggestions. But how can I do that? My git-fu is weak. Perhaps close this PR and do a new one?

It's not necessary to close this PR. As long as this PR is open, whatever is on rafaelbco:master will be what this PR represents, but you'll need to do git rebase -i upstream/master to reorder your history and squash commits, then git push -f to force push the rewritten branch on top of the one in your github repository.

This assumes you have a git "remote" called upstream pointing to the original buildout repo.

Check some tutorials on git interactive rebase or git rewrite history.

In any case it's better to do the history rewriting as the last step, after replying to all change requests with pure additional commits.

But in the end, if after making all changes you're still uncomfortable doing git history rewriting, just give me permission to your repo and I'll rewrite the history for you. The easiest way to do that is by allowing changes to a pull request.

@@ -107,7 +101,6 @@ def __init__(self, buildout, name, options):
if self.extra_paths:
options['extra-paths'] = '\n'.join(self.extra_paths)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about commits for whitespace here as above

allow_hosts=allow_hosts)
_working_set_cache[cache_key] = ws

return orig_distributions, _working_set_cache[cache_key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function looks fine as far as it goes. I only see one issue with it but it's a big one:

Instances of pkg_resources.WorkingSet() are mutable (see for ex. the methods .add() and .add_entry()).

So, conceivably, recipes inheriting from zc.recipe.egg:Scripts (and there are a lot of them), could take one of these working sets from the cache and further call .add() or .add_entry() on them, polluting the cache and creating the kind of hair pulling bug that manifests very randomly, depending on:

  • the order of execution of parts within a recipe
  • the sequence of calls to buildout (e.g. it fails the first time but works the second, or the other way around, without any changes to buildout.cfg in between).

Or it might not cause a failure in the buildout run at all, but might generate scripts with slightly mismatching eggs in the sys.path, again depending on the order of the parts or the sequence of execution of repeated buildout runs.

The issue of the mutability of WorkingSet was mentioned by @djay on the original issue

So, instead of straight up returning WorkingSet instances from the cache, please return a copy.deepcopy() of them (WorkingSet instances define .__getstate__() / .__setstate__() pair of methods, so deepcopy() should be enough). Something like:

return orig_distributions, copy.deepcopy(_working_set_cache[cache_key])

In any case, you should at least add a test for this situation, namely:

  • Call working_set() with a given set of parameters
  • Modify the resulting working set (the second element of the returned tuple) by calling .add() or .add_entry() on it.
  • Call working_set() with the same set of parameters as above.
  • Make sure the resulting WorkingSet does not contain the modifications caused by .add() or .add_entry() above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, I'll do that. And thanks for the feedback!

"""
orig_distributions = list(distributions)
cache_key = (
tuple(distributions),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you should sort this tuple of distributions to increase the possibility of a cache hit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the order of distributions affect the generated working set? If not then definitely the list should be sorted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the order of distributions affect the generated working set?

That is a good question. In theory, there should be no difference as long as packages don't try to place modules in the same namespace as other packages and a smart-enough dependency resolver should try to find a set of dependencies for the requested packages no matter the order the packages are requested.

In practice I think the order matters because the resolver in setuptools is not smart enough and the specific ordering can mean the difference between a dependency resolution that works and one that doesn't, so leave the distributions unsorted for now.

@rafaelbco
Copy link
Contributor Author

@leorochael I think I addressed all the issues in the code. Please review again.

The timing comparisons are still pending though.

@rafaelbco
Copy link
Contributor Author

"Allow edits from maintainers" is enabled so feel free to change whatever you want.

@leorochael
Copy link
Contributor

leorochael commented Jul 28, 2017

Hi Rafael,

I've just added a commit (to be squashed later, in case you agree), fixing some formatting and wording.

But I just had another idea: The working_set() top level function, doesn't actually do anything with orig_distributions and extra other than jamming them together and then returning orig_distributions as is.

So, what if instead of taking extras and returning a pair of orig_distributions and a working set, the top level function accepted the list of distributions already concatenated, and returned just the working set? The method version would do the concatenation of orig_distributions and extra and would still return the orig_distributions, ws pair.

The documentation you added would need to be adjusted not to say it does the same thing as the working_set() method, but the top level function itself would be simpler, with a simpler signature and simpler return value and its name would match better what it does.

What do you think?

@rafaelbco
Copy link
Contributor Author

I think you're right. This way the module level function would make an even better addition to the API (besides being used for making the caching easier).

I'll work on it.

@rafaelbco
Copy link
Contributor Author

@leorochael I made the changes requested to the working_set function and I'm working on the timing comparisons.

@leorochael
Copy link
Contributor

Looks good to me!

@jimfulton, any comments?

@jamadden
Copy link
Contributor

jamadden commented Aug 4, 2017

I wonder if instead of the module-global _working_set_cache, the cache could be on the buildout instance object itself? That way we have isolation of the cache to a particular buildout run.

If I understand what is being cached, then in practice (running buildout on the command line) I wouldn't expect that to make any difference, but for third-party tests (or anything that may invoke buildout APIs programmatically more than once in a single process) there's no (or less) concern about cache pollution/corruption.

(E.g., for this reason, if a zopefoundation project has to use a module-global cache (because there is no scoped argument to hang it on) it tends to include cleanup code like so:

_cache = {}
try:
   from zope.testing import cleanup
except ImportError:
   pass
else:
   cleanup.addCleanUp(_cache.clear())

)

Copy link
Contributor

@leorochael leorochael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if instead of the module-global _working_set_cache, the cache could be on the buildout instance object itself? That way we have isolation of the cache to a particular buildout run.

@jamadden, good catch! It's always better to avoid global state, but I hadn't figured out a good way to propose to have a working set cache that would span different parts AND recipe instances.

Hanging the cache off of either the buildout instance itself (something like self.buildout._working_set_cache) or off the buildout options (e.g. self.buildout['buildout']['_working_set_cache']) would work.

I think the second one is less "messing with internal attributes of other instances" than the first, but I'm open to alternatives.

This would imply that either the function that caches working sets goes back to being a method of the Egg recipe class or that it remains module global, but accepts an extra argument: the cache dictionary itself.

@rafaelbco, I hope the repeated request for changes doesn't discourage you.

I think your proposed improvement is nice, and each additional opinion has made it better.

@rafaelbco
Copy link
Contributor Author

Results are cached in the `_working_set_cache` dict at module level.
Results are cached in the mapping provided in the `cache_storage` argument.
This defaults to a global dict at this module. If `None` is passed then
caching is disabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to even have the option of a module global dict cache? That seems like a foot-gun to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is I think the working_set() function at module level is useful on its own. And if people choose to use it then it's nice to have caching enabled by default, since building working sets is slow.

I think it's better to do this than putting the burden of setting up the cache storage in the callers.

For example, check out this code: https://github.com/plone/plone.recipe.zope2instance/blob/master/src/plone/recipe/zope2instance/__init__.py#L620

I think it would be nice to use the working_set() function here, instead of instantiating the Egg class.

Remember caching can be disable if needed, passing cache_storage=None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is I think the working_set() function at module level is useful on its own.
if people choose to use it then it's nice to have caching enabled by default, since building working sets is slow.

I think it's better to do this than putting the burden of setting up the cache storage in the callers.

Personally, I happen to disagree, I think module level caches are dangerous by default. Moreover...:

I think it would be nice to use the working_set() function here, instead of instantiating the Egg class.

That seems to disprove the point. That recipe already has an Egg instance, which has the buildout, etc, and so access to the scoped cache. That's been the only way to do it so far, and I don't see a reason to introduce another way to do it ("There should be one-- and preferably only one --obvious way to do it."), especially if that way is potentially dangerous.

Copy link
Contributor Author

@rafaelbco rafaelbco Aug 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I see two different and independent issues brought by you here.

First is:

module level caches are dangerous by default

It's fixable by not providing a default value for the cache_storage argument. I don't agree with this change because I think the benefits are greater than the danger.

Second is:

I don't see a reason to introduce another way to do it.

It's fixable by removing the working_set() module level function from the API. Probably making it a non-public method named _working_set() or _build_working_set().

I don't like this change because I'd rather use working_set() as a function than instantiate the Egg class just to call the method. I think the module level function is a better API.

So, if I'm right, and the new proposed API is better, then we have to solve the problem you mentioned: now we have two ways of doing the same thing.

The obvious fix is to remove the Egg.working_set() method from the API. However it seems unwise to break backwards compatibility in a PR about performance. So I propose we violate "There should be one...".

Since:

  1. I want the PR approved because of the performance gains. The API improvement is a small bonus for me.
  2. I don't agree with the two changes above. I'd like the code to stay unchanged.
  3. The changes proposed don't impact the performance gains.

Then I'll ask kindly for @leorochael and @jamadden decide together what changes needs to be done and I'll made them (provided they are in the scope I delimited above).

It would be even better if you do the changes yourselves, after agreeing on what needs to be changed.

@rafaelbco
Copy link
Contributor Author

I made the proposed changes.

I chose to store the cache as an attribute of the buildout object. Storing it as an option seemed kind of strange to me. I used __zc_recipe_egg_working_set_cache__ as the attribute name to avoid name clashes and make debugging easier. The __XXX__ syntax makes it clear it is an special thing.

This would imply that either the function that caches working sets goes back to being a method of the Egg recipe class or that it remains module global, but accepts an extra argument: the cache dictionary itself.

I like having working_set() as a module function with a simpler signature and return value. I think it's useful as a separate function.. So I kept the function at module level and added an optional cache_storage argument. If not provided then it uses the global module level dict as before. You can even pass None to disable caching.

@rafaelbco, I hope the repeated request for changes doesn't discourage you.
I think your proposed improvement is nice, and each additional opinion has made it better.

Thanks!

@leorochael
Copy link
Contributor

@rafaelbco, I've done a lot of thinking this last weekend on the problem we're trying to solve and the different options and opinions, and how to solve them not through a compromise that'll leave some or all of us only partially dissatisfied, but through a solution that is better overall.

The functionality we're trying to provide is comprised of an entry point (currently a module level function), and some state that exists outside of the scope of a single run of this entry point (the cache). The needed lifetime of this state is greater than the lifetime of the run of a single buildout part, so multiple parts can make use of the cache, but it doesn't need to be longer than the lifetime of the buildout run itself, so it doesn't require module-level state.

At this point the remaining reason for a module level state is simplification of the API, but as @jamadden, pointed out, by having zc.recipe.egg.egg:Eggs make use of it in its .working_set() method, no further changes are needed in any other recipe, so the benefits of a simplified API are restricted to its use inside zc.recipe.egg itself.

(Incidentally, plone.recipe.zope2instance:Recipe is a weird example in that it both inherits from zc.recipe.egg:Eggs (thru zc.recipe.egg.Scripts) and also makes use of an instance of zc.recipe.egg:Eggs (thru zc.recipe.egg.Egg) as an attribute. Only one of these would have been necessary, and again, no changes to it are necessary to make use of the cache, as long as it keeps using the .working_set() method).

Now the usually Pythonic way of having "some state that exists outside of the scope of a single function run" and "an entry point for a functionality" (preferably one that is simple to use) is to create a class (say, WorkingSetCache) that encapsulates both the state and the entry point.

And since the needed lifetime of the cache is contained in the lifetime of the buildout run, it seems to me the better place to put the cache is as an instance of the buildout object, instantiated by the buildout instance itself, so that it's simply used by recipes, instead of being instantiated by recipes on demand (it could perhaps even be used by buildout itself to cache its own working sets, but this is a massive scope creep).

I'm still weighing if the added complexity of the above paragraph is worth the effort (it might be if we could make the classes in zc.recipe.egg much simpler by simply passing all their options into the constructor or methods of the cache instance), but I didn't want @rafaelbco thinking that we just dropped the matter.

But if the additional complexity is not worth it, then I believe the simpler solution here is to get rid of the module level function and turn it into an internal ._working_set() method in the recipe instance.

I will get around to one of these solutions sometime this week, but if any one is feeling impatient. I'd be willing to merge the ._working_set() internal method solution if we have one other approval (@jimfulton?).

Regards,

@rafaelbco
Copy link
Contributor Author

@leorochael and @jamadden

It's my first contribution to zc.buildout and I tried to touch as little as possible in order to make the caching work and not break anything. So far I could keep up with the requested changes. The last comment from @leorochael however mentions bigger changes that I feel I cannot tackle at the moment.

That said, I changed my mind about what I think needs to be done. Here's my new suggestion:

  • I'll change the code to get rid of the module level function. The module level method will be turned into a _working_set() method.
  • The cache will be stored in the buildout object, as a dict instantiated by the _working_set() method, named __zc_recipe_egg_working_set_cache__.

This won't change the API at all and achieve to goal of the PR, which is performance improvement.

So you merge the PR and do a new zc.recipe.egg release. After that you can improve the code, making the more complex proposed changes. Outside the scope of this PR.

What do you think?

@leorochael
Copy link
Contributor

  • I'll change the code to get rid of the module level function. The module level method will be turned into a _working_set() method.
  • The cache will be stored in the buildout object, as a dict instantiated by the _working_set() method, named __zc_recipe_egg_working_set_cache__.

@rafaelbco, sounds good to me.

@rafaelbco
Copy link
Contributor Author

@leorochael Ready to be reviewed again!

@leorochael
Copy link
Contributor

@leorochael Ready to be reviewed again!

@rafaelbco, looks good to me!

@jamadden, I expect your concerns have been met, right?

@jimfulton, are you ok, with merging this?

@jamadden
Copy link
Contributor

jamadden commented Aug 8, 2017

@jamadden, I expect your concerns have been met, right?

Absolutely! Thanks @rafaelbco!

@jimfulton
Copy link
Member

I don't have time to look at this now. If @leorochael and @jamadden are good I'm good. :)

@leorochael leorochael merged commit 8821903 into buildout:master Aug 8, 2017
@jimfulton
Copy link
Member

Thanks @rafaelbco!

@rafaelbco
Copy link
Contributor Author

Thank you guys, it was a pleasure work with you.

@rafaelbco
Copy link
Contributor Author

Any plans on releasing on pypi?

@leorochael
Copy link
Contributor

Any plans on releasing on pypi?

I tried today, but failed to use zest.releaser while trying to keep different sets of tags for zc.buildout and zc.recipe.egg due to a ConfigParser issue.

I'm afraid to try without zest.releaser and mess the release for now. If anyone else wants to have a go, fell free.

@leorochael
Copy link
Contributor

A new version of zest.releaser now gives better support for specifying the tag format.

@jimfulton, I don't think I have permission for adding zc.buildout our zc.recipe.egg releases to PyPI.

Could you please either create a new release or give me the permission to do so?

@leorochael
Copy link
Contributor

@jimfulton, thanks for the PyPI credentials.

@rafaelbco, after much wrangling with zest.releaser (including conjuring a new bugfix for it, and being forced to run it from python3), I managed to upload a new release of zc.recipe.egg.

It seems we're keeping the pace of releasing one every two years ;-)

@jimfulton
Copy link
Member

Thanks @leorochael !

@rafaelbco
Copy link
Contributor Author

rafaelbco commented Aug 17, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants