Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OEP-0018: Python Dependency Management #56

Merged
merged 1 commit into from
May 2, 2018
Merged

Conversation

jmbowman
Copy link
Contributor

Recommendations for managing python package dependencies which should help minimize problems over time. The key technology choice involved is pip-tools, which we already use in many of our repositories. The current draft specifies that we should use a service to auto-generate PRs with requirement updates, but stops short of recommending which one to use (largely because we're still in the process of finalizing that decision, and the choice may change over time anyway).

* An attempt to pin dependencies in setup.py (or parse its dependencies
automatically from a requirements file) forced us to change that package
before we could upgrade one of those dependencies in a downstream
repository.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this bullet, perhaps you can say "downstream repository, and in another case an upstream repository" given that openedx/edx-drf-extensions#26 was done for edx-platform.

contexts. For example:

* Just the standard set of core dependencies for execution on a production
server to perform its primary purpose (``base.in``)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have not explained what "base.in" is, or why you are mentioning it here. Maybe a paragraph at the top giving the overview ("We'll use pip-tools to manage multiple contexts."), then you can tie these filenames into that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was hoping to avoid front-loading too much description of the overall process, but there probably does need to be some advance mention of what these names are for.

`cookiecutter-django-app`_ repository.
* Each listed dependency should have a brief end-of-line comment explaining
its primary purpose(s) in this context. These comments typically start at
the 27th character, but this is just a convention for consistency with files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

27th!? We live in a 4-space-indent world, and pip-compile goes to the 27th character? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the purpose: wouldn't this often be obvious or trivial? It sounds kind of like requirement comments on every line of a program in an intro programming course: they don't end up being useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of those leading 26 characters are often the package name and optional version constraint. pip-tools lines up the start of the comments for ease of reading, I figured it just makes sense to match that for consistency.

Unlike imports in a Python file where you can see later in the file what they're used for, seeing something like "singledispatch" in a requirements file tells you very little about why it's there. The main reason for adding the comments is to make it easier to identify requirements that are no longer needed, which is tough if you're not even sure what half of them do. I'm thinking things like "celery task broker" for redis and "Used in shoppingcart for extracting/parsing pdf text" for pdfminer; the reason for including as a dependency, not the overview from the package metadata. I've typically found it very tough to prune out obsolete dependencies without having this kind of comments in the file, but I'm open to arguments against it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like something that would easily get out of date. What would the comment for "singledispatch" be? "# So that I can do single dispatch"? If I add it, and put "# used in the foobar manager" in the requirements file, how will that comment get updated if it's also used elsewhere, or if it's no longer used in its original spot? Wouldn't grepping the source for imports be a better way to understand whether and how packages are being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I've done this before is that I put the main thing the package is currently used for in there. If I later notice when scanning the file that the listed use is no longer relevant, I look to see if it's obsolete or still used somewhere else. If the former, I delete it; if the latter, I update the comment with the current usage info. If I have to hunt down all uses of every single package I don't recognize to see if any of them are obsolete, I'm basically never going to bother...it just takes too long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the indentation: looking at the .in files in the edx-platform pull request, on a few lines, 26 isn't enough, and the alignment is bumped out. I'd choose a larger number that is a multiple of four, and let pip-tools do what it wants. In some ways, it's better to have a clear distinction between the tool-generated and human-generated comments anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the comments themselves:

lxml==3.8.0               # XML parser

I don't find this useful. I don't know how to get the useful comments without requiring people to add useless ones like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggled a bit with the edx-sandbox/shared.in file comments because it seems like a grab bag of "make this stuff available to instructors in case they want to use it". So yeah, those ended up being pretty generic, but I think that's more the exception than the rule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am skeptical that these comments will be accurate and specific enough to be useful in the long run. But we can give it a try. At least requiring a comment will slow people down from adding too many dependencies! :)

``-r base.txt``) and explain in an end-of-line comment why that set of
dependencies is also needed in this context. Note that the ``pip-compile``
output file should be included rather than the looser requirements file it
was generated from, as this ensures that the same versions of packages are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the "include the output file rather than the requirements file." An example would help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is getting kind of long and hard to follow for a bullet point. I'll see if I can clarify it a bit.

was generated from, as this ensures that the same versions of packages are
installed in the different contexts. We don't want to run tests with
different versions of dependencies than we use in production, for example.
* Avoid direct links to packages local directories, GitHub, or other version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/packages/packages'/ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to write "...packages in local directories, ..."; fixing.

``requirements/base.in`` with a simple Python function declared in
``setup.py`` itself, but for packages with few base dependencies it may be
better to just list them in both places. Just add comment reminders in both
places to also update the other dependency listing when making any changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As big a fan as I am of not overengineering, it seems like better advice to always use the requirements files if they exist. "Few base dependencies" doesn't seem like a good reason to duplicate the information, especially since adding the reminder comments will lose the terseness benefits of the few dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just not thrilled about copying another function into all setup.py files. Better than the alternatives, I suppose.

installing a package from a URL is appropriate, see the notes below on
`Installing Dependencies from URLs`_

If the repository contains a Python package, base dependencies also need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What else could a repository contain? IDAs don't have setup.py I guess? Maybe better is: "if the repository contains a setup.py"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, how about: "If the repository contains a setup.py file defining a Python package, the
base dependencies also need to be specified there."

the packages listed in the given requirements files, installing, upgrading,
and uninstalling packages as needed.

Open edX packages use an ``upgrade`` make target to use ``pip-compile`` to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this, "do use," or "should use"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be technical, "many already do use, and the rest probably should use".

`AllMyChanges.com`_ can make this easier.

.. _requires.io: https://requires.io/
.. _AllMyChanges.com: https://allmychanges.com/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why, but AllMyChanges doesn't know about the last five releases of coverage.py: https://allmychanges.com/p/python/coverage/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a number of packages haven't been getting updated for a year or so, the issue was reported 6 days ago: https://github.com/AllMyChanges/allmychanges.com/issues/65 . Do you think it's still worth mentioning here, or pull until that gets resolved? It's missing the last several Django releases also. requires.io is current on these, but missing changelog links for several packages that are included on AllMyChanges.com.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just a sentence like, "At the time of writing, these services were still in development. Carefully evaluate which is currently reliable."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting. does requires.io automatically pull in github's release history? From what I can tell it looks like it needs a CHANGES.txt file or similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These services parse CHANGES.txt (or some other file). GitHub histories are not good sources of changes, they are too noisy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

releases are not noisy at all in general - https://github.com/jgm/pandoc/releases, https://github.com/tqdm/tqdm/releases, etc...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, i misread your comment as "commit history." I don't know if they use GitHub releases. Frankly, I hope they don't. A text file in the repo is a low-tech way to make the information available to everyone, regardless of how the source is stored or transmitted. Unless I've misunderstood something, GitHub releases are a proprietary feature that separates information from the source tree.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree changes.txt should take precedence, but where none is available it would be nice to fallback to things like github release notes. It's convenient to use github's release system to link to issues, upload binaries etc. anyway this is probably a discussion not meant for here. I was only asking if anyone knows of good crawlers of https://api.github.com/repos/{REPO}/releases, not if this is a good idea :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requires.io changelog parsing, at least as of when it was first introduced, is summarized in this blog post. At least at the time, it only looked in PyPI and for appropriately named files in the repo, not at the GitHub releases list or commit history.

There's also the changelog parser for Gemnasium at https://github.com/tech-angels/vandamme , it seems somewhat more comprehensive but I haven't looked specifically to see what it scrapes from GitHub.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

``scripts/post-pip-compile.sh``.

If you do need to include a package at a URL, it should be editable (start with
"-e ") and have both the package name and version specified (end with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "should be editable" advice is contrary to our current instructions. You might want to include here, "(because pip-tools doesn't otherwise support URL installations)" or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll reference the bugs noted above.

@nedbat
Copy link
Contributor

nedbat commented Apr 3, 2018

Something I still don't understand after reading this (maybe because i was too focused on paragraphs, and not enough on the big picture): who is supposed to run "make upgrade" to update the version numbers in the .txt files, and when are they supposed to do it?

installed.
2. For each of these contexts, declare the direct dependencies that will be
needed. Use the least restrictive constraints which should allow pip to
install a working set of dependencies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I missed a key point here: our philosophy now is to not explicitly pin versions, and instead to let pip-tools pin them for us. This is a big shift from how we used to do things, and needs to be highlighted much more strongly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, rewording to try to make things like that more clear.

@jmbowman
Copy link
Contributor Author

jmbowman commented Apr 3, 2018

@nedbat The section on automating updates recommends using a service to auto-generate PRs which run make upgrade when appropriate, and having a designated maintainer for each repo to keep tabs on those and merge, fix, or reject as appropriate. This could probably be the "owner" from OEP-2's openedx.yaml by default. But more generally, anytime somebody needs to add, upgrade, or remove a dependency, they should edit an input requirements file (if necessary) and run make upgrade. It's possible to upgrade just a specific target package in a particular file, but that opens up enough ways for the files to fall out of sync with each other that I hesitate to recommend doing that; I'd rather upgrade a few other things at the same time and only add version constraints if tests fail than never upgrade anything by default.

I'll add some notes to this effect to the doc.

@nedbat
Copy link
Contributor

nedbat commented Apr 3, 2018

IIUC, this means that I as a developer might need to add a dependency, so I edit a .in file. Then I run make update. This can update the pins of an arbitrary number of other packages, some of which might break existing code. What do I do then?

@jmbowman
Copy link
Contributor Author

jmbowman commented Apr 3, 2018

  1. If you aren't in an incredible hurry, scan the changelogs of the changed packages to look for potential problems. If there's a change which seems like it will require nontrivial work to adapt to, put an upper version limit on that package in the .in file, rerun "make upgrade", and file a ticket to deal with it later.
  2. Push the changes and wait for tests to finish. If one of the upgrades caused a problem, add a version cap, rerun "make upgrade", and ticket as above.

The automation is important here to make sure that we always get package upgrades in small batches. We should really never go more than a week or so without updating all the dependencies which we're basically current on (unless miraculously no new versions of packages we use were released in that time; more likely to happen for small repos). And I'd argue that someone who doesn't feel qualified to determine if it's probably safe to upgrade a few dependencies probably shouldn't be adding new dependencies without assistance or review either.

@nedbat
Copy link
Contributor

nedbat commented Apr 3, 2018

That needs to go into the document. :)

@jmbowman jmbowman force-pushed the jmbowman/oep-0018 branch 2 times, most recently from 300a58e to 7bc91ce Compare April 12, 2018 20:09
@nasthagiri nasthagiri changed the title OEP-18: Python Dependency Management OEP-0018: Python Dependency Management Apr 14, 2018
dependencies).
* An automated test suite with reasonably good code coverage, configured to
be run on new GitHub pull requests.
* A service configured to periodically auto-generate a GitHub pull request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often would such a service be expected to run? Daily, weekly, monthly?

dependencies).
* An automated test suite with reasonably good code coverage, configured to
be run on new GitHub pull requests.
* A service configured to periodically auto-generate a GitHub pull request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding this correctly: Until the next time this service runs, the automatically generated requirements .txt file may not match the output of the .in file, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anytime a developer updates one of the .in files, they should run make upgrade to update all of the .txt files. The periodic auto-generated pull requests (probably weekly or so) are just to notify us of updates to packages that we aren't deliberately holding back, so we don't fall too far behind. Those auto-generated PRs would only update the .txt files, since the input constraints won't have changed (and the suggested changes still need to be reviewed and deliberately merged).

* The need for an old dependency was removed.
* A version constraint needs to be added to prevent upgrading to a
backwards-incompatible release of a required package until appropriate code
changes can be made.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the best practice when a development team needs to make a backward incompatible change to an API that is implemented in 1 repo but used in another? If the team didn't pin down the version number prior to this, what are the ramifications (open edX community, failure on master branch, etc.) if they pin down the version only afterward?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing ever gets automatically upgraded; changes to the generated .txt files should always be reviewed before merging to master. I think in this scenario, best practice is to leave packages unpinned by default, but clearly document backwards-incompatible changes in the changelog and release notes; when an update of the .txt files indicates that the package would be upgraded, that should be noticed so the developer can either constrain the version or make the changes needed to use the new version, as appropriate.


Many of the Open edX repositories have already begun to comply with the
recommendations outlined here. In particular, repositories generated using
`cookiecutter-django-app`_ should be configured correctly from the outset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor nit. Might be clearer to say "should already be configured correctly..."

@johnbaldwin
Copy link

What level of pip knowledge do you expect from the readers of this OEP? I ask because it covers a lot of ground and reads like two documents in one.

A) This is how we plan to structure our python package dependencies for our projects

B) Here are some tips and guidelines for working with the more advanced features of pip and the pipe ecosystem which we use for A, above.

I don't want to suggest unneeded work, but perhaps the OEP be narrowed to A with a bonus of linking to additional reading resources for engineers to get up to speed on the tools and techniques to better work with python package dependencies?

@johnbaldwin
Copy link

BTW, I've learned a bit of new stuff about pip just by reading it :)

@jmbowman
Copy link
Contributor Author

@johnbaldwin The intended level of detail is to be sufficient to help a developer create a new code repository that follows the guidelines, and keep it that way as changes are made over time. A lot of the seemingly esoteric points covered here come up depressingly often in routine package maintenance, but I'm open to suggestions on specific content that could be harmlessly extracted.

I could move about half of the "Installing Dependencies from URLs" section into the Rationale section (which seems more appropriate for background info), and localize the references to pip-compile and pip-sync to the "Generate Exact Dependency Specifications" section (where that information is important in explaining how to create the make targets), and just reference the make targets elsewhere. Do you think that would help? Is there anything else that seems distinctly secondary to the main content?

@johnbaldwin
Copy link

@jmbowman Thanks for your reply and additional context. It sounds like you're intentionally using this OEP as both a spec and a guide, and I can see why now. Given your comments, and that you need to keep within the structure of the OEP template, I'm not sure restructuring would be worth the effort. I think the thing to do is for the community to start using this as a reference then perhaps folks will create derivative guides, focusing on aspects of package maintenance and dependency management to different audience experience levels.

* Just the standard set of core dependencies for execution on a production
server to perform its primary purpose (``base.in``)
* Additional dependencies which are only needed when optional extra features
of a package are desired
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only bullet point that doesn't have a filename attached. Is that an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename would typically be based on the name of the extra in setup.py, I'll a note to that effect.

base dependencies also need to be specified there. These can be derived from
``requirements/base.in`` with a simple Python function declared in
``setup.py`` itself. An example can be found in the
`setup.py file for edx-completion`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to inline the function definition in this document? That would give a more canonical best-version, it seems like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. The first draft of the OEP didn't particularly recommend this approach, but earlier review discussions concluded that it was best not to explicitly list the core dependencies in two places, so having a canonical implementation of loading them makes sense. The existing code is about 30 lines; a little long, but probably short enough to include.

In most other circumstances, the package should be added to PyPI instead. If
you do need to include a package at a URL, it should be editable (start with
"-e ") and have both the package name and version specified (end with
"#egg=NAME==VERSION").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a worthwhile place to have an example of a fully-baked url include line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add an example.

Copy link
Contributor

@cpennington cpennington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion on this seems to have reached a conclusion with no hanging issues, so as the arbiter, I'm happy to call this Accepted.

@jmbowman jmbowman merged commit 0b55225 into master May 2, 2018
@jmbowman jmbowman deleted the jmbowman/oep-0018 branch May 2, 2018 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants