Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation Plan: Switch Python package management away from Pipenv #286

Closed
1 task
sarayourfriend opened this issue Aug 4, 2022 · 10 comments · Fixed by #4128
Closed
1 task

Implementation Plan: Switch Python package management away from Pipenv #286

sarayourfriend opened this issue Aug 4, 2022 · 10 comments · Fixed by #4128
Assignees
Labels
🤖 aspect: dx Concerns developers' experience with the codebase 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API 🧱 stack: ingestion server Related to the ingestion/data refresh server 🧱 stack: mgmt Related to repo management and automations 🐍 tech: python Involves Python
Projects

Comments

@sarayourfriend
Copy link
Contributor

Problem

Pipenv was a good start, but it has some issues that are difficult to work around:

  1. It is slow to lock and resolve versions
  2. It is difficult (maybe impossible) to atomically update dependencies on an individual basis

Description

It would be great to have a concise RFC to investigate what it would take to switch from Pipenv to Poetry or PDM. The RFC should make a recommendation for one or the other and describe the trade-offs between them.

Please evaluate the effort required for such in the API and openverse "meta" repositories.

Please also evaluate the effect on any Dockerfiles and other deployment artifacts.

Additional context

WordPress Make Slack conversation here: https://wordpress.slack.com/archives/C02012JB00N/p1659363874533649

You can register for a Make Slack account here: https://chat.wordpress.org

Implementation

  • 🙋 I would be interested in implementing this feature.
@sarayourfriend sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon 🤖 aspect: dx Concerns developers' experience with the codebase 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🐍 tech: python Involves Python labels Aug 4, 2022
@openverse-bot openverse-bot added this to Backlog in Openverse Aug 4, 2022
@sarayourfriend
Copy link
Contributor Author

I've just learned about pip-tools recently and it seems like it could be a very good option: https://github.com/jazzband/pip-tools. pypi itself uses it, for example.

It follows existing Python conventions closely, so may be more usable for the catalogue (which relies on requirements.txt files in a way that it can't use Pipenv for now). However, it's just some tools around pip, so doesn't manage virtual environments. That's an advantage Poetry and PDM have (and Pipenv as well), the venv management is transparent and you don't have to think about it.

I'd be nice to document the trade offs here.

dhruvkb pushed a commit that referenced this issue Feb 20, 2023
Add initial query values

Handle media reset on query change only in `search` page

Update src/utils/prepare-search-query-params.js

Co-authored-by: sarayourfriend <24264157+sarayourfriend@users.noreply.github.com>
dhruvkb pushed a commit that referenced this issue Apr 14, 2023
* Remove tsv_to_postgres_loader_overwrite

* Remove overwrite tests & extra steps
@sarayourfriend sarayourfriend changed the title RFC request: Switch Python package management away from Pipenv Implementation Plan: Switch Python package management away from Pipenv Apr 18, 2023
@sarayourfriend sarayourfriend changed the title RFC request: Switch Python package management away from Pipenv Implementation Plan: Switch Python package management away from Pipenv Apr 18, 2023
@sarayourfriend sarayourfriend changed the title RFC request: Switch Python package management away from Pipenv Implementation Plan: Switch Python package management away from Pipenv Apr 18, 2023
@AetherUnbound AetherUnbound added 🧱 stack: api Related to the Django API 🧱 stack: ingestion server Related to the ingestion/data refresh server 🧱 stack: mgmt Related to repo management and automations and removed 🧱 stack: backend labels May 15, 2023
@dhruvkb
Copy link
Member

dhruvkb commented Apr 3, 2024

We have not prioritised this so far, which makes sense because of the large scale disruption to at least two parts of the stack (the API and ingestion server) and many other smaller sub-projects. However, two developments suggest we should be considering it with increased priority.

  • Pipenv has been a major component of the cause that led to service disruption because of its tendency to update unrelated packages during an update. It has been noted that other package managers (notably Poetry) are not subject to this problem.

  • The ingestion server is soon on its way out Removal of the ingestion server #3925, which will leave the API as the only Pipenv based stack in the repository. Other smaller Python sub-projects (which are not critical to Openverse services) can switch to a different project manager to match the API or stick to Pipenv which serves them well.

@sarayourfriend
Copy link
Contributor Author

That makes a lot of sense to me, Dhruv. We could even go ahead and do this scoped just to the API, and assume the ingestion server is a non-issue long-term due to the removal project you linked.

I'll update the title of this to represent that.

There are some really simple solutions we could use, given we're just an application and not a package, and don't necessarily need pyproject.toml for the API. pip-tools, for example, could be a much simpler solution overall compared to Poetry or others.

However, Poetry and other tools might have better support for cross-referencing packages in a monorepo context, which would be nice if we pulled certain shared utilities out. There are ways around that, though.

@dhruvkb
Copy link
Member

dhruvkb commented Apr 4, 2024

I'm quite pleased with how Poetry works in my personal projects, and I believe @AetherUnbound is too. I would prefer to not have to learn another toolchain with pip-tools but I wouldn't mind if the team decides on something else.

The implementation plan should list down some concrete requirements and good-to-haves that we expect from the final solution (including but not limited to targeted package updates, good monorepo/workspace support, integration with Renovate/Dependabot etc.) and those will help us decide the best tool for the job.

@sarayourfriend
Copy link
Contributor Author

Might be good to iron those out before the actual plan, so that the plan can focus on determining what meets the requirements, rather than detemrining the requirements on top of that. They're separate discussions.

Let's just go with Poetry if y'all are happy with it. It's enormously popular and works well in my experience as well.

@stacimc
Copy link
Contributor

stacimc commented Apr 5, 2024

Noting that while the ingestion-server is on its way out, the indexer-worker (which will live in the catalog) will still be Pipenv based (assuming #4026 is not changed significantly before approval).

@pathunstrom
Copy link

Was asked to come share my experience now with multiple workflows.

Before going too far, my current standard for my development is just use pip and requirements.txt mostly because I'm waiting for the standards to finally shake out.

That said, I keep playing with these tools so can maybe speak to potential failure modes:

pip-tools I generally like. It's specifically a suite of tools built on top of a pip based workflow.
pip-compile expects your requirements to live in pyproject.toml following the least specific principle of a library dependency list. Then you provide a requirements.in and then generate your requirements.txt off of the metadata. The naive approach to its use has some of the same issues as the problem mentioned for pipenv of being eager in version updates. The second observation is that if you aren't already comfortable with the new python packaging ecosystem, it's very easy for developers to make changes to requirements.txt and not the .in or pyproject.toml and that can break environments for everyone if it's not caught.

Poetry I've only used a little. My current big complaint is that it strongly wants the package name to match the project name, but that might not be a problem for you. The good of poetry is its about as close to NPM for Python as I've encountered. I think if you have folks who are weaker on python, the instructions are much clearer than the pip-tools pile. It also allows dependency grouping.

I don't think either properly solve the monorepo problem. I have to maintain a mono-repo at work and my actual process for things I'm not needing editable, is cding from directory to directory and doing a build of each (or a full install because it's "easier") and moving back to what I'm working on to pip install the thing I'm working on.

@sarayourfriend
Copy link
Contributor Author

sarayourfriend commented Apr 9, 2024

Thanks for sharing that, @pathunstrom. I wasn't aware Poetry didn't follow PEP 621! It makes sense given its context, but as we're only moving to it after it is settled, I think it would be good to choose a tool that does follow it.

I use hatch extensively and like it, but the lack of lock files make it inappropriate for applications. I've used PDM before, and it has, I think, all the features we want:

I think that makes it an appropriate choice. It sees active development and is supported by Renovate through it's generic PEP 621 support. It also has an opt-in centralised installation cache to speed up re-installations and reduce internet usage like PNPM.

@dhruvkb what do you think? @WordPress/openverse-maintainers?

@dhruvkb
Copy link
Member

dhruvkb commented Apr 13, 2024

I didn't want to form any opinions before trying out all the possible tools. So I tried them all, from toy projects for new tools like PDM https://github.com/dhruvkb/pdm-mono and Rye https://github.com/dhruvkb/rye-mono to a full scale migration attempt for Poetry #4107, which is familiar.

I really liked Rye mainly for its speed. It is incredibly fast! It's still not a good fit because it's very unstable (pre-v1), misses important features that we need (like cross-platform lockfiles) and is still largely maintained by one person. It's also very ideologically different and tries to do a lot more that isn't applicable to us (like managing Python installations).

PDM was also very good, but I could not wrap my head around the monorepo setup. For example, try as I might I could not figure out how to lock dev dependencies for the sub-packages. I expected pdm install to do that but it doesn't. Also PDM creates a whole bunch of files everywhere like .egg-info, .dist-info, .pdm-python etc. which I didn't like to see especially in the API which is an application and not a library.

This got me wondering why we even need monorepo support in our Python projects. Such a setup puts all our packages in one virtualenv and makes it harder to get a project specific lockfile that we can use inside Docker. So I gave Poetry a try without a monorepo setup. It this case, the API defines a path dependency on a package in packages/ and we copy it into the container when building. It works fine, CI passes, changes are minimal and the end result involves little to no magic.

Poetry has good venv management, targeted dependency upgrades, support from Renovate and Dependabot, cross-platform lockfiles, support for build backends (I don't know why we would need that) and good CLI experience.

It's not PEP-621 compliant but I personally would take that tradeoff for a mature, polished experience that's suited for applications, works quite well for libraries and produces understandable dependency chains.

These are just my observations, I would like to hear your thoughts after you take a look at #4107.

@sarayourfriend
Copy link
Contributor Author

sarayourfriend commented Apr 16, 2024

I was interested in monorepo support so that we could easily cross-reference packages locally and have dependencies locked between them. For example, if we abstracted code out to be shared by the catalog, API, or indexer workers (e.g., Elasticsearch mappings, other utils, etc), I thought a monorepo would help with that.

In fact, it isn't necessary. We would use an editable local dependency for that anyway.

Regarding these two from Rye:

misses important features that we need (like cross-platform lockfiles)

We don't need this, though. Our lock files only matter for Linux, because that's where we run the application, and we can generate them in the docker images themselves (and probably should!). We do not run our apps on our local machines ever and we shouldn't try to support that (there are bigger more complex issues we'd need to solve for it). If this is a significant issue for us using Rye, I would highly encourage documenting why cross-platform lock files are relevant to our project.

If lock files were relevant for isolated package developments, it'd be a different question, because that generally does happen locally. However, Python packages do not use lock files (and cannot because of how package dependencies are incorporated into a single resolved dependency list) so it is irrelevant.

tries to do a lot more that isn't applicable to us (like managing Python installations)

I disagree this doesn't apply to us. I find it a pretty big pain to need to manage Python installations locally for Openverse, when every other Python project I work on uses a more recent Python version and uses package managers that download Python distributions for you. This is a really nice feature, and especially when we get around to writing Python packages, it makes testing with multiple Python versions (including pypy) trivially easy.

This is also pretty common now because https://pypi.org/project/pbs-installer/ makes it basically a non-issue. Rye, PDM, and Hatch all use that or the backing library and in my experience both PDM and Hatch work perfectly for this.

For package development, this is a must have QoL feature. I would go back to tox and manually manging Python installations with pyenv for package development if I can help it.

support for build backends (I don't know why we would need that)

We need this for building Python packages, which we have expressed intentions to do.


Poetry sounds fine to me. I place a high value on PEP-621 compliance, especially if we are choosing a new tool to use in 2024, but if that is too high a bar, and if the tools available for achieving that are otherwise unsuitable, then fine. It's probably less relevant for applications, but that least me to the next point...

We need to keep in mind that we do want to develop Python packages, not just apps. Whatever solution we choose should either work for both or not prevent us from using another for packages if appropriate.

I say go for it with Poetry, so long as we can 100% use rye/hatch/whatever for package development, for packages that will get referenced locally by Poetry as local editable dependencies. I don't see why not PEP-621 for packages, especially when lock files are irrelevant there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖 aspect: dx Concerns developers' experience with the codebase 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API 🧱 stack: ingestion server Related to the ingestion/data refresh server 🧱 stack: mgmt Related to repo management and automations 🐍 tech: python Involves Python
Projects
Archived in project
Status: Accepted
Openverse
  
Backlog
Development

Successfully merging a pull request may close this issue.

6 participants