Set up reproducible Python environments with conda-lock #2968

zaneselvans · 2023-10-21T20:26:33Z

PR Overview

This PR switches our dependency management system to use platform-specific lockfiles produced by conda-lock to generate reproducible Python environments, hopefully increasing the stability of our nightly builds, and ensuring that we can reproduce a given set of data products at a later date if need be, given the commit hash of our own repository, and the exact versions of all our software dependencies.

Conda Lockfile

The new setup uses pyproject.toml as the single source specifying all of our software dependencies.
An explicit multi-platform lockfile is output at environments/conda-lock.yml which specifies exact versions and hashes for the linux-64, linux-aarch64, osx-64, osx-arm64, and win-64 platforms.
Platform-specific environment files are rendered and placed in the environments/ directory.
In contexts where we have access to conda-lock or micromamba (which can parse the full conda-lock.yml) we refer to the full, multi-platform conda-lock.yml.
In contexts where only conda or mamba are available, we have to specify the platform-specific environment files.
The PR introduces a new GitHub Action update-lockfile that will eventually run on a schedule, against the dev branch, and will re-lock the dependencies, and re-render the environment files under environments/ to update them with the new exact versions (assuming all tests pass).
The PR also changes the base image used in the Dockerfile defining the container we use for our nightly builds, to mambaorg/micromamba which comes with the tiny micromamba package manager installed, and not much else. micromamba can directly read the conda-lock.yml file and install the packages very rapidly. It's designed for this kind of locked environment reproduction.
Minor naming change: the optional dependencies required to build the documentation used be referred to as doc but this didn't match the directory name and other references which were labeled docs and this PR switches everything to refer to docs (which is also what we use in the cheshire template repo).
All of our dependencies (including recordlinkage and libsnappy) are now available via conda, so we don't need to have a pip install section of any of our environment files.
Software dependencies that are not python packages are specified at the bottom of pyproject.toml in the section dedicated to conda-lock. This includes e.g. python itself, sqlite, nodejs, and pandoc`.
Note that conda-lock appears unable to parse the "extras" format sometimes used with pip dependencies, so rather than being able to use pandas[performance,gcp]>=2,<2.2 you actually have to specify the packages that would have been installed as a result of including those extras.
Unfortunately grpcio still has to be specified directly in pyproject.toml because it has binary incompatibilities on MacOS that even conda isn't solving.
The changes to our testing environment in this PR implicitly assume that from now on, we will not be distributing a python package via PyPI and conda-forge because we're not testing a built package anymore. From now on PUDL is an application that gets deployed from a repository in a locked conda environment.

Swtiching from Tox to Makefile

Since the CI is now running inside a conda environment based on the lockfile, the environment produced by Tox would be both redundant, and likely install other versions of dependencies that aren't part of the locked environment.
If we aren't using Tox for Python environment management, and we are going to install catalystcoop.pudl from the repository rather than from a built package, then all Tox is doing is storing logic about how to run the tests and various other tasks.
This logic can be stored in a lot of other simpler ways. Here I've migrated into a Makefile and brought as much of the logic from the GHA workflows into the Makefile as possible, so that it can be easily run either locally or in CI.
A lot of the Makefile targets are "phony" in that they do not correspond directly to a filesystem output whose creation date indicates which commands need to be re-run, but in general the setup replicates the "collection of code snippets" that we had stored in tox.ini. We can accumulate additional reusable snippets in there if we want to standardize other tasks across all of our development environments.

Repository vs. Package Installation & optional dependencies aka "extras"

If we aren't using Tox for Python environment management, and we are going to install catalystcoop.pudl from the repository, then the way we've historically split out our dev, test, docs, etc dependencies no longer makes sense. Almost everywhere we're installing all dependencies directly from the explicit multi-platform conda-lock.yml file.
There are a few dependencies that only need to be installed for local development where we need interactive visualization, like pygraphviz, dagster-webserver, jupyterlab etc. These have been split out into a very small dev list of optional dependencies that is installed locally when you run make install-pudl but that don't get installed in the Docker container.

Tasks

Give feedback

Migrate the logic encoded in `tox.ini` into another system, like a `Makefile`

Migrate the logic encoded in tox.ini into another system, like a Makefile
Options
Successfully updated the issue's project

There was an error updating the issue's project
Get the [simpleexec conda package](https://github.com/conda-forge/simpleeval-feedstock/pull/18) switched to `noarch` so that `linux-aarch64` (Docker on Apple ARM chips) works.

Get the simpleexec conda package switched to noarch so that linux-aarch64 (Docker on Apple ARM chips) works.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Increase minimum version of the `timezonefinder` dependency once https://github.com/conda-forge/timezonefinder-feedstock/issues/61 has been addressed

Increase minimum version of the timezonefinder dependency once No osx-arm64 builds since v6.1 removed noarch conda-forge/timezonefinder-feedstock#61 has been addressed
Options
Successfully updated the issue's project

There was an error updating the issue's project
Try running a nightly build with the new micromamba docker container + lockfiles.

Try running a nightly build with the new micromamba docker container + lockfiles.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Remove all possible upper bounds on our dependencies in `pyproject.toml`

Remove all possible upper bounds on our dependencies in pyproject.toml
Options
Successfully updated the issue's project

There was an error updating the issue's project
Automatically relock dependencies any time `pyproject.toml` is changed in a PR and merge those changes into the PR branch, rather than creating a whole separate PR into the PR branch.

Automatically relock dependencies any time pyproject.toml is changed in a PR and merge those changes into the PR branch, rather than creating a whole separate PR into the PR branch.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Consolidate optional dependencies into required dependencies.

Consolidate optional dependencies into required dependencies.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Create a new optional dev dependency that just has interactive packages like jupyterlab

Create a new optional dev dependency that just has interactive packages like jupyterlab
Options
Successfully updated the issue's project

There was an error updating the issue's project
Test running `make nuke` locally

Test running make nuke locally
Options
Successfully updated the issue's project

There was an error updating the issue's project
Try running a nightly build with the new environment and `pytest-xdist`

Try running a nightly build with the new environment and pytest-xdist
Options
Successfully updated the issue's project

There was an error updating the issue's project
Test running `make pytest-jupyter` locally (it tries and fails! because we haven't been running it.)

Test running make pytest-jupyter locally (it tries and fails! because we haven't been running it.)
Options
Successfully updated the issue's project

There was an error updating the issue's project
Test running `make unmapped-ids` locally (seems to run but takes forever due to in-process dagster)

Test running make unmapped-ids locally (seems to run but takes forever due to in-process dagster)
Options
Successfully updated the issue's project

There was an error updating the issue's project
Schedule `update-conda-lockfile` to run once a week, instead of dependabot.

Schedule update-conda-lockfile to run once a week, instead of dependabot.
Options
Successfully updated the issue's project

There was an error updating the issue's project
rename `tox-pytest` action to `pytest` since Tox is no more.

rename tox-pytest action to pytest since Tox is no more.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Rename `update-lockfile` action etc. to be more descriptive:`update-conda-lockfile`

Rename update-lockfile action etc. to be more descriptive:update-conda-lockfile
Options
Successfully updated the issue's project

There was an error updating the issue's project
Set the scheduled `update-conda-lockfile` to run against `dev` and make PRs into `dev`

Set the scheduled update-conda-lockfile to run against dev and make PRs into dev
Options
Successfully updated the issue's project

There was an error updating the issue's project
Update `dev_setup` documentation to reflect use of conda lockfiles.

Update dev_setup documentation to reflect use of conda lockfiles.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Update documentation to reflect removal of Tox and use of Makefile

Update documentation to reflect removal of Tox and use of Makefile
Options
Successfully updated the issue's project

There was an error updating the issue's project
Update documentation to reflect installation of application repo rather than python package

Update documentation to reflect installation of application repo rather than python package
Options
Successfully updated the issue's project

There was an error updating the issue's project
Update release notes

Update release notes
Options
Successfully updated the issue's project

There was an error updating the issue's project
Merge `update-conda-lockfile.yml` and `pytest.yml` workflows into `main`

Merge update-conda-lockfile.yml and pytest.yml workflows into main
Options
Successfully updated the issue's project

There was an error updating the issue's project
Disasble dependabot on the PUDL repo.

Disasble dependabot on the PUDL repo.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Increase minimum version of the `sphinx-issues` dependency once https://github.com/conda-forge/sphinx-issues-feedstock/pull/12 has been merged

Increase minimum version of the sphinx-issues dependency once sphinx-issues v3.0.1 conda-forge/sphinx-issues-feedstock#12 has been merged
Options
Successfully updated the issue's project

There was an error updating the issue's project
Options

Outstanding Questions

Logging

micromamba doesn't have a direct analog to conda --no-capture-input and I'm not sure how its stream handling / attachement args correspond (if at all) so for now I've removed --no-capture-input from the command that's passed into the container. But not sure what the implications of that are.
Is the logging output that's being generated by the nightly builds with this new setup acceptable? Are we missing anything? Is the parallelization making it illegible?

Docker stuff

For some reason, when I run update-conda-lockfile using workflow_dispatch the docker-build-test action is not kicked off. See Update Lockfile #3022
Installing the AWS CLI and Google Cloud SDK is hacky. Is AWS really only available via curl? gcloud can be installed similarly with curl, or via apt-get or from conda-forge (for OS X + Intel based Linux). What's the right approach? Also, how were we previously installing the Google Cloud SDK? I don't know what I changed to cause it not to be available any more.
There are many environment variables which are set both in the docker/.env file, and inside Dockerfile Is that intentional? Or should one of those be the unique source of that information?

Remaining dependency issues

Why does pip think that the recordlinkage version is 0.0.0 when it's installed with mamba?

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

bendnorman · 2023-10-26T23:45:59Z

Here are some thoughts on the remaining questions:

micromamba doesn't have a direct analog to conda --no-capture-input, and I'm not sure how its stream handling / attachement args correspond (if at all) so for now I've removed --no-capture-output from the command that's passed into the container. But not sure what the implications of that are.

I think this might prevent us from piping the nightly build STDOUT into a log file. If there is no way to stop micromamba from intercepting STDOUT we’ll have to find another way to capture the logs. We might be able to do this using Cloud Logging.

Installing the AWS CLI and Google Cloud SDK is hacky. Is AWS really only available via curl? gcloud can be installed similarly with curl, or via apt-get or from conda-forge (for OS X + Intel based Linux). What's the right approach? Also, how were we previously installing the Google Cloud SDK? I don't know what I changed to cause it not to be available any more.

Looks like AWS is not interested in publishing awscli2 on PyPI. I'm assuming the same goes for conda.

It looks like our google python packages were installed via other dependencies (gcsfs, pandas-gcp). I think we installed the CLI tool separately: Sometimes it is helpful to download the logs and data outputs of nightly builds when debugging failures. To do this you’ll need to set up the Google Cloud software Development Kit (SDK).

Install the gcloud utilities on your computer. There are several ways to do this. We recommend using conda or its faster sibling mamba. If you’re not using conda environments, there are other ways to install the Google Cloud SDK explained in the link above.

When do we want to trigger the creation of a new lockfile, and how do we make sure it incorporates newer versions of our dependencies that are outside of the current range?

Should we let the dependabot keep bumping upper bounds and automatically create a new lockfile and merge it into those dependabot PRs?

Should we automatically update conda-lock.yml any time pyproject.toml is changed?

Should we remove all of the upper bounds on our dependencies and let the lockfile take the newest possible versions >of everything, and assume it all works if the tests pass? Would this in effect replace the need to run dependabot on our python dependencies?

Option 3: Does this mean we would have an action that updates the lock files on a schedule? Would the logic be, run make conda-lock.yml , if the lock files changed, run the tests, if the tests pass, merge into main. If there are no new packages and the lock files do not change, don’t run the tests and don’t open a PR.
Options 1 and 3 seem similar to me. I like option 1 because it lets dependabot figure out when it’s appropriate to update packages instead of running make conda-lock.yml on a schedule and figuring out if packages changed.
Don’t we need to retain the upper bounds on our packages for when we know there are newer versions of dependencies that aren’t compatible with our code or other dependencies? For example, dagster requires an older version of coloredlogs and pydantic.
I wonder if it’s possible to create an action that runs make conda-lock.yml any time the dependencies in pyproject.toml changes and commits the new files directly to the PR that changed pyproject.toml. This way the lock files are always updated when dependabot or a contributor bumps a version in pyproject.toml.

Do we want to build and install a catalystcoop.pudl package in our CI tests and Docker image, rather than installing from the repository directly?

If so, do we want to continue distributing that package in PyPI and conda-forge?

If so, what allowable dependency versions do we want to communicate to users of the package?

What are the benefits of installing from catalystcoop.pudl package instead of from the the repository directly? Distributing the package in PyPI and condo-forge seems like it will create unnecessary overhead.

Why does pip think that the recordlinkage version is 0.0.0 when it's installed with mamba?

I'm not sure! Happy to help debug.

Should we separate our interactive development dependencies (like jupyterlab) from the headless dev requirements? If we're going to install all of the optional dependencies in all of the conda environments, does it even make sense to break them out as optional dependencies?

Great point! It doesn't make much sense anymore to have optional dependencies if we treat PUDL like an application with pinned dependencies that produces data outputs. Does it make sense to keep the optional dependencies around until our main users no longer depend on install PUDL as a package?

bendnorman · 2023-10-27T00:18:50Z

.github/workflows/tox-pytest.yml

@@ -10,19 +10,15 @@ on:
      - ready_for_review


We should probably rename this workflow now that tox is no more.

Yeah, I was just holding off since I think I have to change the main branch to do that right? And the name is insinuated into a bunch of different places.

I want to rename the update-lockfile action too since that's a bit generic.

bendnorman · 2023-10-27T00:22:59Z

Makefile

+		test/integration/glue_test.py
+
+########################################################################################
+# The github- prefixed targets are meant to be run by GitHub Actions


Could you elaborate on why there are different commands for local and github in this comment?

For the GitHub ones I'm having them install PUDL and deal with coverage before & after the run. But maybe we could do those things every time locally or on GitHub and it would be fine?

Got it. I say we keep it as is for now. Just run coverage on GitHub.

I think it would be better if the make targets are exactly what we do on CI - once we start having local-, github-, and targets with no prefix we lose the standardization benefit.

This would also let us have less code in the Makefile which is a plus.

The tricky thing is handling PUDL installation & coverage.

PUDL installation: this should be part of a separate devenv target, I think - something like

devenv: conda-env mamba run -n pudl-dev pip install -e . conda-env: environments/conda-lock.yml conda-lock install -n pudl-dev $^

Coverage: on CI, we're doing coverage for unit/integration/docs separately, and combining them at the end. Locally, we might want to do coverage for any subset of those, which is the impetus behind the various local-pytest-ci targets. I think we should:

stop running coverage erase within make - on CI runners it is unnecessary, and locally this forces a new target for every subset of tests we want to run coverage for.

include the coverage args for all unit/integration/doc-build targets, for local/CI consistency. If you really want to skip coverage, you can always run the naked command.

Here's what I have locally:

diff --git a/Makefile b/Makefile index a86b30874..246ee37f5 100644 --- a/Makefile +++ b/Makefile @@ -14,6 +14,12 @@ pip_install_pudl := pip install --no-deps --editable ./ dagster: dagster dev -m pudl.etl -m pudl.ferc_to_sqlite +devenv: conda-env + mamba run -n pudl-dev pip install -e . + +conda-env: environments/conda-lock.yml + conda-lock install -n pudl-dev $^ + ######################################################################################## # Conda lockfile generation ######################################################################################## @@ -47,16 +53,19 @@ docs-clean: docs-build: docs-clean doc8 docs/ README.rst - sphinx-build -W -b html docs docs/_build/html + coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html ######################################################################################## # Generic pytest commands for local use, without test coverage ######################################################################################## pytest-unit: - pytest --doctest-modules src/pudl test/unit + pytest ${pytest_args} --doctest-modules src/pudl test/unit pytest-integration: - pytest test/integration + pytest ${pytest_args} --etl-settings ${etl_fast_yml} test/integration + +pytest-integration-full: + pytest ${pytest_args} --etl-settings ${etl_full_yml} test/integration pytest-validate: pytest --live-dbs test/validate @@ -65,41 +74,17 @@ pytest-validate: ######################################################################################## # More complex pytest commands for local use that collect test coverage ######################################################################################## -# Run unit & integration tests on 1-2 years of data and collect test coverage data. -local-pytest-ci: docs-clean - ${coverage_erase} - doc8 docs/ README.rst - coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html - pytest ${pytest_args} --doctest-modules src/pudl test/unit - pytest ${pytest_args} --etl-settings ${etl_fast_yml} test/integration - ${coverage_report} - -# Run unit & integration tests on ALL years of data and collect test coverage data. -# NOTE: This will take 1+ hours to run and the PUDL DB will not be retained. -local-pytest-ci-all-years: docs-clean - ${coverage_erase} - doc8 docs/ README.rst - coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html - pytest ${pytest_args} --doctest-modules src/pudl test/unit - pytest ${pytest_args} --etl-settings ${etl_full_yml} test/integration - ${coverage_report} - # Run the full ETL, generating new FERC & PUDL SQLite DBs and EPA CEMS Parquet files. # Then run the full integration tests and data validations on all years of data. # NOTE: This will clobber your existing databases and takes hours to run!!! # Backgrounding the data validation and integration tests and using wait allows them to # run in parallel. -nuke: docs-clean - ${coverage_erase} - doc8 docs/ README.rst - coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html - pytest ${pytest_args} --doctest-modules src/pudl test/unit - pytest ${pytest_args} \ - --etl-settings ${etl_fast_yml} \ - test/integration +# + +nuke: docs-build pytest-unit rm -f tox-nuke.log coverage run ${covargs} -- \ - src/pudl/convert/ferc_to_sqlite.py \ + src/pudl/ferc_to_sqlite/cli.py \ --logfile tox-nuke.log \ --clobber \ ${gcs_cache_path} \ @@ -126,11 +111,7 @@ pytest-jupyter: # Compare actual and expected number of rows in many tables: pytest-minmax-rows: - pytest --live-dbs \ - test/validate/epacamd_eia_test.py::test_minmax_rows \ - test/validate/ferc1_test.py::test_minmax_rows \ - test/validate/eia_test.py::test_minmax_rows \ - test/validate/mcoe_test.py::test_minmax_rows_mcoe + pytest --live-dbs test/validate -k minmax_rows # Build the FERC 1 and PUDL DBs, ignoring foreign key constraints. # Identify any plant or utility IDs in the DBs that haven't yet been mapped

Couple extra notes:

nuke:

I think clobber has some unpleasant behavior with ferc_to_sqlite where it will delete the XBRL databases when reading the 2022 files, but...

src/pudl/conver/ferc_to_sqlite.py doesn't even exist, so I think nuke was totally broken already

should we be managing logfiles like this in dagster-land? seems a little clunky, but I also know logs in dagster are funky

pytest-minmax-rows: this is a shortcut to adding @pytest.mark.minmax to the minmax_rows tests. I sort of think this particular make target should be replaced by the built-in pytest test filtering tooling. Maybe the jupyter one too.

Makefile

bendnorman · 2023-10-27T00:26:19Z

docker/Dockerfile

-RUN useradd -Ums /bin/bash catalyst
-
-ENV CONTAINER_HOME=/home/catalyst
+RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \


I guess this is what Amazon wants us to do?!?

And who are we to question Lord Bezos?

zaneselvans · 2023-10-27T14:31:14Z

I think this might prevent us from piping the nightly build STDOUT into a log file. If there is no way to stop micromamba from intercepting STDOUT we’ll have to find another way to capture the logs. We might be able to do this using Cloud Logging.

There's the micromamba --attach flag, but the docs are a bit sparse and I wasn't sure how to use it. I think we might be able to do micromamba --attach "" and have it not capture any outputs?

$ micromamba run --help                                                                                                                                                                        
Run an executable in an environment
Usage: micromamba run [OPTIONS]

Options:
  -h,--help                   Print this help message and exit
  -a,--attach TEXT            Attach to stdin, stdout and/or stderr. -a "" for disabling stream redirection
  --cwd TEXT                  Current working directory for command to run in. Defaults to cwd
  -d,--detach                 Detach process from terminal
  --clean-env                 Start with a clean environment
  -e,--env TEXT ...           Add env vars with -e ENVVAR or -e ENVVAR=VALUE
  --label TEXT                Specifies the name of the process. If not set, a unique name will be generated derived from the executable name if possible.

Looks like AWS is note interested in publishing awscliv2 on PyPI. I'm assuming the same goes for conda.

There's a stale user-contributed package on conda-forge but no, it seems like curl is the only way to programmatically install the AWS CLI which seems dumb. But oh well. The conda-forge version of google-cloud-sdk seems to be well maintained though, and installing it through via conda will allow us to have a specific, known version that's stored in the lockfile, so that's what I went with for now. And also it means that it'll be installed for all of us in pudl-dev so we don't all have to figure out how to install it independently on various platforms.

Options 1 and 3 seem similar to me. I like option 1 because it lets dependabot figure out when it’s appropriate to update packages instead of running make conda-lock.yml on a schedule and figuring out if packages changed.

One thing with allowing dependabot to keep doing the updates is it'll make one PR for every individual dependency that changes, which will be good for isolating changes, but it'll be a lot of PRs, and every one of them will need the lockfile to be regenerated. I was thinking that we'd maybe run update-lockfile on a weekly schedule, in which case it would update all of the versions together once a week. And then we'd also need to either run it manually or trigger it whenever pyproject.toml is changed to catch newly added or removed or changed dependencies.

With our ~500 direct+indirect dependencies, I don't think there will ever be a week in which none of them release a new version.

In removing the upper bounds I was thinking just of removing the ones where there's no known problem and the upper bound is just a safety measure that keeps the dependency from changing too much without some intervention & testing on our part. For pydantic>=2.0 and other known incompatibilities we'd leave them in so conda-lock only chooses from the versions that we expect could work.

One thing I think probably want to try and avoid is merge conflicts in conda-lock.yml... it's meant to reflect a single run of conda-lock and it's tens of thousands of lines long and not particularly human-friendly. Not sure what this means for the workflow we want to choose though.

I wonder if it’s possible to create an action that runs make conda-lock.yml any time the dependencies in pyproject.toml changes and commits the new files directly to the PR that changed pyproject.toml. This way the lock files are always updated when dependabot or a contributor bumps a version in pyproject.toml.

I think this should definitely be doable, in the same way that pre-commit.ci currently fixes any issues that make it to GitHub.

What are the benefits of installing from catalystcoop.pudl package instead of from the the repository directly? Distributing the package in PyPI and condo-forge seems like it will create unnecessary overhead.

I think we've really gone to an "application" model at this point, where catalystcoop.pudl isn't meant to be installed outside of its locked environment, and is intended for deployment to run the ETL so probably distributing these packages and doing the package isolation in testing is no longer particularly useful. I think for reproducibility of a particular run we'll use tags in git, or if something catastrophic happens, there are zipfiles archived on Zenodo automatically every time we tag. We should note this somehow so folks aren't confused by how stale the package is on PyPI and conda-forge.

Great point! It doesn't make much sense anymore to have optional dependencies if we treat PUDL like an application with pinned dependencies that produces data outputs. Does it make sense to keep the optional dependencies around until our main users no longer depend on install PUDL as a package?

I think maybe we want to consolidate all the currently "optional" dependencies and make them required, but what do we want to do about dependencies that are for interactive local development? Like the Docker container doesn't need to have JupyterLab installed in it, but we all probably want it. Should we just leave that up to individuals? Or include them in the lockfile? Or I guess, don't include them in the lockfile but include them in the rendered environment files that we'll use to install the environment locally with the new make target?

zaneselvans · 2023-10-27T17:32:40Z

Oh, the downside of using conda to install google-cloud-sdk is that it's not available for win-64 or linux-aarch64. I don't know if this will end up mattering, but I got a warning at some point about not being able to build the docker container on my Mac because there was another package that wasn't available for aarch64...

bendnorman · 2023-10-31T00:51:16Z

With our ~500 direct+indirect dependencies, I don't think there will ever be a week in which none of them release a new version.

I thought dependabot only updates direct dependencies? If so, our dependabot PR frequency wouldn't change right? If it opens a PR for every indirect dependency update I think we should all the dependencies on a schedule.

I think we've really gone to an "application" model at this point, where catalystcoop.pudl isn't meant to be installed outside of its locked environment, and is intended for deployment to run the ETL so probably distributing these packages and doing the package isolation in testing is no longer particularly useful. I think for reproducibility of a particular run we'll use tags in git, or if something catastrophic happens, there are zipfiles archived on Zenodo automatically every time we tag. We should note this somehow so folks aren't confused by how stale the package is on PyPI and conda-forge.

I'm in favor of no more package releases. Can't think of a reason why someone would want to install PUDL given we're only distributing data now.

I think maybe we want to consolidate all the currently "optional" dependencies and make them required, but what do we want to do about dependencies that are for interactive local development? Like the Docker container doesn't need to have JupyterLab installed in it, but we all probably want it. Should we just leave that up to individuals? Or include them in the lockfile? Or I guess, don't include them in the lockfile but include them in the rendered environment files that we'll use to install the environment locally with the new make target?

This sounds like a good plan to me. So have two targets: make local_env which includes interactive development packages and make ci_env which has the minimum required packages?

zaneselvans · 2023-10-31T02:44:24Z

I thought dependabot only updates direct dependencies? If so, our dependabot PR frequency wouldn't change right? If it opens a PR for every indirect dependency update I think we should all the dependencies on a schedule.

The dependabot only looks at our direct dependencies (in pyproject.toml but the lockfile resolves all of our indirect dependencies too. I was just saying that if we recreate the lockfile once a week, I think it will always have changes, probably like... 20 of them every week, and all in a single PR. While with dependabot we'll get several PRs -- one for every single direct dependency that changed, and since they'll all change pyproject.toml, we'll need to regenerate the lockfile every time dependabot makes a change, but there's really no point to that -- except that it might be easier to isolate which dependency is responsible for breakage when it happens (if it's a direct dependency anyway). But I think our problem has more often been that indirect dependencies change, and we have no idea what it was that changed or which of our dependencies relied on the thing that changed.

Would an update-conda-lockfile action that regenerates the lockfile and re-renders the environment files any time pyproject.toml is changed (including by dependabot) and auto-merges the new lockfile and environment files into whatever PR includes the changes to pyproject.toml end up having a lot of unpleasant conflicts in the big lockfile? We don't ever really want to merge the lockfile or environment files. We want them to be a unified reflection of a solved environment that flows from a particular pyproject.toml.

bendnorman · 2023-10-31T22:17:12Z

Updating all of the dependencies weekly sounds simpler than opening multiple dependabot PRs 👍

When would the lock and environment files have merge conflicts? Contributor updates pyproject.toml on a feature branch, update-conda-lockfile merges new conda lock and environment files into dev, contributor merges dev into feature branch and has to resolve merge conflicts in the lock and env files?

zaneselvans · 2023-10-31T22:26:20Z

Hmm, @jdangerx it looks like we got another stochastic failure on the hypothesis test:

FAILED test/unit/io_managers_test.py::test_filter_for_freshest_data - hypothesis.errors.Flaky: Hypothesis test_filter_for_freshest_data(df=
      entity_id                          date utility_type              publication_time  int_factoid  float_factoid str_factoid
    0         5 1969-12-31 23:59:59.999999999        other 1969-12-31 23:59:59.999999869          -65  -1.415102e+15            
    1         7 1970-01-01 00:00:00.000000025          gas 1969-12-31 23:59:59.999999743           -6   5.130000e+02            
    2         M 1972-04-14 18:37:52.853935361        total 1969-12-31 23:59:59.999999994         -108   2.814750e+14            
) produces unreliable results: Falsified on the first call but did not on a subsequent one
Falsifying example: test_filter_for_freshest_data(
    df=
          entity_id                          date utility_type              publication_time  int_factoid  float_factoid str_factoid
        0         5 1969-12-31 23:59:59.999999999        other 1969-12-31 23:59:59.999999869          -65  -1.415102e+15            
        1         7 1970-01-01 00:00:00.000000025          gas 1969-12-31 23:59:59.999999743           -6   5.130000e+02            
        2         M 1972-04-14 18:37:52.853935361        total 1969-12-31 23:59:59.999999994         -108   2.814750e+14            
    ,
)
Found these contexts (['entity_id', 'date', 'utility_type']) in input data:
  entity_id                          date utility_type
0         5 1969-12-31 23:59:59.999999999        other
1         7 1970-01-01 00:00:00.000000025          gas
2         M 1972-04-14 18:37:52.853935361        total
The freshest data:
  entity_id                          date utility_type              publication_time  int_factoid  float_factoid str_factoid
1         7 1970-01-01 00:00:00.000000025          gas 1969-12-31 23:59:59.999999743           -6   5.130000e+02            
0         5 1969-12-31 23:59:59.999999999        other 1969-12-31 23:59:59.999999869          -65  -1.415102e+15            
2         M 1972-04-14 18:37:52.853935361        total 1969-12-31 23:59:59.999999994         -108   2.814750e+14            
Paired by context:
                                                               publication_time_in  int_factoid_in  float_factoid_in str_factoid_in          publication_time_out  int_factoid_out  float_factoid_out str_factoid_out _merge
entity_id date                          utility_type                                                                                                                                                                        
5         1969-12-31 23:59:59.999999999 other        1969-12-31 23:59:59.999999869             -65     -1.415102e+15                1969-12-31 23:59:59.999999869              -65      -1.415102e+15                   both
7         1970-01-01 00:00:00.000000025 gas          1969-12-31 23:59:59.999999743              -6      5.130000e+02                1969-12-31 23:59:59.999999743               -6       5.130000e+02                   both
M         1972-04-14 18:37:52.853935361 total        1969-12-31 23:59:59.999999994            -108      2.814750e+14                1969-12-31 23:59:59.999999994             -108       2.814750e+14                   both
Unreliable test timings! On an initial run, this test took 305.40ms, which exceeded the deadline of 200.00ms, but on a subsequent run it took 53.25 ms, which did not. If you expect this sort of variability in your test timings, consider turning deadlines off for this test by setting deadline=None.

You can reproduce this example by temporarily adding
@reproduce_failure('6.88.1', b'AXicNYzbDYAwDMTslLYbIDEMH+zDFuzFJAxDWoSly10eClJBUwUXuLDKEWm19+B+kM682UwYxSKNn8gh60iyZ1ON+atpl/NbfLz5gAOv')
as a decorator on your test case

It also seems like a lot of the dates that show up in these tests are suspiciously close to (within a few nanoseconds of) Unix time zero.

zaneselvans · 2023-10-31T22:54:27Z

When would the lock and environment files have merge conflicts? Contributor updates pyproject.toml on a feature branch, update-conda-lockfile merges new conda lock and environment files into dev, contributor merges dev into feature branch and has to resolve merge conflicts in the lock and env files?

I think any time there are changes to conda-lock.yml from more than one branch that run into each other we want to regenerate the locked outputs from scratch, based on whatever is found in pyproject.toml which really does need to deal with merge conflicts. This would potentially happen frequently if we were using dependabot, since it typically makes several PRs simultaneously, updating different dependencies individually, which then all get merged into dev in quick succession.

In the conda-lock repo itself they just re-lock the dependencies on a schedule.

I think we're trying to avoid a scenario where, say, we run a nightly build in an environment that's determined by a lockfile that doesn't actually correspond to the the pyproject.toml file associated with the commit. Though I guess if we're good about only ever building the environment from the lockfile, and installing pudl using pip --no-deps this isn't a catastrophe -- the lockfile that was checked into git would define the environment used for the build, and it would be there in git for future reproducibility. Even if the lockfile it self wasn't reproducible because it was some weird frankenstein merge.

jdangerx

Woohoo! Seems like everything works, the lockfile-updating schedule makes sense, and the only remaining question is whether to stick with make, right?

My thoughts:

it's nice that make exists outside of the Python environment, so we can use it to standardize some of our env management stuff
it's a little awkward to use make as a general task-runner because we'll have to make various .PHONY targets... but that seems OK. As long as we keep the task definitions short and split anything complicated out into a real Python script I think we won't run into maintainability hell.

So I think I've talked myself into "just use make, it's good enough and extremely standard." As such I've weighed in with my opinions on how to make in a slightly nicer way.

Other stuff:

when renaming a workflow, the old one will be there forever unless you delete all the existing workflow runs... that's kind of annoying but I guess it's better than losing history.

Makefile

jdangerx · 2023-11-02T14:45:45Z

Makefile

+		test/integration/glue_test.py
+
+########################################################################################
+# The github- prefixed targets are meant to be run by GitHub Actions


I think it would be better if the make targets are exactly what we do on CI - once we start having local-, github-, and targets with no prefix we lose the standardization benefit.

This would also let us have less code in the Makefile which is a plus.

The tricky thing is handling PUDL installation & coverage.

PUDL installation: this should be part of a separate devenv target, I think - something like

devenv: conda-env mamba run -n pudl-dev pip install -e . conda-env: environments/conda-lock.yml conda-lock install -n pudl-dev $^

Coverage: on CI, we're doing coverage for unit/integration/docs separately, and combining them at the end. Locally, we might want to do coverage for any subset of those, which is the impetus behind the various local-pytest-ci targets. I think we should:

stop running coverage erase within make - on CI runners it is unnecessary, and locally this forces a new target for every subset of tests we want to run coverage for.

include the coverage args for all unit/integration/doc-build targets, for local/CI consistency. If you really want to skip coverage, you can always run the naked command.

Here's what I have locally:

diff --git a/Makefile b/Makefile index a86b30874..246ee37f5 100644 --- a/Makefile +++ b/Makefile @@ -14,6 +14,12 @@ pip_install_pudl := pip install --no-deps --editable ./ dagster: dagster dev -m pudl.etl -m pudl.ferc_to_sqlite +devenv: conda-env + mamba run -n pudl-dev pip install -e . + +conda-env: environments/conda-lock.yml + conda-lock install -n pudl-dev $^ + ######################################################################################## # Conda lockfile generation ######################################################################################## @@ -47,16 +53,19 @@ docs-clean: docs-build: docs-clean doc8 docs/ README.rst - sphinx-build -W -b html docs docs/_build/html + coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html ######################################################################################## # Generic pytest commands for local use, without test coverage ######################################################################################## pytest-unit: - pytest --doctest-modules src/pudl test/unit + pytest ${pytest_args} --doctest-modules src/pudl test/unit pytest-integration: - pytest test/integration + pytest ${pytest_args} --etl-settings ${etl_fast_yml} test/integration + +pytest-integration-full: + pytest ${pytest_args} --etl-settings ${etl_full_yml} test/integration pytest-validate: pytest --live-dbs test/validate @@ -65,41 +74,17 @@ pytest-validate: ######################################################################################## # More complex pytest commands for local use that collect test coverage ######################################################################################## -# Run unit & integration tests on 1-2 years of data and collect test coverage data. -local-pytest-ci: docs-clean - ${coverage_erase} - doc8 docs/ README.rst - coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html - pytest ${pytest_args} --doctest-modules src/pudl test/unit - pytest ${pytest_args} --etl-settings ${etl_fast_yml} test/integration - ${coverage_report} - -# Run unit & integration tests on ALL years of data and collect test coverage data. -# NOTE: This will take 1+ hours to run and the PUDL DB will not be retained. -local-pytest-ci-all-years: docs-clean - ${coverage_erase} - doc8 docs/ README.rst - coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html - pytest ${pytest_args} --doctest-modules src/pudl test/unit - pytest ${pytest_args} --etl-settings ${etl_full_yml} test/integration - ${coverage_report} - # Run the full ETL, generating new FERC & PUDL SQLite DBs and EPA CEMS Parquet files. # Then run the full integration tests and data validations on all years of data. # NOTE: This will clobber your existing databases and takes hours to run!!! # Backgrounding the data validation and integration tests and using wait allows them to # run in parallel. -nuke: docs-clean - ${coverage_erase} - doc8 docs/ README.rst - coverage run ${covargs} -- ${CONDA_PREFIX}/bin/sphinx-build -W -b html docs docs/_build/html - pytest ${pytest_args} --doctest-modules src/pudl test/unit - pytest ${pytest_args} \ - --etl-settings ${etl_fast_yml} \ - test/integration +# + +nuke: docs-build pytest-unit rm -f tox-nuke.log coverage run ${covargs} -- \ - src/pudl/convert/ferc_to_sqlite.py \ + src/pudl/ferc_to_sqlite/cli.py \ --logfile tox-nuke.log \ --clobber \ ${gcs_cache_path} \ @@ -126,11 +111,7 @@ pytest-jupyter: # Compare actual and expected number of rows in many tables: pytest-minmax-rows: - pytest --live-dbs \ - test/validate/epacamd_eia_test.py::test_minmax_rows \ - test/validate/ferc1_test.py::test_minmax_rows \ - test/validate/eia_test.py::test_minmax_rows \ - test/validate/mcoe_test.py::test_minmax_rows_mcoe + pytest --live-dbs test/validate -k minmax_rows # Build the FERC 1 and PUDL DBs, ignoring foreign key constraints. # Identify any plant or utility IDs in the DBs that haven't yet been mapped

Couple extra notes:

nuke:

I think clobber has some unpleasant behavior with ferc_to_sqlite where it will delete the XBRL databases when reading the 2022 files, but...

src/pudl/conver/ferc_to_sqlite.py doesn't even exist, so I think nuke was totally broken already

should we be managing logfiles like this in dagster-land? seems a little clunky, but I also know logs in dagster are funky

pytest-minmax-rows: this is a shortcut to adding @pytest.mark.minmax to the minmax_rows tests. I sort of think this particular make target should be replaced by the built-in pytest test filtering tooling. Maybe the jupyter one too.

zaneselvans · 2023-11-06T16:10:06Z

Started make nuke again locally at 10:10am
Finished at 12:13pm
need to rm pudl.sqlite && alembic upgrade head when running nuke.
Started make nuke at 13:35
Finished at: 15:50

codecov · 2023-11-14T00:10:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6596401) 88.7% compared to head (071766a) 88.7%.
Report is 5 commits behind head on dev.

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #2968   +/-   ##
=====================================
  Coverage   88.7%   88.7%           
=====================================
  Files         90      90           
  Lines      10994   10994           
=====================================
  Hits        9758    9758           
  Misses      1236    1236

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bendnorman

Nothing blocking. I'm excited about make and conda-lock!

Responses to your Outstanding Questions

Logging
I'm not sure what the logs will look like with the parallel tests and the change to --attach "". I say we run a full build on this branch before we merge it into dev.

Docker stuff

For some reason, when I run update-conda-lockfile using workflow_dispatch the docker-build-test action is not kicked off. See #3022

Yeah this one is stumping me. I can took a look again tomorrow.

Installing the AWS CLI and Google Cloud SDK is hacky. Is AWS really only available via curl? gcloud can be installed similarly with curl, or via apt-get or from conda-forge (for OS X + Intel based Linux). What's the right approach? Also, how were we previously installing the Google Cloud SDK? I don't know what I changed to cause it not to be available any more.

I thought we decided to use conda to install Google Cloud SDK using and curl to install the AWS CLI?

There are many environment variables which are set both in the docker/.env file, and inside Dockerfile Is that intentional? Or should one of those be the unique source of that information?

I think the env vars are set in the .env and Dockerfile files so folks could overwrite the env var values when using docker-compose locally in the .env file. We haven't really kept docker-compose.yml and local_pudl_etl.sh up to date because we mostly test the image build and container execution using github actions and the GCP VM. I think we could probably remove these files.

Remaining dependency issues

Why does pip think that the recordlinkage version is 0.0.0 when it's installed with mamba?

I'm not sure. Is this causing problems?

Makefile

bendnorman · 2023-11-14T00:47:38Z

.github/workflows/build-deploy-pudl.yml

-            --container-arg="/home/catalyst/env" \
+            --container-arg="--prefix" \
+            --container-arg="/home/mambauser/env" \
+            --container-arg="--attach" \


Conda swallows all of the logs if you don't include --no-capture-output but it sounds like --attach "" should work:

Attach to stdin, stdout and/or stderr. -a "" for disabling stream redirection

Maybe we run part of a full build to make sure the logs are coming through before we merge into dev?

I did run one full build, I think it should be in the build outputs if you want to take a look at the logs and see if they look like you'd expect them to. I think using pytest-xdist does obscure some of the test logging, but also shaves a couple of hours off the nightly builds.

This is the output from the parallelized tests if you want to take a look:

gcloud storage cp gs://nightly-build-outputs.catalyst.coop/ece863d762c9a8790c7baff5397a0c6ee30609bb-conda-lockfile/pudl-etl.log .

These logs look good to me!

Makefile

docs/dev/dev_setup.rst

docs/dev/testing.rst

bendnorman · 2023-11-14T01:49:47Z

.github/workflows/update-conda-lockfile.yml

+        if: ${{ (github.event_name == 'push') }}
+        run: |
+          echo "GITHUB_REF="${{ github.ref_name }} >> $GITHUB_ENV


It looks like you're setting GITHUB_REF = github.ref_name when github.event_name == "push" or "workflow_dispatch". Why not handle it in one action step?

They were different at some point while I was developing the setup, and it felt more legible to enumerate the 3 cases we're trying to cover. But I could go either way.

Ok! I'm down to keep it as is if it feels more legible.

bendnorman · 2023-11-14T01:52:32Z

.github/workflows/update-conda-lockfile.yml

+        # If running on push due to dependency changes, commit directly to the base
+        # branch of the existing PR. Don't trigger the workflow again if we're already
+        # running it as pudlbot (to avoid infinite recursion).
+        if: ${{ (github.event_name == 'push' && github.actor != 'pudlbot') }}
+        uses: stefanzweifel/git-auto-commit-action@v5
+        with:
+          file_pattern: "environments/*"
+          commit_message: "Update conda-lock.yml and rendered conda environment files."


What's stopping the action from running again on a push?

The second time, the action is being run by pudlbot. This felt a little hacky, but it worked. Is there a more elegant way to do it?

I looked at the docs for the push trigger again and it don't look like there is a way to filter based on who triggered the push.

Would it make sense test this condition and abort earlier in the workflow?

.github/workflows/update-conda-lockfile.yml

zaneselvans · 2023-11-14T16:00:11Z

pip not picking up on the version of recordlinkage isn't breaking anything... but it means I can't specify a version for recordlinkage in pyproject.toml or it will try to install it, even though it's already satisfied by the conda environment. I guess this wouldn't happen any more since I'm doing pip install --no-deps now, but it seemed like an annoying mixed-environment gotcha.

* Add a conda-lock setup for discussion. * Move python-snappy into project.dependencies in pyproject.toml * Remove sphinx-autoapi from pypi deps, and no longer required snappy-python * Switch to using conda-forge version of recordlinkage v0.16 * Update conda-lock.yml now that all dependencies are available on conda-forge * Consolidate conda env files under environments/ dir * Add a GitHub action to relock dependencies * Quote the pip install command * Remove pip install of pudl from environment.yml * Rename workflow * Only build lockfile from pyproject.toml, don't install extras. * Just install conda-lock, not pudl, before running conda-lock. * install conda-lock with pip * Move all remaining dev-environment.yml deps to pyproject.toml * Add other platforms; make draft PR against dev. * Comment out dev base branch for now. * Remove pandas extras and recordlinkage deps from pyproject.toml * Use conda-lock --micromamba rather than --mamba * Don't specify grpcio, or specific recordlinkage version * Render platform-specific environment files in github action * Fix paths relative to environments directory * Add some comment notes to workflow * Render environment for Read The Docs. * Use environment not explicit rendered lockfile * Add readthedocs specific sphinx extension * Don't render explicit conda env for RTD since it can't read it. * Build linux-aarch64 lockfile. Use conda-lock.yml in workflows. * Comment out non-working linux-aarch64 platform for now. * Switch to using rendered lockfiles. * Remove deprecated environment files * Switch to using a micromamba docker image * Install git into the docker image. * Use micromamba and unrendered multi-platform lockfile. * Add main category to micromamba environment creation. * Use conda-lock not base as env name * Add a conda-lock setup for discussion. * Move python-snappy into project.dependencies in pyproject.toml * Remove sphinx-autoapi from pypi deps, and no longer required snappy-python * Add linux-aarch64 platform back into conda-lock settings.

- Remove deprecated API_KEY_EIA envvar - Add .gitignore in new environments directory - Create Makefile and migrat tox.ini logic into it - Replace spaces in Makefile with tabs - Install pudl via pip install --no-deps to avoid contaminating the locked environment - Move pip install and coverage XML logic from GHA workflow into the Makefile - Increase the minimum versions of some dependencies. - Move update-lockfile GHA workflow logic into Makefile - Attempt to run slow tests in parallel using "wait" to prevent premature exit of the shell

* Use micromamba not conda in Dockerfile CMD, also use pip install --no-deps * Use micromamba not conda in command passed to build container * Use default mambauser rather than catalyst in docker container * Remove --no-capture-output which isn't supported by micromamba. Is this a problem? * Remove uninterpolated vars in .env and more --no-capture-output * Separate ETL and pytest commands. * Stop trying to run tests in parallel. Sigh. * Add google cloud sdk to conda environment. * Install Google Cloud SDK from conda-forge. * Add back in the making of required directories. Oops. * Attempt to have micromamba run pass through output * Use prettier to standardize formatting of lockfiles. * Add dagster (server startup) target to Makefile * Update conda lockfile and rerender environment files * Attempt to trigger update-lockfile when pyproject.toml is changed. * Remove non-required upper bounds on dependency versions. * Use correct branch name for update-lockfile trigger. * Fix incorrect nbformat minimum version * Update Makefile in response to PR feedback

* Remove dagster-postgres version to avoid PyPI conda-forge conflict. * Break up and distribute the nuke target. * Resolve issues with pandera dependency. Reorganize makefile. * Add triggers and commit/PR for workflow_dispatch, pull_request, schedule and set appropriate GITHUB_REF values for each case. * Use push instead of pull_request to trigger on path. This avoids re-locking the dependencies every single time you push to a PR that had a change to pyproject.toml *somewhere* in it. * Also trigger based on path if .github/workflows/update-lockfile.yml changes. * Update conda-lock.yml and rendered conda environment files.

* Move previous dev, test, and datasette optional dependencies into the required dependencies to simplify application installation. * Test make nuke; parallelize --live-dbs tests * Move prettier into conda-only dependencies * Update conda-lock.yml and rendered conda environment files. * Remove action test file hashlog * Remove merge markers. * Remove transitive astroid dependency that's now correctly included in solve. * Use the real immature library version of dagster-postgres (0.21.6) rather than the accidentally packaged 1.5.6 version found in conda. We'll need to keep an eye out for when dagster-postgres graduates to the stable versioning and update it. This is a bit of a mess because of some broken automation in the conda packaging for dagster which has now been fixed. * Update "make pudl" to remove the old PUDL DB and reinitialize with alembic, rather than writing to the DB that already exists. * Fixed some groupby.agg() deprecation warnings. * Fix dagster-postgres version (again). * Update username in path to settings file * Avoid bugs in ferc_to_sqlite --clobber; don't use cache_dir for pip install. * Make FERC extraction output removal more specific. * Bump numpy and numba minimum versions. * Bump black version in pre-commit * Bump ruff pre-commit hook version

* Rename tox-pytest and update-lockfile workflows. * Make scheduled PRs against dev rather than conda-lockfile branch. * Update to pyarrow 14 and grpcio 1.59 b/c security * Update release notes

* Add a conda-lock setup for discussion. * Move python-snappy into project.dependencies in pyproject.toml * Remove sphinx-autoapi from pypi deps, and no longer required snappy-python * Switch to using conda-forge version of recordlinkage v0.16 * Update conda-lock.yml now that all dependencies are available on conda-forge * Consolidate conda env files under environments/ dir * Add a GitHub action to relock dependencies * Quote the pip install command * Remove pip install of pudl from environment.yml * Rename workflow * Only build lockfile from pyproject.toml, don't install extras. * Just install conda-lock, not pudl, before running conda-lock. * install conda-lock with pip * Move all remaining dev-environment.yml deps to pyproject.toml * Add other platforms; make draft PR against dev. * Comment out dev base branch for now. * Remove pandas extras and recordlinkage deps from pyproject.toml * Use conda-lock --micromamba rather than --mamba * Don't specify grpcio, or specific recordlinkage version * Render platform-specific environment files in github action * Fix paths relative to environments directory * Add some comment notes to workflow * Render environment for Read The Docs. * Use environment not explicit rendered lockfile * Add readthedocs specific sphinx extension * Don't render explicit conda env for RTD since it can't read it. * Build linux-aarch64 lockfile. Use conda-lock.yml in workflows. * Comment out non-working linux-aarch64 platform for now. * Switch to using rendered lockfiles. * Remove deprecated environment files * Switch to using a micromamba docker image * Install git into the docker image. * Use micromamba and unrendered multi-platform lockfile. * Add main category to micromamba environment creation. * Use conda-lock not base as env name * Add a conda-lock setup for discussion. * Move python-snappy into project.dependencies in pyproject.toml * Remove sphinx-autoapi from pypi deps, and no longer required snappy-python * Add linux-aarch64 platform back into conda-lock settings. - Remove deprecated API_KEY_EIA envvar - Add .gitignore in new environments directory - Create Makefile and migrat tox.ini logic into it - Replace spaces in Makefile with tabs - Install pudl via pip install --no-deps to avoid contaminating the locked environment - Move pip install and coverage XML logic from GHA workflow into the Makefile - Increase the minimum versions of some dependencies. - Move update-lockfile GHA workflow logic into Makefile - Attempt to run slow tests in parallel using "wait" to prevent premature exit of the shell * Use micromamba not conda in Dockerfile CMD, also use pip install --no-deps * Use micromamba not conda in command passed to build container * Use default mambauser rather than catalyst in docker container * Remove --no-capture-output which isn't supported by micromamba. Is this a problem? * Remove uninterpolated vars in .env and more --no-capture-output * Separate ETL and pytest commands. * Stop trying to run tests in parallel. Sigh. * Add google cloud sdk to conda environment. * Install Google Cloud SDK from conda-forge. * Add back in the making of required directories. Oops. * Attempt to have micromamba run pass through output * Use prettier to standardize formatting of lockfiles. * Add dagster (server startup) target to Makefile * Update conda lockfile and rerender environment files * Attempt to trigger update-lockfile when pyproject.toml is changed. * Remove non-required upper bounds on dependency versions. * Use correct branch name for update-lockfile trigger. * Fix incorrect nbformat minimum version * Update Makefile in response to PR feedback * Remove dagster-postgres version to avoid PyPI conda-forge conflict. * Break up and distribute the nuke target. * Resolve issues with pandera dependency. Reorganize makefile. * Add triggers and commit/PR for workflow_dispatch, pull_request, schedule and set appropriate GITHUB_REF values for each case. * Use push instead of pull_request to trigger on path. This avoids re-locking the dependencies every single time you push to a PR that had a change to pyproject.toml *somewhere* in it. * Also trigger based on path if .github/workflows/update-lockfile.yml changes. * Update conda-lock.yml and rendered conda environment files. * Move previous dev, test, and datasette optional dependencies into the required dependencies to simplify application installation. * Test make nuke; parallelize --live-dbs tests * Move prettier into conda-only dependencies * Update conda-lock.yml and rendered conda environment files. * Remove action test file hashlog * Remove merge markers. * Remove transitive astroid dependency that's now correctly included in solve. * Use the real immature library version of dagster-postgres (0.21.6) rather than the accidentally packaged 1.5.6 version found in conda. We'll need to keep an eye out for when dagster-postgres graduates to the stable versioning and update it. This is a bit of a mess because of some broken automation in the conda packaging for dagster which has now been fixed. * Update "make pudl" to remove the old PUDL DB and reinitialize with alembic, rather than writing to the DB that already exists. * Fixed some groupby.agg() deprecation warnings. * Fix dagster-postgres version (again). * Update username in path to settings file * Avoid bugs in ferc_to_sqlite --clobber; don't use cache_dir for pip install. * Make FERC extraction output removal more specific. * Bump numpy and numba minimum versions. * Bump black version in pre-commit * Bump ruff pre-commit hook version * Rename tox-pytest and update-lockfile workflows. * Make scheduled PRs against dev rather than conda-lockfile branch. * Update to pyarrow 14 and grpcio 1.59 b/c security * Update release notes * Add CI targets in Makefile. Docs cleanup. * Update conda-lock.yml and rendered conda environment files. --------- Co-authored-by: zaneselvans <zaneselvans@users.noreply.github.com>

* Set up reproducible Python environments with conda-lock (#2968) * Add a conda-lock setup for discussion. * Move python-snappy into project.dependencies in pyproject.toml * Remove sphinx-autoapi from pypi deps, and no longer required snappy-python * Switch to using conda-forge version of recordlinkage v0.16 * Update conda-lock.yml now that all dependencies are available on conda-forge * Consolidate conda env files under environments/ dir * Add a GitHub action to relock dependencies * Quote the pip install command * Remove pip install of pudl from environment.yml * Rename workflow * Only build lockfile from pyproject.toml, don't install extras. * Just install conda-lock, not pudl, before running conda-lock. * install conda-lock with pip * Move all remaining dev-environment.yml deps to pyproject.toml * Add other platforms; make draft PR against dev. * Comment out dev base branch for now. * Remove pandas extras and recordlinkage deps from pyproject.toml * Use conda-lock --micromamba rather than --mamba * Don't specify grpcio, or specific recordlinkage version * Render platform-specific environment files in github action * Fix paths relative to environments directory * Add some comment notes to workflow * Render environment for Read The Docs. * Use environment not explicit rendered lockfile * Add readthedocs specific sphinx extension * Don't render explicit conda env for RTD since it can't read it. * Build linux-aarch64 lockfile. Use conda-lock.yml in workflows. * Comment out non-working linux-aarch64 platform for now. * Switch to using rendered lockfiles. * Remove deprecated environment files * Switch to using a micromamba docker image * Install git into the docker image. * Use micromamba and unrendered multi-platform lockfile. * Add main category to micromamba environment creation. * Use conda-lock not base as env name * Add a conda-lock setup for discussion. * Move python-snappy into project.dependencies in pyproject.toml * Remove sphinx-autoapi from pypi deps, and no longer required snappy-python * Add linux-aarch64 platform back into conda-lock settings. - Remove deprecated API_KEY_EIA envvar - Add .gitignore in new environments directory - Create Makefile and migrat tox.ini logic into it - Replace spaces in Makefile with tabs - Install pudl via pip install --no-deps to avoid contaminating the locked environment - Move pip install and coverage XML logic from GHA workflow into the Makefile - Increase the minimum versions of some dependencies. - Move update-lockfile GHA workflow logic into Makefile - Attempt to run slow tests in parallel using "wait" to prevent premature exit of the shell * Use micromamba not conda in Dockerfile CMD, also use pip install --no-deps * Use micromamba not conda in command passed to build container * Use default mambauser rather than catalyst in docker container * Remove --no-capture-output which isn't supported by micromamba. Is this a problem? * Remove uninterpolated vars in .env and more --no-capture-output * Separate ETL and pytest commands. * Stop trying to run tests in parallel. Sigh. * Add google cloud sdk to conda environment. * Install Google Cloud SDK from conda-forge. * Add back in the making of required directories. Oops. * Attempt to have micromamba run pass through output * Use prettier to standardize formatting of lockfiles. * Add dagster (server startup) target to Makefile * Update conda lockfile and rerender environment files * Attempt to trigger update-lockfile when pyproject.toml is changed. * Remove non-required upper bounds on dependency versions. * Use correct branch name for update-lockfile trigger. * Fix incorrect nbformat minimum version * Update Makefile in response to PR feedback * Remove dagster-postgres version to avoid PyPI conda-forge conflict. * Break up and distribute the nuke target. * Resolve issues with pandera dependency. Reorganize makefile. * Add triggers and commit/PR for workflow_dispatch, pull_request, schedule and set appropriate GITHUB_REF values for each case. * Use push instead of pull_request to trigger on path. This avoids re-locking the dependencies every single time you push to a PR that had a change to pyproject.toml *somewhere* in it. * Also trigger based on path if .github/workflows/update-lockfile.yml changes. * Update conda-lock.yml and rendered conda environment files. * Move previous dev, test, and datasette optional dependencies into the required dependencies to simplify application installation. * Test make nuke; parallelize --live-dbs tests * Move prettier into conda-only dependencies * Update conda-lock.yml and rendered conda environment files. * Remove action test file hashlog * Remove merge markers. * Remove transitive astroid dependency that's now correctly included in solve. * Use the real immature library version of dagster-postgres (0.21.6) rather than the accidentally packaged 1.5.6 version found in conda. We'll need to keep an eye out for when dagster-postgres graduates to the stable versioning and update it. This is a bit of a mess because of some broken automation in the conda packaging for dagster which has now been fixed. * Update "make pudl" to remove the old PUDL DB and reinitialize with alembic, rather than writing to the DB that already exists. * Fixed some groupby.agg() deprecation warnings. * Fix dagster-postgres version (again). * Update username in path to settings file * Avoid bugs in ferc_to_sqlite --clobber; don't use cache_dir for pip install. * Make FERC extraction output removal more specific. * Bump numpy and numba minimum versions. * Bump black version in pre-commit * Bump ruff pre-commit hook version * Rename tox-pytest and update-lockfile workflows. * Make scheduled PRs against dev rather than conda-lockfile branch. * Update to pyarrow 14 and grpcio 1.59 b/c security * Update release notes * Add CI targets in Makefile. Docs cleanup. * Update conda-lock.yml and rendered conda environment files. --------- Co-authored-by: zaneselvans <zaneselvans@users.noreply.github.com> * Revert temporary change to .pre-commit.yml * Update conda-lock.yml and rendered conda environment files. * Remove no longer required create args. --------- Co-authored-by: zaneselvans <zaneselvans@users.noreply.github.com>

zaneselvans linked an issue Oct 21, 2023 that may be closed by this pull request

Use lockfile to specify a reproducible python environment #2896

Closed

zaneselvans added the dependencies Pull requests that update a dependency file label Oct 21, 2023

zaneselvans force-pushed the conda-lockfile branch 4 times, most recently from 78d735d to 8ee276e Compare October 26, 2023 21:43

bendnorman reviewed Oct 27, 2023

View reviewed changes

zaneselvans force-pushed the conda-lockfile branch 2 times, most recently from 4e5589f to 32e8353 Compare October 27, 2023 19:35

zaneselvans force-pushed the conda-lockfile branch 2 times, most recently from bdc50fd to 2ae5658 Compare October 31, 2023 16:59

jdangerx mentioned this pull request Nov 1, 2023

Switch regex strategy to sampling strategy to improve performance #2998

Merged

jdangerx reviewed Nov 2, 2023

View reviewed changes

zaneselvans force-pushed the conda-lockfile branch 8 times, most recently from e852b6a to aaa8082 Compare November 6, 2023 15:58

trexfeathers mentioned this pull request Nov 8, 2023

Switch to tox SciTools/iris#5184

Open

zaneselvans force-pushed the conda-lockfile branch 8 times, most recently from 927c9a3 to 071766a Compare November 13, 2023 21:57

zaneselvans marked this pull request as ready for review November 13, 2023 21:57

zaneselvans force-pushed the conda-lockfile branch from 071766a to 8d85019 Compare November 14, 2023 02:31

bendnorman reviewed Nov 14, 2023

View reviewed changes

zaneselvans and others added 7 commits November 14, 2023 10:01

Update docs to reflect swapping Make in for Tox, application packaging.

ff20c66

* Rename tox-pytest and update-lockfile workflows. * Make scheduled PRs against dev rather than conda-lockfile branch. * Update to pyarrow 14 and grpcio 1.59 b/c security * Update release notes

Add CI targets in Makefile. Docs cleanup.

08f318f

zaneselvans force-pushed the conda-lockfile branch from 5574498 to 08f318f Compare November 14, 2023 16:56

Update conda-lock.yml and rendered conda environment files.

240b99e

zaneselvans merged commit dacfcb6 into dev Nov 14, 2023
9 checks passed

zaneselvans deleted the conda-lockfile branch November 14, 2023 18:41

zaneselvans mentioned this pull request Nov 14, 2023

Bring renamed/scheduled workflow files into main from dev. #3047

Closed

8 tasks

zaneselvans mentioned this pull request Nov 15, 2023

Bring conda-lock workflows into main #3049

Merged

zaneselvans mentioned this pull request Nov 16, 2023

Merge dev into main for 2023-11-16 #3054

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up reproducible Python environments with conda-lock #2968

Set up reproducible Python environments with conda-lock #2968

zaneselvans commented Oct 21, 2023 •

edited

Tasks

bendnorman commented Oct 26, 2023 •

edited by zaneselvans

bendnorman Oct 27, 2023

zaneselvans Oct 27, 2023

zaneselvans Oct 27, 2023

bendnorman Oct 27, 2023

zaneselvans Oct 27, 2023

bendnorman Oct 31, 2023

jdangerx Nov 2, 2023

bendnorman Oct 27, 2023

zaneselvans Oct 27, 2023

zaneselvans commented Oct 27, 2023 •

edited

zaneselvans commented Oct 27, 2023

bendnorman commented Oct 31, 2023 •

edited

zaneselvans commented Oct 31, 2023

bendnorman commented Oct 31, 2023

zaneselvans commented Oct 31, 2023 •

edited

zaneselvans commented Oct 31, 2023

jdangerx left a comment

jdangerx Nov 2, 2023

zaneselvans commented Nov 6, 2023 •

edited

codecov bot commented Nov 14, 2023

bendnorman left a comment

bendnorman Nov 14, 2023

zaneselvans Nov 14, 2023

zaneselvans Nov 14, 2023

bendnorman Nov 14, 2023

bendnorman Nov 14, 2023

zaneselvans Nov 14, 2023

bendnorman Nov 14, 2023

bendnorman Nov 14, 2023

zaneselvans Nov 14, 2023

bendnorman Nov 14, 2023

zaneselvans commented Nov 14, 2023

Set up reproducible Python environments with conda-lock #2968

Set up reproducible Python environments with conda-lock #2968

Conversation

zaneselvans commented Oct 21, 2023 • edited

PR Overview

Conda Lockfile

Swtiching from Tox to Makefile

Repository vs. Package Installation & optional dependencies aka "extras"

Tasks

Outstanding Questions

Logging

Docker stuff

Remaining dependency issues

PR Checklist

bendnorman commented Oct 26, 2023 • edited by zaneselvans

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Oct 27, 2023 • edited

zaneselvans commented Oct 27, 2023

bendnorman commented Oct 31, 2023 • edited

zaneselvans commented Oct 31, 2023

bendnorman commented Oct 31, 2023

zaneselvans commented Oct 31, 2023 • edited

zaneselvans commented Oct 31, 2023

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Nov 6, 2023 • edited

codecov bot commented Nov 14, 2023

Codecov Report

bendnorman left a comment

Choose a reason for hiding this comment

Responses to your Outstanding Questions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Nov 14, 2023

zaneselvans commented Oct 21, 2023 •

edited

bendnorman commented Oct 26, 2023 •

edited by zaneselvans

zaneselvans commented Oct 27, 2023 •

edited

bendnorman commented Oct 31, 2023 •

edited

zaneselvans commented Oct 31, 2023 •

edited

zaneselvans commented Nov 6, 2023 •

edited