-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cinco de Mayo 🇲🇽 #1616
Cinco de Mayo 🇲🇽 #1616
Conversation
For more information, see https://pre-commit.ci
Bumps [actions/cache](https://github.com/actions/cache) from 2.1.7 to 3. - [Release notes](https://github.com/actions/cache/releases) - [Commits](actions/cache@v2.1.7...v3) --- updated-dependencies: - dependency-name: actions/cache dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
…actions/dev/actions/cache-3 Bump actions/cache from 2.1.7 to 3
* Rather than using Hive partitioning, output individual state-year files into a single directory. This is easier to read, just as fast, avoids accidentally writing two copies of the data into the same directory hierarchy, and will be good for per-file caching, which intake-parquet doesn't support in the way we need right now (#1495). * Get rid of the nested generators that were being used to coordinate the EPA CEMS ETL process. Now it just loops over the requested years & states, and goes all the way through the ETL process for each year-state combo before moving on to the next step. * There was basically nothing left for the `pudl.load.df_to_parquet()` function to do so I removed it. * Fix warnings about mixed dtypes (#1126) and timezone conversions (#1423). Closes #1126 Closes #1423 Closes #1495 Closes #1493
Changed the `pudl.metadata.classes.Resource.to_pyarrow()` method to include a simple metadata dictionary mapping column names to their human readable text descriptions. Closes #1378
Update to pip 22, setuptools 61. Add nbconvert to pudl-dev
Updates the requirements on [twine](https://github.com/pypa/twine) to permit the latest version. - [Release notes](https://github.com/pypa/twine/releases) - [Changelog](https://github.com/pypa/twine/blob/main/docs/changelog.rst) - [Commits](pypa/twine@3.3.0...4.0.0) --- updated-dependencies: - dependency-name: twine dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com>
…/twine-gte-3.3-and-lt-5.0 Update twine requirement from ~=3.3 to >=3.3,<5.0
I tired adding the PM code into the backfilling. this resulted in *sliiightly* less tech-types (a grand total of 1609 record) all from generators that have no PM code. After testing the staging.. it felt better to do the backfilling w/ the completely consistent map between PM/ESC:Tech first bc it feels pretty conserative. And then we come back in w/ the bfill w/o the pm code
Refactor labeling of true granularities with plant part to generator match function
Re-gigger backfilling `technology_description` & make `prime_mover_code` an annually harvested column
Remove some seldom used dependencies from pudl-dev environment.yml
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov Report
@@ Coverage Diff @@
## main #1616 +/- ##
=====================================
Coverage 84.0% 84.0%
=====================================
Files 65 65
Lines 7204 7181 -23
=====================================
- Hits 6055 6039 -16
+ Misses 1149 1142 -7
Continue to review full report at Codecov.
|
Dask introduced breaking changes to `dd.read_parquet()` in v 2022.4.2: https://docs.dask.org/en/stable/changelog.html#v2022-04-2 However, we didn't catch this when it happened because it's only a problem when there's more than one row-group. So now I'm processing 2019-2020 data for both ID and ME (two of the smallest states). I was already processing 4 state-years when testing the parallelized CEMS ETL so this isn't any more data that we were pulling before. Also fixed a bug that the parallelized ETL test introduced, where it was writing the partitioned output to the same directory as the other CEMS output, which might have been creating weird behavior. I put more stringent bounds around the Dask versions in `setup.py` so we'll at least be notified by @dependabot if a version changes now. Moved the CI environment file under tests, since it's more related to them than "devtools" and in preparation for configuring `tox-conda` to use conda environments to (partly) manage the tox builds: https://github.com/tox-dev/tox-conda
Fix breakage resulting from dask v2022.4.2
Just ran through all of the data validations and re-deployed Datasette, so it seemed like we ought to merge into
main
.