Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cinco de Mayo 🇲🇽 #1616

Merged
merged 130 commits into from
May 6, 2022
Merged

Cinco de Mayo 🇲🇽 #1616

merged 130 commits into from
May 6, 2022

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented May 5, 2022

Just ran through all of the data validations and re-deployed Datasette, so it seemed like we ought to merge into main.

katie-lamb and others added 30 commits February 8, 2022 10:48
Bumps [actions/cache](https://github.com/actions/cache) from 2.1.7 to 3.
- [Release notes](https://github.com/actions/cache/releases)
- [Commits](actions/cache@v2.1.7...v3)

---
updated-dependencies:
- dependency-name: actions/cache
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
…actions/dev/actions/cache-3

Bump actions/cache from 2.1.7 to 3
* Rather than using Hive partitioning, output individual state-year files into a single
  directory. This is easier to read, just as fast, avoids accidentally writing two
  copies of the data into the same directory hierarchy, and will be good for per-file
  caching, which intake-parquet doesn't support in the way we need right now (#1495).
* Get rid of the nested generators that were being used to coordinate the EPA CEMS ETL
  process. Now it just loops over the requested years & states, and goes all the way
  through the ETL process for each year-state combo before moving on to the next step.
* There was basically nothing left for the `pudl.load.df_to_parquet()` function to do so
  I removed it.
* Fix warnings about mixed dtypes (#1126) and timezone conversions (#1423).

Closes #1126
Closes #1423
Closes #1495
Closes #1493
Changed the `pudl.metadata.classes.Resource.to_pyarrow()` method to include a simple
metadata dictionary mapping column names to their human readable text descriptions.

Closes #1378
Update to pip 22, setuptools 61. Add nbconvert to pudl-dev
Updates the requirements on [twine](https://github.com/pypa/twine) to permit the latest version.
- [Release notes](https://github.com/pypa/twine/releases)
- [Changelog](https://github.com/pypa/twine/blob/main/docs/changelog.rst)
- [Commits](pypa/twine@3.3.0...4.0.0)

---
updated-dependencies:
- dependency-name: twine
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
…/twine-gte-3.3-and-lt-5.0

Update twine requirement from ~=3.3 to >=3.3,<5.0
katie-lamb and others added 19 commits April 29, 2022 17:27
I tired adding the PM code into the backfilling. this resulted in 
*sliiightly* less tech-types (a grand total of 1609 record) all from 
generators that have no PM code. After testing the staging.. it felt 
better to do the backfilling w/ the completely consistent map between 
PM/ESC:Tech first bc it feels pretty conserative. And then we come back 
in w/ the bfill w/o the pm code
Refactor labeling of true granularities with plant part to generator match function
Re-gigger backfilling `technology_description` & make `prime_mover_code` an annually harvested column
Remove some seldom used dependencies from pudl-dev environment.yml
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@zaneselvans zaneselvans marked this pull request as draft May 5, 2022 07:49
@codecov
Copy link

codecov bot commented May 5, 2022

Codecov Report

Merging #1616 (afebf08) into main (d3f69a8) will increase coverage by 0.0%.
The diff coverage is 98.7%.

@@          Coverage Diff          @@
##            main   #1616   +/-   ##
=====================================
  Coverage   84.0%   84.0%           
=====================================
  Files         65      65           
  Lines       7204    7181   -23     
=====================================
- Hits        6055    6039   -16     
+ Misses      1149    1142    -7     
Impacted Files Coverage Δ
src/pudl/analysis/epa_crosswalk.py 96.4% <ø> (ø)
src/pudl/analysis/mcoe.py 92.4% <ø> (ø)
src/pudl/analysis/service_territory.py 29.0% <ø> (ø)
src/pudl/analysis/spatial.py 93.6% <ø> (ø)
src/pudl/analysis/timeseries_cleaning.py 88.6% <ø> (ø)
src/pudl/cli.py 67.5% <ø> (ø)
src/pudl/convert/censusdp1tract_to_sqlite.py 91.6% <ø> (ø)
src/pudl/convert/ferc1_to_sqlite.py 62.1% <ø> (ø)
src/pudl/convert/metadata_to_rst.py 69.5% <ø> (ø)
src/pudl/convert/metadata_to_yml.py 100.0% <ø> (ø)
... and 48 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d3f69a8...afebf08. Read the comment docs.

Dask introduced breaking changes to `dd.read_parquet()` in v 2022.4.2:
https://docs.dask.org/en/stable/changelog.html#v2022-04-2

However, we didn't catch this when it happened because it's only a
problem when there's more than one row-group. So now I'm processing
2019-2020 data for both ID and ME (two of the smallest states).  I was
already processing 4 state-years when testing the parallelized CEMS ETL
so this isn't any more data that we were pulling before.

Also fixed a bug that the parallelized ETL test introduced, where it was
writing the partitioned output to the same directory as the other CEMS
output, which might have been creating weird behavior.

I put more stringent bounds around the Dask versions in `setup.py` so
we'll at least be notified by @dependabot if a version changes now.

Moved the CI environment file under tests, since it's more related to
them than "devtools" and in preparation for configuring `tox-conda` to
use conda environments to (partly) manage the tox builds:

https://github.com/tox-dev/tox-conda
Fix breakage resulting from dask v2022.4.2
@zaneselvans zaneselvans changed the title Draft v0.7.0 Release Cinco de Mayo 🇲🇽 May 5, 2022
@zaneselvans zaneselvans marked this pull request as ready for review May 5, 2022 23:54
@zaneselvans zaneselvans merged commit 69f66c5 into main May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants