Cinco de Mayo 🇲🇽 #1616

zaneselvans · 2022-05-05T07:44:04Z

Just ran through all of the data validations and re-deployed Datasette, so it seemed like we ought to merge into main.

For more information, see https://pre-commit.ci

Bumps [actions/cache](https://github.com/actions/cache) from 2.1.7 to 3. - [Release notes](https://github.com/actions/cache/releases) - [Commits](actions/cache@v2.1.7...v3) --- updated-dependencies: - dependency-name: actions/cache dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>

…actions/dev/actions/cache-3 Bump actions/cache from 2.1.7 to 3

* Rather than using Hive partitioning, output individual state-year files into a single directory. This is easier to read, just as fast, avoids accidentally writing two copies of the data into the same directory hierarchy, and will be good for per-file caching, which intake-parquet doesn't support in the way we need right now (#1495). * Get rid of the nested generators that were being used to coordinate the EPA CEMS ETL process. Now it just loops over the requested years & states, and goes all the way through the ETL process for each year-state combo before moving on to the next step. * There was basically nothing left for the `pudl.load.df_to_parquet()` function to do so I removed it. * Fix warnings about mixed dtypes (#1126) and timezone conversions (#1423). Closes #1126 Closes #1423 Closes #1495 Closes #1493

Changed the `pudl.metadata.classes.Resource.to_pyarrow()` method to include a simple metadata dictionary mapping column names to their human readable text descriptions. Closes #1378

…ary key.

Update to pip 22, setuptools 61. Add nbconvert to pudl-dev

Updates the requirements on [twine](https://github.com/pypa/twine) to permit the latest version. - [Release notes](https://github.com/pypa/twine/releases) - [Changelog](https://github.com/pypa/twine/blob/main/docs/changelog.rst) - [Commits](pypa/twine@3.3.0...4.0.0) --- updated-dependencies: - dependency-name: twine dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com>

…/twine-gte-3.3-and-lt-5.0 Update twine requirement from ~=3.3 to >=3.3,<5.0

I tired adding the PM code into the backfilling. this resulted in *sliiightly* less tech-types (a grand total of 1609 record) all from generators that have no PM code. After testing the staging.. it felt better to do the backfilling w/ the completely consistent map between PM/ESC:Tech first bc it feels pretty conserative. And then we come back in w/ the bfill w/o the pm code

…rocess

Refactor labeling of true granularities with plant part to generator match function

Re-gigger backfilling `technology_description` & make `prime_mover_code` an annually harvested column

Remove some seldom used dependencies from pudl-dev environment.yml

review-notebook-app · 2022-05-05T07:44:08Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2022-05-05T08:14:11Z

Codecov Report

Merging #1616 (afebf08) into main (d3f69a8) will increase coverage by 0.0%.
The diff coverage is 98.7%.

@@          Coverage Diff          @@
##            main   #1616   +/-   ##
=====================================
  Coverage   84.0%   84.0%           
=====================================
  Files         65      65           
  Lines       7204    7181   -23     
=====================================
- Hits        6055    6039   -16     
+ Misses      1149    1142    -7

Impacted Files	Coverage Δ
src/pudl/analysis/epa_crosswalk.py	`96.4% <ø> (ø)`
src/pudl/analysis/mcoe.py	`92.4% <ø> (ø)`
src/pudl/analysis/service_territory.py	`29.0% <ø> (ø)`
src/pudl/analysis/spatial.py	`93.6% <ø> (ø)`
src/pudl/analysis/timeseries_cleaning.py	`88.6% <ø> (ø)`
src/pudl/cli.py	`67.5% <ø> (ø)`
src/pudl/convert/censusdp1tract_to_sqlite.py	`91.6% <ø> (ø)`
src/pudl/convert/ferc1_to_sqlite.py	`62.1% <ø> (ø)`
src/pudl/convert/metadata_to_rst.py	`69.5% <ø> (ø)`
src/pudl/convert/metadata_to_yml.py	`100.0% <ø> (ø)`
... and 48 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d3f69a8...afebf08. Read the comment docs.

Dask introduced breaking changes to `dd.read_parquet()` in v 2022.4.2: https://docs.dask.org/en/stable/changelog.html#v2022-04-2 However, we didn't catch this when it happened because it's only a problem when there's more than one row-group. So now I'm processing 2019-2020 data for both ID and ME (two of the smallest states). I was already processing 4 state-years when testing the parallelized CEMS ETL so this isn't any more data that we were pulling before. Also fixed a bug that the parallelized ETL test introduced, where it was writing the partitioned output to the same directory as the other CEMS output, which might have been creating weird behavior. I put more stringent bounds around the Dask versions in `setup.py` so we'll at least be notified by @dependabot if a version changes now. Moved the CI environment file under tests, since it's more related to them than "devtools" and in preparation for configuring `tox-conda` to use conda environments to (partly) manage the tox builds: https://github.com/tox-dev/tox-conda

Fix breakage resulting from dask v2022.4.2

katie-lamb and others added 30 commits February 8, 2022 10:48

add eia plant part to gen match function to ppl module

9a15ab1

Merge branch 'dev' into katie-ccai

34d2a32

Merge branch 'dev' into katie-ccai

31f7141

[pre-commit.ci] auto fixes from pre-commit.com hooks

b0c8851

For more information, see https://pre-commit.ci

Merge pull request #1551 from catalyst-cooperative/dependabot/github_…

1e436aa

…actions/dev/actions/cache-3 Bump actions/cache from 2.1.7 to 3

made new addattribute function for install year and construction year

c40660b

src/pudl/analysis/plant_parts_eia.py

f36864f

changed name to maxminattribute and reworked attribute global dict

dc62714

Experimental parquet intake catalog and notebook.

e698172

Add an intro to the ntoebook.

065cd3e

Add an intro to the ntoebook.

149224a

Change PyArrow types so that they're consistent with file outputs.

6a43fc0

Output EPA CEMS in one Parquet file with (year, state) row groups

bb06b25

swap nbconvert in for mega jupyter contrib dependency in pudl-dev.

e2f3d36

swap nbconvert in for mega jupyter contrib dependency in pudl-dev.

10c4383

Add column descriptions to PyArrow schema metadata

1532d54

Changed the `pudl.metadata.classes.Resource.to_pyarrow()` method to include a simple metadata dictionary mapping column names to their human readable text descriptions. Closes #1378

Move PyArrow field metadata to fields, add table description and prim…

95ecd4f

…ary key.

Clean up notebook and add notes / instructions.

c5305ce

Add catalog exploration and discussion to notebook.

e32f949

Update to pip 22, setuptools 61. Add nbconvert to pudl-dev

84a9703

Merge pull request #1565 from catalyst-cooperative/pip22-setuptools61

f798d4a

Update to pip 22, setuptools 61. Add nbconvert to pudl-dev

Merge branch 'dev' into intake-parquet

7126294

Simplify epacems transform function slightly

8efcb6e

Merge pull request #1566 from catalyst-cooperative/dependabot/pip/dev…

4e22bb6

…/twine-gte-3.3-and-lt-5.0 Update twine requirement from ~=3.3 to >=3.3,<5.0

Add catalog-level description

f041045

Merge branch 'dev' into intake-parquet

6ffb338

Allow partitioning of CEMS Parquet outputs and run ETL in parallel.

dc64ee3

katie-lamb and others added 19 commits April 29, 2022 17:27

comment formatting

4d35e1d

added unit test and made appro part label col an object dtype

5667a29

took out option to keep only record id from match function

4b9aa89

first attempt to integrate manual overrides of prime_mover_codes

b9554c8

reorder filling of tech description and make backfill more strict

b0f7502

Move the PM code from a static to an annual harvested column

d3348d5

update harvest notebook to work w/ new(lol not so new) settings/etl p…

67d4dac

…rocess

fixes in response to review

31fa103

updated commnet

dd86c48

fix literal typing

d2cb921

updated literal

d064361

Merge pull request #1447 from catalyst-cooperative/katie-ccai

5dbc337

Refactor labeling of true granularities with plant part to generator match function

Merge branch 'dev' into non_static_tech

c6c5f84

Remove some seldom used dependencies from environment.yml

3b4557b

add overview of moving the primve_mover_code into release notes

0dd285d

Merge pull request #1600 from catalyst-cooperative/non_static_tech

3b9e6bd

Re-gigger backfilling `technology_description` & make `prime_mover_code` an annually harvested column

Merge branch 'dev' into reduce-dependencies

f479c80

Merge pull request #1615 from catalyst-cooperative/reduce-dependencies

495b191

Remove some seldom used dependencies from pudl-dev environment.yml

zaneselvans marked this pull request as draft May 5, 2022 07:49

zaneselvans added 5 commits May 5, 2022 14:27

Add release notes related to the Dask / EPA CEMS breakage fixes.

4422647

Merge pull request #1618 from catalyst-cooperative/fix-dask-2022.5.0

0c1b684

Fix breakage resulting from dask v2022.4.2

Remove extraneous comments and ignore dynamically generated metadata.

c215e47

Merge branch 'main' into dev

afebf08

zaneselvans changed the title ~~Draft v0.7.0 Release~~ Cinco de Mayo 🇲🇽 May 5, 2022

zaneselvans marked this pull request as ready for review May 5, 2022 23:54

zaneselvans merged commit 69f66c5 into main May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cinco de Mayo 🇲🇽 #1616

Cinco de Mayo 🇲🇽 #1616

zaneselvans commented May 5, 2022 •

edited

review-notebook-app bot commented May 5, 2022

codecov bot commented May 5, 2022 •

edited

Cinco de Mayo 🇲🇽 #1616

Cinco de Mayo 🇲🇽 #1616

Conversation

zaneselvans commented May 5, 2022 • edited

review-notebook-app bot commented May 5, 2022

codecov bot commented May 5, 2022 • edited

Codecov Report

zaneselvans commented May 5, 2022 •

edited

codecov bot commented May 5, 2022 •

edited