Update CEMS partitions to handle year-quarter files #3096

e-belfer · 2023-11-29T19:43:26Z

Make CEMS quarterly and add 2023 data! See #2973 for detailed task list.

This PR:

Updates pudl.extract.epacems to read in quarterly data
Updates pudl.transform.epacems to handle quarterly data
Updates pudl.etl.epacems_assets to write year/state row groups from quarterly parquet files to the monolithic output
Reduces dagster concurrency of hourly_emissions_epacems.process_single_year() to prevent OOM issues (currently to 2 threads)
Updates integration tests to handle new EPA CEMS format

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

src/pudl/workspace/datastore.py

test/integration/epacems_test.py

…ive/pudl into cems-quarterly

…fix typos

…ive/pudl into cems-quarterly

src/pudl/etl/epacems_assets.py

src/pudl/metadata/sources.py

src/pudl/transform/epacems.py

src/pudl/workspace/resource_cache.py

codecov · 2023-12-06T21:26:01Z

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (cf09bda) 92.6% compared to head (bbef1e2) 92.6%.
Report is 1 commits behind head on dev.

Files	Patch %	Lines
src/pudl/extract/epacems.py	84.2%	3 Missing ⚠️
src/pudl/settings.py	88.9%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             dev   #3096     +/-   ##
=======================================
- Coverage   92.6%   92.6%   -0.0%     
=======================================
  Files        134     134             
  Lines      12577   12566     -11     
=======================================
- Hits       11648   11634     -14     
- Misses       929     932      +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…pudl into cems-year_quarters

… into cems-quarterly

…pudl into cems-year_quarters

WIP Transition CEMS paritions to `year_quarter` from `year` and `quarter`

src/pudl/workspace/datastore.py

zaneselvans · 2023-12-12T03:47:13Z

@cmgosnell

I materialized the EPA CEMS assets locally so I could debug the test failures and it took nearly 2 hours, even with nothing else running.
It's not immediately clear why it took so long though. It wasn't fully saturating either my CPUs or using all the memory.
This is a very large performance regression. I think we need to put some effort into speeding this back up, or at least understanding why it's so slow. You said it "only" took 40 minutes for you, right?
The integration / output tests were easy to fix and mostly had to do with changes to the EpaCemsSettings class, and the new expected behavior of the output function, which won't complain if you request state/year combos that don't exist. It just won't give you back the data it doesn't have.

Edit: I pre-populated the datastore before starting the CEMS materialization last night, so it should not have been downloading anything. I checked the timestamps on the files this morning and they were all from the same time, before I ran the ETL. Also I've re-run the CEMS asset materializations this morning and it's looking like it'll take 2 hours again. Not sure what the difference is between my system and yours though.

cmgosnell · 2023-12-12T14:21:46Z

thanks for fixing that ci failure!

Its curious that it took you 2 hours. eeee. it has consistently taken my computer ~40 minutes. When it needed to download a new archive, it took 58. Which is still ~2x time from the previous setup. I 100% agree that we should take some time to make it faster, but I don't think we should delay integrating this before we do that.

zaneselvans · 2023-12-12T15:28:02Z

Another weird thing that I'm seeing locally comparing my two sets of outputs is the new ETL produces significantly smaller outputs, despite including a little bit more data.

4.8G	epacems_quarterly/
5.7G	hourly_emissions_epacems/

5.1G	hourly_emissions_epacems.parquet
5.6G	hourly_emissions_epacems.parquet (from nightly build)

zaneselvans · 2023-12-12T16:25:59Z

Hmm. Using the MacOS Activity Monitor (rather than btop) I see 10 python3.11 processes each of which claims to have 20 threads, and appears to be using multiple GB of memory, which means a lot of it is spilling onto swap on disk which would slow things down a lot.

In btop the same processes show up, but most of them only have a few hundred MB of memory attributed to them and overall memory usage is supposedly just 16GB. However, my computer is super laggy, as if memory usage is maxed out.

zaneselvans · 2023-12-12T16:49:00Z

It seems that if you select the epacems assets using the "Asset groups" sidebar in the Dagster UI, the concurrency configuration that tells it only to transform 2 years at a time is not used.

While if you select the etl_full job and then search for or graphically select the EPA CEMS assets, the concurrency limitation configuration found in pudl/etl/__init__.py is used.

I feel like the configs are flaky in general, and have run into issues with them not getting updated when the settings files change in the context of the ferc_to_sqlite jobs too. How can we make them more robust?

e-belfer added 2 commits November 24, 2023 13:44

WIP move to year-quarter partitions

3946f71

WIP change extraction partitions and etl settings, fix unit tests

e5d3ab4

e-belfer self-assigned this Nov 29, 2023

e-belfer linked an issue Nov 29, 2023 that may be closed by this pull request

CEMS: Repartition extraction process and parquet files. #2973

Closed

14 tasks

e-belfer changed the title ~~Refactor CEMS partitions to handle year-quarter files~~ Rejigger CEMS partitions to handle year-quarter files Nov 29, 2023

zaneselvans reviewed Nov 29, 2023

View reviewed changes

src/pudl/workspace/datastore.py Outdated Show resolved Hide resolved

e-belfer and others added 4 commits November 30, 2023 08:06

Resolve merge conflict and update settings_test

f0d3d03

Merge branch 'dev' into cems-quarterly

78f1992

Update conda-lock.yml and rendered conda environment files.

ec00457

Update integration tests to use quarter

77db673

e-belfer commented Dec 4, 2023

View reviewed changes

test/integration/epacems_test.py Outdated Show resolved Hide resolved

e-belfer and others added 9 commits December 4, 2023 15:47

Update DOI to production

a0f729c

Merge branch 'dev' into cems-quarterly

9320d8a

Merge branch 'cems-quarterly' of https://github.com/catalyst-cooperat…

748c09b

…ive/pudl into cems-quarterly

Fix EPACEMS integration test

8e01373

Update conda-lock.yml and rendered conda environment files.

58bffe3

Repartition row groups in monolith parquet, update integration test, …

610ef4c

…fix typos

Merge branch 'cems-quarterly' of https://github.com/catalyst-cooperat…

70c720a

…ive/pudl into cems-quarterly

Drop year from fast ETL and add concurrency limiting

64560bb

Merge branch 'dev' into cems-quarterly

e8c5542

e-belfer changed the title ~~Rejigger CEMS partitions to handle year-quarter files~~ Update CEMS partitions to handle year-quarter files Dec 6, 2023

e-belfer and others added 3 commits December 6, 2023 19:08

Update conda-lock.yml and rendered conda environment files.

2f567cf

Drop concurrency further and update integration test to use 2022 data

9607248

Update conda-lock.yml and rendered conda environment files.

441a9b5

e-belfer commented Dec 6, 2023

View reviewed changes

src/pudl/etl/epacems_assets.py Show resolved Hide resolved

e-belfer commented Dec 6, 2023

View reviewed changes

src/pudl/etl/epacems_assets.py Show resolved Hide resolved

e-belfer commented Dec 6, 2023

View reviewed changes

src/pudl/metadata/sources.py Outdated Show resolved Hide resolved

cmgosnell reviewed Dec 6, 2023

View reviewed changes

src/pudl/transform/epacems.py Outdated Show resolved Hide resolved

cmgosnell reviewed Dec 6, 2023

View reviewed changes

src/pudl/workspace/resource_cache.py Outdated Show resolved Hide resolved

cmgosnell and others added 12 commits December 11, 2023 11:31

Merge branch 'cems-quarterly' into cems-year_quarters

7197e64

address pr concerns

1815560

Merge branch 'cems-year_quarters' of github.com:catalyst-cooperative/…

5c7eaba

…pudl into cems-year_quarters

Merge branch 'dev' into cems-quarterly

1ac9e48

Merge branch 'cems-quarterly' of github.com:catalyst-cooperative/pudl…

51d9a03

… into cems-quarterly

Update conda-lock.yml and rendered conda environment files.

27c415d

Merge branch 'cems-quarterly' into cems-year_quarters

f33db7b

Update conda-lock.yml and rendered conda environment files.

55828de

Merge branch 'cems-quarterly' into cems-year_quarters

0db0d4a

Merge branch 'cems-year_quarters' of github.com:catalyst-cooperative/…

15c6069

…pudl into cems-year_quarters

Merge pull request #3139 from catalyst-cooperative/cems-year_quarters

98861eb

WIP Transition CEMS paritions to `year_quarter` from `year` and `quarter`

add release notes for quarterly cems

fbd4689

cmgosnell requested a review from zaneselvans December 11, 2023 21:26

Fix some comments/docstrings; clarify Zenodo RECID regex

84c5330

zaneselvans approved these changes Dec 11, 2023

View reviewed changes

src/pudl/workspace/datastore.py Show resolved Hide resolved

zaneselvans added epacems Integration and analysis of the EPA CEMS dataset. new-data Requests for integration of new data. labels Dec 11, 2023

zaneselvans added 2 commits December 11, 2023 16:35

Remove comment about epacems DOI being draft archive.

1c3e47a

Adjust epacems output tests to reflect quarterly partitions.

003f1d7

Merge branch 'dev' into cems-quarterly

bbef1e2

add tests to cover a few uncovered lines

bfe6203

cmgosnell enabled auto-merge December 12, 2023 15:04

cmgosnell merged commit f020a07 into dev Dec 12, 2023
15 checks passed

cmgosnell deleted the cems-quarterly branch December 12, 2023 16:37

bendnorman mentioned this pull request Dec 14, 2023

Add EPA CEMS concurrency limit to pudl_etl #3160

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update CEMS partitions to handle year-quarter files #3096

Update CEMS partitions to handle year-quarter files #3096

e-belfer commented Nov 29, 2023 •

edited by cmgosnell

codecov bot commented Dec 6, 2023 •

edited

zaneselvans commented Dec 12, 2023 •

edited

cmgosnell commented Dec 12, 2023

zaneselvans commented Dec 12, 2023

zaneselvans commented Dec 12, 2023 •

edited

zaneselvans commented Dec 12, 2023

Update CEMS partitions to handle year-quarter files #3096

Update CEMS partitions to handle year-quarter files #3096

Conversation

e-belfer commented Nov 29, 2023 • edited by cmgosnell

PR Checklist

codecov bot commented Dec 6, 2023 • edited

Codecov Report

zaneselvans commented Dec 12, 2023 • edited

cmgosnell commented Dec 12, 2023

zaneselvans commented Dec 12, 2023

zaneselvans commented Dec 12, 2023 • edited

zaneselvans commented Dec 12, 2023

e-belfer commented Nov 29, 2023 •

edited by cmgosnell

codecov bot commented Dec 6, 2023 •

edited

zaneselvans commented Dec 12, 2023 •

edited

zaneselvans commented Dec 12, 2023 •

edited