Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CEMS partitions to handle year-quarter files #3096

Merged
merged 44 commits into from Dec 12, 2023
Merged

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Nov 29, 2023

Make CEMS quarterly and add 2023 data! See #2973 for detailed task list.

This PR:

  • Updates pudl.extract.epacems to read in quarterly data
  • Updates pudl.transform.epacems to handle quarterly data
  • Updates pudl.etl.epacems_assets to write year/state row groups from quarterly parquet files to the monolithic output
  • Reduces dagster concurrency of hourly_emissions_epacems.process_single_year() to prevent OOM issues (currently to 2 threads)
  • Updates integration tests to handle new EPA CEMS format

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@e-belfer e-belfer self-assigned this Nov 29, 2023
@e-belfer e-belfer linked an issue Nov 29, 2023 that may be closed by this pull request
14 tasks
@e-belfer e-belfer changed the title Refactor CEMS partitions to handle year-quarter files Rejigger CEMS partitions to handle year-quarter files Nov 29, 2023
@e-belfer e-belfer changed the title Rejigger CEMS partitions to handle year-quarter files Update CEMS partitions to handle year-quarter files Dec 6, 2023
src/pudl/metadata/sources.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Dec 6, 2023

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (cf09bda) 92.6% compared to head (bbef1e2) 92.6%.
Report is 1 commits behind head on dev.

Files Patch % Lines
src/pudl/extract/epacems.py 84.2% 3 Missing ⚠️
src/pudl/settings.py 88.9% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #3096     +/-   ##
=======================================
- Coverage   92.6%   92.6%   -0.0%     
=======================================
  Files        134     134             
  Lines      12577   12566     -11     
=======================================
- Hits       11648   11634     -14     
- Misses       929     932      +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zaneselvans zaneselvans added epacems Integration and analysis of the EPA CEMS dataset. new-data Requests for integration of new data. labels Dec 11, 2023
@zaneselvans
Copy link
Member

zaneselvans commented Dec 12, 2023

@cmgosnell

  • I materialized the EPA CEMS assets locally so I could debug the test failures and it took nearly 2 hours, even with nothing else running.
  • It's not immediately clear why it took so long though. It wasn't fully saturating either my CPUs or using all the memory.
  • This is a very large performance regression. I think we need to put some effort into speeding this back up, or at least understanding why it's so slow. You said it "only" took 40 minutes for you, right?
  • The integration / output tests were easy to fix and mostly had to do with changes to the EpaCemsSettings class, and the new expected behavior of the output function, which won't complain if you request state/year combos that don't exist. It just won't give you back the data it doesn't have.

Edit: I pre-populated the datastore before starting the CEMS materialization last night, so it should not have been downloading anything. I checked the timestamps on the files this morning and they were all from the same time, before I ran the ETL. Also I've re-run the CEMS asset materializations this morning and it's looking like it'll take 2 hours again. Not sure what the difference is between my system and yours though.

@cmgosnell
Copy link
Member

thanks for fixing that ci failure!

Its curious that it took you 2 hours. eeee. it has consistently taken my computer ~40 minutes. When it needed to download a new archive, it took 58. Which is still ~2x time from the previous setup. I 100% agree that we should take some time to make it faster, but I don't think we should delay integrating this before we do that.

@zaneselvans
Copy link
Member

Another weird thing that I'm seeing locally comparing my two sets of outputs is the new ETL produces significantly smaller outputs, despite including a little bit more data.

4.8G	epacems_quarterly/
5.7G	hourly_emissions_epacems/
5.1G	hourly_emissions_epacems.parquet
5.6G	hourly_emissions_epacems.parquet (from nightly build)

@zaneselvans
Copy link
Member

zaneselvans commented Dec 12, 2023

Hmm. Using the MacOS Activity Monitor (rather than btop) I see 10 python3.11 processes each of which claims to have 20 threads, and appears to be using multiple GB of memory, which means a lot of it is spilling onto swap on disk which would slow things down a lot.

In btop the same processes show up, but most of them only have a few hundred MB of memory attributed to them and overall memory usage is supposedly just 16GB. However, my computer is super laggy, as if memory usage is maxed out.

@cmgosnell cmgosnell merged commit f020a07 into dev Dec 12, 2023
15 checks passed
@cmgosnell cmgosnell deleted the cems-quarterly branch December 12, 2023 16:37
@zaneselvans
Copy link
Member

It seems that if you select the epacems assets using the "Asset groups" sidebar in the Dagster UI, the concurrency configuration that tells it only to transform 2 years at a time is not used.

While if you select the etl_full job and then search for or graphically select the EPA CEMS assets, the concurrency limitation configuration found in pudl/etl/__init__.py is used.

I feel like the configs are flaky in general, and have run into issues with them not getting updated when the settings files change in the context of the ferc_to_sqlite jobs too. How can we make them more robust?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epacems Integration and analysis of the EPA CEMS dataset. new-data Requests for integration of new data.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

CEMS: Repartition extraction process and parquet files.
3 participants