Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize Dagster processing of EPA CEMS #2376

Closed
3 tasks
Tracked by #2386
zaneselvans opened this issue Mar 9, 2023 · 1 comment · Fixed by #2472
Closed
3 tasks
Tracked by #2386

Parallelize Dagster processing of EPA CEMS #2376

zaneselvans opened this issue Mar 9, 2023 · 1 comment · Fixed by #2472
Assignees
Labels
dagster Issues related to our use of the Dagster orchestrator epacems Integration and analysis of the EPA CEMS dataset. inframundo parquet Issues related to the Apache Parquet file format which we use for long tables. performance Make PUDL run faster!
Milestone

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Mar 9, 2023

Right now, EPA CEMS is responsible for about 80% of the overall ETL runtime in Dagster (more than 60min out of 78min). It's also very susceptible to parallelization, and we need to figure out how to do this kind of asset partitioning well for other data too:

image

Currently we are generating 2 versions of the data, one that's partitioned by state and year, and one that has all the data in single file (though it is internally partitioned into many row groups, each of which represents a unique state-year combo). See #2354. We are only processing the data once, and then writing it out to these two destinations (since writing it out is fast).

While having state-year file partitions / row-groups is useful for making efficient queries, the state-year outputs are probably too small to be efficient work units for parallelizing the data processing, and they also vary wildly in size (TX has way more data than VT). However each year of data is about the same size, is small enough to fit in memory, and there are around 30 years of data total. The nightly build machine has 16 cores, and our laptops often have 8-12 cores, so 30 work units would be able to fully utilize them without having crazy per-chunk overhead.

@bendnorman spent some time playing with Dagster's asset partitioning a while ago, but for some reason it wasn't totally straightforward. I think maybe it was too focused on backfills / time series? But maybe we could structure this as a time series with annual granularity?

To avoid having lots of processes trying to append data to the same monolithic parquet file, the parallelization will probably need to output all of the individual partitions, and only attempt to compile the big file once they are all complete. So maybe that is a separate asset that depends on an asset group composed of all of the individual partitions?

Should we be able to get it down to 4-15 min on a nightly build machine with 16 CPUs? Or 10-20 min on a laptop?

Scope

Next steps

@zaneselvans zaneselvans added epacems Integration and analysis of the EPA CEMS dataset. parquet Issues related to the Apache Parquet file format which we use for long tables. dagster Issues related to our use of the Dagster orchestrator inframundo labels Mar 9, 2023
@zaneselvans zaneselvans modified the milestones: Port ETL to Dagster, Refine Dagster ETL, PUDL 2023Q1 Release Mar 9, 2023
@zaneselvans zaneselvans added the performance Make PUDL run faster! label Mar 11, 2023
@zaneselvans zaneselvans modified the milestones: 2023Q1, 2023Q2 Mar 14, 2023
@zaneselvans
Copy link
Member Author

Someone at Dagster just made this blog post about partitioning pipelines.

It looks like it's easy to have a downstream asset depend on all of the partitions of an upstream partitioned asset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator epacems Integration and analysis of the EPA CEMS dataset. inframundo parquet Issues related to the Apache Parquet file format which we use for long tables. performance Make PUDL run faster!
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants