Parallelize Dagster processing of EPA CEMS #2376

zaneselvans · 2023-03-09T20:21:47Z

Right now, EPA CEMS is responsible for about 80% of the overall ETL runtime in Dagster (more than 60min out of 78min). It's also very susceptible to parallelization, and we need to figure out how to do this kind of asset partitioning well for other data too:

Currently we are generating 2 versions of the data, one that's partitioned by state and year, and one that has all the data in single file (though it is internally partitioned into many row groups, each of which represents a unique state-year combo). See #2354. We are only processing the data once, and then writing it out to these two destinations (since writing it out is fast).

While having state-year file partitions / row-groups is useful for making efficient queries, the state-year outputs are probably too small to be efficient work units for parallelizing the data processing, and they also vary wildly in size (TX has way more data than VT). However each year of data is about the same size, is small enough to fit in memory, and there are around 30 years of data total. The nightly build machine has 16 cores, and our laptops often have 8-12 cores, so 30 work units would be able to fully utilize them without having crazy per-chunk overhead.

@bendnorman spent some time playing with Dagster's asset partitioning a while ago, but for some reason it wasn't totally straightforward. I think maybe it was too focused on backfills / time series? But maybe we could structure this as a time series with annual granularity?

To avoid having lots of processes trying to append data to the same monolithic parquet file, the parallelization will probably need to output all of the individual partitions, and only attempt to compile the big file once they are all complete. So maybe that is a separate asset that depends on an asset group composed of all of the individual partitions?

Should we be able to get it down to 4-15 min on a nightly build machine with 16 CPUs? Or 10-20 min on a laptop?

Scope

Give feedback

EPA CEMS partitions are processed in parallel
Options

Next steps

Give feedback

attempt to partition the EPA CEMS asset by year
teach PandasParquetIOManager how to read/write partitioned data
Options

zaneselvans · 2023-03-28T02:56:38Z

Someone at Dagster just made this blog post about partitioning pipelines.

It looks like it's easy to have a downstream asset depend on all of the partitions of an upstream partitioned asset.

zaneselvans added epacems Integration and analysis of the EPA CEMS dataset. parquet Issues related to the Apache Parquet file format which we use for long tables. dagster Issues related to our use of the Dagster orchestrator inframundo labels Mar 9, 2023

zaneselvans modified the milestones: Port ETL to Dagster, Refine Dagster ETL, PUDL 2023Q1 Release Mar 9, 2023

zaneselvans mentioned this issue Mar 11, 2023

Improve ergonomics & performance of pudl.etl DAG #2386

Open

zaneselvans added the performance Make PUDL run faster! label Mar 11, 2023

zaneselvans modified the milestones: 2023Q1, 2023Q2 Mar 14, 2023

jdangerx assigned zschira Mar 27, 2023

zaneselvans mentioned this issue Mar 30, 2023

Observe nightly build CPU usage & identify parallelism bottlenecks #2444

Open

zaneselvans linked a pull request Mar 30, 2023 that will close this issue

Parallelize Dagster processing of EPA CEMS #2472

Merged

zaneselvans closed this as completed Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize Dagster processing of EPA CEMS #2376

Parallelize Dagster processing of EPA CEMS #2376

zaneselvans commented Mar 9, 2023 •

edited by jdangerx

Loading

Scope

Next steps

zaneselvans commented Mar 28, 2023

Parallelize Dagster processing of EPA CEMS #2376

Parallelize Dagster processing of EPA CEMS #2376

Comments

zaneselvans commented Mar 9, 2023 • edited by jdangerx Loading

Scope

Next steps

zaneselvans commented Mar 28, 2023

zaneselvans commented Mar 9, 2023 •

edited by jdangerx

Loading