Parallelize Dagster processing of EPA CEMS #2376
Labels
dagster
Issues related to our use of the Dagster orchestrator
epacems
Integration and analysis of the EPA CEMS dataset.
inframundo
parquet
Issues related to the Apache Parquet file format which we use for long tables.
performance
Make PUDL run faster!
Milestone
Right now, EPA CEMS is responsible for about 80% of the overall ETL runtime in Dagster (more than 60min out of 78min). It's also very susceptible to parallelization, and we need to figure out how to do this kind of asset partitioning well for other data too:
Currently we are generating 2 versions of the data, one that's partitioned by state and year, and one that has all the data in single file (though it is internally partitioned into many row groups, each of which represents a unique state-year combo). See #2354. We are only processing the data once, and then writing it out to these two destinations (since writing it out is fast).
While having state-year file partitions / row-groups is useful for making efficient queries, the state-year outputs are probably too small to be efficient work units for parallelizing the data processing, and they also vary wildly in size (TX has way more data than VT). However each year of data is about the same size, is small enough to fit in memory, and there are around 30 years of data total. The nightly build machine has 16 cores, and our laptops often have 8-12 cores, so 30 work units would be able to fully utilize them without having crazy per-chunk overhead.
@bendnorman spent some time playing with Dagster's asset partitioning a while ago, but for some reason it wasn't totally straightforward. I think maybe it was too focused on backfills / time series? But maybe we could structure this as a time series with annual granularity?
To avoid having lots of processes trying to append data to the same monolithic parquet file, the parallelization will probably need to output all of the individual partitions, and only attempt to compile the big file once they are all complete. So maybe that is a separate asset that depends on an asset group composed of all of the individual partitions?
Should we be able to get it down to 4-15 min on a nightly build machine with 16 CPUs? Or 10-20 min on a laptop?
Scope
Next steps
The text was updated successfully, but these errors were encountered: