Parallelize Dagster processing of EPA CEMS #2472

zschira · 2023-03-30T16:14:37Z

Background

epacems is currently a major bottleneck in the ETL and could be easily parallelized. Here I've tried using Dagster's partitioned asets to attempt to achieve concurrency, but they don't seem to be behaving quite as expected/hoped. It seems that for every partition combination, it is a launching a run of the etl_fast/etl_full job, which running the entire ETL instead of just materializing all assets just once, then running each CEMS partition individually. This is unsurprisingly blowing up the resource usage on my computer and causing it to crash even when limiting the max concurrency to 5 runs.

Possible solutions

Option 1: Try to get partitions to work as desired

It would be ideal if we could just get partitions working as we want them to, but I'm not entirely sure if it's possible. I'm going to write up a question in Dagster's slack and see if we can get any help, so we'll se what comes of it.

edit: Just got a response from the Dagster slack, and they recommended creating a new asset job for the partitioned asset and running it separately. This will allow all of the epacems partitions to be generated concurrently, but they won't be running with the main ETL, so we do lose a little concurrency there. For that reason, option 2 might still be preferable, but I'd like to hear others thoughts?

Option 2: Create an asset factory to generate assets for each partition

This seems to me like a decent solution if we can't get partitions working. We could have the factory generate an asset for each year, and allow dagster to handle generating each of these assets concurrently. This behavior seems like exactly what we'd want, but it would add a bunch of assets that kind of create some clutter, but I don't think it would be too bad.

Option 3: Handle concurrency ourselves

We could just use normal python concurrency tools to handle this ourselves and only have one asset. I think doing our own concurrency wouldn't really be ideal, as I think performance will be best optimized if we let dagster handle it. The one advantage I see over the asset factory option is just that we don't have to create a bunch of assets

For more information, see https://pre-commit.ci

bendnorman · 2023-03-30T18:19:55Z

Bummer the partitions aren't working out as planned :/

I like the asset factory idea but I don't think we'd be able to parameterize which years to process using the datasets_settings resource. To process specific years we'd have to select specific assets which would differ from how we parameterize all of the other datasources.
Another option I played around with at one point was wrapping a dynamic graph in an asset. This way we would have one epacems assets and we can select which years to process using the datasets_settings resource. This is the commit where I experimented with this option. We would want to collect() all of the years because it would blow up memory unless we incorporated dask.
I'm concerned about moving CEMS into it's own job. We'll end up with an awkward circular dependency, where the CEMS job depends on normalized pudl tables and output tables depend on the outputs of the CEMS job.
I also don't think handling parallel processing in a single epacems asset would be the worst idea.

zaneselvans · 2023-03-30T21:11:29Z

That seems like really weird behavior from Dagster. I wonder why their idea about how it should work is so different from ours.

zschira · 2023-03-31T13:54:04Z

I think that dagster expects partitioned assets to depend only on assets with the same partitions. It sounds like they're working on adding functionality more like what we expect, but for the time being we have to go with another option.

codecov · 2023-03-31T16:43:02Z

Codecov Report

Patch coverage: 100.0% and no project coverage change.

Comparison is base (62cb44a) 86.7% compared to head (25138da) 86.7%.

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #2472   +/-   ##
=====================================
  Coverage   86.7%   86.7%           
=====================================
  Files         81      81           
  Lines       9490    9504   +14     
=====================================
+ Hits        8233    8247   +14     
  Misses      1257    1257

Impacted Files	Coverage Δ
src/pudl/cli.py	`60.0% <ø> (-1.0%)`	⬇️
src/pudl/convert/epacems_to_parquet.py	`64.7% <ø> (-1.1%)`	⬇️
src/pudl/etl/epacems_assets.py	`100.0% <100.0%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

…onization write problems

…puts in and writing to single file

src/pudl/etl/epacems_assets.py

src/pudl/cli.py

zaneselvans

I think the performance on this is totally good enough -- It went from 70 minutes to 14 minutes on my machine, and once it's interleaved with all of the other jobs in the DAG I don't think the additional granularity would speed things up much. It kept my 10 CPUs pegged to 100% for 10+ minutes.

I needed to merge in dev to run it locally and I tried to fix another lingering partitioning config that was in the integration tests, so I went ahead and pushed that merge back up to the PR. I hope that's okay!

I took at look at the Dagster docs for the other abstractions you're using here, which are different from how we're managing the other assets, and I wasn't totally clear on how it's working. Would you be willing to talk me through it? Maybe a little more of an explanation in the epacems_assets.py module-level docstring would be helpful, since it's so different from the straight-up assets we're using everywhere else?

But functionally it seems to work great!

zaneselvans · 2023-03-31T16:40:59Z

src/pudl/resources.py

@@ -30,6 +31,23 @@ def ferc_to_sqlite_settings(init_context) -> FercToSqliteSettings:
    return FercToSqliteSettings(**init_context.resource_config)


+class ParquetWriter(pq.ParquetWriter):


Is it worth preserving the ability to write out a single monolithic file directly, rather than compiling it after the fact from partitioned outputs? Once the data is in Parquet, it's extremely fast to read and write and it seems simpler to avoid the issue of concurrency altogether. We're hacking around it with SQLite because we don't have a choice unless we want to switch to another format for the DB entirely, but Parquet is designed to be sliced and diced.

Right now I don't think there's any way to directly write to the monolithic output. I was using this class for testing, but it's removed now.

Ack! This was a "pending" comment that never got submitted from 4 days ago when I first saw you push to the branch.

zaneselvans · 2023-04-03T21:34:33Z

src/pudl/etl/epacems_assets.py

+            plants_entity_eia,
+        )
+    )
+    return consolidate_partitions(partitions.collect())


Can you say more about what the meaning of the return value is here? consolidate_partitions() doesn't return anything itself and I'm not immediately understanding the Dagster docs on what's going on here.

Basically a graph_asset like this has to return the output from an op for dagster to be able to figure out how to create the asset. In most cases the op would actually be returning something that would then get passed to an io_manager. This case is kind of confusing though because we are directly writing to the parquet outputs inside the op, so there's nothing to return.

src/pudl/etl/epacems_assets.py

test/conftest.py

src/pudl/etl/epacems_assets.py

…pudl into epacems_partitions

For more information, see https://pre-commit.ci

zaneselvans

Looks like there's a docstring formatting issue that's breaking the docs build, but other than that this looks good to me.

Convert cems to use dagster partitions

661ff17

zschira requested a review from bendnorman March 30, 2023 16:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

6e4e63e

For more information, see https://pre-commit.ci

Create dynamic graph for epacems asset

d80e7c6

zaneselvans linked an issue Mar 30, 2023 that may be closed by this pull request

Parallelize Dagster processing of EPA CEMS #2376

Closed

Add partition option back to epacems asset

6eb0ee5

zaneselvans added epacems Integration and analysis of the EPA CEMS dataset. performance Make PUDL run faster! dagster Issues related to our use of the Dagster orchestrator labels Mar 30, 2023

zaneselvans added this to the 2023Q1 milestone Mar 30, 2023

bendnorman mentioned this pull request Mar 31, 2023

Parallelize extraction of Excel spreadsheets #2385

Closed

zaneselvans added the parquet Issues related to the Apache Parquet file format which we use for long tables. label Mar 31, 2023

zschira added 2 commits March 31, 2023 10:56

Fix epacems integration tests

adf2d92

Add synchronization around parquet write_table

37609e2

Seperate epacems monolithic output into its own asset to avoid synchr…

f90b3df

…onization write problems

zaneselvans modified the milestones: 2023Q1, 2023Q2 Apr 3, 2023

Create monolithic epacems parquet output from reading partitioned out…

2d9e926

…puts in and writing to single file

zschira requested a review from zaneselvans April 3, 2023 19:58

zschira marked this pull request as ready for review April 3, 2023 19:58

Fix epacems unit test

a30f031

zschira commented Apr 3, 2023

View reviewed changes

src/pudl/etl/epacems_assets.py Show resolved Hide resolved

src/pudl/etl/epacems_assets.py Show resolved Hide resolved

src/pudl/etl/epacems_assets.py Show resolved Hide resolved

Remove partition options from cli commands

887ee74

zschira commented Apr 3, 2023

View reviewed changes

src/pudl/cli.py Show resolved Hide resolved

zaneselvans added 2 commits April 3, 2023 16:06

Merge branch 'dev' into epacems_partitions

d0fc821

Remove EPA CEMS partitioning config from tests

150e303

zaneselvans requested changes Apr 3, 2023

View reviewed changes

zaneselvans changed the title ~~Convert cems to use dagster partitions~~ Parallelize Dagster processing of EPA CEMS Apr 4, 2023

zaneselvans reviewed Apr 4, 2023

View reviewed changes

src/pudl/etl/epacems_assets.py Show resolved Hide resolved

zschira and others added 3 commits April 4, 2023 12:56

Add detailed module description to epacems_assets.py

75bac25

Merge branch 'epacems_partitions' of github.com:catalyst-cooperative/…

a3a8192

…pudl into epacems_partitions

[pre-commit.ci] auto fixes from pre-commit.com hooks

68b5116

For more information, see https://pre-commit.ci

zaneselvans approved these changes Apr 4, 2023

View reviewed changes

zschira added 3 commits April 4, 2023 13:10

Merge branch 'dev' into epacems_partitions

e3185bb

Reformat epacems_assets docstring

61ce3c1

Reformat links in epacems_assets docstring

25138da

zschira merged commit f5cc15f into dev Apr 4, 2023

zschira deleted the epacems_partitions branch April 4, 2023 19:23

dstansby mentioned this pull request Jun 8, 2023

Read EIA860 data in parallel #2644

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize Dagster processing of EPA CEMS #2472

Parallelize Dagster processing of EPA CEMS #2472

zschira commented Mar 30, 2023 •

edited

Loading

bendnorman commented Mar 30, 2023 •

edited

Loading

zaneselvans commented Mar 30, 2023

zschira commented Mar 31, 2023

codecov bot commented Mar 31, 2023 •

edited

Loading

zaneselvans left a comment

zaneselvans Mar 31, 2023

zschira Apr 4, 2023

zaneselvans Apr 4, 2023

zaneselvans Apr 3, 2023

zschira Apr 4, 2023

zaneselvans left a comment

		@@ -30,6 +31,23 @@ def ferc_to_sqlite_settings(init_context) -> FercToSqliteSettings:
		return FercToSqliteSettings(**init_context.resource_config)


		class ParquetWriter(pq.ParquetWriter):

Parallelize Dagster processing of EPA CEMS #2472

Parallelize Dagster processing of EPA CEMS #2472

Conversation

zschira commented Mar 30, 2023 • edited Loading

Background

Possible solutions

Option 1: Try to get partitions to work as desired

Option 2: Create an asset factory to generate assets for each partition

Option 3: Handle concurrency ourselves

bendnorman commented Mar 30, 2023 • edited Loading

zaneselvans commented Mar 30, 2023

zschira commented Mar 31, 2023

codecov bot commented Mar 31, 2023 • edited Loading

Codecov Report

zaneselvans left a comment

Choose a reason for hiding this comment

zaneselvans Mar 31, 2023

Choose a reason for hiding this comment

zschira Apr 4, 2023

Choose a reason for hiding this comment

zaneselvans Apr 4, 2023

Choose a reason for hiding this comment

zaneselvans Apr 3, 2023

Choose a reason for hiding this comment

zschira Apr 4, 2023

Choose a reason for hiding this comment

zaneselvans left a comment

Choose a reason for hiding this comment

zschira commented Mar 30, 2023 •

edited

Loading

bendnorman commented Mar 30, 2023 •

edited

Loading

codecov bot commented Mar 31, 2023 •

edited

Loading