Split monolithic ferc_to_sqlite ops into per-dataset pieces #3098

rousik · 2023-11-30T07:37:55Z

Presence of the monolithic dbf2sqlite and xbrl2sqlite ops that processed all datasets in a serial loop seemed like a missed opportunity. This change rips out those monoliths and replaces them with dataset specific ops that are directly invoked from within the relevant graphs.

One thing that prevented efficient breakup into smaller ops was a fact that these ops relied on configuration fields like clobber and batch_size, and this could fan out quite drastically. To prevent this, I have replaced the op-specific config with RuntimeSettings resource that is defined once and for all and used by all the ops that need access to these parameters.

This change:

decomposes ferc_to_sqlite graph into dataset specific operations that are generated by relevant extractor classes
introduces --dataset-only flag that can further restrict the graph to single dataset operation (this is going to be used in CI for sharding and speeding up ferc_to_sqlite)
drops support of xbrl_enabled/dbf_enabled flags in the job factory in favor of op_selector logic
added tags to ops indicating which dataset and data_format is used in the op (so that they can be appropriately selected)

Instead of using single monolith op that loops over all forms, we can use ops factory and XbrlRuntimeSettings resource to simplify how stuff is passed in. This way, single runtime settings exists for all xbrl ops and one op is generated for each form. This should allow for better parallelism, even though it might interfere with the num_workers being set to num cpus by default (i.e. this way we will oversubscribe available cores twice, one for dagster workers, and once for xbrl num workers). This, hovewer, should be an easy fix.

Refactor monolithic dbf2sqlite and xbrl2sqlite methods into per-dataset smaller ops that are invoked within the graphs. This should allow us to better make use of dagster parallelism and speed up ferc_to_sqlite processing. It seems that current unit/integration tests only use FERC1 raw data, so I've modified the fixtures to only run the relevant pieces of processing.

rousik · 2023-12-01T20:26:11Z

I ran this with the pytest refactor that runs ferc_to_sqlite as a standalone step on larger runner and got the following results:

https://github.com/catalyst-cooperative/pudl/actions/runs/7064167417

	1 xbrl worker	2 xbrl workers	3 xbrl workers	4 xbrl workers	8 xbrl workers
1 dagster worker	?	49 mins	43 mins	41 mins	40 mins
2 dagster workers	46 mins	39 mins	38 mins	38 mins	⛔
4 dagster workers	46 mins	39 mins	⛔	⛔	⛔
unlimited dagster workers	48 mins	38 mins	⛔	⛔	⛔

TL;DR - both dagster and xbrl worker parameters provide similar speed-ups, with xbrl workers providing slightly better value/speedups. Note that we can only run with 2 and 4 of each before we run larger runner out of memory. We could, therefore, run with 2 xbrl workers in ci-integration tests or explore running on even larger runners. Config that runs with single thread of each is still running and ferc_to_sqlite is taking over 75 minutes already.

jdangerx

Thanks for switching up the settings datastructure so we can better split out the assets and run CI faster!

I'm also sort of inclined to rely more on dagster parallelization than xbrl extractor parallelization, because one system is easier to think about than two, but that definitely falls out of scope of this PR.

Finally, I have a few small thoughts about the changes you made to the test config that we should talk about before merging this in.

src/pudl/resources.py

jdangerx · 2023-12-05T23:31:57Z

test/conftest.py

                },
            },
        )


 @pytest.fixture(scope="session")
-def ferc_to_sqlite_xbrl_only(
+def ferc1_xbrl_extract(


Ooh, nice to not have to extract all the other forms in test 🤩

Though this is probably irrelevant once we're running ferc_to_sqlite normally as part of ci-integration (see #2825) and not as part of integration tests in-process. At that point, we will likely deprecate these fixtures and assume we're always relying on --live-dbs?

Yes, that sounds correct to me - though we should consider making sure we can't actually mutate anything in --live-dbs mode....

Are we going to rework the testing setup in the Makefile to run ferc_to_sqlite and pudl_etl before running the tests too? That would be nice for the sake of consistency (doing the same thing in CI and locally) but would we need a different way of isolating the test outputs from live outputs? If the tests are always run with --live-dbs then should that option just be removed entirely?

I'd say yes to removing --live-dbs functionality altogether, once we've isolated ferc_to_sqlite/etl from the unit/integration tests.

The obvious question is what is the semantics of running the integration tests locally - do we expect that the outputs are already in place, or do we want to re-run the data-generation (which could be time consuming).

In the latter case, even if we no longer have --live-dbs, we could set PUDL_OUTPUT in the Makefile itself to ensure that outputs that are generated for the sake of testing are isolated from the local outputs.

Happy to have some more conversation about this, but this seems to be out of scope for this specific PR so I would like to not block on that.

test/conftest.py

jdangerx · 2023-12-06T16:33:35Z

test/unit/extract/xbrl_test.py

-
-    # Construct xbrl2sqlite op context
-    context = build_op_context(
+    # always use tmp ath here so that we don't clobber the live DB when --live-dbs is passed


We're no longer mocking PudlPaths() here - is this comment still relevant? Seems like we're checking at test runtime for --live-dbs, which is fine I guess but also error-prone - we need to remember to bail out of any db-touching test when live_dbs == True.

Yeah, the fact that we have the tests operating in these two distinct modes (read-only vs mutating) is a bit dangerous IMO and there's always a possibility that we don't do isolation well and things get mangled. I suppose that, ideally, we would rely on most of the db tests to be validations that will not modify whatever is in place.

test/unit/extract/xbrl_test.py

zaneselvans · 2023-12-06T17:05:58Z

Another option for splitting the sizeable FERC extraction work out from all of the other stuff without blowing up the resource use on an individual machine would be to split each of the FERC extractions into its own CI job. I think we only need to upload/download the FERC 1 DBs for the PUDL ETL to run. PUDL wouldn't need to be dependent on any of the other FERC jobs, and we could just upload and combine coverage reports from the non-FERC1 forms.

I think this could just be done by splitting the ferc2sqlite params out into separate settings files. The FERC1 takes the most time by a significant margin. It could even be worthwhile to split out the XBRL and DBF extractions into 2 separate jobs (since they're totally independent conceptually, and take roughly the same amount of time to run). I think splitting the work out into separate CI jobs on separate runners would probably cut our ~45min FERC run down to more like 15min and would also cut the time we burn on larger runners almost in half.

rousik · 2023-12-06T18:59:35Z

Another option for splitting the sizeable FERC extraction work out from all of the other stuff without blowing up the resource use on an individual machine would be to split each of the FERC extractions into its own CI job. I think we only need to upload/download the FERC 1 DBs for the PUDL ETL to run. PUDL wouldn't need to be dependent on any of the other FERC jobs, and we could just upload and combine coverage reports from the non-FERC1 forms.

I think this could just be done by splitting the ferc2sqlite params out into separate settings files. The FERC1 takes the most time by a significant margin. It could even be worthwhile to split out the XBRL and DBF extractions into 2 separate jobs (since they're totally independent conceptually, and take roughly the same amount of time to run). I think splitting the work out into separate CI jobs on separate runners would probably cut our ~45min FERC run down to more like 15min and would also cut the time we burn on larger runners almost in half.

This is basically putting yet another layer of doing the same thing in a different way on top of dagster parallelization; I can see the practical reason for it here (breaking out of the bounds of relatively small GitHub runners) but it's a bit of a dirty trick IMO. If this can save some money, then we could go for it. The other option could be to have --only-dataset or --num-shards flags that could subselect dagster assets/jobs or dataset settings programmatically, which would remove the need to maintain collection of configs for this.

zaneselvans · 2023-12-06T20:23:09Z

Yeah, I don't love the additional complexity, but it seems like it could be done with no change to the codebase, result in a significantly bigger speedup, and also save some paid CI minutes, so it seemed worth looking at. And I guess these approaches are not necessarily mutually exclusive.

@jdangerx Do you think there are any clear opportunities for speeding up the XBRL extraction jobs at a low level? Right now it takes about the same amount of time to extract 20-30 years of DBF data as 2 years of XBRL data, so it seems like in production this issue is going to get worse over time.

jdangerx · 2023-12-07T23:02:18Z

We can always profile and see if there's anything funky going on - not sure how to prioritize that work over e.g. getting archiver stuff working, again (again).

rousik · 2023-12-13T03:48:25Z

Another option for splitting the sizeable FERC extraction work out from all of the other stuff without blowing up the resource use on an individual machine would be to split each of the FERC extractions into its own CI job. I think we only need to upload/download the FERC 1 DBs for the PUDL ETL to run. PUDL wouldn't need to be dependent on any of the other FERC jobs, and we could just upload and combine coverage reports from the non-FERC1 forms.

I think this could just be done by splitting the ferc2sqlite params out into separate settings files. The FERC1 takes the most time by a significant margin. It could even be worthwhile to split out the XBRL and DBF extractions into 2 separate jobs (since they're totally independent conceptually, and take roughly the same amount of time to run). I think splitting the work out into separate CI jobs on separate runners would probably cut our ~45min FERC run down to more like 15min and would also cut the time we burn on larger runners almost in half.

I think this is a reasonable idea and I would like to pursue it as well. I'm thinking I could add --dataset-only and maybe --format-only flags to ferc_to_sqlite CLI that could filter down what ops are executed. I do think that having the work split into one op per dataset/format as is done here would pair very nicely with that functionality so I think we should still pursue this change.

For resource usage safety, I would also suggest setting default xbrl workers to 2 and rely primarily on the dagster workers for parallelism.

Rely on dagster parallelism here.

Restrict processing to, say, ferc1_dbf or ferc2_xbrl dataset. This is intended for ci-integration parallelism.

For more information, see https://pre-commit.ci

rousik · 2023-12-13T04:40:57Z

Implemented --dataset-only=ferc1_xbrl filter as a command-line flag that could be used in conjunction with the new pytest

Import the top level pudl module.

zaneselvans · 2023-12-15T04:01:28Z

@rousik there was an error because pudl wasn't being imported in conftest.py so I added the import.

Now there's a real error related to splitting the FERC jobs up because the Datasette metadata to RST script that generates the metadat.yaml file that annotates our Datasette deployment expects all of the FERC outputs to be available in $PUDL_OUTPUT when it runs.

rousik · 2023-12-29T16:14:30Z

@rousik there was an error because pudl wasn't being imported in conftest.py so I added the import.

Now there's a real error related to splitting the FERC jobs up because the Datasette metadata to RST script that generates the metadat.yaml file that annotates our Datasette deployment expects all of the FERC outputs to be available in $PUDL_OUTPUT when it runs.

@zaneselvans can you provide some additional context? Do we expect to run this datasette metadata job as part of ci-integration, or is that something that should be run as part of daily/nightly deployments? I suppose the question here is whether early coverage/error detection is worth the added cost (time/resources), or whether catching this on a nightly cadence (when we try to ship the datasette) is what we want.

rousik · 2023-12-29T17:58:45Z

@rousik there was an error because pudl wasn't being imported in conftest.py so I added the import.
Now there's a real error related to splitting the FERC jobs up because the Datasette metadata to RST script that generates the metadat.yaml file that annotates our Datasette deployment expects all of the FERC outputs to be available in $PUDL_OUTPUT when it runs.

@zaneselvans can you provide some additional context? Do we expect to run this datasette metadata job as part of ci-integration, or is that something that should be run as part of daily/nightly deployments? I suppose the question here is whether early coverage/error detection is worth the added cost (time/resources), or whether catching this on a nightly cadence (when we try to ship the datasette) is what we want.

It seems to me that we could adapt the datasette tests so that they're configuration aware and would only test for presence of files that should have been produced, i.e. skipping over datasets that are marked as disabled.

jdangerx

Overall looks pretty good! Thanks for sticking with this long-standing PR for so long 😅

I have a few clarifying questions to make sure the behavior is what we need before merging in.

docker/gcp_pudl_etl.sh

src/pudl/ferc_to_sqlite/__init__.py

src/pudl/resources.py

jdangerx · 2024-01-19T21:42:31Z

test/conftest.py

                },
            },
        )


 @pytest.fixture(scope="session")
-def ferc_to_sqlite_xbrl_only(
+def ferc1_xbrl_extract(


Yes, that sounds correct to me - though we should consider making sure we can't actually mutate anything in --live-dbs mode....

test/conftest.py

test/unit/extract/xbrl_test.py

rousik · 2024-01-30T23:41:44Z

Reworked datasette tests after the recent changes there, to allow for presence of ferc1_xbrl only if --live-dbs is not enabled. This is not pretty, but I'm expecting we are going to ditch that functionality/duality and then we can fix these tests properly.

jdangerx

Thanks for pushing this through @rousik ! Though there is still one merge conflict & a small typo re: a comment you had meant to remove.

There's definitely room for figuring out how we want to handle --live-dbs and that is also way out of scope of this PR. If you have thoughts, @rousik, I'd love if you could write them up in a Discussion!

test/unit/extract/xbrl_test.py

src/pudl/resources.py

rousik · 2024-02-03T04:53:03Z

Thanks for pushing this through @rousik ! Though there is still one merge conflict & a small typo re: a comment you had meant to remove.

There's definitely room for figuring out how we want to handle --live-dbs and that is also way out of scope of this PR. If you have thoughts, @rousik, I'd love if you could write them up in a Discussion!

Merged main and resolved conflicts, letting the tests run overnight and fingers crossed, things should be good now.

zaneselvans

Hey @rousik note that we turned set up the merge queue last week, and the CI now only runs the integration tests in the merge queue.

zaneselvans · 2024-02-05T07:20:40Z

src/pudl/ferc_to_sqlite/cli.py

+    default=2,
    help=(
        "Number of worker processes to use when parsing XBRL filings. "
        "Defaults to using the number of CPUs."


Is the default here still supposed to be 2? If so the help message needs to be updated.

Let's keep the default as 0, which is what it was, and then force concurrency of 2 in conftest.py. Also, we set --workers 8 in gcp_pudl_etl.sh, so that is unaffected by this default.

…lt to 1. It didn't default to num CPUs anyways - we passed in 0 as default before, which would eventually make the batch size equal to the total # of filings, i.e. 1 concurrent worker.

jdangerx · 2024-02-07T17:00:59Z

Fixed merge conflict & changed the default back to 0. Though I think that will actually throw a ValueError per the docs, so I'm running the tests with a default of None and will push that if that works.

rousik added 2 commits November 29, 2023 17:28

rousik marked this pull request as ready for review November 30, 2023 07:37

Fix xbrl form number.

2ca6fa4

jdangerx self-requested a review November 30, 2023 22:46

rousik and others added 2 commits December 1, 2023 13:27

Merge branch 'dev' into split-ferc2sqlite-ops

5c5fcb5

Update conda-lock.yml and rendered conda environment files.

dd435cb

rousik requested a review from zaneselvans December 5, 2023 18:26

jdangerx requested changes Dec 6, 2023

View reviewed changes

Remove outdated TODO comment.

bf1078e

rousik and others added 4 commits December 12, 2023 21:52

Merge remote-tracking branch 'origin/dev' into split-ferc2sqlite-ops

a759adc

Use 2 xbrl workers by default.

88c3255

Rely on dagster parallelism here.

Introduce --dataset-only flag for ferc_to_sqlite.

7531882

Restrict processing to, say, ferc1_dbf or ferc2_xbrl dataset. This is intended for ci-integration parallelism.

[pre-commit.ci] auto fixes from pre-commit.com hooks

8b8622c

For more information, see https://pre-commit.ci

rousik requested a review from jdangerx December 13, 2023 17:28

rousik and others added 4 commits December 13, 2023 11:29

Merge remote-tracking branch 'origin/dev' into split-ferc2sqlite-ops

d732f45

Merge remote-tracking branch 'origin/dev' into split-ferc2sqlite-ops

c3366d2

Update conftest.py

9d2bec0

Import the top level pudl module.

Merge branch 'dev' into split-ferc2sqlite-ops

ca49f66

Base automatically changed from dev to main January 5, 2024 04:14

zaneselvans assigned rousik Jan 18, 2024

rousik added 2 commits January 18, 2024 13:21

Switch to tmp_path fixture.

ce0f51e

Merge branch 'main' into split-ferc2sqlite-ops

1113e32

jdangerx requested changes Jan 19, 2024

View reviewed changes

jdangerx added the community label Jan 22, 2024

Merge remote-tracking branch 'origin/main' into split-ferc2sqlite-ops

4d3c01a

rousik requested a review from jdangerx January 30, 2024 23:40

Update datasette tests, fix instantiatiation of pudl paths.

e184a53

jdangerx approved these changes Jan 31, 2024

View reviewed changes

test/unit/extract/xbrl_test.py Show resolved Hide resolved

src/pudl/resources.py Outdated Show resolved Hide resolved

rousik added 2 commits February 2, 2024 21:45

Log the error when PudlPaths() fails to instantiate.

5d789a9

Merge remote-tracking branch 'origin/main' into split-ferc2sqlite-ops

0ec53ed

Merge branch 'main' into split-ferc2sqlite-ops

5e1b746

zaneselvans reviewed Feb 5, 2024

View reviewed changes

jdangerx added 2 commits February 7, 2024 11:43

Merge branch 'main' into split-ferc2sqlite-ops

6c1cafe

Force FERC extraction to 2 concurrent XBRL workers in test, but defau…

a67a570

…lt to 1. It didn't default to num CPUs anyways - we passed in 0 as default before, which would eventually make the batch size equal to the total # of filings, i.e. 1 concurrent worker.

jdangerx force-pushed the split-ferc2sqlite-ops branch from c158319 to a67a570 Compare February 7, 2024 16:56

Use None for default so that we don't break ProcessPoolExecutor

03dc3a3

jdangerx added this pull request to the merge queue Feb 7, 2024

Merged via the queue into main with commit 5e71660 Feb 7, 2024
13 checks passed

jdangerx deleted the split-ferc2sqlite-ops branch February 7, 2024 19:30

jdangerx mentioned this pull request Feb 8, 2024

Stop using live DB in unit tests!! #3377

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split monolithic ferc_to_sqlite ops into per-dataset pieces #3098

Split monolithic ferc_to_sqlite ops into per-dataset pieces #3098

rousik commented Nov 30, 2023 •

edited

rousik commented Dec 1, 2023 •

edited

jdangerx left a comment •

edited

jdangerx Dec 5, 2023

rousik Dec 13, 2023

jdangerx Jan 19, 2024

zaneselvans Jan 19, 2024

rousik Jan 30, 2024

jdangerx Dec 6, 2023

rousik Dec 13, 2023

zaneselvans commented Dec 6, 2023 •

edited

rousik commented Dec 6, 2023

zaneselvans commented Dec 6, 2023 •

edited

jdangerx commented Dec 7, 2023

rousik commented Dec 13, 2023

rousik commented Dec 13, 2023

zaneselvans commented Dec 15, 2023

rousik commented Dec 29, 2023

rousik commented Dec 29, 2023

jdangerx left a comment

jdangerx Jan 19, 2024

rousik commented Jan 30, 2024

jdangerx left a comment

rousik commented Feb 3, 2024

zaneselvans left a comment

zaneselvans Feb 5, 2024

jdangerx Feb 7, 2024

jdangerx commented Feb 7, 2024 •

edited

Split monolithic ferc_to_sqlite ops into per-dataset pieces #3098

Split monolithic ferc_to_sqlite ops into per-dataset pieces #3098

Conversation

rousik commented Nov 30, 2023 • edited

rousik commented Dec 1, 2023 • edited

jdangerx left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Dec 6, 2023 • edited

rousik commented Dec 6, 2023

zaneselvans commented Dec 6, 2023 • edited

jdangerx commented Dec 7, 2023

rousik commented Dec 13, 2023

rousik commented Dec 13, 2023

zaneselvans commented Dec 15, 2023

rousik commented Dec 29, 2023

rousik commented Dec 29, 2023

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rousik commented Jan 30, 2024

jdangerx left a comment

Choose a reason for hiding this comment

rousik commented Feb 3, 2024

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx commented Feb 7, 2024 • edited

rousik commented Nov 30, 2023 •

edited

rousik commented Dec 1, 2023 •

edited

jdangerx left a comment •

edited

zaneselvans commented Dec 6, 2023 •

edited

zaneselvans commented Dec 6, 2023 •

edited

jdangerx commented Feb 7, 2024 •

edited