Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split monolithic ferc_to_sqlite ops into per-dataset pieces #3098

Merged
merged 30 commits into from
Feb 7, 2024

Conversation

rousik
Copy link
Collaborator

@rousik rousik commented Nov 30, 2023

Presence of the monolithic dbf2sqlite and xbrl2sqlite ops that processed all datasets in a serial loop seemed like a missed opportunity. This change rips out those monoliths and replaces them with dataset specific ops that are directly invoked from within the relevant graphs.

One thing that prevented efficient breakup into smaller ops was a fact that these ops relied on configuration fields like clobber and batch_size, and this could fan out quite drastically. To prevent this, I have replaced the op-specific config with RuntimeSettings resource that is defined once and for all and used by all the ops that need access to these parameters.

This change:

  • decomposes ferc_to_sqlite graph into dataset specific operations that are generated by relevant extractor classes
  • introduces --dataset-only flag that can further restrict the graph to single dataset operation (this is going to be used in CI for sharding and speeding up ferc_to_sqlite)
  • drops support of xbrl_enabled/dbf_enabled flags in the job factory in favor of op_selector logic
  • added tags to ops indicating which dataset and data_format is used in the op (so that they can be appropriately selected)

Instead of using single monolith op that loops over all forms,
we can use ops factory and XbrlRuntimeSettings resource to simplify
how stuff is passed in.

This way, single runtime settings exists for all xbrl ops and one
op is generated for each form. This should allow for better
parallelism, even though it might interfere with the num_workers
being set to num cpus by default (i.e. this way we will oversubscribe
available cores twice, one for dagster workers, and once for xbrl
num workers). This, hovewer, should be an easy fix.
Refactor monolithic dbf2sqlite and xbrl2sqlite methods into per-dataset
smaller ops that are invoked within the graphs. This should allow us to
better make use of dagster parallelism and speed up ferc_to_sqlite
processing.

It seems that current unit/integration tests only use FERC1 raw data,
so I've modified the fixtures to only run the relevant pieces of processing.
@rousik rousik marked this pull request as ready for review November 30, 2023 07:37
@jdangerx jdangerx self-requested a review November 30, 2023 22:46
@rousik
Copy link
Collaborator Author

rousik commented Dec 1, 2023

I ran this with the pytest refactor that runs ferc_to_sqlite as a standalone step on larger runner and got the following results:

https://github.com/catalyst-cooperative/pudl/actions/runs/7064167417

  1 xbrl worker 2 xbrl workers 3 xbrl workers 4 xbrl workers 8 xbrl workers
1 dagster worker  ? 49 mins 43 mins 41 mins 40 mins
2 dagster workers 46 mins 39 mins 38 mins 38 mins
4 dagster workers 46 mins 39 mins
unlimited dagster workers 48 mins 38 mins

TL;DR - both dagster and xbrl worker parameters provide similar speed-ups, with xbrl workers providing slightly better value/speedups. Note that we can only run with 2 and 4 of each before we run larger runner out of memory. We could, therefore, run with 2 xbrl workers in ci-integration tests or explore running on even larger runners. Config that runs with single thread of each is still running and ferc_to_sqlite is taking over 75 minutes already.

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for switching up the settings datastructure so we can better split out the assets and run CI faster!

I'm also sort of inclined to rely more on dagster parallelization than xbrl extractor parallelization, because one system is easier to think about than two, but that definitely falls out of scope of this PR.

Finally, I have a few small thoughts about the changes you made to the test config that we should talk about before merging this in.

src/pudl/resources.py Outdated Show resolved Hide resolved
},
},
)


@pytest.fixture(scope="session")
def ferc_to_sqlite_xbrl_only(
def ferc1_xbrl_extract(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, nice to not have to extract all the other forms in test 🤩

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though this is probably irrelevant once we're running ferc_to_sqlite normally as part of ci-integration (see #2825) and not as part of integration tests in-process. At that point, we will likely deprecate these fixtures and assume we're always relying on --live-dbs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds correct to me - though we should consider making sure we can't actually mutate anything in --live-dbs mode....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to rework the testing setup in the Makefile to run ferc_to_sqlite and pudl_etl before running the tests too? That would be nice for the sake of consistency (doing the same thing in CI and locally) but would we need a different way of isolating the test outputs from live outputs? If the tests are always run with --live-dbs then should that option just be removed entirely?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say yes to removing --live-dbs functionality altogether, once we've isolated ferc_to_sqlite/etl from the unit/integration tests.

The obvious question is what is the semantics of running the integration tests locally - do we expect that the outputs are already in place, or do we want to re-run the data-generation (which could be time consuming).

In the latter case, even if we no longer have --live-dbs, we could set PUDL_OUTPUT in the Makefile itself to ensure that outputs that are generated for the sake of testing are isolated from the local outputs.

Happy to have some more conversation about this, but this seems to be out of scope for this specific PR so I would like to not block on that.

test/conftest.py Outdated Show resolved Hide resolved

# Construct xbrl2sqlite op context
context = build_op_context(
# always use tmp ath here so that we don't clobber the live DB when --live-dbs is passed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're no longer mocking PudlPaths() here - is this comment still relevant? Seems like we're checking at test runtime for --live-dbs, which is fine I guess but also error-prone - we need to remember to bail out of any db-touching test when live_dbs == True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the fact that we have the tests operating in these two distinct modes (read-only vs mutating) is a bit dangerous IMO and there's always a possibility that we don't do isolation well and things get mangled. I suppose that, ideally, we would rely on most of the db tests to be validations that will not modify whatever is in place.

test/unit/extract/xbrl_test.py Show resolved Hide resolved
@zaneselvans
Copy link
Member

zaneselvans commented Dec 6, 2023

Another option for splitting the sizeable FERC extraction work out from all of the other stuff without blowing up the resource use on an individual machine would be to split each of the FERC extractions into its own CI job. I think we only need to upload/download the FERC 1 DBs for the PUDL ETL to run. PUDL wouldn't need to be dependent on any of the other FERC jobs, and we could just upload and combine coverage reports from the non-FERC1 forms.

I think this could just be done by splitting the ferc2sqlite params out into separate settings files. The FERC1 takes the most time by a significant margin. It could even be worthwhile to split out the XBRL and DBF extractions into 2 separate jobs (since they're totally independent conceptually, and take roughly the same amount of time to run). I think splitting the work out into separate CI jobs on separate runners would probably cut our ~45min FERC run down to more like 15min and would also cut the time we burn on larger runners almost in half.

@rousik
Copy link
Collaborator Author

rousik commented Dec 6, 2023

Another option for splitting the sizeable FERC extraction work out from all of the other stuff without blowing up the resource use on an individual machine would be to split each of the FERC extractions into its own CI job. I think we only need to upload/download the FERC 1 DBs for the PUDL ETL to run. PUDL wouldn't need to be dependent on any of the other FERC jobs, and we could just upload and combine coverage reports from the non-FERC1 forms.

I think this could just be done by splitting the ferc2sqlite params out into separate settings files. The FERC1 takes the most time by a significant margin. It could even be worthwhile to split out the XBRL and DBF extractions into 2 separate jobs (since they're totally independent conceptually, and take roughly the same amount of time to run). I think splitting the work out into separate CI jobs on separate runners would probably cut our ~45min FERC run down to more like 15min and would also cut the time we burn on larger runners almost in half.

This is basically putting yet another layer of doing the same thing in a different way on top of dagster parallelization; I can see the practical reason for it here (breaking out of the bounds of relatively small GitHub runners) but it's a bit of a dirty trick IMO. If this can save some money, then we could go for it. The other option could be to have --only-dataset or --num-shards flags that could subselect dagster assets/jobs or dataset settings programmatically, which would remove the need to maintain collection of configs for this.

@zaneselvans
Copy link
Member

zaneselvans commented Dec 6, 2023

Yeah, I don't love the additional complexity, but it seems like it could be done with no change to the codebase, result in a significantly bigger speedup, and also save some paid CI minutes, so it seemed worth looking at. And I guess these approaches are not necessarily mutually exclusive.

@jdangerx Do you think there are any clear opportunities for speeding up the XBRL extraction jobs at a low level? Right now it takes about the same amount of time to extract 20-30 years of DBF data as 2 years of XBRL data, so it seems like in production this issue is going to get worse over time.

@jdangerx
Copy link
Member

jdangerx commented Dec 7, 2023

We can always profile and see if there's anything funky going on - not sure how to prioritize that work over e.g. getting archiver stuff working, again (again).

@rousik
Copy link
Collaborator Author

rousik commented Dec 13, 2023

Another option for splitting the sizeable FERC extraction work out from all of the other stuff without blowing up the resource use on an individual machine would be to split each of the FERC extractions into its own CI job. I think we only need to upload/download the FERC 1 DBs for the PUDL ETL to run. PUDL wouldn't need to be dependent on any of the other FERC jobs, and we could just upload and combine coverage reports from the non-FERC1 forms.

I think this could just be done by splitting the ferc2sqlite params out into separate settings files. The FERC1 takes the most time by a significant margin. It could even be worthwhile to split out the XBRL and DBF extractions into 2 separate jobs (since they're totally independent conceptually, and take roughly the same amount of time to run). I think splitting the work out into separate CI jobs on separate runners would probably cut our ~45min FERC run down to more like 15min and would also cut the time we burn on larger runners almost in half.

I think this is a reasonable idea and I would like to pursue it as well. I'm thinking I could add --dataset-only and maybe --format-only flags to ferc_to_sqlite CLI that could filter down what ops are executed. I do think that having the work split into one op per dataset/format as is done here would pair very nicely with that functionality so I think we should still pursue this change.

For resource usage safety, I would also suggest setting default xbrl workers to 2 and rely primarily on the dagster workers for parallelism.

rousik and others added 4 commits December 12, 2023 21:52
Rely on dagster parallelism here.
Restrict processing to, say, ferc1_dbf or ferc2_xbrl dataset. This is
intended for ci-integration parallelism.
@rousik
Copy link
Collaborator Author

rousik commented Dec 13, 2023

Implemented --dataset-only=ferc1_xbrl filter as a command-line flag that could be used in conjunction with the new pytest

@rousik rousik requested a review from jdangerx December 13, 2023 17:28
@zaneselvans
Copy link
Member

@rousik there was an error because pudl wasn't being imported in conftest.py so I added the import.

Now there's a real error related to splitting the FERC jobs up because the Datasette metadata to RST script that generates the metadat.yaml file that annotates our Datasette deployment expects all of the FERC outputs to be available in $PUDL_OUTPUT when it runs.

@rousik
Copy link
Collaborator Author

rousik commented Dec 29, 2023

@rousik there was an error because pudl wasn't being imported in conftest.py so I added the import.

Now there's a real error related to splitting the FERC jobs up because the Datasette metadata to RST script that generates the metadat.yaml file that annotates our Datasette deployment expects all of the FERC outputs to be available in $PUDL_OUTPUT when it runs.

@zaneselvans can you provide some additional context? Do we expect to run this datasette metadata job as part of ci-integration, or is that something that should be run as part of daily/nightly deployments? I suppose the question here is whether early coverage/error detection is worth the added cost (time/resources), or whether catching this on a nightly cadence (when we try to ship the datasette) is what we want.

@rousik
Copy link
Collaborator Author

rousik commented Dec 29, 2023

@rousik there was an error because pudl wasn't being imported in conftest.py so I added the import.
Now there's a real error related to splitting the FERC jobs up because the Datasette metadata to RST script that generates the metadat.yaml file that annotates our Datasette deployment expects all of the FERC outputs to be available in $PUDL_OUTPUT when it runs.

@zaneselvans can you provide some additional context? Do we expect to run this datasette metadata job as part of ci-integration, or is that something that should be run as part of daily/nightly deployments? I suppose the question here is whether early coverage/error detection is worth the added cost (time/resources), or whether catching this on a nightly cadence (when we try to ship the datasette) is what we want.

It seems to me that we could adapt the datasette tests so that they're configuration aware and would only test for presence of files that should have been produced, i.e. skipping over datasets that are marked as disabled.

Base automatically changed from dev to main January 5, 2024 04:14
@zaneselvans zaneselvans added testing Writing tests, creating test data, automating testing, etc. performance Make PUDL run faster! github-actions Pull requests that update GitHub Actions code xbrl Related to the FERC XBRL transition dbf Data coming from FERC's old Visual FoxPro DBF database file format. labels Jan 18, 2024
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good! Thanks for sticking with this long-standing PR for so long 😅

I have a few clarifying questions to make sure the behavior is what we need before merging in.

docker/gcp_pudl_etl.sh Outdated Show resolved Hide resolved
src/pudl/ferc_to_sqlite/__init__.py Outdated Show resolved Hide resolved
src/pudl/resources.py Show resolved Hide resolved
},
},
)


@pytest.fixture(scope="session")
def ferc_to_sqlite_xbrl_only(
def ferc1_xbrl_extract(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds correct to me - though we should consider making sure we can't actually mutate anything in --live-dbs mode....

test/conftest.py Outdated Show resolved Hide resolved
test/unit/extract/xbrl_test.py Show resolved Hide resolved
@rousik rousik requested a review from jdangerx January 30, 2024 23:40
@rousik
Copy link
Collaborator Author

rousik commented Jan 30, 2024

Reworked datasette tests after the recent changes there, to allow for presence of ferc1_xbrl only if --live-dbs is not enabled. This is not pretty, but I'm expecting we are going to ditch that functionality/duality and then we can fix these tests properly.

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this through @rousik ! Though there is still one merge conflict & a small typo re: a comment you had meant to remove.

There's definitely room for figuring out how we want to handle --live-dbs and that is also way out of scope of this PR. If you have thoughts, @rousik, I'd love if you could write them up in a Discussion!

test/unit/extract/xbrl_test.py Show resolved Hide resolved
src/pudl/resources.py Outdated Show resolved Hide resolved
@rousik
Copy link
Collaborator Author

rousik commented Feb 3, 2024

Thanks for pushing this through @rousik ! Though there is still one merge conflict & a small typo re: a comment you had meant to remove.

There's definitely room for figuring out how we want to handle --live-dbs and that is also way out of scope of this PR. If you have thoughts, @rousik, I'd love if you could write them up in a Discussion!

Merged main and resolved conflicts, letting the tests run overnight and fingers crossed, things should be good now.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rousik note that we turned set up the merge queue last week, and the CI now only runs the integration tests in the merge queue.

Comment on lines 88 to 91
default=2,
help=(
"Number of worker processes to use when parsing XBRL filings. "
"Defaults to using the number of CPUs."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the default here still supposed to be 2? If so the help message needs to be updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the default as 0, which is what it was, and then force concurrency of 2 in conftest.py. Also, we set --workers 8 in gcp_pudl_etl.sh, so that is unaffected by this default.

…lt to 1.

It didn't default to num CPUs anyways - we passed in 0 as default before, which would eventually make the batch size equal to the total # of filings, i.e. 1 concurrent worker.
@jdangerx
Copy link
Member

jdangerx commented Feb 7, 2024

Fixed merge conflict & changed the default back to 0. Though I think that will actually throw a ValueError per the docs, so I'm running the tests with a default of None and will push that if that works.

@jdangerx jdangerx added this pull request to the merge queue Feb 7, 2024
Merged via the queue into main with commit 5e71660 Feb 7, 2024
13 checks passed
@jdangerx jdangerx deleted the split-ferc2sqlite-ops branch February 7, 2024 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community dbf Data coming from FERC's old Visual FoxPro DBF database file format. ferc1 Anything having to do with FERC Form 1 github-actions Pull requests that update GitHub Actions code performance Make PUDL run faster! testing Writing tests, creating test data, automating testing, etc. xbrl Related to the FERC XBRL transition
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants