Add support for choosing between multiprocess and inprocess executors via cli flag #2895

rousik · 2023-09-26T16:44:14Z

This change allows picking appropriate dagster executor based on how many workers are used. With 1 worker, use inprocess executor, with N max workers, set the limit.

The original flag --max-concurrent is renamed to --dagster-workers to make it more explicit. There are many kinds of parallelism/concurrency at play so I thought the new name would be more appropriate.

Added this flag to ferc_to_sqlite as well to make both tools uniform.

When --max-parallelism=1, use in-process executor.

zaneselvans

It might be simpler to just always use the multi-process executor to avoid having to handle the different asset caching behavior between in-process and multi-process execution that came up in #2470, and ensure that the tests always use an isolated $DAGSTER_HOME that keeps the user's files safe?

Other than that just minor help message stuff.

src/pudl/ferc_to_sqlite/cli.py

zaneselvans · 2023-09-26T16:59:04Z

src/pudl/helpers.py

+        return {
+            "execution": {
+                "config": {
+                    "in_process": {},


The switch away from using the in process executor will also affect the behavior in #2470 I think.

Do we need to make sure that the tests have their own isolated $DAGSTER_HOME so that they don't clobber the user's environment when they run? Also if we're running with a real $DAGSTER_HOME and writing the pickled dataframes out to disk rather than holding everything in memory, can we re-enable the Jupyter Notebook tests?

Note that this behavior controls the execution when CLI tools are executed directly, and should not really change how tests are invoked. I'm not sure I fully understand the problem here, but my guess is that tests are currently not fully isolated and leak some materialization into shared $DAGSTER_HOME in which case, I agree that we should probably spin up temporary directory when running tests whenever we are not running with --live-dbs. With that flag, no ETL should really be run anyways, right?

It seems to me that the two concerns (how we run CLIs and how we run tests) should be orthogonal concerns?

It seemed to me like they overlap a bit because in this PR we're using the CLI to run the ETL for the tests rather than using the test fixture.

With the in-process executor the default IO Manager doesn't write anything to disk in $DAGSTER_HOME, it keeps the assets in memory (the assets that are written to the database still get written to the database, wherever the PUDL DB path says that DB is). And currently the tests inherit the user's $DAGSTER_HOME so if anything in the tests attempts to read from that directory, it's reading from the user's preexisting outputs rather than outputs that were generated by the tests, so a test that relies on an intermediate output that was written to disk can work when the user runs the tests locally (since they've got a $DAGSTER_HOME full of pickled dataframes) and then fail in the GitHub CI, where no intermediate dataframes have ever been written to disk, since the ETL has only ever been run in-process there.

So my guess is that what will happen now when the tests that use the CLI to build the DB are run locally is that they will overwrite the user's pickled dataframes, probably replacing etl_full outputs with etl_fast outputs.

I think that the current plan was to use CLI for ETL for integration tests only in the context of CI infrastructure and there it should be using multiprocess with no limits. This change should not affect the default behavior, but if people want to run single threaded, they can (e.g. for perf analysis).

Do we also want to make use of separating ETL for the purpose of locally run integration tests? If so, we probably need to do some additional work.

codecov · 2023-09-26T17:47:19Z

Codecov Report

Attention: 14 lines in your changes are missing coverage. Please review.

Comparison is base (f456126) 88.7% compared to head (cc81e2b) 88.6%.
Report is 1 commits behind head on dev.

Files	Patch %	Lines
src/pudl/ferc_to_sqlite/cli.py	33.3%	6 Missing ⚠️
src/pudl/helpers.py	16.6%	5 Missing ⚠️
src/pudl/cli/etl.py	25.0%	3 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             dev   #2895     +/-   ##
=======================================
- Coverage   88.7%   88.6%   -0.1%     
=======================================
  Files         89      89             
  Lines      11023   11040     +17     
=======================================
+ Hits        9785    9790      +5     
- Misses      1238    1250     +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

For more information, see https://pre-commit.ci

rousik added 3 commits September 25, 2023 14:41

Add support for running in-process executor.

436bd83

When --max-parallelism=1, use in-process executor.

Switch to --dagster-workers flags and refactor.

514cb51

Merge remote-tracking branch 'origin/dev' into inprocess

6ac8b9d

rousik requested review from zaneselvans and bendnorman September 26, 2023 16:44

zaneselvans requested changes Sep 26, 2023

View reviewed changes

rousik and others added 3 commits October 3, 2023 11:59

Merge branch 'dev' into inprocess

9d776ab

update help command for the --dagster-workers flag

5aff2ee

[pre-commit.ci] auto fixes from pre-commit.com hooks

f4195c1

For more information, see https://pre-commit.ci

rousik requested a review from zaneselvans October 4, 2023 15:21

zaneselvans approved these changes Oct 5, 2023

View reviewed changes

rousik and others added 5 commits October 6, 2023 07:38

Merge branch 'dev' into inprocess

b874c19

Merge branch 'dev' into inprocess

3007276

Merge branch 'dev' into inprocess

eb92a84

Merge branch 'dev' into inprocess

4d0d9f4

Fix the type of --dagster-workers argument.

33254b5

jdangerx added the community label Nov 11, 2023

rousik and others added 5 commits November 30, 2023 01:15

Merge branch 'dev' into inprocess

2728ad6

[pre-commit.ci] auto fixes from pre-commit.com hooks

2c4c829

For more information, see https://pre-commit.ci

Add default --gcs-cache-path to simplify code below.

2190545

Merge remote-tracking branch 'origin/dev' into inprocess

7abe68a

Update conda-lock.yml and rendered conda environment files.

cc81e2b

rousik merged commit 7f1fb1c into dev Dec 1, 2023
9 of 11 checks passed

zaneselvans mentioned this pull request Dec 8, 2023

Merge dev into main for 2023-12-08 #3138

Merged

zaneselvans deleted the inprocess branch January 2, 2024 22:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for choosing between multiprocess and inprocess executors via cli flag #2895

Add support for choosing between multiprocess and inprocess executors via cli flag #2895

rousik commented Sep 26, 2023

zaneselvans left a comment

zaneselvans Sep 26, 2023

rousik Sep 26, 2023

rousik Sep 26, 2023

zaneselvans Sep 26, 2023

rousik Sep 28, 2023

codecov bot commented Sep 26, 2023 •

edited

Add support for choosing between multiprocess and inprocess executors via cli flag #2895

Add support for choosing between multiprocess and inprocess executors via cli flag #2895

Conversation

rousik commented Sep 26, 2023

zaneselvans left a comment

Choose a reason for hiding this comment

zaneselvans Sep 26, 2023

Choose a reason for hiding this comment

rousik Sep 26, 2023

Choose a reason for hiding this comment

rousik Sep 26, 2023

Choose a reason for hiding this comment

zaneselvans Sep 26, 2023

Choose a reason for hiding this comment

rousik Sep 28, 2023

Choose a reason for hiding this comment

codecov bot commented Sep 26, 2023 • edited

Codecov Report

codecov bot commented Sep 26, 2023 •

edited