WIP: Initial conversion of the ETL to prefect flow #901

rousik · 2021-01-27T14:49:48Z

This is still very much WIP and is for informative purposes only so that you can familiarize yourself with the changes that this introduces.

Couple of important features:

adds support for uploading csv files, datapackages and parquet files to GCS
ferc1_to_sqlite is now executed as part of ETL
when datapackages are created, pudl database is populated with their contents
validation step is optional but it doesn't seem to work very well yet
epacems is not written to datapackages anymore but instead written directly to parquet files
Excel extractor has been turned into prefect task so it can be directly used to fetch data as a prefect task

Overall, the flow is constructed in etl.py using ${Dataset}Pipeline objects that are responsible for adding relevant tasks to the flow. Many of the extract/transform methods are annotated with @task decorators which turns them into prefect tasks, sometimes simple wrappers are used to aggregate some steps (e.g. epacems).

DataFrameCollection is used as a default data structure for passing dataframes between prefect tasks. This will ensure that data is written to disk and not held in memory.

1. Added task tagging with dataset/$blah, bundle/$bundle_name 2. Using the above annotation to create synthetic dependency between _transform_eia -> _transform_epacems. The latter reads the emitted csv files, which is pretty bad. 3. Inline _should_run_$dataset function and fix return types of the _etl_$x methods (return None if no tasks are generated)

Use prefect.List once instead of nested prefect.Set tasks. Flatten the list of lists into unique list of table names inside MetadataBundleMaker.

Use one task per (year, state) combo which allows massive parallelization. The current code may still have bugs w.r.t. extracting table names for datapkg metadata generation. Refactored eia->epacems dependency into singular task that gets injected into the mapped epacems tasks.

--use_dask_cluster ${scheduler_addr} enables this behavior.

WIP: Naming of these artifacts still feels somewhat clunky and there doesn't seem to be a good way to control the output path for this. It is also unclear how to swap LocalResult for cloud-based storage, but perhaps this is also not entirely necessary until after we have kubernetes based deployment (if this is needed/useful).

@task

1. csv dumping method wrapped in prefect @task (write_datapackages) 2. created synthetic dependencies between write_datapackage and _extract_table_names to ensure synchronization (unclear if this is needed) 3. Make sure that we use {df_name: df} dicts as a structure exchanged between ETL tasks. 4. Clean up task tagging (use bundle/{bundle_name}/{datapkg_name}

1. Created DatasetPipeline abstraction and refactored the existing code to use that for the individual task construction. 2. Explicitly link eia into epacems via get_table() method and refactor load_plant_utc_offset to read dataframe directly. 3. created merge_dataframe_maps task that ensures that we always operate on {df_name: df} dictionaries. This is still WIP and the ETL currently fails.

This control whether the ETL will be run on the local dask cluster with task parallelization or use local executor that runs tasks serially.

Originally, this dtyping was embedded in the etl.py but it makes more sense here.

int64 dtype can't be cast from object dtype directly. It needs to be casted via float: object -> float -> int64. This modification finds all object dtyped columns that should be converted to int64 and converts them to float first.

1. add dtype conversion capability to merge_dataframe_maps (currently unused but may come in handy to simplify things) 2. use pudl_task_target_name callback to construct result paths that incorporate datapkg_name (for uniquenes) 3. configure cache validation for variety of tasks 4. mark eia tasks with the datapkg/${datapkg_name} for the purpose of pudl_task_target_name (WIP: needs to be applied everywhere)

Fix logic for object->int64 conversion which was broken. Improved debugging information. If dtype conversion fails, dump the offending dataframe to disk and run dtype conversion once again column by column to identify which column holds the offending data.

Pretty-print the json-style report from goodtables and trim the offending values to 5 max per error. This should make it easier to quickly identify what has gone wrong.

Added lots of docstrings and applied some automatic style fixes.

Moved settings validation to extract/ferc1.py, created simple prefect.task wrapper that initializes ferc1_to_sqlite database and wired this to Ferc1Pipeline such that only one ferc1_to_sqlite task is created in a flow. Plumbed some necessary arguments (etl_settings and clobber) through the intermediary layers.

This can be set to ALWAYS|ONCE|NEVER and controls how ferc1 sqlite database should be initialized.

--upload-to-gcs-bucket cmdline flag will kick off archival and upload of the datapackage files to bucket. Prefix ${uuid}/${doi} is used for the ETL artifacts.

Added simple tool for constructing requirements.txt from setup.py. Restructured docker build instructions to pre-install pudl dependencies before installing the package. This, along with experimental cache mounts, should result in build time speedups. Injecting requirement.txt and installing dependencies *before* the code is added to the container image should also help better utilize the native docker layer caching mechanism.

This triggers on push to master, dev and sprint* branches.

This controls whether dependency graph for prefect flow should be shown or not.

…ests.

codecov · 2021-11-12T18:20:04Z

Codecov Report

Merging #901 (46a7701) into dev (90fc378) will decrease coverage by 4.00%.
The diff coverage is 78.35%.

@@            Coverage Diff             @@
##              dev     #901      +/-   ##
==========================================
- Coverage   83.32%   79.31%   -4.00%     
==========================================
  Files          62       68       +6     
  Lines        6737     7067     +330     
==========================================
- Hits         5613     5605       -8     
- Misses       1124     1462     +338

Impacted Files	Coverage Δ
src/pudl/cli.py	`44.83% <34.48%> (-22.67%)`	⬇️
src/pudl/helpers.py	`80.39% <42.86%> (-5.02%)`	⬇️
src/pudl/convert/epacems_to_parquet.py	`62.00% <46.88%> (+2.00%)`	⬆️
src/pudl/fsspec_result.py	`50.00% <50.00%> (ø)`
src/pudl/settings.py	`97.10% <80.00%> (+0.83%)`	⬆️
src/pudl/glue/ferc1_eia.py	`80.26% <83.33%> (+0.40%)`	⬆️
src/pudl/extract/ferc1.py	`85.49% <84.62%> (-0.28%)`	⬇️
src/pudl/workspace/datastore.py	`73.76% <84.62%> (+4.94%)`	⬆️
src/pudl/etl.py	`85.71% <85.19%> (-7.08%)`	⬇️
src/pudl/transform/eia923.py	`93.99% <87.50%> (+0.08%)`	⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90fc378...46a7701. Read the comment docs.

…ctions from etl.py

…c tests to pytest

…path None type testing error.

bendnorman · 2021-11-24T18:54:48Z

Ok! I’ve got dev merged into this branch with all the tests passing. I have also cleaned up the branch and tackled some TODOs Jan left. This branch is not complete (see remaining issues) but I think the largest changes have occurred and I would love some feedback. I left some questions in my review.

bendnorman · 2021-11-24T18:55:55Z

src/pudl/cli.py

+    """Create an argument parser for the etl script."""
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        parents=[pudl.etl.command_line_flags()])


Jan split command args into etl.py and cli.py. Should all of them live in cli.py?

was cli.py initially just a place to setup everything for the etl? or did it also run the elt? If it is the latter than it makes sense to me to just have everything in cli.py

Hm. Is there any practical reason for having the chunk of args in etl.py? On the surface it seems much cleaner to pull all of the command args over here.

I suspect think all the args should probably be read and parsed in the CLI module and handed off. Unless maybe the same parsing logic is getting used by multiple CLIs which are all passing stuff to the ETL module? But in that case we probably need to consider a standalone CLI parsing module that can be re-used, or consolidation of the CLIs / entrypoint scripts.

bendnorman · 2021-11-24T18:56:52Z

src/pudl/etl.py

    pudl_db_path = Path(pudl_settings["sqlite_dir"]) / "pudl.sqlite"
-    if pudl_db_path.exists() and not clobber:
+    if pudl_db_path.exists() and not commandline_args.clobber:
        raise SystemExit(
            "The PUDL DB already exists, and we don't want to clobber it.\n"
            f"Move {pudl_db_path} aside or set clobber=True and try again."
        )

-    # Configure how we want to obtain raw input data:
-    ds_kwargs = dict(
-        gcs_cache_path=gcs_cache_path,
-        sandbox=pudl_settings.get("sandbox", False)
-    )
-    if use_local_cache:
-        ds_kwargs["local_cache_path"] = Path(pudl_settings["pudl_in"]) / "data"
-
-    validated_etl_settings = etl_settings.datasets
+    # Setup pipeline cache
+    configure_prefect_context(etl_settings, pudl_settings, commandline_args)


Prefect setup is kind of split between cli and etl. I think I’ve been staring at this for too long and would love any ideas on how to make the setup cleaner.

I think the CLI module should probably just be about gathering the and parsing the arguments from the user, and then handing them off to other modules that do the real work. Having argument parsing inside modules that aren't exported as scripts seems unusual to me.

bendnorman · 2021-11-24T18:57:16Z

src/pudl/etl.py

+            shutil.rmtree(cache_root)
+
+
+def configure_prefect_context(etl_settings, pudl_settings, commandline_args):


Prefect.context is a global var store that can be accessed during a flow run. Things like etl_settings, pudl_settings, overwrite_ferc1_db and pipeline_cache_path are all stored in prefect.context. Do y’all have any opinions on what should and should not live in this?

Should this be anything that's invariant across an ETL run, and will need to be accessed by different Tasks (which could be running on different notes, in the fullness of time)?

bendnorman · 2021-11-24T18:59:00Z

src/pudl/workflow/epacems.py

+            plants = pudl.transform.epacems.load_plant_utc_offset()
+
+            # Wait to build CEMS until EIA is done if EIA is in the settings file.
+            # If EIA is not in the settings file, go ahead and build CEMS on its own.
+            if "eia" in prefect.context.get("datasets", {}).keys():
+                dfs_to_sqlite_task = flow.get_tasks(name="dfs_to_sqlite")
+                logger.info("Setting EIA as a dependency of CEMS.")
+                plants.set_dependencies(upstream_tasks=[dfs_to_sqlite_task])


Setting the eia and epacems dependency is a little awkward right now. EpaCemsPipeline has to look into the flow to see if the EIA pipeline is being run.

And if it is being run, it has to wait for it to complete so there's an SQLite DB to read from? And if it's not being run, it just reads from the DB directly?

Would it make sense to integrate SQLite with FSSpecResult class somehow, so we can have an input that's available based on the database existing -- either because the EIA ETL has completed, or because we're not running the EIA ETL?

bendnorman · 2021-11-24T18:59:22Z

src/pudl/etl.py

+    # TODO(bendnorman): Should upload-to be supported right now?
+    parser.add_argument(
+        "--upload-to",
+        type=str,
+        default=os.environ.get('PUDL_UPLOAD_TO'),
+        help="""A location (local or remote) where the results of the ETL run
+        should be uploaded to. This path will be interpreted by fsspec so
+        anything supported by that module is a valid destination.
+        This should work with GCS and S3 remote destinations.
+        Default value for this will be loaded from PUDL_UPLOAD_TO environment
+        variable.
+        Files will be stored under {upload_to}/{run_id} to avoid conflicts.
+        """)


Do we want to support upload_to right now?

We'll need this as soon as we start running this remotely right? Otherwise we'll just end up with some files on the local disk that won't be accessible from anywhere else?

That is true. If we aren't running this remotely soon should it be included in this PR? I don't want to add functionality that isn't fully supported.

cmgosnell · 2021-11-29T14:29:19Z

src/pudl/cli.py

+    """Create an argument parser for the etl script."""
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        parents=[pudl.etl.command_line_flags()])


was cli.py initially just a place to setup everything for the etl? or did it also run the elt? If it is the latter than it makes sense to me to just have everything in cli.py

cmgosnell · 2021-11-29T14:32:19Z

src/pudl/cli.py

+        "--timestamped-logfile",
+        default="/tmp/pudl_etl.%F-%H%M%S.log",  # nosec
+        help="""If specified, also log to the timestamped logfile. The value of
+        this flag is passed to strftime method of datetime.now().""")


does this default mean that it will always log to the timestamped file?

I think so yes, but it'll go to a temporary directory and the OS will clean it up automatically.

Is it okay to hard-code the /tmp/... path here? Or do we want to ask the OS for a tmpdir? Is this hooked up to the Python logging system later hopefully? I guess we'll need a different kind of logging output for the cloud use case.

cmgosnell · 2021-11-29T14:39:42Z

src/pudl/cli.py

+    run_id = generate_run_id(args)
+    prefect.context.pudl_run_id = run_id
+
+    logger.warning(


why is this a warning instead of info?

I think it's just so it's visible even when the LOGLEVEL is lower than INFO.

cmgosnell · 2021-11-29T15:00:36Z

src/pudl/cli.py

+    """Create an argument parser for the etl script."""
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        parents=[pudl.etl.command_line_flags()])


Hm. Is there any practical reason for having the chunk of args in etl.py? On the surface it seems much cleaner to pull all of the command args over here.

zaneselvans

omg I got through the whole thing. Lots of comments and questions inline and I may bring some comments over from the Prefect Docs and put them in the comments on the PR.

zaneselvans · 2021-12-13T23:38:03Z

docker-compose.yml

-    command: /pudl/src/release/data-release.sh
-  # TODO(rousik): create validate service that validates stuff under build/
+      # - ${GCP_KEY_PATH}:/tmp/keys/gcp-keyfile.json:ro
+      - datapkg:/pudl/outputs/datapkg


Is there still datapackage stuff going on over here?

zaneselvans · 2021-12-13T23:38:36Z

docker-compose.yml

+   command: ["dask-worker", "tcp://dask-scheduler:8786"]
+   volumes:
+      # - ${GCP_KEY_PATH}:/tmp/keys/gcp-keyfile.json:ro
+      - datapkg:/pudl/outputs/datapkg


Ditto datapackage question above.

zaneselvans · 2021-12-13T23:45:26Z

docker-compose.yml

 volumes:
-  pudldata: {}
+  datapkg:


datapackages!

zaneselvans · 2021-12-13T23:46:51Z

release/settings/test.yml

+datapkg_bundle_name: pudl-fast
+datapkg_bundle_doi: 10.5072/zenodo.123456 # Sandbox DOI... not real.
 datapkg_bundle_settings:


All the datapkg settings attributes have been obsoleted now haven't they?

zaneselvans · 2021-12-13T23:47:13Z

release/settings/test.yml

+ferc1_to_sqlite_tables:
+  - f1_respondent_id
+  - f1_gnrt_plant
+  - f1_steam
+  - f1_fuel
+  - f1_plant_in_srvce
+  - f1_hydro
+  - f1_pumped_storage
+  - f1_purchased_pwr


Is this list of FERC 1 tables truncated because it's just for testing purposes?

zaneselvans · 2021-12-15T16:04:08Z

test/unit/dfc_test.py

-        self.assertEqual(["first", "second"], final_dfc.get_table_names())
+        assert dfc1.get_table_names() == ["first"]
+        assert dfc2.get_table_names() == ["second"]
+        assert final_dfc.get_table_names() == ["first", "second"]


Is this particular ordering of names guaranteed? Would ["second", "first"] indicate failure? Could convert both sides to sets.

Yup, get_table_names uses sort so it guarantees order (with the intention to make the results stable for easy unit testing).

zaneselvans · 2021-12-15T16:06:04Z

test/unit/dfc_test.py


-    def test_to_dict(self):
+        with pytest.raises(dfc.TableExistsError):
+            dfc1.union(dfc1)


Is this supposed to be unioning dfc1 with itself? Or with dfc2?

This is a bug. The original code unioned dfc1 and dfc2.

test/unit/dfc_test.py

zaneselvans · 2021-12-15T16:12:37Z

test/unit/dfc_test.py

@@ -193,7 +201,7 @@ def named_empty_df(df_name):
            names = ["a", "b", "c", "d"]
            res = dfc.merge_list(named_empty_df.map(names))
        final = f.run().result[res].result
-        self.assertEqual(["a", "b", "c", "d"], final.get_table_names())
+        assert final.get_table_names() == ["a", "b", "c", "d"]

    def test_fanout_task(self):


I'm bewildered by this test. I guess I need to go look at the DFC class. Do you understand what fanout is doing, and the work that Prefect is doing in here?

Fanout task simply splits DFC into list of DFC with fixed number of tables each.

This could be used in situations like epacems where large number of tables come together in a single dfc and where you want to process this in parallel by splitting it into smaller chunks that can be processed in parallel.

zaneselvans · 2021-12-15T16:17:38Z

test/unit/settings_test.py

+class TestEtlSettings:
+    """Test EtlSettings."""
+
+    # TODO(bendnorman): The behavior of this method works and passes as an invidual unit test but fails when run with all unit tests.


Weird. Is it some kind of issue with the order in which tests get run?

zaneselvans · 2021-12-15T17:05:24Z

Persistence of intermediate results seems useful for development/debugging, for data provenance, and for building other derived outputs (e.g. the "final" PUDL DB is itself also an intermediate result, since we then do analysis based on that data).
Using FSSpec to make interacting with local & remote filesystems seems useful and simpler than having totally separate systems.
Using our own system to persist results seems like it might not be ideal in the long run. What's the canonical way of doing this? My impression is that it's database tables. Feels like we need to talk about whether / how we are going to rely on having Postgres / BigQuery be a requirement for running the ETL, and more broadly our expectations about who is going to run the ETL and under what circumstances.
I have a gut feeling that the way we're using Prefect here is kind of off-label, and might be annoying to maintain. It seems like there are a lot of explicitly constructed dependencies that are kind of hard to follow, rather than a high level coordinating module which composes all the tasks together "functionally" and lets you follow the flow of data as a narrative. Is this really necessary? Or is it an artifact of trying to preserve the existing structure of the codebase?
If we do some thinking about which parts of the pipeline should really be getting run when... does the architecture change? If we keep the datasets separate from each other for longer?
I wasn't totally clear on whether ferc1_to_sqlite and epacems_to_parquet were still available to be run independently.
Do we have a sense of what doing development and testing with all of this being obligated to run on cloud resources would look like?
Moving to using Zenodo as cold-storage that's mirrored to cloud storage initially seems like it would be great, especially if/when we go to doing continuous updates to the raw data archives, since that will mean more frequent downloads. Seems like splitting the Datastore stuff off as n independent package might make sense?
Seems like as the codebase is set up right now, it'll be hard to only re-run parts of the pipeline when they change -- we don't really have any way of telling which parts have changed and will need an update.

rousik · 2021-12-15T18:45:56Z

Using our own system to persist results seems like it might not be ideal in the long run. What's the canonical way of doing this? My impression is that it's database tables. Feels like we need to talk about whether / how we are going to rely on having Postgres / BigQuery be a requirement for running the ETL, and more broadly our expectations about who is going to run the ETL and under what circumstances.

I suppose you're referring to the dataframe serialization to disk? I do not think there's any canonical/universal way of doing this and the implementation choice depends on your specific usecase.

This format essentially adds disk serialization feature to the preexisting data exchange format between tasks (collections of named dataframes) which allows you to 1. distribute the workload across machines, 2. reduces the memory pressure by offloading the intermediate states to disk.

That said, this is totally something that can be switched/replaced with a different serialization mechanism if needed/wanted. I would view this as internal/development format so we should be using whatever is useful to us w.r.t. debuggability/maintenance and integration with the pipeline internals.

In particular, we could use database but I think it would make sense only if:

db would provide superior performance and we consider performance to be critical
the pipeline itself (individual tasks) would operate on the database directly rather than processing dataframes (otherwise, we will likely have to do dataframe -> serialize to database -> extract dataframe from database -> process dataframe -> repeat)

I have a gut feeling that the way we're using Prefect here is kind of off-label, and might be annoying to maintain. It seems like there are a lot of explicitly constructed dependencies that are kind of hard to follow, rather than a high level coordinating module which composes all the tasks together "functionally" and lets you follow the flow of data as a narrative. Is this really necessary? Or is it an artifact of trying to preserve the existing structure of the codebase?

Can you expand on what you mean by off-label use of prefect?

The general idea is that DatasetPipeline should ultimately offer a high-level abstraction that allows end-user to construct the ETL for a given dataset and should be data-flow oriented. Ideally, we would be operating at the level of table rather than dataset which would be ideal but this would require further refactoring of the code to make it happen so some of the clunkiness may be related to preservation of the current code structure (with dataset granularity of tasks).

Overall I do think that it might be valuable to offer higher-level API where end-users/developers will describe how individual tables are constructed or transformed and then let some translation layer turn this into prefect tasks. This might have the benefit of insulating end-users from the lower-level abstractions (prefect tasks) and enforcing some common structure. I do think that the prototype is somewhat leaky as we are still dealing with a bit of both (DatasetPipeline combined with prefect task annotated functions).

If we do some thinking about which parts of the pipeline should really be getting run when... does the architecture change? If we keep the datasets separate from each other for longer?

Can you elaborate on this thought a bit more?

Do we have a sense of what doing development and testing with all of this being obligated to run on cloud resources would look like?

I think it should still be possible to run this locally so using cloud resources should not be mandatory. However, I do think that this would allow us to run some automated tests/validation on the cloud, e.g. as part of the PR validation (e.g. PR can kick of github action that will run the pipeline and emit the results of the pipeline somewhere so that they can be checked for correctness or compared against main branch to highlight if the results have changed or not (noop validation)).

Moving to using Zenodo as cold-storage that's mirrored to cloud storage initially seems like it would be great, especially if/when we go to doing continuous updates to the raw data archives, since that will mean more frequent downloads. Seems like splitting the Datastore stuff off as n independent package might make sense?

Yeah. I can even imagine situation where ETL would rely solely on the cloud storage mirror and another component/pipeline (datastore) would take care of synchronization.

Seems like as the codebase is set up right now, it'll be hard to only re-run parts of the pipeline when they change -- we don't really have any way of telling which parts have changed and will need an update.

Automatically determining what needs to be rerun is a hard problem (perhaps practically intractable). Trivial changes to libraries or some support code can have unexpected impact on all sorts of things.

That said, we have options here:

if we know what tables/datasets we want in the end, we should be able to prune the flow graph and only run tasks that are precursors to that output
for rapid development, the serialized intermediate results on disk can work as a cache. Because prefect can skip over tasks for which the results are already available, this boils down to explicitly invalidating specific results and rerunning those tasks along with their children in the dependency graph.

jdangerx · 2023-01-06T23:23:12Z

Closing this as we chose Dagster in #1487 and I want to clean up our open issues a bit.

rousik added 30 commits November 20, 2020 09:20

Wrap ETL process in coarse prefect tasks.

ac474c5

Refactor aggregation of table names.

4171441

Use prefect.List once instead of nested prefect.Set tasks. Flatten the list of lists into unique list of table names inside MetadataBundleMaker.

Fix an extraneous argument when building bundles.

54654cc

Add support for running ETL on local dask cluster.

0129626

--use_dask_cluster ${scheduler_addr} enables this behavior.

Added --use-dask-executor commandline flag.

f1ba347

This control whether the ETL will be run on the local dask cluster with task parallelization or use local executor that runs tasks serially.

Fix GluePipeline execution logic.

33a14dd

eia.transform: apply eia dtypes on inputs

2caae70

Originally, this dtyping was embedded in the etl.py but it makes more sense here.

convert_col_dtypes: fix object->int64 conversion

d6b300d

int64 dtype can't be cast from object dtype directly. It needs to be casted via float: object -> float -> int64. This modification finds all object dtyped columns that should be converted to int64 and converts them to float first.

validate_save_datapkg: improve error reporting

7fd507f

Pretty-print the json-style report from goodtables and trim the offending values to 5 max per error. This should make it easier to quickly identify what has gone wrong.

Fix presubmit issues.

ba75177

Added lots of docstrings and applied some automatic style fixes.

Fix style issues found in helpers.py

e2e8487

Added --overwrite-ferc1-db commandline flag.

db54ac8

This can be set to ALWAYS|ONCE|NEVER and controls how ferc1 sqlite database should be initialized.

Upload datapackage archives to GCS.

c209cf3

--upload-to-gcs-bucket cmdline flag will kick off archival and upload of the datapackage files to bucket. Prefix ${uuid}/${doi} is used for the ETL artifacts.

added prefect as install time dependency.

af4d686

Update prefect dependency in setup.py

be9b72f

Automatically build docker images on git push.

e021e81

This triggers on push to master, dev and sprint* branches.

Added --show-flow-graph flag.

0cf782b

This controls whether dependency graph for prefect flow should be shown or not.

Update the docker image build process.

8851841

Fix --use-dask-executor flag to be truly boolean.

45b3bbf

Replace ENTRYPOINT with CMD form.

59476cc

Fix --show-flow-graph to be truly boolean.

1da7c8a

bendnorman added 2 commits November 11, 2021 14:32

Fixed dtype comparison bug in convert_cols_dtypes. Pass integration t…

aa46ecc

…ests.

Removed ferc1_solo_test from etl_test.

cf64ae7

bendnorman added 9 commits November 12, 2021 13:23

Added some todo comments. removed epacems_to_parquet code.

3898b89

Merge branch 'dev' into prefect

9848fd1

Simplified etl() logic. Removed prefect targets.

630d978

removed pudl_settings from dataset_pipeline args. Removed old etl fun…

195f281

…ctions from etl.py

Cleaned up some command line args.

5984f6a

Added abstract properties to DatasetPipeline.

9527cab

Some small comment and testing cleaning.

a21b30a

Added cache targets back and updated keep_cache cmd arg. Converted df…

4f29fb9

…c tests to pytest

Commandline args type fixture fix.

4c6454c

bendnorman assigned bendnorman and unassigned bendnorman Nov 24, 2021

bendnorman added 3 commits November 23, 2021 17:39

Added pudl_settings fixture to cmdline fixture to fix pipeline_cache_…

d41db5e

…path None type testing error.

Revert pipeline-cache-path arg default

d134c1a

Remove BooleanOptionalAction because only available in py-3.9

46a7701

bendnorman reviewed Nov 24, 2021

View reviewed changes

cmgosnell reviewed Nov 29, 2021

View reviewed changes

zaneselvans requested changes Dec 15, 2021

View reviewed changes

bendnorman mentioned this pull request Jan 18, 2022

Prefect merge discussion #1406

Closed

This was referenced Feb 9, 2022

Prototype CEMS pipeline using Dagster #1432

Closed

Create Dagster and Prefect comparison #1451

Closed

bendnorman mentioned this pull request Feb 24, 2022

Dagster vs Prefect Decision #1487

Closed

jdangerx closed this Jan 6, 2023

jdangerx deleted the prefect branch February 14, 2023 00:29

		shutil.rmtree(cache_root)


		def configure_prefect_context(etl_settings, pudl_settings, commandline_args):

WIP: Initial conversion of the ETL to prefect flow #901

WIP: Initial conversion of the ETL to prefect flow #901

Conversation

rousik commented Jan 27, 2021

codecov bot commented Nov 12, 2021 • edited Loading

Codecov Report

bendnorman commented Nov 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendnorman Nov 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Dec 15, 2021

rousik commented Dec 15, 2021

jdangerx commented Jan 6, 2023

codecov bot commented Nov 12, 2021 •

edited

Loading

bendnorman Nov 24, 2021 •

edited

Loading