Bring the Census DP1 to SQLite ETL into dagster #2621

e-belfer · 2023-06-02T13:19:48Z

Addresses issue #2412.

This PR brings pudl.convert.censusdp1tract_to_sqlite and pudl.output.censusdp1 into dagster, producing 3 pickled outputs (1 for each layer of the Census DP1 database that is of interest). As of now, these outputs are not written into the PUDL database. They currently have 100+ columns, and are mostly used for their geometries. We pickle these outputs using the current default IO manager and feed them into downstream outputs that are written into PUDL (e.g. the state demand outputs in #2550). This is what is currently in this PR.

Remaining tasks:

Convert pudl.output.censusdpt1tract.py into dagster assets

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

…tive/pudl into 714-861-dagster

…DL, reorder output organization

src/pudl/etl/__init__.py

zaneselvans · 2023-06-06T17:53:21Z

I think we want to treat the Census DP1 like we do the ferc1.sqlite database -- where the whole censusdp1tract.sqlite DB is created as a "source asset" rather than as individual tables that are written into PUDL or dataframes that are pickled. This would preserve the current set of distributed outputs, which includes several SQLite DBs of which the Census DP1 is one.

jdangerx · 2023-06-06T20:34:44Z

I think we want to treat the Census DP1 like we do the ferc1.sqlite database -- where the whole censusdp1tract.sqlite DB is created as a "source asset" rather than as individual tables that are written into PUDL or dataframes that are pickled. This would preserve the current set of distributed outputs, which includes several SQLite DBs of which the Census DP1 is
one.

After actually looking at the code, my response is:

It looks like we do still generate the censusdp1tract.sqlite file in the output directory - the pickled stuff appears to be the tables that have then been read out of that SQLite file. Those pickled tables seem like interim data products that can stay pickled until we see a reason to distribute them.

edit: old note from before I actually looked at the code is below:

Still persisting the data in a SQLite file that we distribute makes sense to me!

I would add that we can do this without making a whole new DAG - we could add a new subclass of a SQLiteIOManager (a la PudlSQLiteIOManager or FercSQLiteIOManager that knows to read/write to censusdp1tract.sqlite, then use that IOManager to write out the three tables instead of just pickling them. Then we can actually keep these assets as normal assets in the pudl_etl module, and have Dagster manage the dependencies for us instead of having to manually run DAGs in a specific order.

Now that we have a bunch of different instances of "thing that reads/writes from SQLite" though, it might make sense to refactor the code to better fit the use-cases/patterns we've discovered 🤷 .

jdangerx

I'm glad this diff turned out so small, nice work!

A couple questions that I'd like you to respond to, though your answer to both could just be "no" 😅

jdangerx · 2023-06-06T20:37:03Z

src/pudl/convert/censusdp1tract_to_sqlite.py

@@ -152,7 +175,9 @@ def main():

    pudl_settings["sandbox"] = args.sandbox

-    censusdp1tract_to_sqlite(pudl_settings=pudl_settings, ds=ds, clobber=args.clobber)
+    _ = censusdp1tract_to_sqlite(


Did you want to use this assignment somewhere else?

This just returns the path to the saved DB, which we don't need to call anywhere else. The loggers for this function already print the output path, which would be the only possible application I could think of for this variable.

Hmm, at this point could you just avoid the assignment altogether?

Indeed, I deleted it (and the whole CLI interface)! :)

jdangerx · 2023-06-06T20:55:43Z

src/pudl/output/censusdp1tract.py

-    layer: Literal["state", "county", "tract"], pudl_settings=None, ds=None
-) -> gpd.GeoDataFrame:
-    """Select one layer from the Census DP1 database.
+@multi_asset(


My understanding is that multi-assets are hard to parallelize - that's probably not a huge problem here, since there's only 3 assets anyways, but you might want to try making a bunch of separate assets.

Looks like you already have a bit of an asset factory idea going on, based on the name of the function, but this function is returning a list of Outputs as opposed to an Asset. Did you want to try something like the finished_eia_asset_factory in transform/eia.py?

def finished_eia_asset_factory( table_name: str, io_manager_key: str | None = None ) -> AssetsDefinition: """An asset factory for finished EIA tables.""" clean_table_name = "clean_" + table_name @asset( ins={clean_table_name: AssetIn()}, name=table_name, io_manager_key=io_manager_key, ) def finished_eia_asset(**kwargs) -> pd.DataFrame: """Enforce PUDL DB schema on a cleaned EIA dataframe.""" df = convert_cols_dtypes(kwargs[clean_table_name], data_source="eia") res = Package.from_resource_ids().get_resource(table_name) return res.enforce_schema(df) return finished_eia_asset finished_eia_assets = [ finished_eia_asset_factory(table_name, io_manager_key="pudl_sqlite_io_manager") for table_name in [ "boiler_fuel_eia923", "coalmine_eia923", "fuel_receipts_costs_eia923", "generation_eia923", "generation_fuel_eia923", "generation_fuel_nuclear_eia923", "ownership_eia860", "emissions_control_equipment_eia860", "boiler_emissions_control_equipment_assn_eia860", "boiler_cooling_assn_eia860", "boiler_stack_flue_assn_eia860", ] ]

If you follow that pattern it would look like

def census_asset_factory( layer: str ) -> AssetsDefinition: """An asset factory for finished EIA tables.""" @asset( ins={"censusdp1tract_to_sqlite": AssetIn()}, name=f"{layer}_censusdp1", ) def census_layer(**kwargs) -> pd.DataFrame: """yada yada""" dp1_engine = ... def get_layer(...): ... # could be nice to pass in the sql engine as opposed to closing over it while defining the get_layer function layer_gdf = get_layer(layer, dp1_engine) # or, just in-line the body of get_layer since you're not calling it in a loop anymore? return layer_gdf return census_layer census_dp1_layers = [ census_asset_factory(layer) for layer in ["state", "county", "tract"] ]

…tive/pudl into 714-861-dagster

Undo reversed edit

…tive/pudl into 714-861-dagster

zaneselvans · 2023-06-12T19:14:05Z

It looks like it might be having trouble with some of the Python geospatial stack / Proj in the test environment. This is one of the things we typically install locally using mamba via environment.yml so maybe there's something that's not working the same between the local and GitHub runner conda env?

2023-06-10 21:11:54 [    INFO] catalystcoop.pudl.workspace.datastore:383 Retrieved Resource(censusdp1tract/10.5281/zenodo.4127049/censusdp1tract-2010.zip) from cache.
2023-06-10 21:11:54 [    INFO] catalystcoop.pudl.convert.censusdp1tract_to_sqlite:96 Extracting the Census DP1 GeoDB to /tmp/pytest-of-runner/pytest-0/pudl0/output/censusdp1tract.sqlite
2023-06-10 21:11:58 [    INFO] catalystcoop.pudl.convert.censusdp1tract_to_sqlite:98 extract_root = /tmp/tmpsn80qypp/Profile-County_Tract.gdb
2023-06-10 21:11:58 [    INFO] catalystcoop.pudl.convert.censusdp1tract_to_sqlite:99 out_path = /tmp/pytest-of-runner/pytest-0/pudl0/output/censusdp1tract.sqlite
ERROR 1: PROJ: proj_identify: Open of /home/runner/micromamba-root/envs/pudl-test/share/proj failed

e-belfer · 2023-06-12T19:16:00Z

@zaneselvans In the long-term we'll want to figure out how to access interim dagster assets for tests, I'm sure. But I don't think this is one of those cases, since this issue gets effectively resolved in #2550 and the issue here is only to do with managing an interim workaround. I propose we merge that PR into this one and then merge both into dev if they pass the tests.

zaneselvans · 2023-06-12T19:17:59Z

Go for it, if that fixes the issue, great!

e-belfer · 2023-06-12T19:21:38Z

Great! Will do after #2550 gets a proper review.

Convert EIA-861 and FERC 714 service territory outputs to Dagster assets

…ive/pudl into census_dagster

codecov · 2023-06-21T18:29:18Z

Codecov Report

Patch coverage: 91.5% and project coverage change: +1.1 🎉

Comparison is base (8daa787) 87.2% compared to head (9987ca4) 88.4%.

Additional details and impacted files

@@           Coverage Diff           @@
##             dev   #2621     +/-   ##
=======================================
+ Coverage   87.2%   88.4%   +1.1%     
=======================================
  Files         87      87             
  Lines      10155   10135     -20     
=======================================
+ Hits        8864    8965    +101     
+ Misses      1291    1170    -121

Impacted Files	Coverage Δ
src/pudl/metadata/classes.py	`86.3% <ø> (ø)`
src/pudl/metadata/fields.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/eia.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/eia860.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/eia861.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/ferc714.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/pudl.py	`100.0% <ø> (ø)`
src/pudl/output/pudltabl.py	`95.0% <ø> (ø)`
src/pudl/analysis/service_territory.py	`51.2% <63.4%> (+23.7%)`	⬆️
src/pudl/convert/censusdp1tract_to_sqlite.py	`86.6% <88.8%> (+14.3%)`	⬆️
... and 3 more

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

e-belfer and others added 18 commits April 26, 2023 13:28

WIP: stashing changes

da77770

Add non-spatial dataframe to dagster

2c5af18

Merge branch 'dev' into 714-861-dagster

f4395fc

Add tables into PUDL metadata

f0060a7

Merge branch 'dev' into 714-861-dagster

6b2c432

Fix metadata and fk error for compiled_geom tables

ea3ee34

Revert spatial test to match new geopandas output

ecb5826

Add 714 outputs to default io mgr

efa2cd3

Add georef resp and counties, summarized_demand_ferc714

e3672fb

Merge branch 'dev' into 714-861-dagster

b74dfe0

Deduplicate particulate_control_id_eia

06fe62e

Add FERC714 tables to metadata, confirm all tables identical

539dbb0

Merge branch '714-861-dagster' of https://github.com/catalyst-coopera…

1e2136b

…tive/pudl into 714-861-dagster

Fix migrations, start tests, add 714 to PUDL

70f6fa0

Merge branch 'dev' into 714-861-dagster

6610ace

Fix FK errors

706c80a

Add validation and integration tests, add state_demand output into PU…

442305c

…DL, reorder output organization

Functional Census dagster integration, first pass

c90dff1

e-belfer changed the base branch from main to dev June 2, 2023 13:19

e-belfer commented Jun 2, 2023

View reviewed changes

src/pudl/etl/__init__.py Show resolved Hide resolved

e-belfer linked an issue Jun 2, 2023 that may be closed by this pull request

Bring Census DP1 database into Dagster #2412

Closed

Updated working Census ETL that pickles all output layers

48dbf82

e-belfer requested a review from jdangerx June 6, 2023 17:44

e-belfer self-assigned this Jun 6, 2023

e-belfer added this to the 2023 Spring milestone Jun 6, 2023

Merge branch 'dev' into census_dagster

ecdfb1f

e-belfer marked this pull request as ready for review June 6, 2023 17:44

jdangerx requested changes Jun 6, 2023

View reviewed changes

e-belfer and others added 9 commits June 8, 2023 18:15

update migrations and census read-in

5a5bd81

Merge branch '714-861-dagster' of https://github.com/catalyst-coopera…

6fb63ef

…tive/pudl into 714-861-dagster

Update state_demand.py

bcfc4ca

Undo reversed edit

Clean up testing environment from WIP census read-ins

e78951b

Merge branch '714-861-dagster' of https://github.com/catalyst-coopera…

8130143

…tive/pudl into 714-861-dagster

Merge branch 'dev' into census_dagster

581a195

Merge branch 'census_dagster' into 714-861-dagster

7d185d5

Remove accidental field change, add release notes

4168a6d

Remove outdated args in function

03355e9

e-belfer and others added 11 commits June 16, 2023 10:02

Address first round of PR comments

df1edc0

Prune intermediate assets

68f9c59

Add type hints and clean up docs

f8cc104

Rename asset groups, expand docstring

00df2d9

Merge pull request #2550 from catalyst-cooperative/714-861-dagster

423dd39

Convert EIA-861 and FERC 714 service territory outputs to Dagster assets

Merge branch 'dev' into census_dagster

55c0564

Merge branch 'dev' into census_dagster

ddd289d

Update alembic migrations

5b3aedf

Merge branch 'dev' into census_dagster

929518b

Merge branch 'census_dagster' of https://github.com/catalyst-cooperat…

acf6a4f

…ive/pudl into census_dagster

Update release notes

eb9bab6

e-belfer requested a review from zschira June 21, 2023 18:28

Update release notes to more accurately reflect revisions

9987ca4

e-belfer removed the request for review from zschira June 21, 2023 18:31

e-belfer merged commit 0bbde04 into dev Jun 21, 2023
10 checks passed

e-belfer deleted the census_dagster branch June 21, 2023 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bring the Census DP1 to SQLite ETL into dagster #2621

Bring the Census DP1 to SQLite ETL into dagster #2621

e-belfer commented Jun 2, 2023 •

edited

zaneselvans commented Jun 6, 2023

jdangerx commented Jun 6, 2023 •

edited

jdangerx left a comment

jdangerx Jun 6, 2023

e-belfer Jun 7, 2023

jdangerx Jun 7, 2023

e-belfer Jun 7, 2023

jdangerx Jun 6, 2023

zaneselvans commented Jun 12, 2023 •

edited

e-belfer commented Jun 12, 2023

zaneselvans commented Jun 12, 2023

e-belfer commented Jun 12, 2023

codecov bot commented Jun 21, 2023 •

edited

Bring the Census DP1 to SQLite ETL into dagster #2621

Bring the Census DP1 to SQLite ETL into dagster #2621

Conversation

e-belfer commented Jun 2, 2023 • edited

PR Checklist

zaneselvans commented Jun 6, 2023

jdangerx commented Jun 6, 2023 • edited

jdangerx left a comment

Choose a reason for hiding this comment

jdangerx Jun 6, 2023

Choose a reason for hiding this comment

e-belfer Jun 7, 2023

Choose a reason for hiding this comment

jdangerx Jun 7, 2023

Choose a reason for hiding this comment

e-belfer Jun 7, 2023

Choose a reason for hiding this comment

jdangerx Jun 6, 2023

Choose a reason for hiding this comment

zaneselvans commented Jun 12, 2023 • edited

e-belfer commented Jun 12, 2023

zaneselvans commented Jun 12, 2023

e-belfer commented Jun 12, 2023

codecov bot commented Jun 21, 2023 • edited

Codecov Report

e-belfer commented Jun 2, 2023 •

edited

jdangerx commented Jun 6, 2023 •

edited

zaneselvans commented Jun 12, 2023 •

edited

codecov bot commented Jun 21, 2023 •

edited