Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring the Census DP1 to SQLite ETL into dagster #2621

Merged
merged 50 commits into from Jun 21, 2023
Merged

Bring the Census DP1 to SQLite ETL into dagster #2621

merged 50 commits into from Jun 21, 2023

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Jun 2, 2023

Addresses issue #2412.

This PR brings pudl.convert.censusdp1tract_to_sqlite and pudl.output.censusdp1 into dagster, producing 3 pickled outputs (1 for each layer of the Census DP1 database that is of interest). As of now, these outputs are not written into the PUDL database. They currently have 100+ columns, and are mostly used for their geometries. We pickle these outputs using the current default IO manager and feed them into downstream outputs that are written into PUDL (e.g. the state demand outputs in #2550). This is what is currently in this PR.

Remaining tasks:

  • Convert pudl.output.censusdpt1tract.py into dagster assets

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@e-belfer e-belfer changed the base branch from main to dev June 2, 2023 13:19
@e-belfer e-belfer linked an issue Jun 2, 2023 that may be closed by this pull request
@e-belfer e-belfer requested a review from jdangerx June 6, 2023 17:44
@e-belfer e-belfer self-assigned this Jun 6, 2023
@e-belfer e-belfer added this to the 2023 Spring milestone Jun 6, 2023
@e-belfer e-belfer marked this pull request as ready for review June 6, 2023 17:44
@zaneselvans
Copy link
Member

I think we want to treat the Census DP1 like we do the ferc1.sqlite database -- where the whole censusdp1tract.sqlite DB is created as a "source asset" rather than as individual tables that are written into PUDL or dataframes that are pickled. This would preserve the current set of distributed outputs, which includes several SQLite DBs of which the Census DP1 is one.

@jdangerx
Copy link
Member

jdangerx commented Jun 6, 2023

I think we want to treat the Census DP1 like we do the ferc1.sqlite database -- where the whole censusdp1tract.sqlite DB is created as a "source asset" rather than as individual tables that are written into PUDL or dataframes that are pickled. This would preserve the current set of distributed outputs, which includes several SQLite DBs of which the Census DP1 is
one.

After actually looking at the code, my response is:

It looks like we do still generate the censusdp1tract.sqlite file in the output directory - the pickled stuff appears to be the tables that have then been read out of that SQLite file. Those pickled tables seem like interim data products that can stay pickled until we see a reason to distribute them.


edit: old note from before I actually looked at the code is below:

Still persisting the data in a SQLite file that we distribute makes sense to me!

I would add that we can do this without making a whole new DAG - we could add a new subclass of a SQLiteIOManager (a la PudlSQLiteIOManager or FercSQLiteIOManager that knows to read/write to censusdp1tract.sqlite, then use that IOManager to write out the three tables instead of just pickling them. Then we can actually keep these assets as normal assets in the pudl_etl module, and have Dagster manage the dependencies for us instead of having to manually run DAGs in a specific order.

Now that we have a bunch of different instances of "thing that reads/writes from SQLite" though, it might make sense to refactor the code to better fit the use-cases/patterns we've discovered 🤷 .

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad this diff turned out so small, nice work!

A couple questions that I'd like you to respond to, though your answer to both could just be "no" 😅

@@ -152,7 +175,9 @@ def main():

pudl_settings["sandbox"] = args.sandbox

censusdp1tract_to_sqlite(pudl_settings=pudl_settings, ds=ds, clobber=args.clobber)
_ = censusdp1tract_to_sqlite(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to use this assignment somewhere else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just returns the path to the saved DB, which we don't need to call anywhere else. The loggers for this function already print the output path, which would be the only possible application I could think of for this variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, at this point could you just avoid the assignment altogether?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I deleted it (and the whole CLI interface)! :)

layer: Literal["state", "county", "tract"], pudl_settings=None, ds=None
) -> gpd.GeoDataFrame:
"""Select one layer from the Census DP1 database.
@multi_asset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that multi-assets are hard to parallelize - that's probably not a huge problem here, since there's only 3 assets anyways, but you might want to try making a bunch of separate assets.

Looks like you already have a bit of an asset factory idea going on, based on the name of the function, but this function is returning a list of Outputs as opposed to an Asset. Did you want to try something like the finished_eia_asset_factory in transform/eia.py?

def finished_eia_asset_factory(
    table_name: str, io_manager_key: str | None = None
) -> AssetsDefinition:
    """An asset factory for finished EIA tables."""
    clean_table_name = "clean_" + table_name

    @asset(
        ins={clean_table_name: AssetIn()},
        name=table_name,
        io_manager_key=io_manager_key,
    )
    def finished_eia_asset(**kwargs) -> pd.DataFrame:
        """Enforce PUDL DB schema on a cleaned EIA dataframe."""
        df = convert_cols_dtypes(kwargs[clean_table_name], data_source="eia")
        res = Package.from_resource_ids().get_resource(table_name)
        return res.enforce_schema(df)

    return finished_eia_asset


finished_eia_assets = [
    finished_eia_asset_factory(table_name, io_manager_key="pudl_sqlite_io_manager")
    for table_name in [
        "boiler_fuel_eia923",
        "coalmine_eia923",
        "fuel_receipts_costs_eia923",
        "generation_eia923",
        "generation_fuel_eia923",
        "generation_fuel_nuclear_eia923",
        "ownership_eia860",
        "emissions_control_equipment_eia860",
        "boiler_emissions_control_equipment_assn_eia860",
        "boiler_cooling_assn_eia860",
        "boiler_stack_flue_assn_eia860",
    ]
]

If you follow that pattern it would look like

def census_asset_factory(
    layer: str
) -> AssetsDefinition:
    """An asset factory for finished EIA tables."""
    @asset(
        ins={"censusdp1tract_to_sqlite": AssetIn()},
        name=f"{layer}_censusdp1",
    )
    def census_layer(**kwargs) -> pd.DataFrame:
        """yada yada"""
        dp1_engine = ...
        def get_layer(...):
            ...
        # could be nice to pass in the sql engine as opposed to closing over it while defining the get_layer function
        layer_gdf = get_layer(layer, dp1_engine) # or, just in-line the body of get_layer since you're not calling it in a loop anymore?
        return layer_gdf
    return census_layer


census_dp1_layers = [
    census_asset_factory(layer)
    for layer in ["state", "county", "tract"]
]

@zaneselvans
Copy link
Member

zaneselvans commented Jun 12, 2023

It looks like it might be having trouble with some of the Python geospatial stack / Proj in the test environment. This is one of the things we typically install locally using mamba via environment.yml so maybe there's something that's not working the same between the local and GitHub runner conda env?

2023-06-10 21:11:54 [    INFO] catalystcoop.pudl.workspace.datastore:383 Retrieved Resource(censusdp1tract/10.5281/zenodo.4127049/censusdp1tract-2010.zip) from cache.
2023-06-10 21:11:54 [    INFO] catalystcoop.pudl.convert.censusdp1tract_to_sqlite:96 Extracting the Census DP1 GeoDB to /tmp/pytest-of-runner/pytest-0/pudl0/output/censusdp1tract.sqlite
2023-06-10 21:11:58 [    INFO] catalystcoop.pudl.convert.censusdp1tract_to_sqlite:98 extract_root = /tmp/tmpsn80qypp/Profile-County_Tract.gdb
2023-06-10 21:11:58 [    INFO] catalystcoop.pudl.convert.censusdp1tract_to_sqlite:99 out_path = /tmp/pytest-of-runner/pytest-0/pudl0/output/censusdp1tract.sqlite
ERROR 1: PROJ: proj_identify: Open of /home/runner/micromamba-root/envs/pudl-test/share/proj failed

@e-belfer
Copy link
Member Author

@zaneselvans In the long-term we'll want to figure out how to access interim dagster assets for tests, I'm sure. But I don't think this is one of those cases, since this issue gets effectively resolved in #2550 and the issue here is only to do with managing an interim workaround. I propose we merge that PR into this one and then merge both into dev if they pass the tests.

@zaneselvans
Copy link
Member

Go for it, if that fixes the issue, great!

@e-belfer
Copy link
Member Author

Great! Will do after #2550 gets a proper review.

@e-belfer e-belfer requested a review from zschira June 21, 2023 18:28
@codecov
Copy link

codecov bot commented Jun 21, 2023

Codecov Report

Patch coverage: 91.5% and project coverage change: +1.1 🎉

Comparison is base (8daa787) 87.2% compared to head (9987ca4) 88.4%.

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2621     +/-   ##
=======================================
+ Coverage   87.2%   88.4%   +1.1%     
=======================================
  Files         87      87             
  Lines      10155   10135     -20     
=======================================
+ Hits        8864    8965    +101     
+ Misses      1291    1170    -121     
Impacted Files Coverage Δ
src/pudl/metadata/classes.py 86.3% <ø> (ø)
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia860.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia861.py 100.0% <ø> (ø)
src/pudl/metadata/resources/ferc714.py 100.0% <ø> (ø)
src/pudl/metadata/resources/pudl.py 100.0% <ø> (ø)
src/pudl/output/pudltabl.py 95.0% <ø> (ø)
src/pudl/analysis/service_territory.py 51.2% <63.4%> (+23.7%) ⬆️
src/pudl/convert/censusdp1tract_to_sqlite.py 86.6% <88.8%> (+14.3%) ⬆️
... and 3 more

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@e-belfer e-belfer removed the request for review from zschira June 21, 2023 18:31
@e-belfer e-belfer merged commit 0bbde04 into dev Jun 21, 2023
10 checks passed
@e-belfer e-belfer deleted the census_dagster branch June 21, 2023 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Bring Census DP1 database into Dagster
3 participants