Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring Census DP1 database into Dagster #2412

Closed
3 tasks
Tracked by #1973
zaneselvans opened this issue Mar 17, 2023 · 5 comments · Fixed by #2621
Closed
3 tasks
Tracked by #1973

Bring Census DP1 database into Dagster #2412

zaneselvans opened this issue Mar 17, 2023 · 5 comments · Fixed by #2621
Assignees
Labels
censusdp1tract Issues related to the Census DP1 dataset which we distribute as an SQLite DB dagster Issues related to our use of the Dagster orchestrator good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required.
Milestone

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Mar 17, 2023

We use the Census DP1 geodatabase for spatial analyses (e.g. turning FIPS codes into county geometries) and as a source for demographic information (e.g. population by census tract).

It seems like it may have slipped through the cracks in the Dagster migration? The ad-hoc "ETL" is almost a one-liner, converting the published geodatabase into SQLite it's in pudl.convert.censusdp1tract_to_sqlite.py

The database only has 3-4 tables in it (state, county, and tract-level demographic data). Should it be an asset? A multi-asset? A standalone resource like the FERC DBs?

I'm a little surprised that this didn't break anything in the tests or nightly builds. We currently distributed this converted DB along with the other SQLite DBs if the nightly builds succeed.

Scope

Next steps

@zaneselvans zaneselvans added censusdp1tract Issues related to the Census DP1 dataset which we distribute as an SQLite DB dagster Issues related to our use of the Dagster orchestrator labels Mar 17, 2023
@zaneselvans zaneselvans added this to the 2023Q1 milestone Mar 17, 2023
@bendnorman
Copy link
Member

bendnorman commented Mar 17, 2023

The most recent dagster-asset-etl nightly build included the census1dbtract.sqlite db. The gcp_pudl_et.sh script on a dev and dagster-asset-etl doesn't create the db using the censusdp1tract_to_sqlite command. Instead, the database is created by the ferc714_out fixture.

We should probably dagsterify the census ETL soon but given the db is still being created by the nightly builds on the dagster-asset-etl branch I don't think this is blocking #2104.

@bendnorman
Copy link
Member

As for how we would dagsterify the ETL, I think we could wrap pudl.convert.censusdp1tract_to_sqlite.censusdp1tract_to_sqlite in a @multi_asset and use and create a new io manager the points to the census db:

@multi_asset(
    outs={
        table_name: AssetOut(io_manager_key="census_sqlite_io_manager")
        for table_name in Package.get_etl_group_tables("census")
    },
    required_resource_keys={"dataset_settings", "datastore"},
)
def censusdp1tract_to_sqlite(context):
     # existing extraction code
     ...
    # This multi asset won't return any dataframes because ogr handles the actual extraction.
    # We might have to make all of the `outs` optional.

census_sqlite_io_manager will probably need to subclass SQLiteIOManagerlike the FERC IO Managers.

If adding the multi_asset to the main pudl.etl DAG we could use a similar pattern to ferc_to_sqlite and load the census tables as SourceAssets in the main portion of the DAG.

@zaneselvans
Copy link
Member Author

Okay, sounds like we can do this as part of the next Dagster phase!

@zaneselvans zaneselvans modified the milestones: 2023Q1, 2023Q2 Mar 17, 2023
@zaneselvans zaneselvans added the good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. label Apr 7, 2023
@e-belfer e-belfer assigned e-belfer and unassigned e-belfer Apr 25, 2023
@e-belfer
Copy link
Member

In #2437 I'm removing the ferc714_out fixture that currently generates the census DB, and reading in the relevant 714 tables directly from PUDL where they are getting saved. I think this likely makes this issue a blocker for integrating #2437.

@zaneselvans
Copy link
Member Author

I'm hopeful that this is actually a simple issue -- the code that does the transformation from GeoDB to SQLite now is a one-liner, though it does call out to an external CLI tool that's part of the open geospatial stack, IIRC. So I think it'll look more like the process we're using to generate the "source assets" for the FERC Form 1 SQLite DBs.

@e-belfer e-belfer linked a pull request Jun 2, 2023 that will close this issue
9 tasks
@e-belfer e-belfer self-assigned this Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
censusdp1tract Issues related to the Census DP1 dataset which we distribute as an SQLite DB dagster Issues related to our use of the Dagster orchestrator good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants