New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bring Census DP1 database into Dagster #2412
Comments
The most recent We should probably dagsterify the census ETL soon but given the db is still being created by the nightly builds on the dagster-asset-etl branch I don't think this is blocking #2104. |
As for how we would dagsterify the ETL, I think we could wrap @multi_asset(
outs={
table_name: AssetOut(io_manager_key="census_sqlite_io_manager")
for table_name in Package.get_etl_group_tables("census")
},
required_resource_keys={"dataset_settings", "datastore"},
)
def censusdp1tract_to_sqlite(context):
# existing extraction code
...
# This multi asset won't return any dataframes because ogr handles the actual extraction.
# We might have to make all of the `outs` optional. census_sqlite_io_manager will probably need to subclass If adding the multi_asset to the main pudl.etl DAG we could use a similar pattern to ferc_to_sqlite and load the census tables as SourceAssets in the main portion of the DAG. |
Okay, sounds like we can do this as part of the next Dagster phase! |
I'm hopeful that this is actually a simple issue -- the code that does the transformation from GeoDB to SQLite now is a one-liner, though it does call out to an external CLI tool that's part of the open geospatial stack, IIRC. So I think it'll look more like the process we're using to generate the "source assets" for the FERC Form 1 SQLite DBs. |
We use the Census DP1 geodatabase for spatial analyses (e.g. turning FIPS codes into county geometries) and as a source for demographic information (e.g. population by census tract).
It seems like it may have slipped through the cracks in the Dagster migration? The ad-hoc "ETL" is almost a one-liner, converting the published geodatabase into SQLite it's in pudl.convert.censusdp1tract_to_sqlite.py
The database only has 3-4 tables in it (state, county, and tract-level demographic data). Should it be an asset? A multi-asset? A standalone resource like the FERC DBs?
I'm a little surprised that this didn't break anything in the tests or nightly builds. We currently distributed this converted DB along with the other SQLite DBs if the nightly builds succeed.
Scope
Next steps
The text was updated successfully, but these errors were encountered: