Publish PUDL Intake Catalog #1179

zaneselvans · 2021-09-02T17:51:39Z

Description

Once we have automated our data builds (#1177) and are able to output and test new data products with minimal human intervention, we want to enable the automatic distribution of those data products to users via Intake catalogs.

Motivation

We want to make it easy to distribute our most up-to-date data products quickly and frequently. This will make our data more attractive to users, and also avoid the pattern of users integrating the most recent data themselves in a way that is fragile or unreproducible.
We want to ensure that the software and data being used together are compatible with each other, both in the case of end users, and in development.
We want to provide users with a uniform API for accessing our data that is also useful for accessing other data rather than forcing them to learn a bespoke and non-standard system.
We want to make the PUDL software + data environment setup process as easy as possible for users, to reduce the barrier to entry for new users.
We want to provide bulk, machine-readable, versioned access to our published data so that it can be consumed by other data processing pipelines.
We want to ensure that our data is accompanied by rich human- and machine-readable metadata that helps users understand what the data means, what its limitations are, and where it came from.

Benefits of Using Intake:

We could just upload our SQLite DBs and Parquet files to some cloud buckets, and point people at those resources. Using Intake potentially provides some additional benefits:

We can publish structured table and column level metadata in the catalog so that users can explore what data is available and choose what to query without needing to download it all.
For larger datasets, Intake can be used to do transparent local filesystem caching, minimizing download times and data egress fees.
Our data catalog can be versioned, installed using conda, and made explicitly dependent on, or listed as a dependency of, other software.
We can decouple the user facing API from the way we organize the data on the back end, so if we change the type of database or the partitioning of a parquet dataset it doesn't need to disrupt user applications that have been built on the data.

Tasks / Issues

Out of Scope

Billing

Hours should be billed to Data Distribution under the Sloan grant.

The text was updated successfully, but these errors were encountered:

jdangerx · 2023-01-09T22:28:12Z

@zaneselvans is this part of the catalogs line item that we deprioritized for this quarter?

zaneselvans changed the title ~~Automate Intake Catalog Publication~~ Publish Intake Catalogs Sep 2, 2021

zaneselvans changed the title ~~Publish Intake Catalogs~~ Publish PUDL Intake Catalog Sep 2, 2021

zaneselvans mentioned this issue Mar 30, 2022

EPA CEMS Intake Catalog #1564

Open

15 tasks

zaneselvans self-assigned this May 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish PUDL Intake Catalog #1179

Publish PUDL Intake Catalog #1179

zaneselvans commented Sep 2, 2021 •

edited

Loading

jdangerx commented Jan 9, 2023

Publish PUDL Intake Catalog #1179

Publish PUDL Intake Catalog #1179

Comments

zaneselvans commented Sep 2, 2021 • edited Loading

Description

Motivation

Benefits of Using Intake:

Tasks / Issues

Out of Scope

Billing

jdangerx commented Jan 9, 2023

zaneselvans commented Sep 2, 2021 •

edited

Loading