Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish PUDL Intake Catalog #1179

Open
1 of 11 tasks
zaneselvans opened this issue Sep 2, 2021 · 1 comment
Open
1 of 11 tasks

Publish PUDL Intake Catalog #1179

zaneselvans opened this issue Sep 2, 2021 · 1 comment
Assignees
Labels
cloud Stuff that has to do with adapting PUDL to work in cloud computing context. epic Any issue whose primary purpose is to organize other issues into a group. intake Issues related to intake data catalogs. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. output Exporting data from PUDL into other platforms or interchange formats. packaging Software packaging and distribution of PUDL via pypi, etc.

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Sep 2, 2021

Description

Once we have automated our data builds (#1177) and are able to output and test new data products with minimal human intervention, we want to enable the automatic distribution of those data products to users via Intake catalogs.

Motivation

  • We want to make it easy to distribute our most up-to-date data products quickly and frequently. This will make our data more attractive to users, and also avoid the pattern of users integrating the most recent data themselves in a way that is fragile or unreproducible.
  • We want to ensure that the software and data being used together are compatible with each other, both in the case of end users, and in development.
  • We want to provide users with a uniform API for accessing our data that is also useful for accessing other data rather than forcing them to learn a bespoke and non-standard system.
  • We want to make the PUDL software + data environment setup process as easy as possible for users, to reduce the barrier to entry for new users.
  • We want to provide bulk, machine-readable, versioned access to our published data so that it can be consumed by other data processing pipelines.
  • We want to ensure that our data is accompanied by rich human- and machine-readable metadata that helps users understand what the data means, what its limitations are, and where it came from.

Benefits of Using Intake:

We could just upload our SQLite DBs and Parquet files to some cloud buckets, and point people at those resources. Using Intake potentially provides some additional benefits:

  • We can publish structured table and column level metadata in the catalog so that users can explore what data is available and choose what to query without needing to download it all.
  • For larger datasets, Intake can be used to do transparent local filesystem caching, minimizing download times and data egress fees.
  • Our data catalog can be versioned, installed using conda, and made explicitly dependent on, or listed as a dependency of, other software.
  • We can decouple the user facing API from the way we organize the data on the back end, so if we change the type of database or the partitioning of a parquet dataset it doesn't need to disrupt user applications that have been built on the data.

Tasks / Issues

  • Add SQLite entries to the catalog pudl-catalog#20
  • Do a barebones v0.2.0 release including epacems, ferc1, pudl and censusdp1tract datasets.
  • Define and automate export of epacems catalog metadata
  • Define and automate export of ferc1 catalog metadata
  • Define and automate export of pudl (SQLite) catalog metadata
  • Define and automate export of censusdp1tract catalog metadata
  • Do a v0.3.0 release with the new, complete metadata.
  • Define nightly or automatically triggered GitHub Action workflow that keeps pudl-catalog branches and tags synchronized with the nightly build outputs from the main pudl repository.
    • whenever a new output is generated, or an old output is updated, this repository should regenerate its catalog metadata, using the new or updated output branch / tag from PUDL.
    • Branch-based catalogs will only exist in the repository. Every tagged release should produce a new pudl-catalog package intstallable from PyPI and conda-forge
  • Switch pudl-catalog (and pudl?) to using calendar versioning when nightly builds + intake catalogs are working. E.g. a new Bastille Day release might be tagged v2022.07.14

Out of Scope

Billing

Hours should be billed to Data Distribution under the Sloan grant.

@zaneselvans zaneselvans added output Exporting data from PUDL into other platforms or interchange formats. packaging Software packaging and distribution of PUDL via pypi, etc. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. cloud Stuff that has to do with adapting PUDL to work in cloud computing context. intake Issues related to intake data catalogs. epic Any issue whose primary purpose is to organize other issues into a group. labels Sep 2, 2021
@zaneselvans zaneselvans changed the title Automate Intake Catalog Publication Publish Intake Catalogs Sep 2, 2021
@zaneselvans zaneselvans changed the title Publish Intake Catalogs Publish PUDL Intake Catalog Sep 2, 2021
@zaneselvans zaneselvans self-assigned this May 31, 2022
@jdangerx
Copy link
Member

jdangerx commented Jan 9, 2023

@zaneselvans is this part of the catalogs line item that we deprioritized for this quarter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud Stuff that has to do with adapting PUDL to work in cloud computing context. epic Any issue whose primary purpose is to organize other issues into a group. intake Issues related to intake data catalogs. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. output Exporting data from PUDL into other platforms or interchange formats. packaging Software packaging and distribution of PUDL via pypi, etc.
Projects
None yet
Development

No branches or pull requests

2 participants