Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intake SQLite driver #1156

Closed
5 tasks done
zaneselvans opened this issue Aug 26, 2021 · 0 comments
Closed
5 tasks done

Intake SQLite driver #1156

zaneselvans opened this issue Aug 26, 2021 · 0 comments
Labels
epic Any issue whose primary purpose is to organize other issues into a group. intake Issues related to intake data catalogs. sqlite Issues related to interacting with sqlite databases

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Aug 26, 2021

Motivation

  • Want to enable frequent, automated distribution of versioned data releases
  • Data releases should be dependent on / linked to software versions
  • Want to use a platform-independent data format that supports rich relational data
  • Want to minimize the setup / overhead for users of the published data.
  • Want to minimize overhead / costs associated with providing persistent access to historical versions of the data
  • Would like to get away from managing the bespoke filesystem hierarchy we currently have (sqlite, parquet, settings, etc.) and make working with PUDL in a cloud environment and locally easier and more similar.

Why use SQLite?

  • SQLite doesn't do "remote access" (it's not server-client based) so it is a bit weird in this context. However...
  • SQLite is a widely supported, archive-friendly format for relational data.
  • It's easy to work with locally, and totally sufficient for data on the scale of a few GB in the read-only context in which we're publishing data.
  • The DB is functionally read-only for users, so concurrent writes aren't an issue.
  • Cloud object storage is very cost effective, and makes it easy for us to save and organize many historical versions. The cost associated with them drops off to almost nothing as they are used less and less over time.
  • The SQLite DB files we're generating are each only about 1GB in size, so it's pretty quick and easy to download the whole thing once (per version) and store it locally.
  • Storing a static file that's automatically downloaded and then accessed locally is easier / cheaper than managing many different instances of a database service with lots of different versions of the data in it.

Tasks

Scope

  • We should have an Intake catalog that allows access to the existing FERC 1 and PUDL SQLite databases using the Intake API and the intake-sql driver, returning pandas dataframes to the user.
  • We should compile a basic set of metadata for these catalogs, allowing the user to see what tables and columns are available, based on metadata that we've already compiled.
  • We should broadly understand what work would be necessary to allow transparent access via Intake to SQLite Databases deposited in cloud storage buckets. E.g. how hard would it be to marry the intake-sql driver with fsspec to create an intake-sqlite driver that does transparent caching of a given versioned SQLite DB output from a remote location, while using the existing SQL driver to access the DB locally once it's been downloaded?
@zaneselvans zaneselvans added sqlite Issues related to interacting with sqlite databases intake Issues related to intake data catalogs. epic Any issue whose primary purpose is to organize other issues into a group. labels Aug 26, 2021
@zaneselvans zaneselvans changed the title Draft SQLite Intake Catalogs Prototype SQLite Intake Catalogs Aug 26, 2021
@zaneselvans zaneselvans changed the title Prototype SQLite Intake Catalogs Intake SQLite driver May 10, 2022
@zaneselvans zaneselvans reopened this May 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Any issue whose primary purpose is to organize other issues into a group. intake Issues related to intake data catalogs. sqlite Issues related to interacting with sqlite databases
Projects
None yet
Development

No branches or pull requests

1 participant