Intake SQLite driver #1156

zaneselvans · 2021-08-26T21:46:41Z

Motivation

Want to enable frequent, automated distribution of versioned data releases
Data releases should be dependent on / linked to software versions
Want to use a platform-independent data format that supports rich relational data
Want to minimize the setup / overhead for users of the published data.
Want to minimize overhead / costs associated with providing persistent access to historical versions of the data
Would like to get away from managing the bespoke filesystem hierarchy we currently have (sqlite, parquet, settings, etc.) and make working with PUDL in a cloud environment and locally easier and more similar.

Why use SQLite?

SQLite doesn't do "remote access" (it's not server-client based) so it is a bit weird in this context. However...
SQLite is a widely supported, archive-friendly format for relational data.
It's easy to work with locally, and totally sufficient for data on the scale of a few GB in the read-only context in which we're publishing data.
The DB is functionally read-only for users, so concurrent writes aren't an issue.
Cloud object storage is very cost effective, and makes it easy for us to save and organize many historical versions. The cost associated with them drops off to almost nothing as they are used less and less over time.
The SQLite DB files we're generating are each only about 1GB in size, so it's pretty quick and easy to download the whole thing once (per version) and store it locally.
Storing a static file that's automatically downloaded and then accessed locally is easier / cheaper than managing many different instances of a database service with lots of different versions of the data in it.

Tasks

Scope

We should have an Intake catalog that allows access to the existing FERC 1 and PUDL SQLite databases using the Intake API and the intake-sql driver, returning pandas dataframes to the user.
We should compile a basic set of metadata for these catalogs, allowing the user to see what tables and columns are available, based on metadata that we've already compiled.
We should broadly understand what work would be necessary to allow transparent access via Intake to SQLite Databases deposited in cloud storage buckets. E.g. how hard would it be to marry the intake-sql driver with fsspec to create an intake-sqlite driver that does transparent caching of a given versioned SQLite DB output from a remote location, while using the existing SQL driver to access the DB locally once it's been downloaded?

The text was updated successfully, but these errors were encountered:

zaneselvans added sqlite Issues related to interacting with sqlite databases intake Issues related to intake data catalogs. epic Any issue whose primary purpose is to organize other issues into a group. labels Aug 26, 2021

zaneselvans changed the title ~~Draft SQLite Intake Catalogs~~ Prototype SQLite Intake Catalogs Aug 26, 2021

zaneselvans mentioned this issue Apr 7, 2022

EPA CEMS Intake Catalog #1564

Open

15 tasks

zaneselvans changed the title ~~Prototype SQLite Intake Catalogs~~ Intake SQLite driver May 10, 2022

This was referenced May 13, 2022

Add SQLite entries to the catalog catalyst-cooperative/pudl-catalog#20

Closed

Publish PUDL Intake Catalog #1179

Open

zaneselvans closed this as completed May 31, 2022

zaneselvans reopened this May 31, 2022

zaneselvans closed this as completed May 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intake SQLite driver #1156

Intake SQLite driver #1156

zaneselvans commented Aug 26, 2021 •

edited

Loading

Intake SQLite driver #1156

Intake SQLite driver #1156

Comments

zaneselvans commented Aug 26, 2021 • edited Loading

Motivation

Why use SQLite?

Tasks

Scope

zaneselvans commented Aug 26, 2021 •

edited

Loading