You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
zaneselvans opened this issue
Aug 26, 2021
· 0 comments
Labels
epicAny issue whose primary purpose is to organize other issues into a group.intakeIssues related to intake data catalogs.sqliteIssues related to interacting with sqlite databases
Want to enable frequent, automated distribution of versioned data releases
Data releases should be dependent on / linked to software versions
Want to use a platform-independent data format that supports rich relational data
Want to minimize the setup / overhead for users of the published data.
Want to minimize overhead / costs associated with providing persistent access to historical versions of the data
Would like to get away from managing the bespoke filesystem hierarchy we currently have (sqlite, parquet, settings, etc.) and make working with PUDL in a cloud environment and locally easier and more similar.
Why use SQLite?
SQLite doesn't do "remote access" (it's not server-client based) so it is a bit weird in this context. However...
SQLite is a widely supported, archive-friendly format for relational data.
It's easy to work with locally, and totally sufficient for data on the scale of a few GB in the read-only context in which we're publishing data.
The DB is functionally read-only for users, so concurrent writes aren't an issue.
Cloud object storage is very cost effective, and makes it easy for us to save and organize many historical versions. The cost associated with them drops off to almost nothing as they are used less and less over time.
The SQLite DB files we're generating are each only about 1GB in size, so it's pretty quick and easy to download the whole thing once (per version) and store it locally.
Storing a static file that's automatically downloaded and then accessed locally is easier / cheaper than managing many different instances of a database service with lots of different versions of the data in it.
We should have an Intake catalog that allows access to the existing FERC 1 and PUDL SQLite databases using the Intake API and the intake-sql driver, returning pandas dataframes to the user.
We should compile a basic set of metadata for these catalogs, allowing the user to see what tables and columns are available, based on metadata that we've already compiled.
We should broadly understand what work would be necessary to allow transparent access via Intake to SQLite Databases deposited in cloud storage buckets. E.g. how hard would it be to marry the intake-sql driver with fsspec to create an intake-sqlite driver that does transparent caching of a given versioned SQLite DB output from a remote location, while using the existing SQL driver to access the DB locally once it's been downloaded?
The text was updated successfully, but these errors were encountered:
zaneselvans
added
sqlite
Issues related to interacting with sqlite databases
intake
Issues related to intake data catalogs.
epic
Any issue whose primary purpose is to organize other issues into a group.
labels
Aug 26, 2021
zaneselvans
changed the title
Draft SQLite Intake Catalogs
Prototype SQLite Intake Catalogs
Aug 26, 2021
epicAny issue whose primary purpose is to organize other issues into a group.intakeIssues related to intake data catalogs.sqliteIssues related to interacting with sqlite databases
Motivation
Why use SQLite?
Tasks
Scope
intake-sql
driver with fsspec to create anintake-sqlite
driver that does transparent caching of a given versioned SQLite DB output from a remote location, while using the existing SQL driver to access the DB locally once it's been downloaded?The text was updated successfully, but these errors were encountered: