Create `PudlSQLiteIOManager` to accept a `Package` object #2466

bendnorman · 2023-03-28T21:08:36Z

PR Overview

This PR solves two problems we were running into in #2459 and #2445:

We need to keep track of SQL view datatypes in pudl.metadata.resources so we can read the views into dataframes and convert the columns to correct datatypes but we also don't want to create table schemas in the database for the views. My initial solution was to add a janky feeling exclude_tables: list init parameter to SQLiteIOManager which excludes a list of tables from being created in the database.
Use enforce_schema() and read-chunking in PudlSQLiteIOManager. #2459 uses Resource.enforce_schema() in the SQLiteIOManager instead of the generic pudl.metadata.fields.apply_pudl_dtypes() function. The initial solution was to create a Package object in the io manager that loads all of the default PUDL resources. It felt strange to create a sa.Metadata object by filtering out certain etl groups in pudl_sqlite_io_manager to then create another Package object with the default Resources in the io manager.

This PR does two things to solves these problems:

Adds the include_in_database attribute to the Resource class. The Package.to_sql() method now only creates sa.Table objects for resources when include_in_database == True. This prevents us from creating table schemas for views and tables that are broken and we don't want to be included on datasette.
Creates a subclass of SQLiteIOManager called PudlSQLiteIOManger that accepts a Package object instead of a sa.Metadata object. This way we don't have to initialize a Package object with all of the default resources inside the io manager. Also, the SQLiteIOManager can stay de-coupled from pudl specific classes and metadata.

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

…rce attribute

bendnorman · 2023-03-28T21:14:58Z

src/pudl/io_managers.py

+        self.package = package
+        md = self.package.to_sql()
+        super().__init__(base_dir, db_name, md, timeout)
+
    def load_input(self, context: InputContext) -> pd.DataFrame:


In #2459 @zaneselvans you'll probably rewrite the PudlSQLiteIOManager.load_input() method to:

Make sure the table_name exists as a resource in self.package.

Apply enforce_schema() to the df instead of apply_pudl_dtypes().

You'll also need to overwrite the _handle_pandas_output() method to check the table exists in self.md and enforce schema.

In PudlSQLiteIOManager._handle_pandas_output() does it make sense to use enforce_schema()?

If it calls the parent _handle_pandas_output() then apply_pudl_dtypes() will be called on the dataframe, and I think that will remove the categorical type information, and also any data source specific typing information (since the generic call to apply_pudl_dtypes() without group="eia" or something similar just uses the generic field definitions. I guess maybe that means we can't call super()._handle_pandas_output() and we have to override the whole method, and should use enforce_schema() to get the data source specific types, even if the categorical information is lost when we write to SQLite.

bendnorman · 2023-03-28T21:16:15Z

src/pudl/metadata/classes.py

+            if resource.include_in_database:
+                _ = resource.to_sql(
+                    metadata,
+                    check_types=check_types,
+                    check_values=check_values,
+                )


@zaneselvans Does it make sense to filter out tables we don't want in the db in Package.to_sql()? We could also filter them out when creating pudl_sqlite_io_manager.

Hmm, I'm not sure. My gut says we should filter in to_sql() -- that these tables are broken, not done yet, or otherwise shouldn't be floating around anywhere outside our metadata. Is there a scenario you have in mind for where we would want a table to show up in the metadata object, but not in the actual database we create with it? That discrepancy seems like a recipe for confusion and potentially broken FK relationships.

codecov · 2023-03-28T22:25:14Z

Codecov Report

Patch coverage: 91.3% and project coverage change: -0.1 ⚠️

Comparison is base (5576f10) 86.7% compared to head (c1af353) 86.7%.

Additional details and impacted files

@@           Coverage Diff           @@
##             dev   #2466     +/-   ##
=======================================
- Coverage   86.7%   86.7%   -0.1%     
=======================================
  Files         81      81             
  Lines       9453    9472     +19     
=======================================
+ Hits        8203    8219     +16     
- Misses      1250    1253      +3

Impacted Files	Coverage Δ
src/pudl/metadata/resources/eia.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/epacems.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/ferc1.py	`100.0% <ø> (ø)`
src/pudl/io_managers.py	`89.0% <90.0%> (-0.1%)`	⬇️
src/pudl/metadata/classes.py	`81.8% <100.0%> (-0.1%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

zaneselvans

This seems like a much cleaner solution overall!

Very minor docs / readability stuff. No substantive changes required so I'll approve (but if you want to make those little clarifications that would be great)

src/pudl/io_managers.py

bendnorman added 2 commits March 28, 2023 12:40

Create PudlSQLiteIOManager subclass and add include_in_database Resou…

43cd486

…rce attribute

Remove pudl specific logic from SQLiteIOManager

520bf56

bendnorman requested a review from zaneselvans March 28, 2023 21:09

bendnorman commented Mar 28, 2023

View reviewed changes

zaneselvans added sqlite Issues related to interacting with sqlite databases dagster Issues related to our use of the Dagster orchestrator labels Mar 29, 2023

zaneselvans added this to the 2023Q1 milestone Mar 29, 2023

zaneselvans approved these changes Mar 29, 2023

View reviewed changes

src/pudl/io_managers.py Outdated Show resolved Hide resolved

src/pudl/io_managers.py Show resolved Hide resolved

src/pudl/io_managers.py Outdated Show resolved Hide resolved

src/pudl/io_managers.py Show resolved Hide resolved

bendnorman added 2 commits March 29, 2023 18:58

Merge branch 'dev' into use-package-class-in-iomanager

3b5697f

Clean up doc strings of PudlSQLiteIOManager

c1af353

bendnorman merged commit e32cd36 into dev Mar 30, 2023

bendnorman deleted the use-package-class-in-iomanager branch March 30, 2023 16:27

zaneselvans mentioned this pull request Mar 30, 2023

dev -> main 2023-03-30 #2471

Closed

8 tasks

bendnorman mentioned this pull request Apr 3, 2023

Create simple SQL view assets #2445

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create `PudlSQLiteIOManager` to accept a `Package` object #2466

Create `PudlSQLiteIOManager` to accept a `Package` object #2466

bendnorman commented Mar 28, 2023 •

edited

bendnorman Mar 28, 2023

zaneselvans Mar 30, 2023

bendnorman Mar 28, 2023

zaneselvans Mar 29, 2023

codecov bot commented Mar 28, 2023 •

edited

zaneselvans left a comment

Create PudlSQLiteIOManager to accept a Package object #2466

Create PudlSQLiteIOManager to accept a Package object #2466

Conversation

bendnorman commented Mar 28, 2023 • edited

PR Overview

PR Checklist

bendnorman Mar 28, 2023

Choose a reason for hiding this comment

zaneselvans Mar 30, 2023

Choose a reason for hiding this comment

bendnorman Mar 28, 2023

Choose a reason for hiding this comment

zaneselvans Mar 29, 2023

Choose a reason for hiding this comment

codecov bot commented Mar 28, 2023 • edited

Codecov Report

zaneselvans left a comment

Choose a reason for hiding this comment

Create `PudlSQLiteIOManager` to accept a `Package` object #2466

Create `PudlSQLiteIOManager` to accept a `Package` object #2466

bendnorman commented Mar 28, 2023 •

edited

codecov bot commented Mar 28, 2023 •

edited