Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create PudlSQLiteIOManager to accept a Package object #2466

Merged
merged 4 commits into from
Mar 30, 2023

Conversation

bendnorman
Copy link
Member

@bendnorman bendnorman commented Mar 28, 2023

PR Overview

This PR solves two problems we were running into in #2459 and #2445:

  1. We need to keep track of SQL view datatypes in pudl.metadata.resources so we can read the views into dataframes and convert the columns to correct datatypes but we also don't want to create table schemas in the database for the views. My initial solution was to add a janky feeling exclude_tables: list init parameter to SQLiteIOManager which excludes a list of tables from being created in the database.
  2. Use enforce_schema() and read-chunking in PudlSQLiteIOManager. #2459 uses Resource.enforce_schema() in the SQLiteIOManager instead of the generic pudl.metadata.fields.apply_pudl_dtypes() function. The initial solution was to create a Package object in the io manager that loads all of the default PUDL resources. It felt strange to create a sa.Metadata object by filtering out certain etl groups in pudl_sqlite_io_manager to then create another Package object with the default Resources in the io manager.

This PR does two things to solves these problems:

  1. Adds the include_in_database attribute to the Resource class. The Package.to_sql() method now only creates sa.Table objects for resources when include_in_database == True. This prevents us from creating table schemas for views and tables that are broken and we don't want to be included on datasette.
  2. Creates a subclass of SQLiteIOManager called PudlSQLiteIOManger that accepts a Package object instead of a sa.Metadata object. This way we don't have to initialize a Package object with all of the default resources inside the io manager. Also, the SQLiteIOManager can stay de-coupled from pudl specific classes and metadata.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

self.package = package
md = self.package.to_sql()
super().__init__(base_dir, db_name, md, timeout)

def load_input(self, context: InputContext) -> pd.DataFrame:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #2459 @zaneselvans you'll probably rewrite the PudlSQLiteIOManager.load_input() method to:

  1. Make sure the table_name exists as a resource in self.package.
  2. Apply enforce_schema() to the df instead of apply_pudl_dtypes().

You'll also need to overwrite the _handle_pandas_output() method to check the table exists in self.md and enforce schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PudlSQLiteIOManager._handle_pandas_output() does it make sense to use enforce_schema()?

If it calls the parent _handle_pandas_output() then apply_pudl_dtypes() will be called on the dataframe, and I think that will remove the categorical type information, and also any data source specific typing information (since the generic call to apply_pudl_dtypes() without group="eia" or something similar just uses the generic field definitions. I guess maybe that means we can't call super()._handle_pandas_output() and we have to override the whole method, and should use enforce_schema() to get the data source specific types, even if the categorical information is lost when we write to SQLite.

Comment on lines +1875 to +1880
if resource.include_in_database:
_ = resource.to_sql(
metadata,
check_types=check_types,
check_values=check_values,
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaneselvans Does it make sense to filter out tables we don't want in the db in Package.to_sql()? We could also filter them out when creating pudl_sqlite_io_manager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure. My gut says we should filter in to_sql() -- that these tables are broken, not done yet, or otherwise shouldn't be floating around anywhere outside our metadata. Is there a scenario you have in mind for where we would want a table to show up in the metadata object, but not in the actual database we create with it? That discrepancy seems like a recipe for confusion and potentially broken FK relationships.

@codecov
Copy link

codecov bot commented Mar 28, 2023

Codecov Report

Patch coverage: 91.3% and project coverage change: -0.1 ⚠️

Comparison is base (5576f10) 86.7% compared to head (c1af353) 86.7%.

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2466     +/-   ##
=======================================
- Coverage   86.7%   86.7%   -0.1%     
=======================================
  Files         81      81             
  Lines       9453    9472     +19     
=======================================
+ Hits        8203    8219     +16     
- Misses      1250    1253      +3     
Impacted Files Coverage Δ
src/pudl/metadata/resources/eia.py 100.0% <ø> (ø)
src/pudl/metadata/resources/epacems.py 100.0% <ø> (ø)
src/pudl/metadata/resources/ferc1.py 100.0% <ø> (ø)
src/pudl/io_managers.py 89.0% <90.0%> (-0.1%) ⬇️
src/pudl/metadata/classes.py 81.8% <100.0%> (-0.1%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@zaneselvans zaneselvans added sqlite Issues related to interacting with sqlite databases dagster Issues related to our use of the Dagster orchestrator labels Mar 29, 2023
@zaneselvans zaneselvans added this to the 2023Q1 milestone Mar 29, 2023
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a much cleaner solution overall!

Very minor docs / readability stuff. No substantive changes required so I'll approve (but if you want to make those little clarifications that would be great)

src/pudl/io_managers.py Outdated Show resolved Hide resolved
src/pudl/io_managers.py Show resolved Hide resolved
src/pudl/io_managers.py Outdated Show resolved Hide resolved
src/pudl/io_managers.py Show resolved Hide resolved
@bendnorman bendnorman merged commit e32cd36 into dev Mar 30, 2023
@bendnorman bendnorman deleted the use-package-class-in-iomanager branch March 30, 2023 16:27
@zaneselvans zaneselvans mentioned this pull request Mar 30, 2023
8 tasks
@bendnorman bendnorman mentioned this pull request Apr 3, 2023
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator sqlite Issues related to interacting with sqlite databases
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

2 participants