-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create PudlSQLiteIOManager
to accept a Package
object
#2466
Conversation
self.package = package | ||
md = self.package.to_sql() | ||
super().__init__(base_dir, db_name, md, timeout) | ||
|
||
def load_input(self, context: InputContext) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #2459 @zaneselvans you'll probably rewrite the PudlSQLiteIOManager.load_input()
method to:
- Make sure the
table_name
exists as a resource inself.package
. - Apply
enforce_schema()
to the df instead ofapply_pudl_dtypes()
.
You'll also need to overwrite the _handle_pandas_output()
method to check the table exists in self.md and enforce schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PudlSQLiteIOManager._handle_pandas_output()
does it make sense to use enforce_schema()
?
If it calls the parent _handle_pandas_output()
then apply_pudl_dtypes()
will be called on the dataframe, and I think that will remove the categorical type information, and also any data source specific typing information (since the generic call to apply_pudl_dtypes()
without group="eia"
or something similar just uses the generic field definitions. I guess maybe that means we can't call super()._handle_pandas_output()
and we have to override the whole method, and should use enforce_schema()
to get the data source specific types, even if the categorical information is lost when we write to SQLite.
if resource.include_in_database: | ||
_ = resource.to_sql( | ||
metadata, | ||
check_types=check_types, | ||
check_values=check_values, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zaneselvans Does it make sense to filter out tables we don't want in the db in Package.to_sql()
? We could also filter them out when creating pudl_sqlite_io_manager
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure. My gut says we should filter in to_sql()
-- that these tables are broken, not done yet, or otherwise shouldn't be floating around anywhere outside our metadata. Is there a scenario you have in mind for where we would want a table to show up in the metadata object, but not in the actual database we create with it? That discrepancy seems like a recipe for confusion and potentially broken FK relationships.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #2466 +/- ##
=======================================
- Coverage 86.7% 86.7% -0.1%
=======================================
Files 81 81
Lines 9453 9472 +19
=======================================
+ Hits 8203 8219 +16
- Misses 1250 1253 +3
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a much cleaner solution overall!
Very minor docs / readability stuff. No substantive changes required so I'll approve (but if you want to make those little clarifications that would be great)
PR Overview
This PR solves two problems we were running into in #2459 and #2445:
pudl.metadata.resources
so we can read the views into dataframes and convert the columns to correct datatypes but we also don't want to create table schemas in the database for the views. My initial solution was to add a janky feelingexclude_tables: list
init parameter toSQLiteIOManager
which excludes a list of tables from being created in the database.Resource.enforce_schema()
in theSQLiteIOManager
instead of the genericpudl.metadata.fields.apply_pudl_dtypes()
function. The initial solution was to create a Package object in the io manager that loads all of the default PUDL resources. It felt strange to create asa.Metadata
object by filtering out certain etl groups inpudl_sqlite_io_manager
to then create anotherPackage
object with the default Resources in the io manager.This PR does two things to solves these problems:
include_in_database
attribute to theResource
class. ThePackage.to_sql()
method now only createssa.Table
objects for resources wheninclude_in_database == True
. This prevents us from creating table schemas for views and tables that are broken and we don't want to be included on datasette.SQLiteIOManager
calledPudlSQLiteIOManger
that accepts aPackage
object instead of asa.Metadata
object. This way we don't have to initialize a Package object with all of the default resources inside the io manager. Also, theSQLiteIOManager
can stay de-coupled from pudl specific classes and metadata.PR Checklist
dev
).