New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update PUDL to pandas 2.0 #2320
Conversation
Codecov ReportPatch and project coverage have no change.
Additional details and impacted files@@ Coverage Diff @@
## dev #2320 +/- ##
=====================================
Coverage 86.7% 86.7%
=====================================
Files 81 81
Lines 9490 9490
=====================================
Hits 8233 8233
Misses 1257 1257 see 6 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
@nelsonauner I fixed the integration test failure in the Census DP1 tables and merged |
After bumping SQLAlchemy to v2 I got some additional unit test failures in the |
@nelsonauner I fixed another couple of small issues on this branch, and ended up reporting what seems to be a pandas bug: pandas-dev/pandas#54399 Now the integration tests are failing on a divide-by-zero error in the FERC Form 1 calculation reconciliations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made comments throughout the PR about why things were changed (if there's no comment it's probably a vanilla data type issue).
There are several instances of df.convert_dtypes()
that Nelson introduced, which I think should probably be apply_pudl_dtypes()
which is more selective / specific.
There's one unit test with mixed date formats that I still need to update. Mixed dates aren't automatically parsed by Pandas any more, and it turns out this didn't come up anywhere else in PUDL, so I just need to make all the date formats identical and then I can un-xfail that test.
There's one SQLAlchemy error coming up with respect to the SQLite Views, which we aren't using right now. I've left it xfail for the moment, and thought we might try and fix it when we update to SQLAlchemy 2.0 (which we should do as soon as pandas 2.0 is merged in).
The full ETL and the full data integration tests all pass locally, so it seems like we haven't had any significant impacts on the data contents. 🤞🏼
@@ -423,7 +423,7 @@ def get_utility_most_recent_capacity(pudl_engine) -> pd.DataFrame: | |||
== gen_caps["report_date"] | |||
) | |||
most_recent_gens = gen_caps.loc[most_recent_gens_idx] | |||
utility_caps = most_recent_gens.groupby("utility_id_eia").sum() | |||
utility_caps = most_recent_gens.groupby("utility_id_eia")["capacity_mw"].sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only want to sum one column, not the whole dataframe.
@@ -245,7 +245,7 @@ def drop_invalid_rows(self, df): | |||
"`drop_invalid_rows`. Adding empty columns for: " | |||
f"{missing_required_cols}" | |||
) | |||
df.loc[:, missing_required_cols] = pd.NA | |||
df.loc[:, list(missing_required_cols)] = pd.NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't use sets as indexers in pandas 2.0
@@ -4483,7 +4480,7 @@ def transform_main(self: Self, df: pd.DataFrame) -> pd.DataFrame: | |||
& (df.income_type == "net_utility_operating_income") | |||
) | |||
] | |||
return df | |||
return apply_pudl_dtypes(df, group="ferc1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the random apply_pudl_dtypes
sprinkled throughout are dealing with:
datetime64
columns that have the wrong time resolution or have been turned intoobject
columns- Other kinds of columns that have become
object
columns due to a bad NA value likeNone
sneaking in somehow.
@@ -372,7 +372,7 @@ def filter_ferc714_hourly_demand_matrix( | |||
.groupby("id")["year"] | |||
.apply(lambda x: np.sort(x)) | |||
) | |||
with pd.option_context("display.max_colwidth", -1): | |||
with pd.option_context("display.max_colwidth", None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Displays all columns. Not sure why -1 ever worked.
@@ -663,7 +663,7 @@ def to_pandas_dtype(self, compact: bool = False) -> str | pd.CategoricalDtype: | |||
return "float32" | |||
return FIELD_DTYPES_PANDAS[self.type] | |||
|
|||
def to_sql_dtype(self) -> sa.sql.visitors.VisitableType: | |||
def to_sql_dtype(self) -> type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To work with SQLAlchemy 2.0
@@ -28,7 +28,7 @@ | |||
"year": pa.int32(), | |||
} | |||
|
|||
FIELD_DTYPES_SQL: dict[str, sa.sql.visitors.VisitableType] = { | |||
FIELD_DTYPES_SQL: dict[str, type] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For compatibility with SQLAlchemy 2.0
dp1_engine, | ||
params=[table_name], | ||
params=(table_name,), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something changed about how pandas passes query params to SQL Alchemy. Tuple is okay, list is not.
bece_df = bece_df.append(table) | ||
dfs.append(table) | ||
bece_df = pd.concat(dfs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.append()
has been deprecated in favor of the more featureful pd.concat()
I also switched to doing a single big concatenation rather than many incremental ones.
Hey FYI @nelsonauner I went ahead and chased down the remaining pandas 2.0 issues in the full ETL pipeline and integration tests, so I think we're about ready to merge this in! |
@@ -65,7 +65,7 @@ def test__parse_data_column(elec_txt_dataframe): | |||
4371683.38189, | |||
], | |||
}, | |||
) | |||
).convert_dtypes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried changing this convert_dtypes()
to apply_pudl_dtypes()
and it failed, so leaving it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, thanks for doing this! I don't have anything blocking, just a few questions.
PR Overview
PR Checklist
dev
).