Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PUDL to pandas 2.0 #2320

Merged
merged 54 commits into from Aug 30, 2023
Merged

Update PUDL to pandas 2.0 #2320

merged 54 commits into from Aug 30, 2023

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Feb 21, 2023

PR Overview

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@zaneselvans zaneselvans added the dependencies Pull requests that update a dependency file label Feb 21, 2023
@codecov
Copy link

codecov bot commented Feb 21, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (62cb44a) 86.7% compared to head (65979f1) 86.7%.

❗ Current head 65979f1 differs from pull request most recent head 1214290. Consider uploading reports for the commit 1214290 to get more accurate results

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2320   +/-   ##
=====================================
  Coverage   86.7%   86.7%           
=====================================
  Files         81      81           
  Lines       9490    9490           
=====================================
  Hits        8233    8233           
  Misses      1257    1257           

see 6 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@zaneselvans zaneselvans mentioned this pull request Feb 27, 2023
8 tasks
@zaneselvans zaneselvans modified the milestone: PUDL 2023Q2 Release Mar 11, 2023
@zaneselvans zaneselvans changed the title Update for compatibility with pandas 2.0 Update PUDL to pandas 2.0 Mar 11, 2023
@zaneselvans zaneselvans linked an issue Mar 13, 2023 that may be closed by this pull request
@zaneselvans zaneselvans changed the base branch from dev to python-3.11 March 23, 2023 02:41
Base automatically changed from python-3.11 to dev March 30, 2023 20:31
@zaneselvans
Copy link
Member Author

@nelsonauner I fixed the integration test failure in the Census DP1 tables and merged dev into the branch, if you want to merge it into the branch on your fork.

@zaneselvans
Copy link
Member Author

After bumping SQLAlchemy to v2 I got some additional unit test failures in the pudl_sqlite_io_manager tests dealing with views and the foreign key checks.

@zaneselvans
Copy link
Member Author

@nelsonauner I fixed another couple of small issues on this branch, and ended up reporting what seems to be a pandas bug: pandas-dev/pandas#54399

Now the integration tests are failing on a divide-by-zero error in the FERC Form 1 calculation reconciliations.

Copy link
Member Author

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made comments throughout the PR about why things were changed (if there's no comment it's probably a vanilla data type issue).

There are several instances of df.convert_dtypes() that Nelson introduced, which I think should probably be apply_pudl_dtypes() which is more selective / specific.

There's one unit test with mixed date formats that I still need to update. Mixed dates aren't automatically parsed by Pandas any more, and it turns out this didn't come up anywhere else in PUDL, so I just need to make all the date formats identical and then I can un-xfail that test.

There's one SQLAlchemy error coming up with respect to the SQLite Views, which we aren't using right now. I've left it xfail for the moment, and thought we might try and fix it when we update to SQLAlchemy 2.0 (which we should do as soon as pandas 2.0 is merged in).

The full ETL and the full data integration tests all pass locally, so it seems like we haven't had any significant impacts on the data contents. 🤞🏼

@@ -423,7 +423,7 @@ def get_utility_most_recent_capacity(pudl_engine) -> pd.DataFrame:
== gen_caps["report_date"]
)
most_recent_gens = gen_caps.loc[most_recent_gens_idx]
utility_caps = most_recent_gens.groupby("utility_id_eia").sum()
utility_caps = most_recent_gens.groupby("utility_id_eia")["capacity_mw"].sum()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want to sum one column, not the whole dataframe.

@@ -245,7 +245,7 @@ def drop_invalid_rows(self, df):
"`drop_invalid_rows`. Adding empty columns for: "
f"{missing_required_cols}"
)
df.loc[:, missing_required_cols] = pd.NA
df.loc[:, list(missing_required_cols)] = pd.NA
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use sets as indexers in pandas 2.0

src/pudl/transform/ferc1.py Show resolved Hide resolved
@@ -4483,7 +4480,7 @@ def transform_main(self: Self, df: pd.DataFrame) -> pd.DataFrame:
& (df.income_type == "net_utility_operating_income")
)
]
return df
return apply_pudl_dtypes(df, group="ferc1")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the random apply_pudl_dtypes sprinkled throughout are dealing with:

  • datetime64 columns that have the wrong time resolution or have been turned into object columns
  • Other kinds of columns that have become object columns due to a bad NA value like None sneaking in somehow.

@@ -372,7 +372,7 @@ def filter_ferc714_hourly_demand_matrix(
.groupby("id")["year"]
.apply(lambda x: np.sort(x))
)
with pd.option_context("display.max_colwidth", -1):
with pd.option_context("display.max_colwidth", None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Displays all columns. Not sure why -1 ever worked.

@@ -663,7 +663,7 @@ def to_pandas_dtype(self, compact: bool = False) -> str | pd.CategoricalDtype:
return "float32"
return FIELD_DTYPES_PANDAS[self.type]

def to_sql_dtype(self) -> sa.sql.visitors.VisitableType:
def to_sql_dtype(self) -> type:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To work with SQLAlchemy 2.0

@@ -28,7 +28,7 @@
"year": pa.int32(),
}

FIELD_DTYPES_SQL: dict[str, sa.sql.visitors.VisitableType] = {
FIELD_DTYPES_SQL: dict[str, type] = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compatibility with SQLAlchemy 2.0

dp1_engine,
params=[table_name],
params=(table_name,),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something changed about how pandas passes query params to SQL Alchemy. Tuple is okay, list is not.

bece_df = bece_df.append(table)
dfs.append(table)
bece_df = pd.concat(dfs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.append() has been deprecated in favor of the more featureful pd.concat()

I also switched to doing a single big concatenation rather than many incremental ones.

test/unit/io_managers_test.py Show resolved Hide resolved
@zaneselvans zaneselvans marked this pull request as ready for review August 29, 2023 15:10
@zaneselvans
Copy link
Member Author

Hey FYI @nelsonauner I went ahead and chased down the remaining pandas 2.0 issues in the full ETL pipeline and integration tests, so I think we're about ready to merge this in!

@@ -65,7 +65,7 @@ def test__parse_data_column(elec_txt_dataframe):
4371683.38189,
],
},
)
).convert_dtypes()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried changing this convert_dtypes() to apply_pudl_dtypes() and it failed, so leaving it like this.

Copy link
Member

@zschira zschira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks for doing this! I don't have anything blocking, just a few questions.

src/pudl/transform/ferc1.py Show resolved Hide resolved
src/pudl/extract/eia_bulk_elec.py Show resolved Hide resolved
test/unit/io_managers_test.py Show resolved Hide resolved
@zaneselvans zaneselvans merged commit 8f33cb3 into dev Aug 30, 2023
7 of 8 checks passed
@zaneselvans zaneselvans deleted the pandas-2.0 branch August 30, 2023 18:05
@zaneselvans zaneselvans mentioned this pull request Aug 30, 2023
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file inframundo
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Update PUDL to be compatible with pandas 2.0
4 participants