Update PUDL to pandas 2.0 #2320

zaneselvans · 2023-02-21T05:33:15Z

PR Overview

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

codecov · 2023-02-21T06:32:17Z

Codecov Report

Patch and project coverage have no change.

Comparison is base (62cb44a) 86.7% compared to head (65979f1) 86.7%.

❗ Current head 65979f1 differs from pull request most recent head 1214290. Consider uploading reports for the commit 1214290 to get more accurate results

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #2320   +/-   ##
=====================================
  Coverage   86.7%   86.7%           
=====================================
  Files         81      81           
  Lines       9490    9490           
=====================================
  Hits        8233    8233           
  Misses      1257    1257

see 6 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

zaneselvans · 2023-08-04T03:00:30Z

@nelsonauner I fixed the integration test failure in the Census DP1 tables and merged dev into the branch, if you want to merge it into the branch on your fork.

zaneselvans · 2023-08-04T03:05:02Z

After bumping SQLAlchemy to v2 I got some additional unit test failures in the pudl_sqlite_io_manager tests dealing with views and the foreign key checks.

…bug.

zaneselvans · 2023-08-04T07:00:23Z

@nelsonauner I fixed another couple of small issues on this branch, and ended up reporting what seems to be a pandas bug: pandas-dev/pandas#54399

Now the integration tests are failing on a divide-by-zero error in the FERC Form 1 calculation reconciliations.

zaneselvans

I've made comments throughout the PR about why things were changed (if there's no comment it's probably a vanilla data type issue).

There are several instances of df.convert_dtypes() that Nelson introduced, which I think should probably be apply_pudl_dtypes() which is more selective / specific.

There's one unit test with mixed date formats that I still need to update. Mixed dates aren't automatically parsed by Pandas any more, and it turns out this didn't come up anywhere else in PUDL, so I just need to make all the date formats identical and then I can un-xfail that test.

There's one SQLAlchemy error coming up with respect to the SQLite Views, which we aren't using right now. I've left it xfail for the moment, and thought we might try and fix it when we update to SQLAlchemy 2.0 (which we should do as soon as pandas 2.0 is merged in).

The full ETL and the full data integration tests all pass locally, so it seems like we haven't had any significant impacts on the data contents. 🤞🏼

zaneselvans · 2023-08-29T14:38:16Z

src/pudl/glue/ferc1_eia.py

@@ -423,7 +423,7 @@ def get_utility_most_recent_capacity(pudl_engine) -> pd.DataFrame:
        == gen_caps["report_date"]
    )
    most_recent_gens = gen_caps.loc[most_recent_gens_idx]
-    utility_caps = most_recent_gens.groupby("utility_id_eia").sum()
+    utility_caps = most_recent_gens.groupby("utility_id_eia")["capacity_mw"].sum()


We only want to sum one column, not the whole dataframe.

zaneselvans · 2023-08-29T14:38:34Z

src/pudl/glue/ferc1_eia.py

@@ -245,7 +245,7 @@ def drop_invalid_rows(self, df):
                    "`drop_invalid_rows`. Adding empty columns for: "
                    f"{missing_required_cols}"
                )
-                df.loc[:, missing_required_cols] = pd.NA
+                df.loc[:, list(missing_required_cols)] = pd.NA


Can't use sets as indexers in pandas 2.0

src/pudl/transform/ferc1.py

zaneselvans · 2023-08-29T14:42:45Z

src/pudl/transform/ferc1.py

@@ -4483,7 +4480,7 @@ def transform_main(self: Self, df: pd.DataFrame) -> pd.DataFrame:
                & (df.income_type == "net_utility_operating_income")
            )
        ]
-        return df
+        return apply_pudl_dtypes(df, group="ferc1")


Most of the random apply_pudl_dtypes sprinkled throughout are dealing with:

datetime64 columns that have the wrong time resolution or have been turned into object columns

Other kinds of columns that have become object columns due to a bad NA value like None sneaking in somehow.

zaneselvans · 2023-08-29T14:43:58Z

src/pudl/analysis/state_demand.py

@@ -372,7 +372,7 @@ def filter_ferc714_hourly_demand_matrix(
            .groupby("id")["year"]
            .apply(lambda x: np.sort(x))
        )
-        with pd.option_context("display.max_colwidth", -1):
+        with pd.option_context("display.max_colwidth", None):


Displays all columns. Not sure why -1 ever worked.

zaneselvans · 2023-08-29T14:51:46Z

src/pudl/metadata/classes.py

@@ -663,7 +663,7 @@ def to_pandas_dtype(self, compact: bool = False) -> str | pd.CategoricalDtype:
                return "float32"
        return FIELD_DTYPES_PANDAS[self.type]

-    def to_sql_dtype(self) -> sa.sql.visitors.VisitableType:
+    def to_sql_dtype(self) -> type:


To work with SQLAlchemy 2.0

zaneselvans · 2023-08-29T14:52:24Z

src/pudl/metadata/constants.py

@@ -28,7 +28,7 @@
    "year": pa.int32(),
 }

-FIELD_DTYPES_SQL: dict[str, sa.sql.visitors.VisitableType] = {
+FIELD_DTYPES_SQL: dict[str, type] = {


For compatibility with SQLAlchemy 2.0

zaneselvans · 2023-08-29T14:53:29Z

src/pudl/output/censusdp1tract.py

                dp1_engine,
-                params=[table_name],
+                params=(table_name,),


Something changed about how pandas passes query params to SQL Alchemy. Tuple is okay, list is not.

zaneselvans · 2023-08-29T14:56:08Z

src/pudl/transform/eia860.py

-        bece_df = bece_df.append(table)
+        dfs.append(table)
+    bece_df = pd.concat(dfs)


df.append() has been deprecated in favor of the more featureful pd.concat()

I also switched to doing a single big concatenation rather than many incremental ones.

test/unit/io_managers_test.py

zaneselvans · 2023-08-29T19:19:56Z

Hey FYI @nelsonauner I went ahead and chased down the remaining pandas 2.0 issues in the full ETL pipeline and integration tests, so I think we're about ready to merge this in!

src/pudl/extract/eia_bulk_elec.py

zaneselvans · 2023-08-29T22:18:08Z

test/unit/extract/eia_bulk_elec_test.py

@@ -65,7 +65,7 @@ def test__parse_data_column(elec_txt_dataframe):
                4371683.38189,
            ],
        },
-    )
+    ).convert_dtypes()


Tried changing this convert_dtypes() to apply_pudl_dtypes() and it failed, so leaving it like this.

zschira

This looks good, thanks for doing this! I don't have anything blocking, just a few questions.

src/pudl/transform/ferc1.py

src/pudl/extract/eia_bulk_elec.py

test/unit/io_managers_test.py

zaneselvans added 4 commits February 3, 2023 09:27

Replace no longer available SQLAlchemy VisitableType with plain type

290074f

Merge branch 'dev' into sqlalchemy-2.0

9842f65

Require SQLAlchemy>2 to force compatibility testing.

9e33529

Update for compatibility with pandas 2.0

d184201

zaneselvans added the dependencies Pull requests that update a dependency file label Feb 21, 2023

zaneselvans added 4 commits February 21, 2023 09:32

Temporarily depend on ferc_xbrl_extractor pandas-2.0 branch

ad2c3f4

Merge branch 'dev' into sqlalchemy-2.0

5658b66

Merge branch 'sqlalchemy-2.0' into pandas-2.0

89b35c5

Merge non-dependency changes from sqlalchemy-2.0 branch

3e4323c

jdangerx added the inframundo label Feb 22, 2023

Merge branch 'dev' into pandas-2.0

c4bf0d6

zaneselvans mentioned this pull request Feb 27, 2023

Update PUDL to SQLAlchemy 2.0 #2267

Merged

8 tasks

Merge branch 'dev' into pandas-2.0

2115739

zaneselvans modified the milestone: PUDL 2023Q2 Release Mar 11, 2023

zaneselvans mentioned this pull request Mar 11, 2023

Update PUDL to new major versions of key dependencies #2384

Closed

zaneselvans changed the title ~~Update for compatibility with pandas 2.0~~ Update PUDL to pandas 2.0 Mar 11, 2023

Merge branch 'dev' into pandas-2.0

bc1ce1c

zaneselvans linked an issue Mar 13, 2023 that may be closed by this pull request

Update PUDL to be compatible with pandas 2.0 #2394

Closed

zaneselvans added 3 commits March 18, 2023 20:30

Merge branch 'dev' into pandas-2.0

193c89e

Require pandas 2.0 RC for testing.

bc77904

Merge branch 'dev' into pandas-2.0

e6cb446

zaneselvans changed the base branch from dev to python-3.11 March 23, 2023 02:41

zaneselvans added 2 commits March 22, 2023 20:44

Merge branch 'python-3.11' into pandas-2.0

eb2c47a

Specify resolution of datetime64 types.

ab5fde8

Base automatically changed from python-3.11 to dev March 30, 2023 20:31

zaneselvans added 2 commits March 31, 2023 09:54

Merge branch 'dev' into pandas-2.0

15424f1

Add missed changes from merge to setup.py

9f624c2

Set SQLAlchemy view w/ metadata error to XFAIL.

c2c776b

zaneselvans added 3 commits August 4, 2023 00:55

Ignore DeprecationWarning coming up from imported packages.

9c0883e

Update fix_eia_na() regex to avoid pandas df.replace() vectorization …

91d9595

…bug.

Remove deprecated infer_date_format flag.

e1b437b

zaneselvans added 11 commits August 9, 2023 16:58

Merge branch 'dev' into pandas-2.0

ca602b0

Merge branch 'dev' into pandas-2.0

0aab435

Fix some pandas 2 incompatibilities and highlight others.

f5324f9

Fix more minor pandas 2 compatibility issues.

2e3d228

Fix another datetime dtype merge mismatch.

2aa2556

More permissive type check in convert_col_to_datetime

b9c00fb

Apply types before setting non-unique index.

4f455fc

Use numeric_only=True in rolling average mean()

2b2ad81

Replace complex convert_cols_dtypes with simple apply_pudl_dtypes

378d6b0

Fix ambiguous index/column name overlap in plants_small_ferc1.

09fc965

Fix small pandas-2.0 incompatibilities in glue tests.

f5a56f2

zaneselvans commented Aug 29, 2023

View reviewed changes

zaneselvans marked this pull request as ready for review August 29, 2023 15:10

Merge branch 'dev' into pandas-2.0

10155d5

zaneselvans requested a review from zschira August 29, 2023 18:50

Merge branch 'dev' into pandas-2.0

cb5a56e

zaneselvans commented Aug 29, 2023

View reviewed changes

src/pudl/extract/eia_bulk_elec.py Show resolved Hide resolved

zaneselvans commented Aug 29, 2023

View reviewed changes

zschira approved these changes Aug 30, 2023

View reviewed changes

src/pudl/transform/ferc1.py Show resolved Hide resolved

src/pudl/extract/eia_bulk_elec.py Show resolved Hide resolved

test/unit/io_managers_test.py Show resolved Hide resolved

zaneselvans merged commit 8f33cb3 into dev Aug 30, 2023
7 of 8 checks passed

zaneselvans deleted the pandas-2.0 branch August 30, 2023 18:05

zaneselvans mentioned this pull request Aug 30, 2023

Upgrade PUDL to Pandas 2.0 #2769

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update PUDL to pandas 2.0 #2320

Update PUDL to pandas 2.0 #2320

zaneselvans commented Feb 21, 2023 •

edited

codecov bot commented Feb 21, 2023 •

edited

zaneselvans commented Aug 4, 2023

zaneselvans commented Aug 4, 2023

zaneselvans commented Aug 4, 2023

zaneselvans left a comment

zaneselvans Aug 29, 2023

zaneselvans Aug 29, 2023

zaneselvans Aug 29, 2023

zaneselvans Aug 29, 2023

zaneselvans Aug 29, 2023

zaneselvans Aug 29, 2023

zaneselvans Aug 29, 2023

zaneselvans Aug 29, 2023

zaneselvans commented Aug 29, 2023

zaneselvans Aug 29, 2023

zschira left a comment

Update PUDL to pandas 2.0 #2320

Update PUDL to pandas 2.0 #2320

Conversation

zaneselvans commented Feb 21, 2023 • edited

PR Overview

PR Checklist

codecov bot commented Feb 21, 2023 • edited

Codecov Report

zaneselvans commented Aug 4, 2023

zaneselvans commented Aug 4, 2023

zaneselvans commented Aug 4, 2023

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Aug 29, 2023

Choose a reason for hiding this comment

zschira left a comment

Choose a reason for hiding this comment

zaneselvans commented Feb 21, 2023 •

edited

codecov bot commented Feb 21, 2023 •

edited