Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PUDL to pandas 2.0 #2320

Merged
merged 54 commits into from Aug 30, 2023
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
290074f
Replace no longer available SQLAlchemy VisitableType with plain type
zaneselvans Feb 3, 2023
9842f65
Merge branch 'dev' into sqlalchemy-2.0
zaneselvans Feb 12, 2023
9e33529
Require SQLAlchemy>2 to force compatibility testing.
zaneselvans Feb 13, 2023
d184201
Update for compatibility with pandas 2.0
zaneselvans Feb 21, 2023
ad2c3f4
Temporarily depend on ferc_xbrl_extractor pandas-2.0 branch
zaneselvans Feb 21, 2023
5658b66
Merge branch 'dev' into sqlalchemy-2.0
zaneselvans Feb 21, 2023
89b35c5
Merge branch 'sqlalchemy-2.0' into pandas-2.0
zaneselvans Feb 21, 2023
3e4323c
Merge non-dependency changes from sqlalchemy-2.0 branch
zaneselvans Feb 21, 2023
c4bf0d6
Merge branch 'dev' into pandas-2.0
zaneselvans Feb 25, 2023
2115739
Merge branch 'dev' into pandas-2.0
zaneselvans Mar 3, 2023
bc1ce1c
Merge branch 'dev' into pandas-2.0
zaneselvans Mar 11, 2023
193c89e
Merge branch 'dev' into pandas-2.0
zaneselvans Mar 19, 2023
bc77904
Require pandas 2.0 RC for testing.
zaneselvans Mar 21, 2023
e6cb446
Merge branch 'dev' into pandas-2.0
zaneselvans Mar 21, 2023
eb2c47a
Merge branch 'python-3.11' into pandas-2.0
zaneselvans Mar 23, 2023
ab5fde8
Specify resolution of datetime64 types.
zaneselvans Mar 23, 2023
15424f1
Merge branch 'dev' into pandas-2.0
zaneselvans Mar 31, 2023
9f624c2
Add missed changes from merge to setup.py
zaneselvans Mar 31, 2023
9635ff5
Use actual pandas 2.0.0 now that it's out.
zaneselvans Apr 3, 2023
fcb3fa2
Merge branch 'dev' into pandas-2.0
zaneselvans Apr 3, 2023
17ea796
Merge branch 'dev' into pandas-2.0
zaneselvans Apr 4, 2023
1214290
Merge branch 'pyproject-toml' into pandas-2.0
zaneselvans Apr 4, 2023
a15a244
Merge branch 'dev' into pandas-2.0
zaneselvans Apr 4, 2023
f1164d7
Merge branch 'dev' into pandas-2.0
zaneselvans Apr 4, 2023
97fae98
Merge branch 'dev' into pandas-2.0
zaneselvans Apr 4, 2023
6c8fe03
Merge branch 'dev' into pandas-2.0
zaneselvans Apr 4, 2023
4cc7e4d
Update to ferc-xbrl-extractor 0.8.2
zaneselvans Apr 4, 2023
6e795ec
Merge branch 'dev' into pandas-2.0
zaneselvans Apr 5, 2023
4af9d7e
Merge branch 'dev' into pandas-2.0
zaneselvans Apr 5, 2023
a67248e
Use pandas extras to declare some dependencies
zaneselvans Apr 7, 2023
1f04e24
Merge branch 'dev' into pandas-2.0
zaneselvans Apr 24, 2023
dbb58f0
Merge branch 'dev' into pandas-2.0
zaneselvans Jul 26, 2023
09fb939
Update recordlinkage to version compatible with pandas 2.0
zaneselvans Jul 26, 2023
3347efb
the road to hell is paved with dtypes
nelsonauner Aug 3, 2023
c358d7f
Merge branch 'dev' into pandas-2.0
zaneselvans Aug 4, 2023
f3d9763
Merge remote-tracking branch 'nelsonauner/pandas-2.0' into pandas-2.0
zaneselvans Aug 4, 2023
6b19483
Use tuple instead of list of SQL params.
zaneselvans Aug 4, 2023
c2c776b
Set SQLAlchemy view w/ metadata error to XFAIL.
zaneselvans Aug 4, 2023
9c0883e
Ignore DeprecationWarning coming up from imported packages.
zaneselvans Aug 4, 2023
91d9595
Update fix_eia_na() regex to avoid pandas df.replace() vectorization …
zaneselvans Aug 4, 2023
e1b437b
Remove deprecated infer_date_format flag.
zaneselvans Aug 4, 2023
ca602b0
Merge branch 'dev' into pandas-2.0
zaneselvans Aug 9, 2023
0aab435
Merge branch 'dev' into pandas-2.0
zaneselvans Aug 27, 2023
f5324f9
Fix some pandas 2 incompatibilities and highlight others.
zaneselvans Aug 27, 2023
2e3d228
Fix more minor pandas 2 compatibility issues.
zaneselvans Aug 27, 2023
2aa2556
Fix another datetime dtype merge mismatch.
zaneselvans Aug 27, 2023
b9c00fb
More permissive type check in convert_col_to_datetime
zaneselvans Aug 28, 2023
4f455fc
Apply types before setting non-unique index.
zaneselvans Aug 28, 2023
2b2ad81
Use numeric_only=True in rolling average mean()
zaneselvans Aug 28, 2023
378d6b0
Replace complex convert_cols_dtypes with simple apply_pudl_dtypes
zaneselvans Aug 28, 2023
09fc965
Fix ambiguous index/column name overlap in plants_small_ferc1.
zaneselvans Aug 29, 2023
f5a56f2
Fix small pandas-2.0 incompatibilities in glue tests.
zaneselvans Aug 29, 2023
10155d5
Merge branch 'dev' into pandas-2.0
zaneselvans Aug 29, 2023
cb5a56e
Merge branch 'dev' into pandas-2.0
zaneselvans Aug 29, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view

Large diffs are not rendered by default.

5 changes: 2 additions & 3 deletions pyproject.toml
Expand Up @@ -33,14 +33,13 @@ dependencies = [
"jinja2>=2,<3.2",
"matplotlib>=3.3,<3.8", # Should make this optional with a "viz" extras
"networkx>=2.2,<3.2",
"pandas[parquet,excel,fss,gcp,compression]>=2.0,<2.1",
"numpy>=1.18.5,!=1.23.0,<1.26",
"pandas>=1.4,<1.5.4",
"pyarrow>=5,<12.1",
"pydantic[email]>=1.7,<2",
"python-dotenv>=0.21,<1.1",
"python-snappy>=0.6,<0.7",
"pyyaml>=5,<6.1",
"recordlinkage>=0.14,<0.17",
"recordlinkage>=0.16,<0.17",
"scikit-learn>=1.0,<1.4",
"scipy>=1.6,<1.12",
"Shapely>=2.0,<2.1",
Expand Down
23 changes: 16 additions & 7 deletions src/pudl/analysis/allocate_gen_fuel.py
Expand Up @@ -629,7 +629,7 @@ def stack_generators(
pd.DataFrame(gens.set_index(IDX_GENS)[esc].stack(level=0))
.reset_index()
.rename(columns={"level_3": cat_col, 0: stacked_col})
.pipe(apply_pudl_dtypes, "eia")
.pipe(apply_pudl_dtypes, group="eia")
)
# arrange energy source codes by number and type (start with energy_source_code, then planned_, then startup_)
gens_stack_prep = gens_stack_prep.sort_values(
Expand Down Expand Up @@ -712,13 +712,21 @@ def associate_generator_tables(
"""
stack_gens = stack_generators(
gens, cat_col="energy_source_code_num", stacked_col="energy_source_code"
)
).pipe(apply_pudl_dtypes, group="eia")
# allocate the boiler fuel data to generators
bf_by_gens = allocate_bf_data_to_gens(bf, gens, bga)
bf_by_gens = (
bf_by_gens.set_index(IDX_GENS_PM_ESC).add_suffix("_bf_tbl").reset_index()
allocate_bf_data_to_gens(bf, gens, bga)
.set_index(IDX_GENS_PM_ESC)
.add_suffix("_bf_tbl")
.reset_index()
.pipe(apply_pudl_dtypes, group="eia")
)
gf = (
gf.set_index(IDX_PM_ESC)[DATA_COLUMNS]
.add_suffix("_gf_tbl")
.reset_index()
.pipe(apply_pudl_dtypes, group="eia")
)
gf = gf.set_index(IDX_PM_ESC)[DATA_COLUMNS].add_suffix("_gf_tbl").reset_index()

gen_assoc = (
pd.merge(
Expand Down Expand Up @@ -767,7 +775,7 @@ def associate_generator_tables(
.reset_index(),
on=IDX_ESC,
how="outer",
).pipe(apply_pudl_dtypes, "eia")
).pipe(apply_pudl_dtypes, group="eia")
return gen_assoc


Expand Down Expand Up @@ -1566,7 +1574,7 @@ def assign_plant_year(df):
"year": x.report_date.dt.year,
"month": 1,
"day": 1,
}
},
)
)
.pipe(
Expand All @@ -1577,6 +1585,7 @@ def assign_plant_year(df):
)
.assign(**{data_column_name: lambda x: x[data_column_name] / 12})
.pipe(assign_plant_year)
.pipe(apply_pudl_dtypes, group="eia")
.set_index(["plant_year"])
)
# sometimes a plant oscillates btwn annual and monthly reporting. when it does
Expand Down
1 change: 1 addition & 0 deletions src/pudl/analysis/plant_parts_eia.py
Expand Up @@ -480,6 +480,7 @@ def execute(
validate_own_merge,
)
)
gens_mega = gens_mega.convert_dtypes()
return gens_mega

def get_gens_mega_table(self, mcoe):
Expand Down
2 changes: 1 addition & 1 deletion src/pudl/analysis/state_demand.py
Expand Up @@ -372,7 +372,7 @@ def filter_ferc714_hourly_demand_matrix(
.groupby("id")["year"]
.apply(lambda x: np.sort(x))
)
with pd.option_context("display.max_colwidth", -1):
with pd.option_context("display.max_colwidth", None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Displays all columns. Not sure why -1 ever worked.

logger.info(f"{msg}:\n{report}")
# Drop respondents with no data
blank = df.columns[df.isnull().all()].tolist()
Expand Down
3 changes: 2 additions & 1 deletion src/pudl/extract/eia_bulk_elec.py
Expand Up @@ -75,11 +75,12 @@ def _parse_data_column(elec_df: pd.DataFrame) -> pd.DataFrame:
)
else:
data_df.loc[:, "date"] = pd.to_datetime(
data_df.loc[:, "date"], infer_datetime_format=True, errors="raise"
data_df.loc[:, "date"], errors="raise"
)
data_df["series_id"] = elec_df.at[idx, "series_id"]
out.append(data_df)
out = pd.concat(out, ignore_index=True, axis=0)
out = out.convert_dtypes()
zaneselvans marked this conversation as resolved.
Show resolved Hide resolved
out.loc[:, "series_id"] = out.loc[:, "series_id"].astype("category", copy=False)
return out.loc[:, ["series_id", "date", "value"]] # reorder cols

Expand Down
4 changes: 2 additions & 2 deletions src/pudl/glue/ferc1_eia.py
Expand Up @@ -245,7 +245,7 @@ def drop_invalid_rows(self, df):
"`drop_invalid_rows`. Adding empty columns for: "
f"{missing_required_cols}"
)
df.loc[:, missing_required_cols] = pd.NA
df.loc[:, list(missing_required_cols)] = pd.NA
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use sets as indexers in pandas 2.0

return super().drop_invalid_rows(df)


Expand Down Expand Up @@ -423,7 +423,7 @@ def get_utility_most_recent_capacity(pudl_engine) -> pd.DataFrame:
== gen_caps["report_date"]
)
most_recent_gens = gen_caps.loc[most_recent_gens_idx]
utility_caps = most_recent_gens.groupby("utility_id_eia").sum()
utility_caps = most_recent_gens.groupby("utility_id_eia")["capacity_mw"].sum()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want to sum one column, not the whole dataframe.

return utility_caps


Expand Down
68 changes: 35 additions & 33 deletions src/pudl/helpers.py
Expand Up @@ -26,7 +26,7 @@
from pandas._libs.missing import NAType

import pudl.logging_helpers
from pudl.metadata.fields import get_pudl_dtypes
from pudl.metadata.fields import apply_pudl_dtypes, get_pudl_dtypes

sum_na = partial(pd.Series.sum, skipna=False)
"""A sum function that returns NA if the Series includes any NA values.
Expand Down Expand Up @@ -364,22 +364,23 @@ def is_doi(doi):
return bool(re.match(doi_regex, doi))


def convert_col_to_datetime(df, date_col_name):
"""Convert a column in a dataframe to a datetime.
def convert_col_to_datetime(df: pd.DataFrame, date_col_name: str) -> pd.DataFrame:
"""Convert a non-datetime column in a dataframe to a datetime64[s].

If the column isn't a datetime, it needs to be converted to a string type
first so that integer years are formatted correctly.

Args:
df (pandas.DataFrame): Dataframe with column to convert.
date_col_name (string): name of the column to convert.
df: Dataframe with column to convert.
date_col_name: name of the datetime column to convert.

Returns:
Dataframe with the converted datetime column.
"""
if pd.api.types.is_datetime64_ns_dtype(df[date_col_name]) is False:
if not pd.api.types.is_datetime64_dtype(df[date_col_name]):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a more liberal type check here, since what we're really trying to do is detect non-datetime columns (e.g. int, string) to convert.

logger.warning(
f"{date_col_name} is {df[date_col_name].dtype} column. Converting to datetime."
f"{date_col_name} is {df[date_col_name].dtype} column. "
"Converting to datetime64[ns]."
)
df[date_col_name] = pd.to_datetime(df[date_col_name].astype("string"))
return df
Expand Down Expand Up @@ -618,17 +619,21 @@ def expand_timeseries(
f"{fill_through_freq} is not a valid frequency to fill through."
)
end_dates["drop_row"] = True
df = pd.concat([df, end_dates.reset_index()])
df = (
df.set_index(date_col)
pd.concat([df, end_dates.reset_index()])
.set_index(date_col)
.groupby(key_cols)
.resample(freq)
.ffill()
.drop(key_cols, axis=1)
.reset_index()
)
df = df[df.drop_row.isnull()].drop("drop_row", axis=1).reset_index(drop=True)
return df
return (
df[df.drop_row.isnull()]
.drop(columns="drop_row")
.reset_index(drop=True)
.pipe(apply_pudl_dtypes)
)


def organize_cols(df, cols):
Expand Down Expand Up @@ -984,26 +989,19 @@ def convert_to_date(
return df


def fix_eia_na(df):
def fix_eia_na(df: pd.DataFrame) -> pd.DataFrame:
"""Replace common ill-posed EIA NA spreadsheet values with np.nan.

Currently replaces empty string, single decimal points with no numbers,
and any single whitespace character with np.nan.

Args:
df (pandas.DataFrame): The DataFrame to clean.
df: The DataFrame to clean.

Returns:
pandas.DataFrame: The cleaned DataFrame.
DataFrame with regularized NA values.
"""
return df.replace(
to_replace=[
r"^\.$", # Nothing but a decimal point
r"^\s*$", # The empty string and entirely whitespace strings
],
value=np.nan,
regex=True,
)
return df.replace(regex=r"(^\.$|^\s*$)", value=np.nan)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a regression in pandas. See pandas-dev/pandas#54399



def simplify_columns(df):
Expand All @@ -1025,14 +1023,18 @@ def simplify_columns(df):
Todo:
Update docstring.
"""
df.columns = (
df.columns.str.replace(r"[^0-9a-zA-Z]+", " ", regex=True)
.str.strip()
.str.lower()
.str.replace(r"\s+", " ", regex=True)
.str.replace(" ", "_")
)
return df
# Do nothing, if empty dataframe (e.g. mocked for tests)
if df.shape[0] == 0:
return df
else:
df.columns = (
df.columns.str.replace(r"[^0-9a-zA-Z]+", " ", regex=True)
.str.strip()
.str.lower()
.str.replace(r"\s+", " ", regex=True)
.str.replace(" ", "_")
)
return df


def drop_tables(engine: sa.engine.Engine, clobber: bool = False):
Expand Down Expand Up @@ -1220,11 +1222,11 @@ def generate_rolling_avg(
# to get the backbone/complete date range/groups
bones = (
date_range.merge(groups)
.drop("tmp", axis=1) # drop the temp column
.drop(columns="tmp") # drop the temp column
.merge(df, on=group_cols + ["report_date"])
.set_index(group_cols + ["report_date"])
.groupby(by=group_cols + ["report_date"])
.mean()
.mean(numeric_only=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change in pandas default behavior. numeric_only=True preserves old behavior.

)
# with the aggregated data, get a rolling average
roll = bones.rolling(window=window, center=True, **kwargs).agg({data_col: "mean"})
Expand Down Expand Up @@ -1600,7 +1602,7 @@ def convert_df_to_excel_file(df: pd.DataFrame, **kwargs) -> pd.ExcelFile:
writer = pd.ExcelWriter(bio, engine="xlsxwriter")
df.to_excel(writer, **kwargs)

writer.save()
writer.close()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change in API?


bio.seek(0)
workbook = bio.read()
Expand Down
2 changes: 1 addition & 1 deletion src/pudl/metadata/classes.py
Expand Up @@ -663,7 +663,7 @@ def to_pandas_dtype(self, compact: bool = False) -> str | pd.CategoricalDtype:
return "float32"
return FIELD_DTYPES_PANDAS[self.type]

def to_sql_dtype(self) -> sa.sql.visitors.VisitableType:
def to_sql_dtype(self) -> type:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To work with SQLAlchemy 2.0

"""Return SQLAlchemy data type."""
if self.constraints.enum and self.type == "string":
return sa.Enum(*self.constraints.enum)
Expand Down
2 changes: 1 addition & 1 deletion src/pudl/metadata/constants.py
Expand Up @@ -28,7 +28,7 @@
"year": pa.int32(),
}

FIELD_DTYPES_SQL: dict[str, sa.sql.visitors.VisitableType] = {
FIELD_DTYPES_SQL: dict[str, type] = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compatibility with SQLAlchemy 2.0

"boolean": sa.Boolean,
"date": sa.Date,
# Ensure SQLite's string representation of datetime uses only whole seconds:
Expand Down
2 changes: 1 addition & 1 deletion src/pudl/metadata/fields.py
Expand Up @@ -2907,7 +2907,7 @@ def apply_pudl_dtypes(
) -> pd.DataFrame:
"""Apply dtypes to those columns in a dataframe that have PUDL types defined.

Note at ad-hoc column dtypes can be defined and merged with default PUDL field
Note that ad-hoc column dtypes can be defined and merged with default PUDL field
metadata before it's passed in as ``field_meta`` if you have module specific column
types you need to apply alongside the standard PUDL field types.

Expand Down
20 changes: 10 additions & 10 deletions src/pudl/output/censusdp1tract.py
Expand Up @@ -42,17 +42,17 @@ def get_layer(layer, dp1_engine):
table_name = f"{layer}_2010census_dp1"
df = pd.read_sql(
"""
SELECT geom_cols.f_table_name as table_name,
geom_cols.f_geometry_column as geom_col,
crs.auth_name as auth_name,
crs.auth_srid as auth_srid
FROM geometry_columns geom_cols
INNER JOIN spatial_ref_sys crs
ON geom_cols.srid = crs.srid
WHERE table_name = ?
""",
SELECT geom_cols.f_table_name as table_name,
geom_cols.f_geometry_column as geom_col,
crs.auth_name as auth_name,
crs.auth_srid as auth_srid
FROM geometry_columns geom_cols
INNER JOIN spatial_ref_sys crs
ON geom_cols.srid = crs.srid
WHERE table_name = ?
""",
dp1_engine,
params=[table_name],
params=(table_name,),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something changed about how pandas passes query params to SQL Alchemy. Tuple is okay, list is not.

)
if len(df) != 1:
raise AssertionError(
Expand Down
15 changes: 10 additions & 5 deletions src/pudl/output/ferc714.py
Expand Up @@ -236,7 +236,7 @@ def filled_balancing_authority_eia861(
df = pd.concat([df, pd.DataFrame(rows)])
# Remove balancing authorities treated as utilities
mask = df["balancing_authority_id_eia"].isin([util["id"] for util in UTILITIES])
return df[~mask]
return apply_pudl_dtypes(df[~mask], group="eia")


def filled_balancing_authority_assn_eia861(
Expand Down Expand Up @@ -312,7 +312,11 @@ def filled_balancing_authority_assn_eia861(
tables.append(table)
if "replace" in util and util["replace"]:
mask |= is_child
return pd.concat([df[~mask], pd.concat(tables)]).drop_duplicates()
return (
pd.concat([df[~mask]] + tables)
.drop_duplicates()
.pipe(apply_pudl_dtypes, group="eia")
)


def filled_service_territory_eia861(
Expand Down Expand Up @@ -340,8 +344,7 @@ def filled_service_territory_eia861(
# Reformat as unique utility-state-year
assn = assn[selected][index].drop_duplicates()
# Select relevant service territories
df = service_territory_eia861
mdf = assn.merge(df, how="left")
mdf = assn.merge(service_territory_eia861, how="left")
# Drop utility-state with no counties for all years
grouped = mdf.groupby(["utility_id_eia", "state"])["county_id_fips"]
mdf = mdf[grouped.transform("count").gt(0)]
Expand All @@ -361,7 +364,9 @@ def filled_service_territory_eia861(
idx = (years - row["report_date"]).abs().idxmin()
mask &= mdf["report_date"].eq(years[idx])
tables.append(mdf[mask].assign(report_date=row["report_date"]))
return pd.concat([df] + tables)
return pd.concat([service_territory_eia861] + tables).pipe(
apply_pudl_dtypes, group="eia"
)


@asset(compute_kind="Python")
Expand Down
6 changes: 3 additions & 3 deletions src/pudl/transform/eia860.py
Expand Up @@ -922,8 +922,7 @@ def _core_eia860__boiler_emissions_control_equipment_assn(
raw_eia860__boiler_particulate,
]

bece_df = pd.DataFrame({})

dfs = []
for table in raw_tables:
# There are some utilities that report the same emissions control equipment.
# Drop duplicate rows where the only difference is utility.
Expand All @@ -948,7 +947,8 @@ def _core_eia860__boiler_emissions_control_equipment_assn(
var_name="emission_control_id_type",
value_name="emission_control_id_eia",
)
bece_df = bece_df.append(table)
dfs.append(table)
bece_df = pd.concat(dfs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.append() has been deprecated in favor of the more featureful pd.concat()

I also switched to doing a single big concatenation rather than many incremental ones.


# The report_year column must be report_date in order for the harvcesting process
# to work on this table. It later gets converted back to report_year.
Expand Down