PUDL Unit ID Assignments #1037

zaneselvans · 2021-06-19T00:57:15Z

Integrate PUDL Unit ID assignment into generator output

Ad-hoc integration of more expansive PUDL Unit ID assignments into the output functions for the EIA 860 generators. Eventually this functionality should go into the boiler generator association process, but it's going to require additional debugging and development, and we don't want to have to run the ETL over and over again to test it out.

Some outstanding questions

Is the way I'm doing lumping vs. splitting in the steam turbine fuel-based unit assignments, and the "simple" per-generator unit ID assignments sensible? In the simple generators every generator gets its own Unit ID. In the steam turbines, generators with the same (consistent across years) fuel types are lumped together right now. In "reality" it looks like these generators are probably independent. In the context of the generation_fuel_eia923 table, they'll be impossible to tell apart.
There's some code duplication between the different kinds of unit ID assignments. Is it worth trying to simplify / consolidate?
Should this process be optional in the gens_eia860() output function? It takes a minute or two to run. Or can it be made much faster?
What kinds of tests or diagnostics can we use to understand whether these particular unit ID assignments are a good idea or not?
Should we fill in pseudo-boiler IDs for the units that we've created, like the latter years of CCNG plants do?
Should we back/forward fill the technology descriptions and prime mover codes? Esp. in older yeras?
Does every boiler that we know of (in the boiler entity table) mapped to generators in the BGA table? Or are there some orphaned, unassociated boilers?

Where should these Unit IDs end up eventually?

Should it be several different well normalized tables defining different kinds of Unit IDs? As it is right now they can't go into the boiler-generator association table, since it requires non-null boiler_id values as a primary key field.
Should it be a single denormalized table that's only available in the outputs?

There's a handful of plants whose unit_id_pudl values do not start at 1. Not sure why this is the case. It could be that the reported Unit IDs that we start with in the BGA don't start at 1. Or it could be that there's a bug... Once you've got the gens_eia860 dataframe, this code will show them:

(
    gens_eia860
    .groupby("plant_id_eia")["unit_id_pudl"].min()
    .reset_index()
    .rename(columns={"unit_id_pudl": "min_unit_id_pudl"})
    .fillna(0).astype(int)
    .query("min_unit_id_pudl > 1")
)

Followup Tasks

Some things to do, based on feedback from @cmgosnell:

Highlight the dynamic nature of Unit IDs in data dictionary so folks don't hard-code them in manual processes.
More clearly document the reasoning behind fuel-specific Unit ID assignment in the steam turbine ID assignments
Raise an exception if both fuel_type_code and label_prefix are set in the fuel-based sub-prime-mover unit ID assignment process.
Make the Unit ID assignment process optional via a flag in the pudl_out.gens_eia860() function.
Functionalize the split-merge-append process that takes place at the end of most of the ID assignment functions
Explicitly flag ST generators that do not have consistent fuel types across all years so they aren't left dangling (similar to labeling of the orphan CT/CA generators)
Combine all the Combined Cycle Unit ID assignments into a single CC specific function. The abstraction as it exists right now is artificial.

Further investigation:

Look at the preexisting units to get a better sense of how they are structured, and whether these unit ID assignments result in similar units. E.g. are GT and IC prime movers typically seen in association with other types of generators within a larger aggregated unit? Are there many pseudo-combined-cycle plants that are made up of GT and ST generators that (maybe) really should be labeled CT and CA?

WIP Notebook

Ad-hoc integration of more expansive PUDL Unit ID assignments into the output functions for the EIA 860 generators. Eventually this functionality should go into the boiler generator association process, but it's going to require additional debugging and development, and we don't want to have to run the ETL over and over again to test it out. This is still a work in progress. Outstanding questions / tasks... * Move explanatory text from the prototype notebook into the docstrings. * Consider lumping vs. splitting in the steam turbine fuel-based unit assignments, as well as the "simple" per-generator unit ID assignments. * There's some code duplication between the different kinds of unit ID assignments. Is it worth trying to simplify? * Should this process be optional in the gens_eia860() output function? It takes a minute or two to run. Alternatively, can it be made much faster? * What kinds of tests or diagnostics can we use to understand whether these particular unit ID assignments are a good idea or not?

review-notebook-app · 2021-06-19T00:57:18Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2021-06-19T01:08:03Z

Codecov Report

Merging #1037 (d681046) into dev (25b1905) will increase coverage by 0.39%.
The diff coverage is 94.44%.

@@            Coverage Diff             @@
##              dev    #1037      +/-   ##
==========================================
+ Coverage   81.03%   81.43%   +0.39%     
==========================================
  Files          49       49              
  Lines        6005     6089      +84     
==========================================
+ Hits         4866     4958      +92     
+ Misses       1139     1131       -8

Impacted Files	Coverage Δ
src/pudl/analysis/mcoe.py	`95.00% <ø> (ø)`
src/pudl/constants.py	`100.00% <ø> (ø)`
src/pudl/output/eia860.py	`97.45% <93.98%> (-2.55%)`	⬇️
src/pudl/helpers.py	`92.26% <100.00%> (ø)`
src/pudl/output/eia923.py	`98.24% <100.00%> (+0.02%)`	⬆️
src/pudl/output/pudltabl.py	`68.67% <100.00%> (ø)`
src/pudl/transform/eia860.py	`96.15% <100.00%> (ø)`
src/pudl/analysis/timeseries_cleaning.py	`88.62% <0.00%> (+2.84%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 25b1905...d681046. Read the comment docs.

TrentonBush · 2021-07-01T17:56:41Z

src/pudl/transform/eia860.py

@@ -582,8 +582,8 @@ def transform(eia860_raw_dfs, eia860_tables=pc.pudl_tables["eia860"]):
    # for each of the tables, run the respective transform funtction
    for table in eia860_transform_functions:
        if table in eia860_tables:
-            logger.info(f"Transforming raw EIA 860 DataFrames for {table} "
-                        f"concatenated across all years.")
+            logger.info("Transforming raw EIA 860 DataFrames for %s "


Out of curiosity, is there something wrong with f-strings here?

My understanding is that f-strings always get interpolated, whereas giving the variables as arguments to the logger method prevents them from getting interpolated if you're not at that log level. And this can be important if you're only assigning to those variables based on the logging level (e.g. you don't bother calculating value unless you're at DEBUG) If you search for Python lazy logging I think you'll find more info https://stackoverflow.com/questions/13500813/how-to-use-modern-string-formatting-options-with-pythons-logging-module

Huh, I didn't know that. Following your link, somebody made the argument that because f-strings are significantly faster to interpolate, the savings from avoided interpolation of suppressed log levels are partly offset by slower interpolation at higher levels, plus the cost of reduced readability. And that the absolute time is very small difference either way unless you're churning out millions of logs.

I take your point that it is a neat trick to keep in mind if table was an expensive calculation only made in the logging call.

I guess my takeaway is that it isn't a hill I'm willing to die on so either way is fine with me 😆

Truth be told, I was just following the orders given to me by the linters! Personally I prefer using f-strings as I find them much more readable. So if there's no clear "right way" to do this then I'd love to use f-strings across the board and turn off that linter error. The one place where I could imagine an actual error coming up is when some variable is only defined contingent upon a certain loglevel being set (e.g. don't generate a JSON report to print out unless you're at DEBUG) in which case attempting to interpolate the unset variable would cause a problem. But I don't think we're actually doing this anywhere, are we?

TrentonBush · 2021-07-07T01:22:46Z

src/pudl/constants.py

@@ -789,6 +789,7 @@
    'BA': 'Energy Storage, Battery',
    'BT': 'Turbines Used in a Binary Cycle. Including those used for geothermal applications',
    'CA': 'Combined-Cycle -- Steam Part',
+    'CC': 'Combined-Cycle, Total Unit',


I saw this category when integrating EIA923 from I think 2004. I actually cleaned it out of the data as an error, instead of adding it here, because

it only occurred on a single plant for a single year

in subsequent years, that plant broke out into CA and CC parts

in prior years, that plant left it blank

The CC category does not appear in EIA docs

Does this category appear more often in some other source, like 860? I'm fine adding it, I just want to know if I should go back and change the 923 stuff to be consistent.

I thought that I saw it more than just one time, and that I had seen EIA documentation of it (from which I got the description) but I could be wrong. If what you're recalling is the way it really is, then you definitely did the right thing and the generators will all be given the CA/CT prime mover codes in the entity resolution process. I need to verify this.

I get 715 records when I look for all the CC generators, and they appear fairly consistently (~45 of them) in every year of data from 2004-2020:

( pudl_out.gens_eia860() .query("prime_mover_code=='CC'") .loc[:, [ "report_date", "plant_id_eia", "generator_id", "prime_mover_code", ]] )

OK, this seems to be a difference between 923 and 860. I just double checked and 'CC' is not in the prime mover code table of the 923 docs but it is in the 860 table. And 'CC' only appears on a single plant-year in 923 data (generation fuel).

I guess I'll leave everything as is? I think it was an error in the 923, but is fine in the 860.

cmgosnell

Thanks for this @zaneselvans! It looks good logically... I asked some questions and made some minor suggestions but to me this looks generally good enough to start working with and testing heat rates with.

src/pudl/helpers.py

src/pudl/output/eia860.py

cmgosnell · 2021-07-09T16:42:53Z

src/pudl/output/eia860.py

+    )
+    # Split original dataframe based on row_mask, and merge in the new IDs and
+    # labels only on the subset of the dataframe matching our row_mask:
+    out_df = gens_df.loc[~row_mask].append(


it looks like you employ this same chunk of code to basically all of these unit-assigner functions. I'd make this a bb function and apply uniformly

Which chunk of code? This whole assignment of the new unit IDs? You can also select a range of lines to highlight in association with a comment.

src/pudl/output/eia923.py

cmgosnell · 2021-07-09T17:14:47Z

src/pudl/output/eia860.py

+    might be getting lumped together.
+
+    """
+    pm_cols = ["plant_id_eia", "generator_id", "prime_mover_code"]


bc moooost of the stuff (and the ideas between this function and assign_unique_combined_cycle_unit_ids() i miiiight think about squishing them together.

IIRC, the only reason they're separate is so that we can differentiate between the two cases after the fact with different labels (since the unique ones ought to be higher confidence / granularity). But yeah I think having a single function that does all of the Combined Cycle unit ID assignments would probably be fine and less duplicative.

cmgosnell · 2021-07-09T17:17:07Z

src/pudl/output/eia860.py

+
+    This only assigns a PUDL Unit ID to generators that don't already have one,
+    and only to generators that have a consistent `fuel_type_code_pudl` across
+    all of the years of data in `gens_df`. This is a simplified fuel code that


can you add the why in here for this one? I get the idea of what is happening, but why does the consistency of the fuel type matter?

This really only applies to the Steam Turbines. That type of prime mover is just too generic -- basically everything ends up spinning a steam turbine -- and the fuel type is the only way we have to differentiate meaningfully between distinct "units" when there are several steam turbines that are part of a single plant. Does that make sense?

test/validate/eia860_test.py

src/pudl/output/eia860.py

@cmgosnell

Integrated feedback from @cmgosnell on PR #1037, including: * Highlight dynamic nature of PUDL Unit IDs in the data dictionary. This was really just an edit to the datapackage.json file. However, my editor decided to reformat the whole dang thing, so this is a giant diff. * Make Unit ID assignment optional via a flag in gens_eia860() for now. * Don't clobber label_prefix with fuel_type_code in the fuel-based steam turbine unit assignment process. * Clarify fuel-based distinction within Steam Turbines docstring * Functionalize split-merge-append for adding new Unit IDs to gens_df * Flag inconsistent fuel ST generators with an informative label * Combine all Combined Cycle ID assignments into a single CC-specific function. This ended up being a pretty dramatic simplification, using df.loc[] based assignments on aligned dataframes rather than split-apply-combine with a bunch of merging. Hopefully it's also easier to read and follow along. Another change here was that I decided to go ahead and assign orphaned CT and CA generators Unit IDs, since they're typically part of a unit that exists in many other years, and they can be excluded after the fact if need by based on their label.

I added a gens_eia860() test to the fast output tests, and have it pull the data with unit_ids=True to test the Unit ID assignment code. I also merged in the dev branch, which included a change in the generators_eia860() function, to use the column dtype standardization function from pudl.helpers. But... this caused errors in the ID assignment process -- a few hundred records get lost along the way if all those dtypes are set. So... I reverted to my simpler dtype assignment code. But this is confusing.

The PUDL Unit ID assignment proces has unexpected behavior if it's operating on a dataframe with the canonical datatypes (as set by convert_cols_dtypes()). The only columns that are involved in the process are: * plant_id_eia * generator_id * report_date * unit_id_pudl * bga_source * fuel_type_code_pudl * prime_mover_code The unexpected behavior is that a few hundred rows are just... lost in the process. This is caught by the check which ensures the index of the input and outputdataframes are identical in the unit ID assignment function. To work around this, I moved the convert_cols_dtypes() call to the very end of the generators_eia860() function, after the unit ID assignments have taken place. But it would be good to understand exactly why this is happening. Seems weird.

zaneselvans added 13 commits April 8, 2021 20:53

WIP improving PUDL Unit IDs / Heat Rate calculations

cd165df

WIP Notebook characterizing Unit ID / Heat Rate issues.

3cb0ff4

Merge branch 'dev' into better-heatrates

ae0cc58

WIP: improving unit IDs and heat rate calculations

dd204a8

Remove stale and conflicting pdbpp dependency

d07e951

Merge branch 'dev' into better-heatrates

ffb1958

Assign Unit IDs to thermal generators not in the BGA

4368cb7

WIP Notebook

Add net generation allocation to WIP notebook

c717bfd

Merge branch 'dev' into better-heatrates

c59e360

Confirmed notebook still runs after merging in dev

3d583c7

WIP heat rate notebook

26e5b96

Work on steam turbine units

697733d

zaneselvans requested review from cmgosnell and TrentonBush June 19, 2021 00:57

zaneselvans changed the title ~~[WIP] Better Heatrates~~ PUDL Unit ID Assignments Jun 30, 2021

zaneselvans marked this pull request as ready for review June 30, 2021 23:21

Improve docstrings and comments in Unit ID assignment code

b42af7a

TrentonBush reviewed Jul 1, 2021

View reviewed changes

zaneselvans mentioned this pull request Jul 5, 2021

Better Heat Rates #962

Open

6 tasks

TrentonBush self-assigned this Jul 5, 2021

TrentonBush reviewed Jul 7, 2021

View reviewed changes

cmgosnell approved these changes Jul 9, 2021

View reviewed changes

Merge branch 'dev' into better-heatrates

64f6a4f

cmgosnell reviewed Jul 9, 2021

View reviewed changes

src/pudl/output/eia860.py Outdated Show resolved Hide resolved

zaneselvans assigned zaneselvans and cmgosnell Jul 10, 2021

zaneselvans added 4 commits July 13, 2021 21:38

WIP scratchpad notebook for heat rates

54f3ed2

Merge in changes from dev

21beb8d

zaneselvans merged commit 5a39e6d into dev Jul 14, 2021

cmgosnell mentioned this pull request Aug 25, 2021

manage retired generators in unit ID creatation #1149

Open

cmgosnell mentioned this pull request Apr 19, 2023

Fix expansion of unit_id_pudl to be compatible with annually varying prime_mover_code #2535

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PUDL Unit ID Assignments #1037

PUDL Unit ID Assignments #1037

zaneselvans commented Jun 19, 2021 •

edited

review-notebook-app bot commented Jun 19, 2021

codecov bot commented Jun 19, 2021 •

edited

TrentonBush Jul 1, 2021

zaneselvans Jul 5, 2021

TrentonBush Jul 7, 2021

zaneselvans Jul 9, 2021

TrentonBush Jul 7, 2021

zaneselvans Jul 9, 2021

zaneselvans Jul 9, 2021

TrentonBush Jul 9, 2021

cmgosnell left a comment

cmgosnell Jul 9, 2021

zaneselvans Jul 9, 2021

cmgosnell Jul 9, 2021

zaneselvans Jul 9, 2021

cmgosnell Jul 9, 2021

zaneselvans Jul 9, 2021

PUDL Unit ID Assignments #1037

PUDL Unit ID Assignments #1037

Conversation

zaneselvans commented Jun 19, 2021 • edited

Some outstanding questions

Followup Tasks

Further investigation:

review-notebook-app bot commented Jun 19, 2021

codecov bot commented Jun 19, 2021 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmgosnell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Jun 19, 2021 •

edited

codecov bot commented Jun 19, 2021 •

edited