Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUDL Unit ID Assignments #1037

Merged
merged 20 commits into from
Jul 14, 2021
Merged

PUDL Unit ID Assignments #1037

merged 20 commits into from
Jul 14, 2021

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Jun 19, 2021

Integrate PUDL Unit ID assignment into generator output

Ad-hoc integration of more expansive PUDL Unit ID assignments into the output functions for the EIA 860 generators. Eventually this functionality should go into the boiler generator association process, but it's going to require additional debugging and development, and we don't want to have to run the ETL over and over again to test it out.

Some outstanding questions

  • Is the way I'm doing lumping vs. splitting in the steam turbine fuel-based unit assignments, and the "simple" per-generator unit ID assignments sensible? In the simple generators every generator gets its own Unit ID. In the steam turbines, generators with the same (consistent across years) fuel types are lumped together right now. In "reality" it looks like these generators are probably independent. In the context of the generation_fuel_eia923 table, they'll be impossible to tell apart.
  • There's some code duplication between the different kinds of unit ID assignments. Is it worth trying to simplify / consolidate?
  • Should this process be optional in the gens_eia860() output function? It takes a minute or two to run. Or can it be made much faster?
  • What kinds of tests or diagnostics can we use to understand whether these particular unit ID assignments are a good idea or not?
  • Should we fill in pseudo-boiler IDs for the units that we've created, like the latter years of CCNG plants do?
  • Should we back/forward fill the technology descriptions and prime mover codes? Esp. in older yeras?
  • Does every boiler that we know of (in the boiler entity table) mapped to generators in the BGA table? Or are there some orphaned, unassociated boilers?

Where should these Unit IDs end up eventually?

  • Should it be several different well normalized tables defining different kinds of Unit IDs? As it is right now they can't go into the boiler-generator association table, since it requires non-null boiler_id values as a primary key field.
  • Should it be a single denormalized table that's only available in the outputs?

There's a handful of plants whose unit_id_pudl values do not start at 1. Not sure why this is the case. It could be that the reported Unit IDs that we start with in the BGA don't start at 1. Or it could be that there's a bug... Once you've got the gens_eia860 dataframe, this code will show them:

(
    gens_eia860
    .groupby("plant_id_eia")["unit_id_pudl"].min()
    .reset_index()
    .rename(columns={"unit_id_pudl": "min_unit_id_pudl"})
    .fillna(0).astype(int)
    .query("min_unit_id_pudl > 1")
)

Followup Tasks

Some things to do, based on feedback from @cmgosnell:

  • Highlight the dynamic nature of Unit IDs in data dictionary so folks don't hard-code them in manual processes.
  • More clearly document the reasoning behind fuel-specific Unit ID assignment in the steam turbine ID assignments
  • Raise an exception if both fuel_type_code and label_prefix are set in the fuel-based sub-prime-mover unit ID assignment process.
  • Make the Unit ID assignment process optional via a flag in the pudl_out.gens_eia860() function.
  • Functionalize the split-merge-append process that takes place at the end of most of the ID assignment functions
  • Explicitly flag ST generators that do not have consistent fuel types across all years so they aren't left dangling (similar to labeling of the orphan CT/CA generators)
  • Combine all the Combined Cycle Unit ID assignments into a single CC specific function. The abstraction as it exists right now is artificial.

Further investigation:

Look at the preexisting units to get a better sense of how they are structured, and whether these unit ID assignments result in similar units. E.g. are GT and IC prime movers typically seen in association with other types of generators within a larger aggregated unit? Are there many pseudo-combined-cycle plants that are made up of GT and ST generators that (maybe) really should be labeled CT and CA?

Ad-hoc integration of more expansive PUDL Unit ID assignments into the
output functions for the EIA 860 generators. Eventually this
functionality should go into the boiler generator association process,
but it's going to require additional debugging and development, and we
don't want to have to run the ETL over and over again to test it out.

This is still a work in progress. Outstanding questions / tasks...
* Move explanatory text from the prototype notebook into the docstrings.
* Consider lumping vs. splitting in the steam turbine fuel-based unit
  assignments, as well as the "simple" per-generator unit ID
  assignments.
* There's some code duplication between the different kinds of unit ID
  assignments. Is it worth trying to simplify?
* Should this process be optional in the gens_eia860() output function?
  It takes a minute or two to run. Alternatively, can it be made much
  faster?
* What kinds of tests or diagnostics can we use to understand whether
  these particular unit ID assignments are a good idea or not?
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link

codecov bot commented Jun 19, 2021

Codecov Report

Merging #1037 (d681046) into dev (25b1905) will increase coverage by 0.39%.
The diff coverage is 94.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev    #1037      +/-   ##
==========================================
+ Coverage   81.03%   81.43%   +0.39%     
==========================================
  Files          49       49              
  Lines        6005     6089      +84     
==========================================
+ Hits         4866     4958      +92     
+ Misses       1139     1131       -8     
Impacted Files Coverage Δ
src/pudl/analysis/mcoe.py 95.00% <ø> (ø)
src/pudl/constants.py 100.00% <ø> (ø)
src/pudl/output/eia860.py 97.45% <93.98%> (-2.55%) ⬇️
src/pudl/helpers.py 92.26% <100.00%> (ø)
src/pudl/output/eia923.py 98.24% <100.00%> (+0.02%) ⬆️
src/pudl/output/pudltabl.py 68.67% <100.00%> (ø)
src/pudl/transform/eia860.py 96.15% <100.00%> (ø)
src/pudl/analysis/timeseries_cleaning.py 88.62% <0.00%> (+2.84%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 25b1905...d681046. Read the comment docs.

@zaneselvans zaneselvans changed the title [WIP] Better Heatrates PUDL Unit ID Assignments Jun 30, 2021
@zaneselvans zaneselvans marked this pull request as ready for review June 30, 2021 23:21
@@ -582,8 +582,8 @@ def transform(eia860_raw_dfs, eia860_tables=pc.pudl_tables["eia860"]):
# for each of the tables, run the respective transform funtction
for table in eia860_transform_functions:
if table in eia860_tables:
logger.info(f"Transforming raw EIA 860 DataFrames for {table} "
f"concatenated across all years.")
logger.info("Transforming raw EIA 860 DataFrames for %s "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is there something wrong with f-strings here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that f-strings always get interpolated, whereas giving the variables as arguments to the logger method prevents them from getting interpolated if you're not at that log level. And this can be important if you're only assigning to those variables based on the logging level (e.g. you don't bother calculating value unless you're at DEBUG) If you search for Python lazy logging I think you'll find more info https://stackoverflow.com/questions/13500813/how-to-use-modern-string-formatting-options-with-pythons-logging-module

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I didn't know that. Following your link, somebody made the argument that because f-strings are significantly faster to interpolate, the savings from avoided interpolation of suppressed log levels are partly offset by slower interpolation at higher levels, plus the cost of reduced readability. And that the absolute time is very small difference either way unless you're churning out millions of logs.

I take your point that it is a neat trick to keep in mind if table was an expensive calculation only made in the logging call.

I guess my takeaway is that it isn't a hill I'm willing to die on so either way is fine with me 😆

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truth be told, I was just following the orders given to me by the linters! Personally I prefer using f-strings as I find them much more readable. So if there's no clear "right way" to do this then I'd love to use f-strings across the board and turn off that linter error. The one place where I could imagine an actual error coming up is when some variable is only defined contingent upon a certain loglevel being set (e.g. don't generate a JSON report to print out unless you're at DEBUG) in which case attempting to interpolate the unset variable would cause a problem. But I don't think we're actually doing this anywhere, are we?

@zaneselvans zaneselvans mentioned this pull request Jul 5, 2021
6 tasks
@TrentonBush TrentonBush self-assigned this Jul 5, 2021
@@ -789,6 +789,7 @@
'BA': 'Energy Storage, Battery',
'BT': 'Turbines Used in a Binary Cycle. Including those used for geothermal applications',
'CA': 'Combined-Cycle -- Steam Part',
'CC': 'Combined-Cycle, Total Unit',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this category when integrating EIA923 from I think 2004. I actually cleaned it out of the data as an error, instead of adding it here, because

  • it only occurred on a single plant for a single year
  • in subsequent years, that plant broke out into CA and CC parts
  • in prior years, that plant left it blank
  • The CC category does not appear in EIA docs

Does this category appear more often in some other source, like 860? I'm fine adding it, I just want to know if I should go back and change the 923 stuff to be consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that I saw it more than just one time, and that I had seen EIA documentation of it (from which I got the description) but I could be wrong. If what you're recalling is the way it really is, then you definitely did the right thing and the generators will all be given the CA/CT prime mover codes in the entity resolution process. I need to verify this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get 715 records when I look for all the CC generators, and they appear fairly consistently (~45 of them) in every year of data from 2004-2020:

(
    pudl_out.gens_eia860()
    .query("prime_mover_code=='CC'")
    .loc[:, [
        "report_date",
        "plant_id_eia",
        "generator_id",
        "prime_mover_code",
    ]]
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this seems to be a difference between 923 and 860. I just double checked and 'CC' is not in the prime mover code table of the 923 docs but it is in the 860 table. And 'CC' only appears on a single plant-year in 923 data (generation fuel).

I guess I'll leave everything as is? I think it was an error in the 923, but is fine in the 860.

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @zaneselvans! It looks good logically... I asked some questions and made some minor suggestions but to me this looks generally good enough to start working with and testing heat rates with.

src/pudl/helpers.py Show resolved Hide resolved
src/pudl/output/eia860.py Show resolved Hide resolved
src/pudl/output/eia860.py Outdated Show resolved Hide resolved
src/pudl/output/eia860.py Show resolved Hide resolved
src/pudl/output/eia860.py Show resolved Hide resolved
)
# Split original dataframe based on row_mask, and merge in the new IDs and
# labels only on the subset of the dataframe matching our row_mask:
out_df = gens_df.loc[~row_mask].append(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like you employ this same chunk of code to basically all of these unit-assigner functions. I'd make this a bb function and apply uniformly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which chunk of code? This whole assignment of the new unit IDs? You can also select a range of lines to highlight in association with a comment.

src/pudl/output/eia923.py Show resolved Hide resolved
might be getting lumped together.

"""
pm_cols = ["plant_id_eia", "generator_id", "prime_mover_code"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bc moooost of the stuff (and the ideas between this function and assign_unique_combined_cycle_unit_ids() i miiiight think about squishing them together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the only reason they're separate is so that we can differentiate between the two cases after the fact with different labels (since the unique ones ought to be higher confidence / granularity). But yeah I think having a single function that does all of the Combined Cycle unit ID assignments would probably be fine and less duplicative.


This only assigns a PUDL Unit ID to generators that don't already have one,
and only to generators that have a consistent `fuel_type_code_pudl` across
all of the years of data in `gens_df`. This is a simplified fuel code that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the why in here for this one? I get the idea of what is happening, but why does the consistency of the fuel type matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really only applies to the Steam Turbines. That type of prime mover is just too generic -- basically everything ends up spinning a steam turbine -- and the fuel type is the only way we have to differentiate meaningfully between distinct "units" when there are several steam turbines that are part of a single plant. Does that make sense?

test/validate/eia860_test.py Show resolved Hide resolved
src/pudl/output/eia860.py Outdated Show resolved Hide resolved
Integrated feedback from @cmgosnell on PR #1037, including:

* Highlight dynamic nature of PUDL Unit IDs in the data dictionary. This
  was really just an edit to the datapackage.json file. However, my
  editor decided to reformat the whole dang thing, so this is a giant
  diff.
* Make Unit ID assignment optional via a flag in gens_eia860() for now.
* Don't clobber label_prefix with fuel_type_code in the fuel-based steam
  turbine unit assignment process.
* Clarify fuel-based distinction within Steam Turbines docstring
* Functionalize split-merge-append for adding new Unit IDs to gens_df
* Flag inconsistent fuel ST generators with an informative label
* Combine all Combined Cycle ID assignments into a single CC-specific
  function. This ended up being a pretty dramatic simplification, using
  df.loc[] based assignments on aligned dataframes rather than
  split-apply-combine with a bunch of merging. Hopefully it's also
  easier to read and follow along. Another change here was that I
  decided to go ahead and assign orphaned CT and CA generators Unit IDs,
  since they're typically part of a unit that exists in many other
  years, and they can be excluded after the fact if need by based on
  their label.
I added a gens_eia860() test to the fast output tests, and have it pull
the data with unit_ids=True to test the Unit ID assignment code.

I also merged in the dev branch, which included a change in the
generators_eia860() function, to use the column dtype standardization
function from pudl.helpers. But... this caused errors in the ID
assignment process -- a few hundred records get lost along the way if
all those dtypes are set. So... I reverted to my simpler dtype
assignment code. But this is confusing.
The PUDL Unit ID assignment proces has unexpected behavior if it's operating on a
dataframe with the canonical datatypes (as set by convert_cols_dtypes()). The only
columns that are involved in the process are:

* plant_id_eia
* generator_id
* report_date
* unit_id_pudl
* bga_source
* fuel_type_code_pudl
* prime_mover_code

The unexpected behavior is that a few hundred rows are just... lost in the process. This
is caught by the check which ensures the index of the input and outputdataframes are
identical in the unit ID assignment function.

To work around this, I moved the convert_cols_dtypes() call to the very end of the
generators_eia860() function, after the unit ID assignments have taken place. But it
would be good to understand exactly why this is happening. Seems weird.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants