-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PUDL Unit ID Assignments #1037
PUDL Unit ID Assignments #1037
Conversation
Ad-hoc integration of more expansive PUDL Unit ID assignments into the output functions for the EIA 860 generators. Eventually this functionality should go into the boiler generator association process, but it's going to require additional debugging and development, and we don't want to have to run the ETL over and over again to test it out. This is still a work in progress. Outstanding questions / tasks... * Move explanatory text from the prototype notebook into the docstrings. * Consider lumping vs. splitting in the steam turbine fuel-based unit assignments, as well as the "simple" per-generator unit ID assignments. * There's some code duplication between the different kinds of unit ID assignments. Is it worth trying to simplify? * Should this process be optional in the gens_eia860() output function? It takes a minute or two to run. Alternatively, can it be made much faster? * What kinds of tests or diagnostics can we use to understand whether these particular unit ID assignments are a good idea or not?
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov Report
@@ Coverage Diff @@
## dev #1037 +/- ##
==========================================
+ Coverage 81.03% 81.43% +0.39%
==========================================
Files 49 49
Lines 6005 6089 +84
==========================================
+ Hits 4866 4958 +92
+ Misses 1139 1131 -8
Continue to review full report at Codecov.
|
@@ -582,8 +582,8 @@ def transform(eia860_raw_dfs, eia860_tables=pc.pudl_tables["eia860"]): | |||
# for each of the tables, run the respective transform funtction | |||
for table in eia860_transform_functions: | |||
if table in eia860_tables: | |||
logger.info(f"Transforming raw EIA 860 DataFrames for {table} " | |||
f"concatenated across all years.") | |||
logger.info("Transforming raw EIA 860 DataFrames for %s " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is there something wrong with f-strings here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that f-strings always get interpolated, whereas giving the variables as arguments to the logger method prevents them from getting interpolated if you're not at that log level. And this can be important if you're only assigning to those variables based on the logging level (e.g. you don't bother calculating value unless you're at DEBUG) If you search for Python lazy logging I think you'll find more info https://stackoverflow.com/questions/13500813/how-to-use-modern-string-formatting-options-with-pythons-logging-module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I didn't know that. Following your link, somebody made the argument that because f-strings are significantly faster to interpolate, the savings from avoided interpolation of suppressed log levels are partly offset by slower interpolation at higher levels, plus the cost of reduced readability. And that the absolute time is very small difference either way unless you're churning out millions of logs.
I take your point that it is a neat trick to keep in mind if table
was an expensive calculation only made in the logging call.
I guess my takeaway is that it isn't a hill I'm willing to die on so either way is fine with me 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truth be told, I was just following the orders given to me by the linters! Personally I prefer using f-strings as I find them much more readable. So if there's no clear "right way" to do this then I'd love to use f-strings across the board and turn off that linter error. The one place where I could imagine an actual error coming up is when some variable is only defined contingent upon a certain loglevel being set (e.g. don't generate a JSON report to print out unless you're at DEBUG) in which case attempting to interpolate the unset variable would cause a problem. But I don't think we're actually doing this anywhere, are we?
@@ -789,6 +789,7 @@ | |||
'BA': 'Energy Storage, Battery', | |||
'BT': 'Turbines Used in a Binary Cycle. Including those used for geothermal applications', | |||
'CA': 'Combined-Cycle -- Steam Part', | |||
'CC': 'Combined-Cycle, Total Unit', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this category when integrating EIA923 from I think 2004. I actually cleaned it out of the data as an error, instead of adding it here, because
- it only occurred on a single plant for a single year
- in subsequent years, that plant broke out into CA and CC parts
- in prior years, that plant left it blank
- The CC category does not appear in EIA docs
Does this category appear more often in some other source, like 860? I'm fine adding it, I just want to know if I should go back and change the 923 stuff to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that I saw it more than just one time, and that I had seen EIA documentation of it (from which I got the description) but I could be wrong. If what you're recalling is the way it really is, then you definitely did the right thing and the generators will all be given the CA/CT prime mover codes in the entity resolution process. I need to verify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get 715 records when I look for all the CC generators, and they appear fairly consistently (~45 of them) in every year of data from 2004-2020:
(
pudl_out.gens_eia860()
.query("prime_mover_code=='CC'")
.loc[:, [
"report_date",
"plant_id_eia",
"generator_id",
"prime_mover_code",
]]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this seems to be a difference between 923 and 860. I just double checked and 'CC' is not in the prime mover code table of the 923 docs but it is in the 860 table. And 'CC' only appears on a single plant-year in 923 data (generation fuel).
I guess I'll leave everything as is? I think it was an error in the 923, but is fine in the 860.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @zaneselvans! It looks good logically... I asked some questions and made some minor suggestions but to me this looks generally good enough to start working with and testing heat rates with.
src/pudl/output/eia860.py
Outdated
) | ||
# Split original dataframe based on row_mask, and merge in the new IDs and | ||
# labels only on the subset of the dataframe matching our row_mask: | ||
out_df = gens_df.loc[~row_mask].append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like you employ this same chunk of code to basically all of these unit-assigner functions. I'd make this a bb function and apply uniformly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which chunk of code? This whole assignment of the new unit IDs? You can also select a range of lines to highlight in association with a comment.
src/pudl/output/eia860.py
Outdated
might be getting lumped together. | ||
|
||
""" | ||
pm_cols = ["plant_id_eia", "generator_id", "prime_mover_code"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bc moooost of the stuff (and the ideas between this function and assign_unique_combined_cycle_unit_ids()
i miiiight think about squishing them together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the only reason they're separate is so that we can differentiate between the two cases after the fact with different labels (since the unique ones ought to be higher confidence / granularity). But yeah I think having a single function that does all of the Combined Cycle unit ID assignments would probably be fine and less duplicative.
src/pudl/output/eia860.py
Outdated
|
||
This only assigns a PUDL Unit ID to generators that don't already have one, | ||
and only to generators that have a consistent `fuel_type_code_pudl` across | ||
all of the years of data in `gens_df`. This is a simplified fuel code that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the why in here for this one? I get the idea of what is happening, but why does the consistency of the fuel type matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really only applies to the Steam Turbines. That type of prime mover is just too generic -- basically everything ends up spinning a steam turbine -- and the fuel type is the only way we have to differentiate meaningfully between distinct "units" when there are several steam turbines that are part of a single plant. Does that make sense?
Integrated feedback from @cmgosnell on PR #1037, including: * Highlight dynamic nature of PUDL Unit IDs in the data dictionary. This was really just an edit to the datapackage.json file. However, my editor decided to reformat the whole dang thing, so this is a giant diff. * Make Unit ID assignment optional via a flag in gens_eia860() for now. * Don't clobber label_prefix with fuel_type_code in the fuel-based steam turbine unit assignment process. * Clarify fuel-based distinction within Steam Turbines docstring * Functionalize split-merge-append for adding new Unit IDs to gens_df * Flag inconsistent fuel ST generators with an informative label * Combine all Combined Cycle ID assignments into a single CC-specific function. This ended up being a pretty dramatic simplification, using df.loc[] based assignments on aligned dataframes rather than split-apply-combine with a bunch of merging. Hopefully it's also easier to read and follow along. Another change here was that I decided to go ahead and assign orphaned CT and CA generators Unit IDs, since they're typically part of a unit that exists in many other years, and they can be excluded after the fact if need by based on their label.
I added a gens_eia860() test to the fast output tests, and have it pull the data with unit_ids=True to test the Unit ID assignment code. I also merged in the dev branch, which included a change in the generators_eia860() function, to use the column dtype standardization function from pudl.helpers. But... this caused errors in the ID assignment process -- a few hundred records get lost along the way if all those dtypes are set. So... I reverted to my simpler dtype assignment code. But this is confusing.
The PUDL Unit ID assignment proces has unexpected behavior if it's operating on a dataframe with the canonical datatypes (as set by convert_cols_dtypes()). The only columns that are involved in the process are: * plant_id_eia * generator_id * report_date * unit_id_pudl * bga_source * fuel_type_code_pudl * prime_mover_code The unexpected behavior is that a few hundred rows are just... lost in the process. This is caught by the check which ensures the index of the input and outputdataframes are identical in the unit ID assignment function. To work around this, I moved the convert_cols_dtypes() call to the very end of the generators_eia860() function, after the unit ID assignments have taken place. But it would be good to understand exactly why this is happening. Seems weird.
Integrate PUDL Unit ID assignment into generator output
Ad-hoc integration of more expansive PUDL Unit ID assignments into the output functions for the EIA 860 generators. Eventually this functionality should go into the boiler generator association process, but it's going to require additional debugging and development, and we don't want to have to run the ETL over and over again to test it out.
Some outstanding questions
generation_fuel_eia923
table, they'll be impossible to tell apart.Where should these Unit IDs end up eventually?
boiler_id
values as a primary key field.There's a handful of plants whose
unit_id_pudl
values do not start at 1. Not sure why this is the case. It could be that the reported Unit IDs that we start with in the BGA don't start at 1. Or it could be that there's a bug... Once you've got thegens_eia860
dataframe, this code will show them:Followup Tasks
Some things to do, based on feedback from @cmgosnell:
fuel_type_code
andlabel_prefix
are set in the fuel-based sub-prime-mover unit ID assignment process.pudl_out.gens_eia860()
function.Further investigation:
Look at the preexisting units to get a better sense of how they are structured, and whether these unit ID assignments result in similar units. E.g. are
GT
andIC
prime movers typically seen in association with other types of generators within a larger aggregated unit? Are there many pseudo-combined-cycle plants that are made up ofGT
andST
generators that (maybe) really should be labeledCT
andCA
?