-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EIA Bulk Electricity Aggregates #1937
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
On your 2nd question, for now I think that we just want the specific tidy table that we can deploy to fill in & estimate missing fuel prices in the Fuel Receipts & Costs table. If we decide to do a more generalized transformation of all the bulk Electricity data, then I think we'll want to figure out a generalized process to deduplicate, normalize and store all of the metadata appropriately, and tie it to the data tables. But I think that's Out Of Scope™ here for sure. |
Codecov ReportBase: 82.4% // Head: 83.0% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## dev #1937 +/- ##
=======================================
+ Coverage 82.4% 83.0% +0.5%
=======================================
Files 64 67 +3
Lines 7092 7141 +49
=======================================
+ Hits 5849 5931 +82
+ Misses 1243 1210 -33
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you envision integrating the time series data with the plant / FRC data for analysis purposes? Having different frequencies, sectors, regions, and fuels -- none of which correspond to data that exists in other tables directly -- seems like it's going to be messy. Having the long Case Sensitive (punctuation (laden)) strings as the values also seems messy. How do all of these columns correspond with information we already have about the plants & fuel deliveries? Are there any cases here where we can use the existing codes and categories to access this information or do we have to have multiple different messy crosswalks?
The structure of the region column seems pretty clear: state => census region => US, and it seems like it can be handled by building a state table (and we already have other information that should be grouped into that table) Frequency also seems pretty clear. But the fuel and sector information doesn't make sense to me. What are the meanings of the different sector aggregations? Can they be integrated with the sectors_consolidated_eia
table directly? It seems like these categories correspond directly to the ones defined there, or to aggregations of them:
electric_utility
ipp_non_cogen
ipp_cogen
commercial_non_cogen
commercial_cogen
industrial_non_cogen
industrial_cogen
Why is the fuel categorization sometimes coarse and sometimes detailed and non-unique?
I guess the fuel
column should be a mix of energy_source_code
and codes that correspond to aggregations of different energy_source_code
values (e.g. all types of coal), and the region
column would be made up of state abbreviations and codes for other geographies that are aggregations of different states and so on with the sector and frequency columns. But this means it'll be impossible to relate these columns directly to other tables (e.g. with FK constraints). But maybe more conceptually each column can refer to several columns in another table that define various aggregations through the associations that show up between columns in that table.
It seems like these conceptual aggregated foreign tables are:
energy_sources_eia
(for the fuels)sector_consolidated_eia
(though I don't really understand the structure of this column), andstates_territories
(to be compiled)
Can you provide context about the data structure in the module and helper function docstrings? As it is I had to play around with it for hours to try and understand what was going on.
I merged dev
into the branch, so it can now access the archived bulk electricity data from the datastore. I tried to integrate accessing the data that way but it broke the tests as they're designed now are now so I reverted it.
src/pudl/transform/eia_bulk_elec.py
Outdated
# PEL is defined as "Summation of all petroleum liquids (distallte fuel oil, jet fuel, residual fuel oil, kerosense waste oil and other petroleum liquids)" | ||
# Missing from this list are all the "other" categories of gases: OG, BFG, SGP, SC, PG | ||
# Those gases combine for about the same total MMBTU as DFO, about 0.2% of all reported fuel receipts | ||
EIA_FUEL_AGGREGATE_ASSOCIATION = pd.read_csv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like another couple of additional columns that should be added to the energy_sources_eia
static table, which is keyed by energy_source_code
as is this one.? Except that these associations appear to be non-unique. Why are there multiple different mappings for BIT
, SUB
, and LIG
?
Was there a change in how they code the fuel types at some point? Or do the different categorizations of energy_source_code
show up in different kinds of series / aggregations? It seems odd that the aggregate_fuel_code
is in most cases a simplification of the energy_source_code
(lumping together all petroleum liquids or grades of coal) but then in the first three rows there's no simplification. What's the context in the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EIA bulk data contains aggregates of the more granular EIA data present in PUDL. EIA identifies these aggregates using codes unique to each aggregate (eg "all types of coal" = "COW"). PUDL has never seen some of these codes before because they only apply to aggregates found in the bulk data. But sometimes the aggregate is over a single fuel, so the aggregate code happens to be the same as the existing energy_source_code
present in PUDL (eg the aggregate over all bitumenous coal is identified with "BIT", which is also the energy_source_code
for bitumenous coal). The bulk data contains multiple levels of aggregation, so the same fuel may be present in multiple aggregates (eg the BIT fuel is in both the BIT aggregate and the COW aggregate).
The table EIA_FUEL_AGGREGATE_ASSOCIATION
defines all these aggregates in terms of the energy_source_code
values contained in that aggregate. It is a many-to-many mapping, AKA an "association table". The boiler-generator association table is another example of such a mapping. This mapping enables us to join aggregate codes onto the fuel_receipts_costs_eia923
table so we can identify which aggregates apply to each entry.
I don't think we should combine this table with the energy_sources_eia
table because this table describes many-to-many relationships and thus must contain duplicates of some keys. That would make it harder to look up simple information about fuel types, which is the purpose of the energy_sources_eia
table.
src/pudl/transform/eia_bulk_elec.py
Outdated
""" | ||
), | ||
) | ||
"""Association table describing the many-to-one relationships between states and Census Regions used in EIA's aggregates.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to create a static table for information associated with states, keyed by two-letter state-and-territory abbreviation. There's already a couple of one-off dictionaries floating around that link state and state FIPS code, and state abbreviation with state name. We can add census_region
and census_region_code
to that table and look them up by state abbreviation (which shows up in plants_eia860
and elsewhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a worthwhile issue in a general cleanup of our global constants!
src/pudl/transform/eia_bulk_elec.py
Outdated
|
||
# modified from: | ||
# meta.groupby(["sector", "sector_code"])['description'].first().str.strip('; ').str.split(';', expand=True) | ||
EIA_SECTOR_AGGREGATE_ASSOCIATION = pd.read_csv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some aspect of the sector information will be (potentially) useful in the fuel price estimation process, right? We have the sectoral codes for the plants, and it looks like we'll get sectorally aggregated reported prices, so in theory there should be information there to work with. Do we need the information in this table to be able to categorize the plants appropriately in the model? Or is just having the eia_sector_code
(which is already associated with the plant-year) enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between the pairs of records where is_aggregate
is True vs. False?
How do these different categories / aggregations relate to information we already have about the plants and generators? I.e. how can we select subsets of the plants to apply these categories to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to change this table -- I misunderstood how sectors were identified in plants_entity_eia
. This is supposed to be a mapping between the various bulk aggregations and the sector_name_eia
from plants_entity_eia
. I thought sector
and is_cogen
were separate dimensions, but they are actually rolled into one value eg "Commercial Non-CHP". I'll adjust this to reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need this table to understand what the EIA bulk aggregates mean (what is included in each one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add another column to the sectors_consolidated_eia
table that just refers to the broader categories of electric utility, commercial, industrial, and IPP, which can be used to identify the aggregated (cogen + non-cogen) categories for each of them?
return out | ||
|
||
|
||
def _parse_data_column(elec_df: pd.DataFrame) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more in the docstring about what the structure of the data column is that you're parsing? Iterating over the index in the dataframe and picking out particular values to do things with seems pretty weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I'll add some documentation. The raw data is nested JSON with two levels. So the result of the initial pd.read_json()
is a dataframe of metadata plus one column called data
that contains JSON strings with the actual timeseries for that aggregate. I could probably make a function to pass to df.apply()
if the explicit looping is bothersome, but it would do the same thing: parse each timeseries and concatenate them together.
src/pudl/transform/eia_bulk_elec.py
Outdated
), "Conflicting information between 'frequency_code' and 'f'." | ||
# keep frequency_code for reference | ||
return { | ||
"f", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really supposed to always return the same thing? Why does it just have a single letter as a name? In most of the transform functions the first thing we do is rename the columns to be legible, and to take on their final name if possible (insofar as the name is consistent with the context)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's always the same. Each series is defined by 5 dimensions (metric, fuel, region, sector, frequency) and for some reason the EIA only give frequency its own column (called "f"). The rest are all encoded in string fields (series_id
and name
). This function checks that the frequency I parse out from those string fields matches the EIAs "f" column, then removes that column because it is no longer needed. The parsed columns have more legible names ("frequency" and "frequency_code")
src/pudl/extract/eia_bulk_elec.py
Outdated
return out.loc[:, ["series_id", "date", "value"]] # reorder cols | ||
|
||
|
||
def extract(raw_zipfile) -> dict[str, pd.DataFrame]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extract()
functions generally take a Datastore
or the means to create one as their input, plus a settings object that says what to exact. With a Datastore
named ds
you can do something like
zipfile_path = zipfile.Path(ds.get_zipfile_resource("eia_bulk_elec"), at="ELEC.txt")
with zipfile_path.open() as fp:
filtered = _filter_and_read_to_dataframe(fp)
|
||
def test__parse_data_column(elec_txt_dataframe): | ||
"""Convert a pd.Series of python lists to a dataframe.""" | ||
input_ = elec_txt_dataframe.iloc[[2, 4], :] # only annual series for easier testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mysterious numerical indices are fragile. There's no indication in the test dataframe that ordering matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I'm subsetting the test fixture for easier testing. I'm picking out the two annual series because they are low resolution and thus less work to manually parse and write out the expected values. Maybe I should change my fixture setup so that this is easier to express?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there were some way to explicitly select the annual records based on their contents that would be clean. Or if there were a couple of variables / constants containing the different series (e.g. a monthly one and an annual one) taht could be addressed explicitly based on their frequency.
src/pudl/transform/eia_bulk_elec.py
Outdated
@@ -0,0 +1,298 @@ | |||
"""Clean and normalize EIA bulk electricity data.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe the structure of the data we're starting with here in the module docstring? It's different from any other source and clearly takes some wrangling to get into a reasonable tabular form.
For more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions and naming convention comments.
src/pudl/extract/eia_bulk_elec.py
Outdated
|
||
|
||
def _filter_for_fuel_receipts_costs_series(df: pd.DataFrame) -> pd.DataFrame: | ||
"""Pick out the desired data series.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a paragraph to this docstring explaining / giving an example of the data structure that you are working with and how the function selects a subset of it?
src/pudl/extract/eia_bulk_elec.py
Outdated
def _extract(raw_zipfile) -> dict[str, pd.DataFrame]: | ||
"""Extract metadata and timeseries from raw EIA bulk electricity data. | ||
|
||
Args: | ||
raw_zipfile (file-like): Path or other file-like object that can be read by pd.read_json() | ||
|
||
Returns: | ||
dict[str, pd.DataFrame]: dictionary of dataframes with keys 'metadata' and 'timeseries' | ||
""" | ||
filtered = _filter_and_read_to_dataframe(raw_zipfile) | ||
timeseries = _parse_data_column(filtered) | ||
metadata = filtered.drop(columns="data") | ||
return {"metadata": metadata, "timeseries": timeseries} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have one extract function wrapping another extract function when they're each only a couple of lines long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_extract
works directly on the zipfile, but I needed to handle the datastore IO abstraction. I thought a wrapper would give better separation of concerns than merging them together, even though they are small.
|
||
def _parse_data_column(elec_df: pd.DataFrame) -> pd.DataFrame: | ||
out = [] | ||
for idx in elec_df.index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this converting every individual row into its own dataframe, and then parsing just that row? Is that not annoyingly slow? Could it be broken into 3 large dataframes instead, one for each type of date, that are each processed on their own and then concatenated together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a nested data structure; a dataframe where one column contains JSON strings that each need conversion to dataframes. I think that has to be traversed row-wise? I guess I could use df.iteritems()
or df.apply()
instead, or totally rewrite the extract around pd.json_normalize()
but I don't think it would make much performance difference. By far the biggest time suck of this ETL is simply the initial decompression and deserialization of the big zip file (90 seconds on my old desktop).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay that makes sense. I think some explanation of the data structures in the docstrings of these functions (even when they are _private
) would help make it clearer why it's being done this way.
src/pudl/extract/eia_bulk_elec.py
Outdated
"""Extract metadata and timeseries from raw EIA bulk electricity data. | ||
|
||
Args: | ||
raw_zipfile (file-like): Path or other file-like object that can be read by pd.read_json() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really take anything that's file-like? The hard-coded compression="zip"
in _filter_and_read_to_dataframe()
seems like it would be specific to a particular kind of input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file-like in the sense that it can accept string paths, Pathlib objects, BytesIO, etc. I copied that language from the pd.read_json()
docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It failed on me when I gave it a zipfile.Path
pointing at the ELEC.txt
file inside the zipfile. Maybe just clarify that it has to be pointing at a ZIP compressed file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more docstring comments. I made #1958 describing the new state/province static table, which I'm happy to make if you like.
"1": "electric_utility", | ||
"2": "ipp_non_cogen", | ||
"3": "ipp_cogen", | ||
"4": "commercial_non_cogen", | ||
"5": "commercial_cogen", | ||
"6": "industrial_non_cogen", | ||
"7": "industrial_cogen", | ||
"94": "all_ipp", | ||
"96": "all_commercial", | ||
"97": "all_industrial", | ||
"98": "all_electric_power", # all_IPP + regulated utilities | ||
"99": "all_sectors", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it appropriate for these integers to be treated as strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those integers are parsed from a longer ID string. Immediately following that parsing, they are mapped to the string names in this dictionary. That's why I didn't convert dtypes in between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so they don't show up anywhere later programmatically.
It looks like the docs build is failing due to some formatting error in eia_bulk_elec.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be loading the new static DFs defining the association tables into the database as we are with all the others?
Shouldn't the state info dataframe you'd previously created be supplanted by the data in the political_subdivisions
table?
@@ -234,6 +234,135 @@ | |||
utility plant (Account 108). | |||
""" | |||
|
|||
EIA_SECTOR_AGGREGATE_ASSN = pd.read_csv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not load these static dfs / association tables into the database?
EIA API Replacement
This draft PR has the E and T of ETL. Metadata has not been input and some tables are awkwardly placed. I have two main questions:
The current state of this PR is a somewhat confusing mix of those two scopes. There are currently two outputs of the transform: a metadata table and a timeseries table. The metadata table provides translation between EIA's internal IDs, codes, abbreviations, and descriptions of each aggregate timeseries. The timeseries table is currently in a tidy-style format that, while convenient for analysis, obscures the connection between EIA's series IDs and the values in the dataframe. (The EIA present the cost and receipt data as separate series, each with their own 5 dimensions and 1 value. I reorganized the data as two value columns (costs and receipts) that share their 5 common indices of fuel, region, sector, frequency, and timestamp.)
If we are focused on removing the API dependency I think we can drop the metadata table entirely -- the keys are already joined into the timeseries data and we don't need to care about documenting nomenclature, abbreviations, and IDs that we won't use. If we prefer the larger scope of faithfully integrating the bulk aggregates, then we might restructure the timeseries to be more directly identifiable using EIA's "key-value pair" format.