Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EIA Bulk Electricity Aggregates #1937

Merged
merged 42 commits into from
Oct 18, 2022
Merged

EIA Bulk Electricity Aggregates #1937

merged 42 commits into from
Oct 18, 2022

Conversation

TrentonBush
Copy link
Member

@TrentonBush TrentonBush commented Sep 16, 2022

This draft PR has the E and T of ETL. Metadata has not been input and some tables are awkwardly placed. I have two main questions:

  1. where to put the association tables that define the EIA aggregates (the three global variables in the transform module)
  2. is the intention of this ETL to faithfully represent the EIA bulk electricity aggregates as an end unto itself, or is the intention to focus more narrowly on supporting the removal/replacement of the EIA API dependency? The latter is smaller in scope and would lead to cleaner code/integration.

The current state of this PR is a somewhat confusing mix of those two scopes. There are currently two outputs of the transform: a metadata table and a timeseries table. The metadata table provides translation between EIA's internal IDs, codes, abbreviations, and descriptions of each aggregate timeseries. The timeseries table is currently in a tidy-style format that, while convenient for analysis, obscures the connection between EIA's series IDs and the values in the dataframe. (The EIA present the cost and receipt data as separate series, each with their own 5 dimensions and 1 value. I reorganized the data as two value columns (costs and receipts) that share their 5 common indices of fuel, region, sector, frequency, and timestamp.)

If we are focused on removing the API dependency I think we can drop the metadata table entirely -- the keys are already joined into the timeseries data and we don't need to care about documenting nomenclature, abbreviations, and IDs that we won't use. If we prefer the larger scope of faithfully integrating the bulk aggregates, then we might restructure the timeseries to be more directly identifiable using EIA's "key-value pair" format.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@zaneselvans
Copy link
Member

On your 2nd question, for now I think that we just want the specific tidy table that we can deploy to fill in & estimate missing fuel prices in the Fuel Receipts & Costs table.

If we decide to do a more generalized transformation of all the bulk Electricity data, then I think we'll want to figure out a generalized process to deduplicate, normalize and store all of the metadata appropriately, and tie it to the data tables. But I think that's Out Of Scope™ here for sure.

@codecov
Copy link

codecov bot commented Sep 17, 2022

Codecov Report

Base: 82.4% // Head: 83.0% // Increases project coverage by +0.5% 🎉

Coverage data is based on head (cc24ac5) compared to base (64f0edd).
Patch coverage: 100.0% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #1937     +/-   ##
=======================================
+ Coverage   82.4%   83.0%   +0.5%     
=======================================
  Files         64      67      +3     
  Lines       7092    7141     +49     
=======================================
+ Hits        5849    5931     +82     
+ Misses      1243    1210     -33     
Impacted Files Coverage Δ
src/pudl/metadata/classes.py 82.0% <ø> (ø)
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/metadata/resources/ferc714.py 100.0% <ø> (ø)
src/pudl/metadata/sources.py 100.0% <ø> (ø)
src/pudl/output/pudltabl.py 88.2% <ø> (ø)
src/pudl/etl.py 89.7% <100.0%> (+0.5%) ⬆️
src/pudl/extract/eia_bulk_elec.py 100.0% <100.0%> (ø)
src/pudl/metadata/dfs.py 100.0% <100.0%> (ø)
src/pudl/metadata/resources/eia_bulk_elec.py 100.0% <100.0%> (ø)
src/pudl/output/eia923.py 98.1% <100.0%> (+17.0%) ⬆️
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you envision integrating the time series data with the plant / FRC data for analysis purposes? Having different frequencies, sectors, regions, and fuels -- none of which correspond to data that exists in other tables directly -- seems like it's going to be messy. Having the long Case Sensitive (punctuation (laden)) strings as the values also seems messy. How do all of these columns correspond with information we already have about the plants & fuel deliveries? Are there any cases here where we can use the existing codes and categories to access this information or do we have to have multiple different messy crosswalks?

The structure of the region column seems pretty clear: state => census region => US, and it seems like it can be handled by building a state table (and we already have other information that should be grouped into that table) Frequency also seems pretty clear. But the fuel and sector information doesn't make sense to me. What are the meanings of the different sector aggregations? Can they be integrated with the sectors_consolidated_eia table directly? It seems like these categories correspond directly to the ones defined there, or to aggregations of them:

  • electric_utility
  • ipp_non_cogen
  • ipp_cogen
  • commercial_non_cogen
  • commercial_cogen
  • industrial_non_cogen
  • industrial_cogen

Why is the fuel categorization sometimes coarse and sometimes detailed and non-unique?

I guess the fuel column should be a mix of energy_source_code and codes that correspond to aggregations of different energy_source_code values (e.g. all types of coal), and the region column would be made up of state abbreviations and codes for other geographies that are aggregations of different states and so on with the sector and frequency columns. But this means it'll be impossible to relate these columns directly to other tables (e.g. with FK constraints). But maybe more conceptually each column can refer to several columns in another table that define various aggregations through the associations that show up between columns in that table.

It seems like these conceptual aggregated foreign tables are:

  • energy_sources_eia (for the fuels)
  • sector_consolidated_eia (though I don't really understand the structure of this column), and
  • states_territories (to be compiled)

Can you provide context about the data structure in the module and helper function docstrings? As it is I had to play around with it for hours to try and understand what was going on.

I merged dev into the branch, so it can now access the archived bulk electricity data from the datastore. I tried to integrate accessing the data that way but it broke the tests as they're designed now are now so I reverted it.

# PEL is defined as "Summation of all petroleum liquids (distallte fuel oil, jet fuel, residual fuel oil, kerosense waste oil and other petroleum liquids)"
# Missing from this list are all the "other" categories of gases: OG, BFG, SGP, SC, PG
# Those gases combine for about the same total MMBTU as DFO, about 0.2% of all reported fuel receipts
EIA_FUEL_AGGREGATE_ASSOCIATION = pd.read_csv(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like another couple of additional columns that should be added to the energy_sources_eia static table, which is keyed by energy_source_code as is this one.? Except that these associations appear to be non-unique. Why are there multiple different mappings for BIT, SUB, and LIG?

Was there a change in how they code the fuel types at some point? Or do the different categorizations of energy_source_code show up in different kinds of series / aggregations? It seems odd that the aggregate_fuel_code is in most cases a simplification of the energy_source_code (lumping together all petroleum liquids or grades of coal) but then in the first three rows there's no simplification. What's the context in the data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EIA bulk data contains aggregates of the more granular EIA data present in PUDL. EIA identifies these aggregates using codes unique to each aggregate (eg "all types of coal" = "COW"). PUDL has never seen some of these codes before because they only apply to aggregates found in the bulk data. But sometimes the aggregate is over a single fuel, so the aggregate code happens to be the same as the existing energy_source_code present in PUDL (eg the aggregate over all bitumenous coal is identified with "BIT", which is also the energy_source_code for bitumenous coal). The bulk data contains multiple levels of aggregation, so the same fuel may be present in multiple aggregates (eg the BIT fuel is in both the BIT aggregate and the COW aggregate).

The table EIA_FUEL_AGGREGATE_ASSOCIATION defines all these aggregates in terms of the energy_source_code values contained in that aggregate. It is a many-to-many mapping, AKA an "association table". The boiler-generator association table is another example of such a mapping. This mapping enables us to join aggregate codes onto the fuel_receipts_costs_eia923 table so we can identify which aggregates apply to each entry.

I don't think we should combine this table with the energy_sources_eia table because this table describes many-to-many relationships and thus must contain duplicates of some keys. That would make it harder to look up simple information about fuel types, which is the purpose of the energy_sources_eia table.

"""
),
)
"""Association table describing the many-to-one relationships between states and Census Regions used in EIA's aggregates."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to create a static table for information associated with states, keyed by two-letter state-and-territory abbreviation. There's already a couple of one-off dictionaries floating around that link state and state FIPS code, and state abbreviation with state name. We can add census_region and census_region_code to that table and look them up by state abbreviation (which shows up in plants_eia860 and elsewhere)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a worthwhile issue in a general cleanup of our global constants!


# modified from:
# meta.groupby(["sector", "sector_code"])['description'].first().str.strip('; ').str.split(';', expand=True)
EIA_SECTOR_AGGREGATE_ASSOCIATION = pd.read_csv(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some aspect of the sector information will be (potentially) useful in the fuel price estimation process, right? We have the sectoral codes for the plants, and it looks like we'll get sectorally aggregated reported prices, so in theory there should be information there to work with. Do we need the information in this table to be able to categorize the plants appropriately in the model? Or is just having the eia_sector_code (which is already associated with the plant-year) enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between the pairs of records where is_aggregate is True vs. False?

How do these different categories / aggregations relate to information we already have about the plants and generators? I.e. how can we select subsets of the plants to apply these categories to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to change this table -- I misunderstood how sectors were identified in plants_entity_eia. This is supposed to be a mapping between the various bulk aggregations and the sector_name_eia from plants_entity_eia. I thought sector and is_cogen were separate dimensions, but they are actually rolled into one value eg "Commercial Non-CHP". I'll adjust this to reflect that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need this table to understand what the EIA bulk aggregates mean (what is included in each one)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add another column to the sectors_consolidated_eia table that just refers to the broader categories of electric utility, commercial, industrial, and IPP, which can be used to identify the aggregated (cogen + non-cogen) categories for each of them?

return out


def _parse_data_column(elec_df: pd.DataFrame) -> pd.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more in the docstring about what the structure of the data column is that you're parsing? Iterating over the index in the dataframe and picking out particular values to do things with seems pretty weird.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I'll add some documentation. The raw data is nested JSON with two levels. So the result of the initial pd.read_json() is a dataframe of metadata plus one column called data that contains JSON strings with the actual timeseries for that aggregate. I could probably make a function to pass to df.apply() if the explicit looping is bothersome, but it would do the same thing: parse each timeseries and concatenate them together.

src/pudl/extract/eia_bulk_elec.py Show resolved Hide resolved
), "Conflicting information between 'frequency_code' and 'f'."
# keep frequency_code for reference
return {
"f",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really supposed to always return the same thing? Why does it just have a single letter as a name? In most of the transform functions the first thing we do is rename the columns to be legible, and to take on their final name if possible (insofar as the name is consistent with the context)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's always the same. Each series is defined by 5 dimensions (metric, fuel, region, sector, frequency) and for some reason the EIA only give frequency its own column (called "f"). The rest are all encoded in string fields (series_id and name). This function checks that the frequency I parse out from those string fields matches the EIAs "f" column, then removes that column because it is no longer needed. The parsed columns have more legible names ("frequency" and "frequency_code")

return out.loc[:, ["series_id", "date", "value"]] # reorder cols


def extract(raw_zipfile) -> dict[str, pd.DataFrame]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extract() functions generally take a Datastore or the means to create one as their input, plus a settings object that says what to exact. With a Datastore named ds you can do something like

zipfile_path = zipfile.Path(ds.get_zipfile_resource("eia_bulk_elec"), at="ELEC.txt")
with zipfile_path.open() as fp:
    filtered = _filter_and_read_to_dataframe(fp)

test/unit/extract/eia_bulk_elec_test.py Outdated Show resolved Hide resolved

def test__parse_data_column(elec_txt_dataframe):
"""Convert a pd.Series of python lists to a dataframe."""
input_ = elec_txt_dataframe.iloc[[2, 4], :] # only annual series for easier testing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mysterious numerical indices are fragile. There's no indication in the test dataframe that ordering matters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'm subsetting the test fixture for easier testing. I'm picking out the two annual series because they are low resolution and thus less work to manually parse and write out the expected values. Maybe I should change my fixture setup so that this is easier to express?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there were some way to explicitly select the annual records based on their contents that would be clean. Or if there were a couple of variables / constants containing the different series (e.g. a monthly one and an annual one) taht could be addressed explicitly based on their frequency.

@@ -0,0 +1,298 @@
"""Clean and normalize EIA bulk electricity data."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe the structure of the data we're starting with here in the module docstring? It's different from any other source and clearly takes some wrangling to get into a reasonable tabular form.

src/pudl/etl.py Outdated Show resolved Hide resolved
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions and naming convention comments.

src/pudl/etl.py Outdated Show resolved Hide resolved
src/pudl/etl.py Outdated Show resolved Hide resolved
src/pudl/extract/eia_bulk_elec.py Show resolved Hide resolved


def _filter_for_fuel_receipts_costs_series(df: pd.DataFrame) -> pd.DataFrame:
"""Pick out the desired data series."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a paragraph to this docstring explaining / giving an example of the data structure that you are working with and how the function selects a subset of it?

Comment on lines 78 to 90
def _extract(raw_zipfile) -> dict[str, pd.DataFrame]:
"""Extract metadata and timeseries from raw EIA bulk electricity data.

Args:
raw_zipfile (file-like): Path or other file-like object that can be read by pd.read_json()

Returns:
dict[str, pd.DataFrame]: dictionary of dataframes with keys 'metadata' and 'timeseries'
"""
filtered = _filter_and_read_to_dataframe(raw_zipfile)
timeseries = _parse_data_column(filtered)
metadata = filtered.drop(columns="data")
return {"metadata": metadata, "timeseries": timeseries}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have one extract function wrapping another extract function when they're each only a couple of lines long?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_extract works directly on the zipfile, but I needed to handle the datastore IO abstraction. I thought a wrapper would give better separation of concerns than merging them together, even though they are small.

src/pudl/transform/eia_bulk_elec.py Outdated Show resolved Hide resolved
test/unit/transform/eia_bulk_elec_test.py Show resolved Hide resolved

def _parse_data_column(elec_df: pd.DataFrame) -> pd.DataFrame:
out = []
for idx in elec_df.index:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this converting every individual row into its own dataframe, and then parsing just that row? Is that not annoyingly slow? Could it be broken into 3 large dataframes instead, one for each type of date, that are each processed on their own and then concatenated together?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a nested data structure; a dataframe where one column contains JSON strings that each need conversion to dataframes. I think that has to be traversed row-wise? I guess I could use df.iteritems() or df.apply() instead, or totally rewrite the extract around pd.json_normalize() but I don't think it would make much performance difference. By far the biggest time suck of this ETL is simply the initial decompression and deserialization of the big zip file (90 seconds on my old desktop).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay that makes sense. I think some explanation of the data structures in the docstrings of these functions (even when they are _private) would help make it clearer why it's being done this way.

"""Extract metadata and timeseries from raw EIA bulk electricity data.

Args:
raw_zipfile (file-like): Path or other file-like object that can be read by pd.read_json()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really take anything that's file-like? The hard-coded compression="zip" in _filter_and_read_to_dataframe() seems like it would be specific to a particular kind of input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file-like in the sense that it can accept string paths, Pathlib objects, BytesIO, etc. I copied that language from the pd.read_json() docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It failed on me when I gave it a zipfile.Path pointing at the ELEC.txt file inside the zipfile. Maybe just clarify that it has to be pointing at a ZIP compressed file.

src/pudl/extract/eia_bulk_elec.py Show resolved Hide resolved
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more docstring comments. I made #1958 describing the new state/province static table, which I'm happy to make if you like.

src/pudl/extract/eia_bulk_elec.py Outdated Show resolved Hide resolved
Comment on lines +61 to +72
"1": "electric_utility",
"2": "ipp_non_cogen",
"3": "ipp_cogen",
"4": "commercial_non_cogen",
"5": "commercial_cogen",
"6": "industrial_non_cogen",
"7": "industrial_cogen",
"94": "all_ipp",
"96": "all_commercial",
"97": "all_industrial",
"98": "all_electric_power", # all_IPP + regulated utilities
"99": "all_sectors",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it appropriate for these integers to be treated as strings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those integers are parsed from a longer ID string. Immediately following that parsing, they are mapped to the string names in this dictionary. That's why I didn't convert dtypes in between.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so they don't show up anywhere later programmatically.

src/pudl/transform/eia_bulk_elec.py Outdated Show resolved Hide resolved
src/pudl/transform/eia_bulk_elec.py Outdated Show resolved Hide resolved
@zaneselvans
Copy link
Member

It looks like the docs build is failing due to some formatting error in eia_bulk_elec.py

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be loading the new static DFs defining the association tables into the database as we are with all the others?

Shouldn't the state info dataframe you'd previously created be supplanted by the data in the political_subdivisions table?

@@ -234,6 +234,135 @@
utility plant (Account 108).
"""

EIA_SECTOR_AGGREGATE_ASSN = pd.read_csv(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not load these static dfs / association tables into the database?

src/pudl/metadata/dfs.py Outdated Show resolved Hide resolved
@zaneselvans zaneselvans added eia923 Anything having to do with EIA Form 923 analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. labels Oct 18, 2022
@zaneselvans zaneselvans linked an issue Oct 18, 2022 that may be closed by this pull request
3 tasks
@zaneselvans zaneselvans merged commit 1710421 into dev Oct 18, 2022
@zaneselvans zaneselvans deleted the eia_bulk_elec branch October 18, 2022 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. eia923 Anything having to do with EIA Form 923
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create ETL functions for stored EIA API fuel price data
2 participants