Skip to content

Latest commit

 

History

History
1031 lines (918 loc) · 57.7 KB

release_notes.rst

File metadata and controls

1031 lines (918 loc) · 57.7 KB

PUDL Release Notes

v2023.XX.XX

Dagster Adoption

  • After comparing comparing python orchestration tools 1487, we decided to adopt Dagster. Dagster will allow us to parallize the ETL, persist datafarmes at any step in the data cleaning process, visualize data depedencies and run subsets of the ETL from upstream caches.
  • We are converting PUDL code to use dagster concepts in two phases. The first phase converts the ETL portion of the code base to use software defined assets 1570. The second phase converts the output and analysis tables in the pudl.output.pudltabl.PudlTabl class to use software defined assets, replacing the existing pudl_out output functions.
  • General changes:
    • pudl.etl is now a subpackage that collects all pudl assets into a dagster Definition.
    • The pudl_settings, Datastore and DatasetSettings are now dagster resources. See pudl.resources.
    • The pudl_etl and ferc_to_sqlite commands no longer support loading specific tables. The commands run all of the tables. Use dagster assets to run subsets of the tables.
    • The --clobber argument has been removed from the pudl_etl command.
    • New static method pudl.metadata.classes.Package.get_etl_group_tables returns the resources ids for a given etl group.
    • pudl.settings.FercToSqliteSettings class now loads all FERC datasources if no datasets are specified.
  • EIA ETL changes:
    • EIA extract methods are now @multi_asset that return an asset for each raw table. 860 and 923 are separate @multi_asset which allows this data to be extracted in parallel.
    • The EIA table level cleaning functions are now dagster assets. The table level cleaning assets now have a "clean_" prefix and a "_{datasource}" suffix to distinguish them from the final harvested tables.
    • pudl.transform.eia.transform() is now a @multi_asset that depends on all of the EIA table level cleaning functions / assets.
  • EPA CEMS ETL changes:
    • pudl.transform.epacems.transform() now loads the epacamd_eia and plants_entity_eia tables as dataframes using the pudl.io_manager.pudl_sqlite_io_manager instead of reading the tables using a pudl_engine.
    • Adds a Ohio plant that is in 2021 CEMS but missing from EIA since 2018 to the additional_epacems_plants.csv sheet.
  • FERC ETL changes:
    • pudl.extract.ferc1.dbf2sqlite() and pudl.extract.xbrl.xbrl2sqlite() are now configurable dagster ops. These ops make up the ferc_to_sqlite dagster graph in pudl.ferc_to_sqlite.defs.
    • FERC 714 extraction methods are now subsettable by year, with 2019 and 2020 data included in the etl_fast.yml by default. See 2628 and PR 2649.

Data Coverage

  • Updated data_sources/eia860 to include data as of 2022-09.
  • New epacamd_eia crosswalk version v0.3, see issue 2317 and PR 2316. EPA's updates add manual matches and exclusions focusing on operating units with a generator ID as of 2018.
  • New PUDL tables from data_sources/ferc1, integrating older DBF and newer XBRL data. See 1574 for an overview of our progress integrating FERC's XBRL data. To see which DBF and XBRL tables the following PUDL tables are derived from, refer to :pypudl.extract.ferc1.TABLE_NAME_MAP
    • electric_energy_sources_ferc1, see issue 1819 & PR 2094.
    • electric_energy_dispositions_ferc1, see issue 1819 & PR 2100.
    • transmission_statistics_ferc1, see issue 1822 & PR 2103
    • utility_plant_summary_ferc1, see issue 1806 & PR 2105.
    • balance_sheet_assets_ferc1, see issue 1805 & PRs 2112,2127.
    • balance_sheet_liabilities_ferc1, see issue 1810 & PR 2134.
    • depreciation_amortization_summary_ferc1, see issue 1816 & PR 2143.
    • income_statement_ferc1, see issue 1813 & PR 2147.
    • electric_plant_depreciation_changes_ferc1 see issue 1808 & 2119.
    • electric_plant_depreciation_functional_ferc1 see issue 1808 & PR 2183
    • electric_operating_expenses_ferc1, see issue 1817 & PR 2162.
    • retained_earnings_ferc1, see issue 1811 & PR 2155.
    • cash_flow_ferc1, see issue 1821 & PR 2184
    • electricity_sales_by_rate_schedule_ferc1, see issue 1823 & PR 2205
  • New PUDL tables from data_sources/eia860:
    • emissions_control_equipment_eia860, see issue 2338 & PR 2561.
    • denorm_emissions_control_equipment_eia860, see issue 2338 & PR 2561.
    • boiler_emissions_control_equipment_assn_eia860, see 2338 & PR 2561.
    • boiler_cooling_assn_eia860, see 2586 & PR 2587
    • boiler_stack_flue_assn_eia860, see 2586 & PR 2587
  • The boilers_eia860 table now includes annual boiler attributes from data_sources/eia860 Schedule 6.2 Environmental Equipment data, and the new boilers_entity_eia table now includes static boiler attributes. See issue 1162 & PR 2319.
  • All data_sources/eia861 tables are now being loaded into the PUDL DB, rather than only being available via an ad-hoc ETL process that was only accessible through the pudl.output.pudltabl.PudlTabl class. Note that most of these tables have not been normalized, and the utility_id_eia and balancing_authority_id_eia values in them haven't been harvested, so these tables have very few valid foreign key relationships with the rest of the database right now -- but at least the data is available in the database! Existing methods for accessing these tables have been preserved. The PudlTabl methods just read directly from the DB and apply uniform data types, rather than actually doing the ETL. See 2265 & 2403. The newly accessible tables contain data from 2001-2021 and include:
    • advanced_metering_infrastructure_eia861
    • balancing_authority_eia861
    • balancing_authority_assn_eia861
    • demand_response_eia861
    • demand_response_water_heater_eia861
    • demand_side_management_sales_eia861
    • demand_side_management_ee_dr_eia861
    • demand_side_management_misc_eia861
    • distributed_generation_tech_eia861
    • distributed_generation_fuel_eia861
    • distributed_generation_misc_eia861
    • distribution_systems_eia861
    • dynamic_pricing_eia861
    • energy_efficiency_eia861
    • green_pricing_eia861
    • mergers_eia861
    • net_metering_customer_fuel_class_eia861
    • net_metering_misc_eia861
    • non_net_metering_customer_fuel_class_eia861
    • non_net_metering_misc_eia861
    • operational_data_revenue_eia861
    • operational_data_misc_eia861
    • reliability_eia861
    • sales_eia861
    • service_territory_eia861
    • utility_assn_eia861
    • utility_data_nerc_eia861
    • utility_data_rto_eia861
    • utility_data_misc_eia861
  • A couple of tables from data_sources/ferc714 have been added to the PUDL DB. These tables contain data from 2006-2020 (2021 is distributed by FERC in XBRL format and we have not yet integrated it). See 2266 & 2421. The newly accessible tables include:
    • respondent_id_ferc714 (linking FERC-714 respondents to EIA utilities)
    • demand_hourly_pa_ferc714 (hourly electricity demand by planning area)
  • Added new table epacamd_eia_subplant_ids, which aguments the epacamd_eia glue table. This table incorporates all generators_entity_eia and all hourly_emissions_epacems ID's and uses these complete IDs to develop a full-coverage subplant_id column which granularly connects EPA CAMD with EIA. Thanks to grgmiller for his contribution to this process. See 2456 & 2491.
  • Thanks to contributions from rousik we've generalized the code we use to convert FERC's old annual Visual FoxPro databases into multi-year SQLite databases.
    • We have started extracting the FERC Form 2 (natual gas utility financial reports). See issues 1984,2642 and PRs 2536,2564,2652. We haven't yet done any integration of the Form 2 into the cleaned and normalized PUDL DB, but the converted FERC Form 2 is available on Datasette covering 1996-2020. Earlier years (1991-1995) were distributed using a different binary format and we don't currently have plans to extract them. From 2021 onward we are extracting the FERC 2 from XBRL.
    • Similarly 2595 converts the earlier years of FERC Form 6 (2000-2020) from DBF to SQLite, describing the finances of oil pipeline companies. When the nightly builds succeed, FERC Form 6 will be available on Datasette as well.

Data Cleaning

  • Removed inconsistently reported leading zeroes from numeric boiler_id values. This affected a small number of records in any table referring to boilers, including boilers_entity_eia, boilers_eia860, boiler_fuel_eia923, boiler_generator_assn_eia860 and the epacamd_eia crosswalk. It also had some minor downstream effects on the MCOE outputs. See 2366 and 2367.
  • The boiler_fuel_eia923 table now includes the prime_mover_code column. This column was previously incorrectly being associated with boilers in the boilers_entity_eia table. See issue 2349 & PR 2362.
  • Fixed column naming issues in the electric_operating_revenues_ferc1 table.
  • Made minor calculation fixes in the metadata for income_statement_ferc1, utility_plant_summary_ferc1, electric_operating_revenues_ferc1, balance_sheet_assets_ferc1, balance_sheet_liabilities_ferc1, and electric_operating_expenses_ferc1. See 2016 and 2563.
  • Changed the retained_earnings_ferc1 table transform to restore factoids for previous year balances, and added calculation metadata. See 1811, 2016, and 2645.
  • Added "correction" records to many FERC Form 1 tables where the reported totals do not match the outcomes of calculations specified in XBRL metadata (even after cleaning up the often incorrect calculation specifications!). See 2957 and 2620.
  • Flip the sign of some erroneous negative values in the plant_in_service_ferc1 and utility_plant_summary_ferc1 tables. See 2599, and 2647.

Analysis

  • Added a method for attributing fuel consumption reported on the basis of boiler ID and fuel to individual generators, analogous to the existing method for attributing net generation reported on the basis of prime mover & fuel. This should allow much more complete estimates of generator heat rates and thus fuel costs and emissions. Thanks to grgmiller for his contribution, which was integrated by cmgosnell! See PRs 1096,1608 and issues 1468,1478.
  • Integrated pudl.analysis.ferc1_eia from our RMI collaboration repo, which uses logistic regression to match FERC1 plants data to EIA 860 records. While far from perfect, this baseline model utilizes the manually created training data and plant IDs to perform record linkage on the FERC1 data and EIA plant parts list created in pudl.analysis.plant_parts_eia. See issue 1064 & PR 2224. To account for 1:m matches in the manual data, we added plant_match_ferc1 as a plant part in pudl.analysis.plant_parts_eia.
  • Refined how we are associating generation and fuel data in pudl.analysis.allocate_net_gen. Energy source codes that show up in the generation_fuel_eia923 or the boiler_fuel_eia923 are now added into the generators_eia860 table so associating those gf and bf records are more cleanly associated with generators. Thanks to grgmiller for his contribution, which was integrated by cmgosnell! See PRs 2235,2446.

Deprecations

  • Replace references to deprecated pudl-scrapers and pudl-zenodo-datastore repositories with references to pudl-archiver repository in intro, dev/datastore, and dev/annual_updates. See 2190.
  • pudl.etl is now a subpackage that collects all pudl assets into a dagster Definition. All pudl.etl._etl_{datasource} functions have been deprecated. The coordination of ETL steps is being handled by dagster.
  • The pudl.load module has been removed in favor of using the pudl.io_managers.pudl_sqlite_io_manager.
  • The pudl_etl and ferc_to_sqlite commands no longer support loading specific tables. The commands run all of the tables. Use dagster assets to run subsets of the tables.
  • The --clobber argument has been removed from the pudl_etl command.
  • pudl.transform.eia860.transform() and pudl.transform.eia923.transform() functions have been deprecated. The table level EIA cleaning funtions are now coordinated using dagster.
  • The pudl.convert.epacems_to_parquet command now executes the hourly_emissions_epacems asset as a dagster job. The —partition option is no longer supported. Now only creates a directory of parquet files for each year/state partition.
  • pudl.transform.ferc1.transform() has been removed. The ferc1 table

    transformations are now being orchestrated with Dagster.

  • pudl.transform.ferc1.transform can no longer be executed as a script. Use dagit to execute just the FERC Form 1 pipeline.
  • pudl.extract.ferc1.extract_dbf, pudl.extract.ferc1.extract_xbrl pudl.extract.ferc1.extract_xbrl_single, pudl.extract.ferc1.extract_dbf_single, pudl.extract.ferc1.extract_xbrl_generic, pudl.extract.ferc1.extract_dbf_generic have all been deprecated. The extraction logic is now covered by the pudl.io_managers.ferc1_xbrl_sqlite_io_manager and pudl.io_managers.ferc1_dbf_sqlite_io_manager IO Managers.
  • pudl.extract.ferc1.extract_xbrl_metadata has been replaced by the pudl.extract.ferc1.xbrl_metadata_json asset.
  • All sub classes of pudl.settings.GenericDatasetSettings in pudl.settings no longer have table attributes because the ETL no longer supports loading specific tables via settings. Use dagster to select subsets of tables to process.

Miscellaneous

  • Updated PUDL to use Python 3.11. See 2408 & 2383
  • Apply start and end dates to ferc1 data in pudl.output.pudltabl.PudlTabl. See 2238 & 274.
  • Add generic spot fix method to transform process, to manually rescue FERC1 records. See 2254 & 1980.
  • Reverted a fix made in 1909, which mapped all plants located in NY state that reported a balancing authority code of "ISONE" to "NYISO". These plants now retain their original EIA codes. Plants with manual re-mapping of BA codes have also been fixed to have correctly updated BA names. See 2312 and 2255.

v2022.11.30

Data Coverage

  • Added archives of the bulk EIA electricity API data to our datastore, since the API itself is too unreliable for production use. This is part of 1763. The code for this new data is eia_bulk_elec and the data comes as a single 200MB zipped JSON file. 1922 updates the datastore to include this archive on Zenodo but most of the work happened in the pudl-scrapers and pudl-zenodo-storage repositories. See issue catalyst-cooperative/pudl-zenodo-storage#29.
  • Incorporated 2021 data from the data_sources/epacems dataset. See 1778
  • Incorporated Final Release 2021 data from the data_sources/eia860, data_sources/eia861, and data_sources/eia923. We also integrated a data_maturity column and related data_maturities table into most of the EIA data tables in order to alter users to the level of finality of the data. See 1834,1855,1915,1921.
  • Incorporated 2022 data from the data_sources/eia860 monthly update from September 2022. See 2079. A June 2022 eia860m update included adding new energy_storage_capacity_mwh (for batteries) and net_capacity_mwdc (for behind-the-meter solar PV) attributes to the generators_eia860 table, as they appear in the data_sources/eia860 monthly updates for 2022. See 1834.
  • Added new datasources table, which includes partitions used to generate the database. See 2079.
  • Integrated several new columns into the EIA 860 and EIA 923 including several codes with coding tables (See data_dictionaries/codes_and_labels). 1836
  • Added the EPACAMD-EIA Crosswalk to the database. Previously, the crosswalk was a csv stored in package_data/glue, but now it has its own scraper https://github.com/catalyst-cooperative/pudl-scrapers/pull/20, archiver, https://github.com/catalyst-cooperative/pudl-zenodo-storage/pull/20 and place in the PUDL db. For now there's a epacamd_eia output table you can use to merge CEMS and EIA data yourself 1692. Eventually we'll work these crosswalk values into an output table combining CEMS and EIA.
  • Integrated 2021 from the data_sources/ferc1 data. FERC updated its reporting format for 2021 from a DBF file to a XBRL files. This required a major overhaul of the extract and transform step. The updates were accumulated in 1665. The raw XBRL data is being extracted through a FERC XBRL Extractor. This work is ongoing with additional tasks being tracked in 1574. Specific updates in this release include:
    • Convert XBRL into raw sqlite database 1831
    • Build transformer infrastructure & Add fuel_ferc1 table 1721
    • Map utility XBRL and DBF utility IDs 1931
    • Add plants_steam_ferc1 table 1881
    • Add plants_hydro_ferc1 1992
    • Add plants_pumped_storage_ferc1 2005
    • Add purchased_power_ferc1 2011
    • Add plants_small_ferc1 table 2035
    • Add plant_in_service_ferc1 table 2025 & 2058
  • Added all of the SQLite databases which we build from FERC's raw XBRL filings to our Datasette deployment. See 2095 & 2080. Browse the published data here:

Data Analysis

  • Instead of relying on the EIA API to fill in redacted fuel prices with aggregate values for individual states and plants, use the archived eia_bulk_elec data. This means we no longer have any reliance on the API, which should make the fuel price filling faster and more reliable. Coverage is still only about 90%. See 1764 and 1998. Additional filling with aggregate and/or imputed values is still on the workplan. You can follow the progress in 1708.

Nightly Data Builds

  • We added infrastructure to run the entire ETL and all tests nightly so we can catch data errors when they are merged into dev. This allows us to automatically update the PUDL Intake data catalogs when there are new code releases. See 1177 for more details.
  • Created a docker image that installs PUDL and it's depedencies. The build-deploy-pudl.yaml GitHub Action builds and pushes the image to Docker Hub and deploys the image on a Google Compute Engine instance. The ETL outputs are then loaded to Google Cloud buckets for the data catalogs to access.
  • Added GoogleCloudStorageCache support to ferc1_to_sqlite and censusdp1tract_to_sqlite commands and pytest.
  • Allow users to create monolithic and partitioned EPA CEMS outputs without having to clobber or move any existing CEMS outputs.
  • GoogleCloudStorageCache now supports accessing requester pays buckets.
  • Added a --loglevel arg to the package entrypoint commands.

Database Schema Changes

  • After learning that generators' prime movers do very occasionally change over time, we recategorized the prime_mover_code column in our entity resolution process to enable the rare but real variability over time. We moved the prime_mover_code column from the statically harvested/normalized data column to an annually harvested data column (i.e. from generators_entity_eia to generators_eia860) 1600. See 1585 for more details.
  • Created operational_status_eia into our static metadata tables (See data_dictionaries/codes_and_labels). Used these standard codes and code fixes to clean operational_status_code in the generators_entity_eia table. 1624
  • Moved a number of slowly changing plant attributes from the plants_entity_eia table to the annual plants_eia860 table. See 1748 and 1749. This was initially inspired by the desire to more accurately reproduce the aggregated fuel prices which are available in the EIA's API. Along with state, census region, month, year, and fuel type, those prices are broken down by industrial sector. Previously sector_id_eia (an aggregation of several primary_purpose_naics_id values) had been assumed to be static over a plant's lifetime, when in fact it can change if e.g. a plant is sold to an IPP by a regulated utility. Other plant attributes which are now allowed to vary annually include:
    • balancing_authority_code_eia
    • balancing_authority_name_eia
    • ferc_cogen_status
    • ferc_exempt_wholesale_generator
    • ferc_small_power_producer
    • grid_voltage_1_kv
    • grid_voltage_2_kv
    • grid_voltage_3_kv
    • iso_rto_code
    • primary_purpose_id_naics
  • Renamed grid_voltage_kv to grid_voltage_1_kv in the plants_eia860 table, to follow the pattern of many other multiply reported values.
  • Added a balancing_authorities_eia coding table mapping BA codes found in the data_sources/eia860 and data_sources/eia923 to their names, cleaning up non-standard codes, and fixing some reporting errors for PACW vs. PACE (PacifiCorp West vs. East) based on the state associated with the plant reporting the code. Also added backfilling for codes in years before 2013 when BA Codes first started being reported, but only in the output tables. See: 1906,1911
  • Renamed and removed some columns in the data_sources/epacems dataset. unitid was changed to emissions_unit_id_epa to clarify the type of unit it represents. unit_id_epa was removed because it is a unique identifyer for emissions_unit_id_epa and not otherwise useful or transferable to other datasets. facility_id was removed because it is specific to EPA's internal database and does not aid in connection with other data. 1692
  • Added a new table political_subdivisions which consolidated various bits of information about states, territories, provinces etc. that had previously been scattered across constants stored in the codebase. The ownership_eia860 table had a mix of state and country information stored in the same column, and to retain all of it we added a new owner_country_code column. 1966

Data Accuracy

  • Retain NA values for data_sources/epacems fields gross_load_mw and heat_content_mmbtu. Previously, these fields converted NA to 0, but this is not accurate, so we removed this step.
  • Update the plant_id_eia field from data_sources/epacems with values from the newly integrated epacamd_eia crosswalk as not all EPA's ORISPL codes are correct.

Helper Function Updates

  • Replaced the PUDL helper function clean_merge_asof that merged two dataframes reported on different temporal granularities, for example monthly vs yearly data. The reworked function, pudl.helpers.date_merge, is more encapsulating and faster and replaces clean_merge_asof in the MCOE table and EIA 923 tables. See 1103,1550
  • The helper function pudl.helpers.expand_timeseries was also added, which expands a dataframe to include a full timeseries of data at a certain frequency. The coordinating function pudl.helpers.full_timeseries_date_merge first calls pudl.helpers.date_merge to merge two dataframes of different temporal granularities, and then calls pudl.helpers.expand_timeseries to expand the merged dataframe to a full timeseries. The added timeseries_fillin argument, makes this function optionally used to generate the MCOE table that includes a full monthly timeseries even in years when annually reported generators don't have matching monthly data. See 1550
  • Updated the fix_leading_zero_gen_ids fuction by changing the name to remove_leading_zeros_from_numeric_strings because it's used to fix more than just the generator_id column. Included a new argument to specify which column you'd like to fix.

Plant Parts List Module Changes

  • We refactored a couple components of the Plant Parts List module in preparation for the next round of entity matching of EIA and FERC Form 1 records with the Panda model developed by the Chu Data Lab at Georgia Tech, through work funded by a CCAI Innovation Grant. The labeling of different aggregations of EIA generators as the true granularity was sped up, resulting in faster generation of the final plant parts list. In addition, the generation of the installation_year column in the plant parts list was fixed and a construction_year column was also added. Finally, operating_year was added as a level that the EIA generators are now aggregated to.
  • The mega generators table and in turn the plant parts list requires the MCOE table to generate. The MCOE table is now created with the new pudl.helpers.date_merge helper function (described above). As a result, now by default only columns from the EIA 860 generators table that are necessary for the creation of the plant parts list will be included in the MCOE table. This list of columns is defined by the global pudl.analysis.mcoe.DEFAULT_GENS_COLS. If additional columns that are not part of the default list are needed from the EIA 860 generators table, these columns can be passed in with the gens_cols argument. See 1550
  • For memory efficiency, appropriate columns are now cast to string and categorical types when the full plant parts list is created. The resource and field metadata is now included in the PUDL metadata. See 1865
  • For clarity and specificity, the plant_name_new column was renamed plant_name_ppe and the ownership column was renamed ownership_record_type. See 1865
  • The PLANT_PARTS_ORDERED list was removed and PLANT_PARTS is now an OrderedDict that establishes the plant parts hierarchy in its keys. All references to PLANT_PARTS_ORDERED were replaced with the PLANT_PARTS keys. See 1865

Metadata

  • Used the data source metadata class added in release 0.6.0 to dynamically generate the data source documentation (See data_sources/index). 1532
  • The EIA plant parts list was added to the resource and field metadata. This is the first output table to be included in the metadata. See 1865

Documentation

  • Fixed broken links in the documentation since the Air Markets Program Data (AMPD) changed to Clean Air Markets Data (CAMD).
  • Added graphics and clearer descriptions of EPA data and reporting requirements to the data_sources/epacems page. Also included information about the epacamd_eia crosswalk.

Bug Fixes

  • Dask v2022.4.2 introduced breaking changes into dask.dataframe.read_parquet. However, we didn't catch this when it happened because it's only a problem when there's more than one row-group. Now we're processing 2019-2020 data for both ID and ME (two of the smallest states) in the tests. Also restricted the allowed Dask versions in our setup.py so that we get notified by the dependabot any time even a minor update. happens to any of the packages we depend on that use calendar versioning. See 1618.
  • Fixed a testing bug where the partitioned EPA CEMS outputs generated using parallel processing were getting output in the same output directory as the real ETL, which should never happen. See 1618.
  • Changed the way fixes to the EIA-861 balancing authority names and IDs are applied, so that they still work when only some years of data are being processed. See 1671 and 828.

Dependencies / Environment

  • In conjunction with getting the dependabot set up to merge its own PRs if CI passes, we tightened the version constraints on a lot of our dependencies. This should reduce the frequency with which we get surprised by changes breaking things after release. See 1655
  • We've switched to using mambaforge to manage our environments internally, and are recommending that users use it as well.
  • We're moving toward treating PUDL like an application rather than a library, and part of that is no longer trying to be compatible with a wide range of versions of our dependencies, instead focusing on a single reproducible environment that is associated with each release, using lockfiles, etc. See 1669
  • As an "application" PUDL is now only supporting the most recent major version of Python (curently 3.10). We used pyupgrade and pep585-upgrade to update the syntax of to use Python 3.10 norms, and are now using those packages as pre-commit hooks as well. See 1685

0.6.0 (2022-03-11)

Data Coverage

  • data_sources/eia860 monthly updates (eia860m) up to the end of 2021. 1510

New Analyses

  • For the purposes of linking EIA and FERC Form 1 records, we (mostly cmgosnell) have created a new output called the Plant Parts List in pudl.analysis.plant_parts_eia which combines many different sub-parts of the EIA generators based on their fuel type, prime movers, ownership, etc. This allows a huge range of hypothiecally possible FERC Form 1 plant records to be synthesized, so that we can identify exactly what data in EIA should be associated with what data in FERC using a variety of record linkage & entity matching techniques. This is still a work in progress, both with our partners at RMI, and in collaboration with the Chu Data Lab at Georgia Tech, through work funded by a CCAI Innovation Grant. 1157

Metadata

  • Column data types for our database and Apache Parquet outputs, as well as pandas dataframes are all based on the same underlying schemas, and should be much more consistent. 1370,1377,1408
  • Defined a data source metadata class pudl.metadata.classes.DataSource using Pydantic to store information and procedures specific to each data source (e.g. data_sources/ferc1, data_sources/eia923). 1446
  • Use the data source metadata classes to automatically export rich metadata for use with our Datasette deployement. 1479
  • Use the data source metadata classes to store rich metadata for use with our Zenodo raw data archives so that information is no longer duplicated and liable to get out of sync. 1475
  • Added static tables and metadata structures that store definitions and additional information related to the many coded categorical columns in the database. These tables are exported directly into the documentation (See data_dictionaries/codes_and_labels). The metadata structures also document all of the non-standard values that we've identified in the raw data, and the standard codes that they are mapped to. 1388
  • As a result of all these metadata improvements we were finally able to close 52 and delete the pudl.constants junk-drawer module... after 5 years.

Data Cleaning

  • Fixed a few inaccurately hand-mapped PUDL Plant & Utility IDs. 1458,1480
  • We are now using the coding table metadata mentioned above and the foreign key relationships that are part of the database schema to automatically recode any column that refers to the codes defined in the coding table. This results in much more uniformity across the whole database, especially in the EIA energy_source_code columns. 1416
  • In the raw input data, often NULL values will be represented by the empty string or other not really NULL values. We went through and cleaned these up in all of the categorical / coded columns so that their values can be validated based on either an ENUM constraint in the database, or a foreign key constraint linking them to the static coding tables. Now they should primarily use the pandas NA value, or numpy.nan in the case of floats. 1376
  • Many FIPS and ZIP codes that appear in the raw data are stored as integers rather than strings, meaning that they lose their leading zeros, rendering them invalid in many contexts. We use the same method to clean them all up now, and enforce a uniform field width with leading zero padding. This also allows us to enforce a regex pattern constraint on these fields in the database outputs. 1405,1476
  • We're now able to fill in missing values in the very useful generators_eia860 technology_description field. Currently this is optionally available in the output layer, but we want to put more of this kind of data repair into the core database gong forward. 1075

Miscellaneous

  • Created a simple script that allows our SQLite DB to be loaded into Google's CloudSQL hosted PostgreSQL service pgloader and pg_dump. 1361
  • Made better use of our Pydantic settings classes to validate and manage the ETL settings that are read in from YAML files and passed around throughout the functions that orchestrate the ETL process. 1506
  • PUDL now works with pandas 1.4 (1421) and Python 3.10 (1373).
  • Addressed a bunch of deprecation warnings being raised by geopandas. 1444
  • Integrated the pre-commit.ci service into our GitHub CI in order to automatically apply a variety of code formatting & checks to all commits. 1482
  • Fixed random seeds to avoid stochastic test coverage changes in the pudl.analysis.timeseries_cleaning module. 1483
  • Silenced a bunch of 3rd party module warnings in the tests. See 1476

Bug Fixes

  • In addressing 851,1296,1325 the generation_fuel_eia923 table was split to create a generation_fuel_nuclear_eia923 table since they have different primary keys. This meant that the pudl.output.pudltabl.PudlTabl.gf_eia923 method no longer included nuclear generation. This impacted the net generation allocation process and MCOE calculations downstream, which were expecting to have all the reported nuclear generation. This has now been fixed, and the generation fuel output includes both the nuclear and non-nuclear generation, with nuclear generation aggregated across nuclear unit IDs so that it has the same primary key as the rest of the generation fuel table. 1518
  • EIA changed the URL of their API to only accept connections over HTTPS, but we had a hard-coded HTTP URL, meaning the historical fuel price filling that uses the API broke. This has been fixed.

Known Issues

  • Everything is fiiiiiine.

0.5.0 (2021-11-11)

Data Coverage Changes

  • Integration of 2020 data for all our core datasets (See 1255):
    • data_sources/eia860 for 2020 as well as 2001-2003 (see 1122).
    • EIA Form 860m through 2021-08.
    • data_sources/eia923 for 2020.
    • data_sources/ferc1 for 2020.
    • data_sources/eia861 data for 2020.
    • data_sources/ferc714 for 2020.
    • Note: the 2020 data_sources/epacems data was already available in v0.4.0.
  • EPA IPM / NEEDS data has been removed from PUDL as we didn't have the internal resources to maintain it, and it was no longer working. Apologies to gschivley!

SQLite and Parquet Outputs

  • The ETL pipeline now outputs SQLite databases and Apache Parquet datasets directly, rather than generating tabular data packages. This is much faster and simpler, and also takes up less space on disk. Running the full ETL including all EPA CEMS data should now take around 2 hours if you have all the data downloaded.
  • The new pudl.load.sqlite and pudl.load.parquet modules contain this logic. The pudl.load.csv and pudl.load.metadata modules have been removed along with other remaining datapackage infrastructure. See 1211
  • Many more tables now have natural primary keys explicitly specified within the database schema.
  • The datapkg_to_sqlite script has been removed and the epacems_to_parquet script can now be used to process the original EPA CEMS CSV data directly to Parquet using an existing PUDL database to source plant timezones. See 1176,806.
  • Data types, specified value constraints, and the uniqueness / non-null constraints on primary keys are validated during insertion into the SQLite DB.
  • The PUDL ETL CLI pudl.cli now has flags to toggle various constraint checks including --ignore-foreign-key-constraints --ignore-type-constraints and --ignore-value-constraints.

New Metadata System

With the deprecation of tabular data package outputs, we've adopted a more modular metadata management system that uses Pydantic. This setup will allow us to easily validate the metadata schema and export to a variety of formats to support data distribution via Datasette and Intake catalogs, and automatic generation of data dictionaries and documentation. See 806,1271,1272 and the pudl.metadata subpackage. Many thanks to ezwelty for most of this work.

ETL Settings File Format Changed

We are also using Pydantic to parse and validate the YAML settings files that tell PUDL what data to include in an ETL run. If you have any old settings files of your own lying around they'll need to be updated. Examples of the new format will be deployed to your system if you re-run the pudl_setup script. Or you can make a copy of the etl_full.yml or etl_fast.yml files that are stored under src/pudl/package_data/settings and edit them to reflect your needs.

Database Schema Changes

With the direct database output and the new metadata system, it's much eaiser for us to create foreign key relationships automatically. Updates that are in progress to the database normalization and entity resolution process also benefit from using natural primary keys when possible. As a result we've made some changes to the PUDL database schema, which will probably affect some users.

  • We have split out a new generation_fuel_nuclear_eia923 table from the existing generation_fuel_eia923 table, as nuclear generation and fuel consumption are reported at the generation unit level, rather than the plant level, requiring a different natural primary key. See 851,1296,1325.
  • Implementing a natural primary key for the boiler_fuel_eia923 table required the aggregation of a small number of records that didn't have well-defined prime_mover_code values. See 852,1306,1311.
  • We repaired, aggregated, or dropped a small number of records in the generation_eia923 (See 1208,1248) and ownership_eia860 (See 1207,1258) tables due to null values in their primary key columns.
  • Many new foreign key constraints are being enforced between the EIA data tables, entity tables, and coding tables. See 1196.
  • Fuel types and energy sources reported to EIA are now defined in / constrained by the static energy_sources_eia table.
  • The columns that indicate the mode of transport for various fuels now contain short codes rather than longer labels, and are defined in / constrained by the static fuel_transportation_modes_eia table.
  • In the simplified FERC 1 fuel type categories, we're now using other instead of unknown.
  • Several columns have been renamed to harmonize meanings between different tables and datasets, including:
    • In generation_fuel_eia923 and boiler_fuel_eia923 the fuel_type and fuel_type_code columns have been replaced with energy_source_code, which appears in various forms in generators_eia860 and fuel_receipts_costs_eia923.
    • fuel_qty_burned is now fuel_consumed_units
    • fuel_qty_units is now fuel_received_units
    • heat_content_mmbtu_per_unit is now fuel_mmbtu_per_unit
    • sector_name and sector_id are now sector_name_eia and sector_id_eia
    • primary_purpose_naics_id is now primary_purpose_id_naics
    • mine_type_code is now mine_type (a human readable label, not a code).

New Analyses

  • Added a deployed console script for running the state-level hourly electricity demand allocation, using FERC 714 and EIA 861 data, simply called state_demand and implemented in pudl.analysis.state_demand. This script existed in the v0.4.0 release, but was not deployed on the user's system.

Known Issues

  • The pudl_territories script has been disabled temporarily due to a memory issue. See 1174
  • Utility and Balancing Authority service territories for 2020 have not been vetted, and may contain errors or omissions. In particular there seems to be some missing demand in ND, SD, NE, KS, and OK. See 1310

Updated Dependencies

  • SQLAlchemy 1.4.x: Addressed all deprecation warnings associated with API changes coming in SQLAlchemy 2.0, and bumped current requirement to 1.4.x
  • Pandas 1.3.x: Addressed many data type issues resulting from changes in how Pandas preserves and propagates ExtensionArray / nullable data types.
  • PyArrow v5.0.0 Updated to the most recent version
  • PyGEOS v0.10.x Updated to the most recent version
  • contextily has been removed, since we only used it optionally for making a single visualization and it has substantial dependencies itself.
  • goodtables-pandas-py has been removed since we're no longer producing or validating datapackages.
  • SQLite 3.32.0 The type checks that we've implemented currently only work with SQLite version 3.32.0 or later, as we discovered in debugging build failures on PR 1228. Unfortunately Ubuntu 20.04 LTS shipped with SQLite 3.31.1. Using conda to manage your Python environment avoids this issue.

0.4.0 (2021-08-16)

This is a ridiculously large update including more than a year and a half's worth of work.

New Data Coverage

  • data_sources/eia860 for 2004-2008 + 2019, plus eia860m through 2020.
  • data_sources/eia923 for 2001-2008 + 2019
  • data_sources/epacems for 2019-2020
  • data_sources/ferc1 for 2019
  • US Census Demographic Profile (DP1) <data-censusdp1tract> for 2010
  • data_sources/ferc714 for 2006-2019 (experimental)
  • data_sources/eia861 for 2001-2019 (experimental)

Documentation & Data Accessibility

We've updated and (hopefully) clarified the documentation, and no longer expect most users to perform the data processing on their own. Instead, we are offering several methods of directly accessing already processed data:

Users who still want to run the ETL themselves will need to set up the set up the PUDL development environment <dev/dev_setup>

Data Cleaning & Integration

  • We now inject placeholder utilities in the cloned FERC Form 1 database when respondent IDs appear in the data tables, but not in the respondent table. This addresses a bunch of unsatisfied foreign key constraints in the original databases published by FERC.
  • We're doing much more software testing and data validation, and so hopefully we're catching more issues early on.

Hourly Electricity Demand and Historical Utility Territories

With support from GridLab and in collaboration with researchers at Berkeley's Center for Environmental Public Policy, we did a bunch of work on spatially attributing hourly historical electricity demand. This work was largely done by ezwelty and yashkumar1803 and included:

  • Semi-programmatic compilation of historical utility and balancing authority service territory geometries based on the counties associated with utilities, and the utilities associated with balancing authorities in the EIA 861 (2001-2019). See e.g. 670 but also many others.
  • A method for spatially allocating hourly electricity demand from FERC 714 to US states based on the overlapping historical utility service territories described above. See 741
  • A fast timeseries outlier detection routine for cleaning up the FERC 714 hourly data using correlations between the time series reported by all of the different entities. See 871

Net Generation and Fuel Consumption for All Generators

We have developed an experimental methodology to produce net generation and fuel consumption for all generators. The process has known issues and is being actively developed. See 989

Net electricity generation and fuel consumption are reported in multiple ways in the EIA 923. The generation_fuel_eia923 table reports both generation and fuel consumption, and breaks them down by plant, prime mover, and fuel. In parallel, the generation_eia923 table reports generation by generator, and the boiler_fuel_eia923 table reports fuel consumption by boiler.

The generation_fuel_eia923 table is more complete, but the generation_eia923 + boiler_fuel_eia923 tables are more granular. The generation_eia923 table includes only ~55% of the total MWhs reported in the generation_fuel_eia923 table.

The pudl.analysis.allocate_net_gen module estimates the net electricity generation and fuel consumption attributable to individual generators based on the more expansive reporting of the data in the generation_fuel_eia923 table.

Data Management and Archiving

  • We now use a series of web scrapers to collect snapshots of the raw input data that is processed by PUDL. These original data are archived as Frictionless Data Packages on Zenodo, so that they can be accessed reproducibly and programmatically via a REST API. This addresses the problems we were having with the v0.3.x releases, in which the original data on the agency websites was liable to be modified long after its "final" release, rendering it incompatible with our software. These scrapers and the Zenodo archiving scripts can be found in our pudl-scrapers and pudl-zenodo-storage repositories. The archives themselves can be found within the Catalyst Cooperative community on Zenodo
  • There's an experimental caching system that allows these Zenodo archives to work as long-term "cold storage" for citation and reproducibility, with cloud object storage acting as a much faster way to access the same data for day to day non-local use, implemented by rousik
  • We've decided to shift to producing a combination of relational databases (SQLite files) and columnar data stores (Apache Parquet files) as the primary outputs of PUDL. Tabular Data Packages didn't end up serving either database or spreadsheet users very well. The CSV file were often too large to access via spreadsheets, and users missed out on the relationships between data tables. Needing to separately load the data packages into SQLite and Parquet was a hassle and generated a lot of overly complicated and fragile code.

Known Issues

  • The EIA 861 and FERC 714 data are not yet integrated into the SQLite database outputs, because we need to overhaul our entity resolution process to accommodate them in the database structure. That work is ongoing, see 639
  • The EIA 860 and EIA 923 data don't cover exactly the same rage of years. EIA 860 only goes back to 2004, while EIA 923 goes back to 2001. This is because the pre-2004 EIA 860 data is stored in the DBF file format, and we need to update our extraction code to deal with the different format. This means some analyses that require both EIA 860 and EIA 923 data (like the calculation of heat rates) can only be performed as far back as 2004 at the moment. See 848
  • There are 387 EIA utilities and 228 EIA palnts which appear in the EIA 923, but which haven't yet been assigned PUDL IDs and associated with the corresponding utilities and plants reported in the FERC Form 1. These entities show up in the 2001-2008 EIA 923 data that was just integrated. These older plants and utilities can't yet be used in conjuction with FERC data. When the EIA 860 data for 2001-2003 has been integrated, we will finish this manual ID assignment process. See 848,1069
  • 52 of the algorithmically assigned plant_id_ferc1 values found in the plants_steam_ferc1 table are currently associated with more than one plant_id_pudl value (99 PUDL plant IDs are involved), indicating either that the algorithm is making poor assignments, or that the manually assigned plant_id_pudl values are incorrect. This is out of several thousand distinct plant_id_ferc1 values. See 954
  • The county FIPS codes associated with coal mines reported in the Fuel Receipts and Costs table are being treated inconsistently in terms of their data types, especially in the output functions, so they are currently being output as floating point numbers that have been cast to strings, rather than zero-padded integers that are strings. See 1119

0.3.2 (2020-02-17)

The primary changes in this release:

  • The 2009-2010 data for EIA 860 have been integrated, including updates to the data validation test cases.
  • Output tables are more uniform and less restrictive in what they include, no longer requiring PUDL Plant & Utility IDs in some tables. This release was used to compile v1.1.0 of the PUDL Data Release, which is archived at Zenodo under this DOI: https://doi.org/10.5281/zenodo.3672068

    With this release, the EIA 860 & 923 data now (finally!) cover the same span of time. We do not anticipate integrating any older EIA 860 or 923 data at this time.

0.3.1 (2020-02-05)

A couple of minor bugs were found in the preparation of the first PUDL data release:

  • No maximum version of Python was being specified in setup.py. PUDL currently only works on Python 3.7, not 3.8.
  • epacems_to_parquet conversion script was erroneously attempting to verify the availability of raw input data files, despite the fact that it now relies on the packaged post-ETL epacems data. Didn't catch this before since it was always being run in a context where the original data was lying around... but that's not the case when someone just downloads the released data packages and tries to load them.

0.3.0 (2020-01-30)

This release is mostly about getting the infrastructure in place to do regular data releases via Zenodo, and updating ETL with 2018 data.

Added lots of data validation / quality assurance test cases in anticipation of archiving data. See the pudl.validate module for more details.

New data since v0.2.0 of PUDL:

  • EIA Form 860 for 2018
  • EIA Form 923 for 2018
  • FERC Form 1 for 1994-2003 and 2018 (select tables)

We removed the FERC Form 1 accumulated depreciation table from PUDL because it requires detailed row-mapping in order to be accurate across all the years. It and many other FERC tables will be integrated soon, using new row-mapping methods.

Lots of new plants and utilities integrated into the PUDL ID mapping process, for the earlier years (1994-2003). All years of FERC 1 data should be integrated for all future ferc1 tables.

Command line interfaces of some of the ETL scripts have changed, see their help messages for details.

0.2.0 (2019-09-17)

This is the first release of PUDL to generate data packages as the canonical output, rather than loading data into a local PostgreSQL database. The data packages can then be used to generate a local SQLite database, without relying on any software being installed outside of the Python requirements specified for the catalyst.coop package.

This change will enable easier installation of PUDL, as well as archiving and bulk distribution of the data products in a platform independent format.

0.1.0 (2019-09-12)

This is the only release of PUDL that will be made that makes use of PostgreSQL as the primary data product. It is provided for reference, in case there are users relying on this setup who need access to a well defined release.