Skip to content

Commit

Permalink
Feature branch: Rename core + output assets to match new naming proto…
Browse files Browse the repository at this point in the history
…cols (#2818)

* Rename static tables

* Rename Census DP1 assets

* Test doc fix

* Update core table names for EIA 860, 923, harvested tables, FERC1, code

* Fix integration tests

* Fix alembic

* Rename 714, 861, epacems

* update tests and rest of assets

* Fix validation tests

* Rename ferc output assets

* Rename denorm_cash_flow_ferc1 and remove leading underscore from cross refs in pudl_db docs

* Rename a missing ferc output table and add migration

* Rename EIA denorm assets

* Recreate ferc rename migration

* Add docs cross ref fix for intermediate assets

* Resolve small denorm EIA rename issues

* Clean up notebooks

* Apply naming convention to allocate generation fuel assets

* Fix a missing gen fuel asset name in PudlTabl

* Update migrations post ferc1 output rename merge

* Update contributor facing documentation with new asset naming conventions

* Add new naming convention to user facing documentation

* Correct allocate-get-fuel down revision

* Apply new naming convention to ferc714 respondents, hourly demand and eia861 service territories

* Fix refs to renamed tables in release notes

* Rename ferc714 and eia861 output tables in integration tests

* Add missing balance authority fk migration

* Rename out_ferc714__fipsified_respondents to out_ferc714__respondents_with_fips

* Respond to first round of Austen's comments

* Update rename-core-assets and clarify raw asset sentence

* Restrict astroid version to avoid random autoapi error

* Reset migrations and fix old table refs in docs

* Fix names of inputs to exploded tables and xbrl calculation fixes

* Rename mcoe and ppl assets

* Fix small ppl migration issue

* Format and sort intermediate resource name cross refs in data dictionary

* Add upstream mcoe assets back to metadata

* Update stragler PudlTabl method name

* Add frequency to ppl asset name and some clean up

* rename six of the non-contreversial FERC1 tables (core + out)

* initial rename of the FERC1 core and out tables

* add db migration

* rename the ferc1 transformer classes in line with new table names

* Incorporate some docs changes from #2912

* FINAL FINAL rename of ferc assets

* ooooops remove the eia860m extraction edit bc that was not supposed to be in here ooop

* Remove README.rst from index.rst and move intro content to index

* Add deprecation warnings to PudlTabl and add minor naming docs updates

* Rename heat_rate_mmbtu_mwh -> heat_rate_mmbtu_mwh_by_unit

* Rename heat rate mmbtu mwh to follow existing naming convention

* Remove PudlTabl removal data and make assn table name sources alphabetical

* Explain why CEMS is stored as parquet

* Rename heat_rate_mmbtu_mwh_eia/ferc1 columns to unit_heat_rate_mmbtu_per_mwh_eia/ferc1

* Remove unused ppe_cols_to_grab variable

* Make association asset names more consistent

* Add association assset naming convention to docs

* Resolve migration issues with unit heat rate column

* Update conda-lock.yml and rendered conda environment files.

* Recreate heat rate migration revision

* Use pudl_sqlite_io_manager for fuel_cost_by_generator assets

* Update conda-lock.yml and rendered conda environment files.

* Checkout lock files from dev

* Update conda-lock.yml and rendered conda environment files.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

For more information, see https://pre-commit.ci

* Remove intro.rst and update ferc s3 urls again

* Update conda-lock.yml and rendered conda environment files.

* Remove some old table names from metaddata

* Update conda-lock.yml and rendered conda environment files.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

For more information, see https://pre-commit.ci

* Remove ref to non existant doc page, remove files no longer in dev

---------

Co-authored-by: bendnorman <bdn29@cornell.edu>
Co-authored-by: Bennett Norman <bennett.norman@catalyst.coop>
Co-authored-by: Christina Gosnell <cgosnell@catalyst.coop>
Co-authored-by: bendnorman <bendnorman@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
6 people committed Dec 16, 2023
1 parent 0e36ef9 commit b8fa2b5
Show file tree
Hide file tree
Showing 112 changed files with 6,351 additions and 6,888 deletions.
81 changes: 63 additions & 18 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,36 +47,81 @@ it's often difficult to work with. PUDL takes the original spreadsheets, CSV fil
and databases and turns them into a unified resource. This allows users to spend more
time on novel analysis and less time on data preparation.

Who is PUDL for?
----------------

The project is focused on serving researchers, activists, journalists, policy makers,
and small businesses that might not otherwise be able to afford access to this data from
commercial sources and who may not have the time or expertise to do all the data
processing themselves from scratch.
and small businesses that might not otherwise be able to afford access to this data
from commercial sources and who may not have the time or expertise to do all the
data processing themselves from scratch.

We want to make this data accessible and easy to work with for as wide an audience as
possible: anyone from a grassroots youth climate organizers working with Google sheets
to university researchers with access to scalable cloud computing resources and everyone
in between!
possible: anyone from a grassroots youth climate organizers working with Google
sheets to university researchers with access to scalable cloud computing
resources and everyone in between!

PUDL is comprised of three core components:

- **Raw Data Archives**

- PUDL `archives <https://github.com/catalyst-cooperative/pudl-archiver>`__
all the raw data inputs on `Zenodo <https://zenodo.org/communities/catalyst-cooperative/?page=1&size=20>`__
to ensure perminant, versioned access to the data. In the event that an agency
changes how they publish data or deletes old files, the ETL will still have access
to the original inputs. Each of the data inputs may have several different versions
archived, and all are assigned a unique DOI and made available through the REST API.
You can read more about the Raw Data Archives in the
`docs <https://catalystcoop-pudl.readthedocs.io/en/dev/intro.html#raw-data-archives>`__.
- **ETL Pipeline**

- The ETL pipeline (this repo) ingests the raw archives, cleans them,
integrates them, and outputs them to a series of tables stored in SQLite Databases,
Parquet files, and pickle files (the Data Warehouse). Each release of the PUDL
Python package is embedded with a set of of DOIs to indicate which version of the
raw inputs it is meant to process. This process helps ensure that the ETL and it's
outputs are replicable. You can read more about the ETL in the
`docs <https://catalystcoop-pudl.readthedocs.io/en/dev/intro.html#the-etl-process>`__.
- **Data Warehouse**

- The outputs from the ETL, sometimes called "PUDL outputs",
are stored in a data warehouse as a collection of SQLite and Parquet files so that
users can access the data without having to run any code. Learn more about how to
access the data `here <https://catalystcoop-pudl.readthedocs.io/en/dev/data_access.html>`__.

What data is available?
-----------------------

PUDL currently integrates data from:

* `EIA Form 860 <https://www.eia.gov/electricity/data/eia860/>`__: 2001 - 2022
* `EIA Form 860m <https://www.eia.gov/electricity/data/eia860m/>`__: 2023-06
* `EIA Form 861 <https://www.eia.gov/electricity/data/eia861/>`__: 2001 - 2022
* `EIA Form 923 <https://www.eia.gov/electricity/data/eia923/>`__: 2001 - 2023-08
* `EPA Continuous Emissions Monitoring System (CEMS) <https://campd.epa.gov/>`__: 1995 - 2022
* `FERC Form 1 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>`__: 1994-2021
* `FERC Form 714 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>`__: 2006-2020
* `US Census Demographic Profile 1 Geodatabase <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>`__: 2010
* **EIA Form 860**: 2001-2022
- `Source Docs <https://www.eia.gov/electricity/data/eia860/>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/eia860.html>`__
* **EIA Form 860m**: 2023-06
- `Source Docs <https://www.eia.gov/electricity/data/eia860m/>`__
* **EIA Form 861**: 2001-2022
- `Source Docs <https://www.eia.gov/electricity/data/eia861/>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/eia861.html>`__
* **EIA Form 923**: 2001-2022
- `Source Docs <https://www.eia.gov/electricity/data/eia923/>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/eia923.html>`__
* **EPA Continuous Emissions Monitoring System (CEMS)**: 1995-2022
- `Source Docs <https://campd.epa.gov/>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/epacems.html>`__
* **FERC Form 1**: 1994-2021
- `Source Docs <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/ferc1.html>`__
* **FERC Form 714**: 2006-2020
- `Source Docs <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/dev/data_sources/ferc714.html>`__
* **FERC Form 2**: 2021 (raw only)
- `Source Docs <https://www.ferc.gov/industries-data/natural-gas/industry-forms/form-2-2a-3-q-gas-historical-vfp-data>`__
* **FERC Form 6**: 2021 (raw only)
- `Source Docs <https://www.ferc.gov/general-information-1/oil-industry-forms/form-6-6q-historical-vfp-data>`__
* **FERC Form 60**: 2021 (raw only)
- `Source Docs <https://www.ferc.gov/form-60-annual-report-centralized-service-companies>`__
* **US Census Demographic Profile 1 Geodatabase**: 2010
- `Source Docs <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>`__

Thanks to support from the `Alfred P. Sloan Foundation Energy & Environment
Program <https://sloan.org/programs/research/energy-and-environment>`__, from
2021 to 2024 we will be integrating the following data as well:
2021 to 2024 we will be cleaning and integrating the following data as well:

* `EIA Form 176 <https://www.eia.gov/dnav/ng/TblDefs/NG_DataSources.html#s176>`__
(The Annual Report of Natural Gas Supply and Disposition)
Expand Down
2 changes: 1 addition & 1 deletion devtools/debug-eia-etl.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@
"outputs": [],
"source": [
"%%time\n",
"asset_key = \"fuel_receipts_costs_eia923\"\n",
"asset_key = \"core_eia923__monthly_fuel_receipts_costs\"\n",
"df = defs.load_asset_value(AssetKey(asset_key))\n",
"\n",
"df.head()"
Expand Down
16 changes: 8 additions & 8 deletions devtools/debug-ferc1-etl.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@
},
"outputs": [],
"source": [
"ferc1_xbrl_raw_dfs[\"fuel_ferc1\"][\"duration\"].report_year"
"ferc1_xbrl_raw_dfs[\"core_ferc1__yearly_steam_plants_fuel_sched402\"][\"duration\"].report_year"
]
},
{
Expand Down Expand Up @@ -206,7 +206,7 @@
"metadata": {},
"outputs": [],
"source": [
"table_name = \"other_regulatory_liabilities_ferc1\"\n",
"table_name = \"core_ferc1__yearly_other_regulatory_liabilities_sched278\"\n",
"TRANSFORMER = transformers[table_name] # add a table here"
]
},
Expand Down Expand Up @@ -326,8 +326,8 @@
"source": [
"transformed_tables = {}\n",
"for table_name, transformer in transformers.items():\n",
" if table_name == \"plants_steam_ferc1\":\n",
" # plants_steam_ferc1 is a special case. It depends on the transformed fuel_ferc1 table.\n",
" if table_name == \"core_ferc1__yearly_steam_plants_sched402\":\n",
" # core_ferc1__yearly_steam_plants_sched402 is a special case. It depends on the transformed core_ferc1__yearly_steam_plants_fuel_sched402 table.\n",
" continue\n",
" transformed_tables[transformer.table_id.value] = transformer.transform(\n",
" raw_dbf=ferc1_dbf_raw_dfs[transformer.table_id.value],\n",
Expand All @@ -345,13 +345,13 @@
},
"outputs": [],
"source": [
"# Handle special case for \"plants_steam_ferc1\"\n",
"transformer = transformers[\"plants_steam_ferc1\"]\n",
"# Handle special case for \"core_ferc1__yearly_steam_plants_sched402\"\n",
"transformer = transformers[\"core_ferc1__yearly_steam_plants_sched402\"]\n",
"transformed_tables[transformer.table_id.value] = transformer.transform(\n",
" raw_dbf=ferc1_dbf_raw_dfs[transformer.table_id.value],\n",
" raw_xbrl_instant=ferc1_xbrl_raw_dfs[transformer.table_id.value][\"instant\"],\n",
" raw_xbrl_duration=ferc1_xbrl_raw_dfs[transformer.table_id.value][\"duration\"],\n",
" transformed_fuel=transformed_tables[\"fuel_ferc1\"],\n",
" transformed_fuel=transformed_tables[\"core_ferc1__yearly_steam_plants_fuel_sched402\"],\n",
")"
]
}
Expand All @@ -372,7 +372,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
"version": "3.11.5"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@
"# Local libraries\n",
"import pudl\n",
"from pudl.workspace.setup import PudlPaths\n",
"from pudl.analysis.ferc1_eia_train import *"
"from pudl.analysis.eia_ferc1_train import *"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
"# Local libraries\n",
"import pudl\n",
"from pudl.workspace.setup import PudlPaths\n",
"from pudl.analysis.ferc1_eia_train import *"
"from pudl.analysis.eia_ferc1_train import *"
]
},
{
Expand Down Expand Up @@ -188,7 +188,7 @@
"outputs": [],
"source": [
"current_training_df = pd.read_csv(\n",
" importlib.resources.files(\"pudl.package_data.glue\").joinpath(\"ferc1_eia_train.csv\")\n",
" importlib.resources.files(\"pudl.package_data.glue\").joinpath(\"eia_ferc1_train.csv\")\n",
")\n",
"path_to_overrides = \"./add_to_training/\"\n",
"override_files = [\n",
Expand Down Expand Up @@ -326,10 +326,10 @@
"# Get paths to CSVs.\n",
"from importlib import resources\n",
"one_to_many = path_to_one_to_many=resources.files(\"pudl.package_data.glue\").joinpath(\n",
" \"ferc1_eia_one_to_many.csv\",\n",
" \"eia_ferc1_one_to_many.csv\",\n",
" )\n",
"nulls = path_to_one_to_many=resources.files(\"pudl.package_data.glue\").joinpath(\n",
" \"ferc1_eia_null.csv\",\n",
" \"eia_ferc1_null.csv\",\n",
" )"
]
},
Expand Down
75 changes: 70 additions & 5 deletions devtools/inspect-assets.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -50,10 +50,61 @@
"\n",
"from pudl.etl import defs\n",
"\n",
"asset_key = \"raw_generator_existing_eia860\"\n",
"asset_key = \"exploded_balance_sheet_assets_ferc1\"\n",
"df = defs.load_asset_value(AssetKey(asset_key))\n",
"\n",
"df.head()"
"#df[df.row_type_xbrl == \"correction\"].xbrl_factoid.value_counts()\n",
"#df[(df.xbrl_factoid.isin([\"operation_expense\", \"maintenance_expense\"]))&(df.rel_diff.notnull())&(df.rel_diff!=0)].sort_values(['utility_id_ferc1', 'report_year', 'xbrl_factoid', 'rel_diff']).head(50)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b2d99594",
"metadata": {},
"outputs": [],
"source": [
"df[(df.xbrl_factoid==\"accumulated_depreciation\")&(df.plant_status==\"in_service\")&(df.plant_function==\"total\")]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "467111b1",
"metadata": {},
"outputs": [],
"source": [
"df[df.xbrl_factoid.isin(factoids)&(df.utility_id_ferc1==9)&(df.report_year==1998)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c6f7427a",
"metadata": {},
"outputs": [],
"source": [
"factoids = ['distribution_maintenance_expense_electric',\n",
" 'hydraulic_power_generation_maintenance_expense',\n",
" 'maintenance_of_general_plant',\n",
" 'nuclear_power_generation_maintenance_expense',\n",
" 'other_power_generation_maintenance_expense',\n",
" 'regional_market_maintenance_expense',\n",
" 'steam_power_generation_maintenance_expense',\n",
" 'transmission_maintenance_expense_electric']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "951b718d",
"metadata": {},
"outputs": [],
"source": [
"asset_key = \"calculation_components_xbrl_ferc1\"\n",
"calcs = defs.load_asset_value(AssetKey(asset_key))\n",
"\n",
"calcs[(calcs.xbrl_factoid_parent == \"accumulated_depreciation\")].head(50)"
]
},
{
Expand All @@ -77,10 +128,24 @@
"\n",
"from pudl.etl import defs\n",
"\n",
"asset_key = \"fuel_receipts_costs_eia923\"\n",
"asset_key = \"emissions_unit_ids_epacems\"\n",
"df = defs.load_asset_value(AssetKey(asset_key))\n",
"\n",
"df.head()"
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9f0d118b",
"metadata": {},
"outputs": [],
"source": [
"from pudl.output.epacems import epacems\n",
"\n",
"test_epacems = epacems(states = [\"ID\"], years = [2022])\n",
"\n",
"test_epacems[test_epacems.operating_datetime_utc>=\"2022-01-04\"].head(40)"
]
}
],
Expand All @@ -100,7 +165,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
"version": "3.11.5"
}
},
"nbformat": 4,
Expand Down
14 changes: 7 additions & 7 deletions devtools/python-output-table-conversion-debug.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -48,14 +48,14 @@
"\n",
"You can create an asset by creating a new function and adding the `@asset` decorator. For now, the only attribute you should add to the decorator is the `compute_type = \"Python\"`. All this does is add a cute tag to the asset in the dag to let people know how the asset is being processed.\n",
"\n",
"Next you'll want to figure out what tables the output table depends on. Read through the old output function to see which normalized tables or output functions are being used as inputs to the joins and imputations. Once you have the input table names, add them to the asset function parameters. For example, the `utilities_eia860()` function merges `utilities_entity_eia`, `utilities_eia860`, and `utilities_eia` tables together so the asset would look like this:\n",
"Next you'll want to figure out what tables the output table depends on. Read through the old output function to see which normalized tables or output functions are being used as inputs to the joins and imputations. Once you have the input table names, add them to the asset function parameters. For example, the `utilities_eia860()` function merges `core_eia__entity_utilities`, `core_eia860__scd_utilities`, and `core_pudl__assn_eia_pudl_utilities` tables together so the asset would look like this:\n",
"\n",
"```python\n",
"@asset(compute_kind=\"Python\")\n",
"def denorm_utilities_eia860(\n",
" utilities_entity_eia: pd.DataFrame,\n",
" utilities_eia860: pd.DataFrame,\n",
" utilities_eia: pd.DataFrame,\n",
" core_eia__entity_utilities: pd.DataFrame,\n",
" core_eia860__scd_utilities: pd.DataFrame,\n",
" core_pudl__assn_eia_pudl_utilities: pd.DataFrame,\n",
"):\n",
" ... # joining logic\n",
" return joined_df\n",
Expand Down Expand Up @@ -108,9 +108,9 @@
"```python\n",
"@asset(io_manager_key=\"pudl_sqlite_io_manager\", compute_kind=\"Python\")\n",
"def denorm_utilities_eia860(\n",
" utilities_entity_eia: pd.DataFrame,\n",
" utilities_eia860: pd.DataFrame,\n",
" utilities_eia: pd.DataFrame,\n",
" core_eia__entity_utilities: pd.DataFrame,\n",
" core_eia860__scd_utilities: pd.DataFrame,\n",
" core_pudl__assn_eia_pudl_utilities: pd.DataFrame,\n",
"):\n",
" ... # joining logic\n",
" return joined_df\n",
Expand Down
7 changes: 6 additions & 1 deletion docs/data_access.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,17 @@
Data Access
=======================================================================================

We publish the :doc:`PUDL pipeline <intro>` outputs in several ways to serve
We publish the PUDL pipeline outputs in several ways to serve
different users and use cases. We're always trying to increase accessibility of the
PUDL data, so if you have a suggestion please `open a GitHub issue
<https://github.com/catalyst-cooperative/pudl/issues>`__. If you have a question you
can `create a GitHub discussion <https://github.com/orgs/catalyst-cooperative/discussions/new?category=help-me>`__.

PUDL's primary data output is the ``pudl.sqlite`` database. We recommend working with
tables with the ``out_`` prefix, as these tables contain the most complete and easiest
to work with data. For more information about the different types
of tables, read through :ref:`PUDL's naming conventions <asset-naming>`.

.. _access-modes:

---------------------------------------------------------------------------------------
Expand Down

0 comments on commit b8fa2b5

Please sign in to comment.