## Configure PUDL
The `.pudl.yml` configuration file tells PUDL where to look for data. Uncomment the next cell and run it if you're on our 2i2c JupyterHub.

In [1]:
#!cp ~/shared/shared-pudl.yml ~/.pudl.yml

In [2]:
%load_ext autoreload
%autoreload 3

# import the necessary packages
import os

# You can ignore this. It suppresses an unimportant warning.
os.environ["USE_PYGEOS"] = '0'
import pandas as pd
import sqlalchemy as sa
import pudl

# Connecting to the PUDL Databases

This notebook will walk you through several ways of pulling data out of the Public Utility Data Liberation (PUDL)
project databases and into [Pandas](https://pandas.pydata.org/) Dataframes for analysis and visualization.

This notebook assumes you have a development version of the [PUDL Python package](https://github.com/catalyst-cooperative/pudl) installed, and a complete PUDL database available locally, in the location expected by the Python package.

If you have any questions or feedback you can:
* [Create an issue](https://github.com/catalyst-cooperative/pudl-tutorials/issues) in the GitHub repo for our tutorials, or
* Contact the team at: pudl@catalyst.coop

## Direct SQLite Access
Much of the PUDL data is published as [SQLite database files](https://www.sqlite.org/index.html). These are relational databases generally intended for use by a single user at a time. If you're already familiar with databases and SQL in Python, you can access them just like you would any other database. [Support for SQLite](https://docs.python.org/3/library/sqlite3.html) is built into the Python standard libraries, and the popular [SQLAlchemy](https://www.sqlalchemy.org) Python package also has extensive support for SQLite.  Here's one in-depth resource on using Python, SQLite and SQLAlchemy together: [Data Management with Python, SQLite, and SQLAlchemy](https://realpython.com/python-sqlite-sqlalchemy/)

For the rest of these tutorials, we're going to assume you want to get the data into Pandas as quickly as possible for interactive work.


## Database Normalization
The data in the PUDL database has been extensively deduplicated, [normalized](https://en.wikipedia.org/wiki/Database_normalization) and generally organized according to best practices of [tidy data](https://tidyr.tidyverse.org/articles/tidy-data.html) in order to ensure that it is internally self-consistent and free of errors. As a result, you'll often need to combine information from more than one table to make it readable or to get all the information you need for your analysis in one place. We've built some tools to do this automatically, which we'll get to below.

## Locate the PUDL DB file
Each SQLite database is stored within a single file. To access the data, you need to know where that file is. With the location of the file, you can create an [SQLAlchemy connection engine](https://docs.sqlalchemy.org/en/13/core/engines.html), which Pandas will use to read data out of the database. PUDL stores its data in a directory structure generally organized by file format. We store the paths to those directories and the SQLAlchemy database URLs in a Python dictionary that's usually called `pudl_settings`. Note that  a URL is just a path to a file that could be either local (on your computer) or remote (on someone else's computer). The following command will construct that `pudl_settings` dictionary based on some directory paths stored in the `.pudl.yml` file in your home directory. Printing out the dictionary contents you can see where PUDL will look for various resources.

In [3]:
pudl_settings = pudl.workspace.setup.get_defaults()
pudl_settings

{'pudl_in': '/Users/zane/code/catalyst/pudl-work',
 'data_dir': '/Users/zane/code/catalyst/pudl-work/data',
 'settings_dir': '/Users/zane/code/catalyst/pudl-work/settings',
 'pudl_out': '/Users/zane/code/catalyst/pudl-work',
 'sqlite_dir': '/Users/zane/code/catalyst/pudl-work/sqlite',
 'parquet_dir': '/Users/zane/code/catalyst/pudl-work/parquet',
 'ferc1_db': 'sqlite:////Users/zane/code/catalyst/pudl-work/sqlite/ferc1.sqlite',
 'ferc1_xbrl_db': 'sqlite:////Users/zane/code/catalyst/pudl-work/sqlite/ferc1_xbrl.sqlite',
 'ferc1_xbrl_datapackage': PosixPath('/Users/zane/code/catalyst/pudl-work/sqlite/ferc1_xbrl_datapackage.json'),
 'ferc1_xbrl_taxonomy_metadata': PosixPath('/Users/zane/code/catalyst/pudl-work/sqlite/ferc1_xbrl_taxonomy_metadata.json'),
 'ferc2_xbrl_db': 'sqlite:////Users/zane/code/catalyst/pudl-work/sqlite/ferc2_xbrl.sqlite',
 'ferc2_xbrl_datapackage': PosixPath('/Users/zane/code/catalyst/pudl-work/sqlite/ferc2_xbrl_datapackage.json'),
 'ferc2_xbrl_taxonomy_metadata': Posi

## The SQLAlchemy Connection Engine
* The `sqlalchemy.create_engine()` function takes a database URL and creates an Engine that knows how to interact with the database. It can do things like list out the names of all the tables in the database.
* For metadata describing the contents of these tables in detail, see the [PUDL Data Dictionaries](https://catalystcoop-pudl.readthedocs.io/en/latest/data_dictionaries/pudl_db.html) in our documentation.

In [4]:
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])
# see all the tables inside of the database
sa.inspect(pudl_engine).get_table_names()

['balancing_authorities_eia',
 'boiler_fuel_eia923',
 'boiler_generator_assn_eia860',
 'boiler_generator_assn_types_eia',
 'boilers_entity_eia',
 'coalmine_eia923',
 'coalmine_msha',
 'coalmine_types_eia',
 'contract_types_eia',
 'data_maturities',
 'energy_sources_eia',
 'epacamd_eia',
 'ferc_accounts',
 'ferc_depreciation_lines',
 'fuel_ferc1',
 'fuel_receipts_costs_aggs_eia',
 'fuel_receipts_costs_eia923',
 'fuel_transportation_modes_eia',
 'fuel_types_aer_eia',
 'generation_eia923',
 'generation_fuel_eia923',
 'generation_fuel_nuclear_eia923',
 'generators_eia860',
 'generators_entity_eia',
 'mine_id_pudl_msha',
 'momentary_interruptions_eia',
 'operational_status_eia',
 'ownership_eia860',
 'plant_in_service_ferc1',
 'plants_eia',
 'plants_eia860',
 'plants_entity_eia',
 'plants_ferc1',
 'plants_hydro_ferc1',
 'plants_pudl',
 'plants_pumped_storage_ferc1',
 'plants_small_ferc1',
 'plants_steam_ferc1',
 'political_subdivisions',
 'power_purchase_types_ferc1',
 'prime_movers_eia',
 

# Reading data with `pandas.read_sql()`
The [pandas.read_sql()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html) method is the simplest way to pull data from an SQL database into a dataframe. You can give it an SQL statement to execute, or just the name of a table to read in its entirety.

## Read a whole table
Reading an entire table all at once is easy. It isn't very memory efficient but there's less than 1 GB of data in the PUDL database, so in most cases this is a fine option. Once you've had a chance to poke around at the whole table a bit, you can select the data that's actually of interest out of it for your analysis or visualization.

You can also explore the contents of the database interactively online at https://data.catalyst.coop if you want to familiarize yourself with its contents in a more graphical way first.

In [5]:
generation_df = pd.read_sql("generation_eia923", pudl_engine)
generation_df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 604906 entries, 0 to 604905
Data columns (total 5 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   plant_id_eia        604906 non-null  int64         
 1   generator_id        604906 non-null  object        
 2   report_date         604906 non-null  datetime64[ns]
 3   net_generation_mwh  576382 non-null  float64       
 4   data_maturity       604906 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 84.0 MB


In [6]:
generation_df.sample(10)

Unnamed: 0,plant_id_eia,generator_id,report_date,net_generation_mwh,data_maturity
195414,6073,4ST,2013-09-01,128440.0,final
117657,2866,4,2011-12-01,45640.0,final
126053,8068,ST6S,2011-08-01,40599.0,final
379375,2104,2,2017-10-01,0.0,final
439827,10350,GEN2,2018-06-01,32.0,final
429018,2840,6,2018-09-01,63043.0,final
67494,55176,GEN3,2009-04-01,22727.1,final
484173,7805,ST4,2019-12-01,96456.0,final
210984,54423,ST1,2013-03-01,16291.0,final
523451,3008,3,2020-02-01,0.0,final


## Select specific data using SQL
If you're familiar with SQL, and you already know what subset of the data you want to pull out of the database, you can give Pandas an SQL statement directly, along with the `pudl_engine`, and it will put the results of the SQL statement into a dataframe for you.

For example, the following statement sums the nameplate capacities of generators by power plant, for every generator that reported a capacity in the EIA 860 in 2019, excluding those in Alaska and Hawaii. It sorts the results by capacity with the biggest plants first, and only returns the biggest 1000 plants.

[Compare with the results from our online database](https://data.catalyst.coop/pudl?sql=select%0D%0A++plants.plant_id_eia%2C%0D%0A++plants.plant_name_eia%2C%0D%0A++SUM%28gens.capacity_mw%29+as+plant_capacity_mw%2C%0D%0A++latitude%2C%0D%0A++longitude%0D%0Afrom%0D%0A++generators_eia860+as+gens%0D%0Ajoin%0D%0A++plants_entity_eia+as+plants%0D%0Awhere%0D%0A++plants.plant_id_eia+%3D+gens.plant_id_eia%0D%0A++and+gens.report_date+%3D+%222019-01-01%22%0D%0A++and+plants.state+not+in+%28%22HI%22%2C+%22AK%22%29%0D%0Agroup+by%0D%0A++plants.plant_id_eia%0D%0Aorder+by%0D%0A++plant_capacity_mw+desc).

In [7]:
example_sql = """
SELECT
  plants.plant_id_eia,
  plants.plant_name_eia,
  SUM(gens.capacity_mw) AS plant_capacity_mw,
  latitude,
  longitude
FROM
  generators_eia860 AS gens
JOIN
  plants_entity_eia AS plants
WHERE
  plants.plant_id_eia = gens.plant_id_eia
  AND gens.report_date = "2019-01-01"
  AND plants.state not in ("HI", "AK")
GROUP BY
  plants.plant_id_eia
ORDER BY
  plant_capacity_mw DESC
LIMIT 1000;
"""
big_plants_df = pd.read_sql(example_sql, pudl_engine)
big_plants_df

Unnamed: 0,plant_id_eia,plant_name_eia,plant_capacity_mw,latitude,longitude
0,6163,Grand Coulee,6809.0,47.957511,-118.977323
1,6043,Martin,6071.5,27.053600,-80.562800
2,628,Crystal River,5303.7,28.965600,-82.697700
3,649,Vogtle,4630.0,33.142700,-81.762500
4,56407,West County Energy Center,4263.0,26.698600,-80.374700
...,...,...,...,...,...
995,389,El Centro Hybrid,438.3,32.802222,-115.540000
996,118,Saguaro,435.5,32.551700,-111.300000
997,63113,Southern Bighorn Solar Hybrid,435.0,36.304793,-114.472800
998,56163,KUCC,434.5,40.711900,-112.122500


# Read tables using the PUDL output layer
Early on in the development of the PUDL database, we found that we were frequently joining the same tables together, and calculating the same derived values in Pandas during our interactive analyses. So we wrote some code to do that work automatically and uniformly. We call this the PUDL Output Layer. It brings in fields like plant and utility names from their home tables, so you have more than just the numeric ID to go by, caches dataframes internally for re-use, and can do some time series aggregation.

These outputs are "denormalized" -- meaning that data will be duplicated in different output tables, and they will contain derived values that don't represent unique information. This structure isn't good inside a database, but it's great for interactive use.

The 2nd notebook in this tutorial is all about the `PudlTabl` objects, which we usually name `pudl_out`, but here is a quick preview.

If you want to access de-normalized tables, we've built an access methodology that saves access methods for most denormalized tables in PUDL and analysis build ontop of PUDL tables. There is a whole other notebook that covers the output tables so if you want more info on that.

## Create a PudlTabl output object
The tabular output object needs to know what PUDL database it's connecting to (via the `pudl_engine` argument), and optionally, what time frequency it should aggregate tables on.

In [8]:
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine)

## Construct denormalized dataframes
The `PudlTabl` object, called `pudl_out` here, has a bunch of methods corresponding to individual tables within the database. They typically use abbreviated names. Hitting `Tab` will show you a preview the available methods.

The `gen_eia923()` method corresponds to the `generation_eia923` table in the database, which details the monthly net generation from each generator reporting on the EIA Form 923.

Note: if you re-run the cell, it will complete almost instantly, because the dataframe has been cached inside the `pudl_out` object for later use.

In [9]:
%%time
gen_eia923 = pudl_out.gen_eia923()
gen_eia923.info()

2022-12-22 01:21:43 [    INFO] catalystcoop.pudl.transform.eia861:456 Started with 81265 missing BA Codes out of 185357 records (43.84%)
2022-12-22 01:21:45 [    INFO] catalystcoop.pudl.transform.eia861:480 Ended with 14923 missing BA Codes out of 185357 records (8.05%)
2022-12-22 01:21:46 [    INFO] catalystcoop.pudl.output.eia860:177 91.6% of plant records have consistently reported BA Codes
2022-12-22 01:21:46 [    INFO] catalystcoop.pudl.output.eia860:227 Before any filling treatment has been applied. 43.8% of records have no BA codes
2022-12-22 01:21:46 [    INFO] catalystcoop.pudl.output.eia860:227 Backfilling and consistent value is the same. Filled w/ most consistent BA code. 10.9% of records have no BA codes
2022-12-22 01:21:46 [    INFO] catalystcoop.pudl.output.eia860:227 SWPP is most consistent value. Filled w/ oldest BA code. 8.3% of records have no BA codes
2022-12-22 01:21:46 [    INFO] catalystcoop.pudl.output.eia860:227 NWMT is most consistent value. Filled w/ oldest B

<class 'pandas.core.frame.DataFrame'>
Int64Index: 604594 entries, 0 to 604593
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   report_date         604594 non-null  datetime64[ns]
 1   plant_id_eia        604594 non-null  Int64         
 2   plant_id_pudl       604594 non-null  Int64         
 3   plant_name_eia      604594 non-null  string        
 4   utility_id_eia      604594 non-null  Int64         
 5   utility_id_pudl     604594 non-null  Int64         
 6   utility_name_eia    604594 non-null  string        
 7   generator_id        604594 non-null  string        
 8   data_maturity       604594 non-null  string        
 9   net_generation_mwh  576227 non-null  float64       
 10  unit_id_pudl        563082 non-null  Int64         
dtypes: Int64(5), datetime64[ns](1), float64(1), string(4)
memory usage: 58.2 MB
CPU times: user 11.7 s, sys: 705 ms, total: 12.4 s
Wall time: 12.8 

In [10]:
gen_eia923.sample(10)

Unnamed: 0,report_date,plant_id_eia,plant_id_pudl,plant_name_eia,utility_id_eia,utility_id_pudl,utility_name_eia,generator_id,data_maturity,net_generation_mwh,unit_id_pudl
144623,2012-02-01,708,246,Hammond,7140,123,Georgia Power Co,3,final,35934.0,3
53996,2009-07-01,6147,2516,Monticello,19323,4091,Luminant Generation Company LLC,2,final,394001.0,2
85325,2010-09-01,3497,2329,Big Brown Power Company LLC,19323,4091,Luminant Generation Company LLC,1,final,397727.0,1
225599,2014-02-01,1082,628,Council Bluffs,12341,185,MidAmerican Energy Co,2,final,38152.0,2
101660,2010-10-01,54752,4194,Pine Hill Op,56291,2125,International Paper Pine Hill,NO2,final,5380.0,2
380538,2017-09-01,2500,2060,Ravenswood,61130,5532,"Helix Ravenswood, LLC",4,final,111917.0,4
541385,2020-08-01,52089,3896,Celco,57151,830,Celanese Acetate LLC,GEN4,final,2927.0,1
594537,2021-12-01,55327,4474,Baytown Energy Center LP,2838,591,Calpine Central LP,CTG3,final,80200.0,1
270128,2014-11-01,57908,6056,Central Utility Plant Cincinnati,57281,3582,University of Cincinnati,CTG2,final,522.83,1
189958,2013-01-01,2866,2171,W H Sammis,6526,1729,FirstEnergy,4,final,87986.0,4


## Compare with the normalized DB table
The denormalized version of the table above includes fields like `utility_name_eia923` and `plant_name_eia923` and `plant_id_pudl` which are all useful, but aren't fundamentally part of this table -- they can all be looked up in other tables based on the value of `plant_id_eia` found in the original `generation_eia923` table, so storing them in this table would mean duplicating data.  You can see what the original table looks like below.

Note also that since we're going back to the database directly rather than accessing the cached dataframe within the `pudl_out` object, this query will take a few seconds to run, just like the first time we read the table using `pudl_out` above.

In [11]:
%%time
gen_eia923_normalized = pd.read_sql("generation_eia923", pudl_engine)
gen_eia923_normalized.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 604906 entries, 0 to 604905
Data columns (total 5 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   plant_id_eia        604906 non-null  int64         
 1   generator_id        604906 non-null  object        
 2   report_date         604906 non-null  datetime64[ns]
 3   net_generation_mwh  576382 non-null  float64       
 4   data_maturity       604906 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 23.1+ MB
CPU times: user 1.5 s, sys: 54.5 ms, total: 1.55 s
Wall time: 1.55 s


In [12]:
gen_eia923_normalized.sample(10)

Unnamed: 0,plant_id_eia,generator_id,report_date,net_generation_mwh,data_maturity
478641,3470,2,2019-12-01,-189.0,final
45048,2161,1,2009-01-01,12442.0,final
147336,1295,ST1,2012-03-01,48886.0,final
520354,1912,9,2020-01-01,140338.0,final
17384,4050,3,2008-08-01,30524.0,final
307651,54562,7,2015-10-01,15703.0,final
179526,377,5,2013-09-01,3839.0,final
2844,628,2,2008-11-01,235677.0,final
130070,50006,STG1,2011-05-01,26179.0,final
489181,10746,GEN1,2019-04-01,55234.0,final


# Raw FERC Data
* You may have noticed that there are other SQLite databases showing up in the `pudl_settings` above.
* In addition to the cleaned and normalized PUDL data, there are also several databases built from FERC data.
* Historically FERC has published their data using the (now obsolete) VisualFoxPro (DBF) database format.
* In 2021 they switched to using a dialect of XML known as XBRL (the eXtensible Business Reporting Language).
* We've extracted the raw DBF and XBRL data for FERC 1 and converted it into SQLite databases which are easier to access.
* For the other FERC Forms (2, 6, 60, and 714) we have yet to fully integrate all of the historical data.

In [13]:
pudl_settings

{'pudl_in': '/Users/zane/code/catalyst/pudl-work',
 'data_dir': '/Users/zane/code/catalyst/pudl-work/data',
 'settings_dir': '/Users/zane/code/catalyst/pudl-work/settings',
 'pudl_out': '/Users/zane/code/catalyst/pudl-work',
 'sqlite_dir': '/Users/zane/code/catalyst/pudl-work/sqlite',
 'parquet_dir': '/Users/zane/code/catalyst/pudl-work/parquet',
 'ferc1_db': 'sqlite:////Users/zane/code/catalyst/pudl-work/sqlite/ferc1.sqlite',
 'ferc1_xbrl_db': 'sqlite:////Users/zane/code/catalyst/pudl-work/sqlite/ferc1_xbrl.sqlite',
 'ferc1_xbrl_datapackage': PosixPath('/Users/zane/code/catalyst/pudl-work/sqlite/ferc1_xbrl_datapackage.json'),
 'ferc1_xbrl_taxonomy_metadata': PosixPath('/Users/zane/code/catalyst/pudl-work/sqlite/ferc1_xbrl_taxonomy_metadata.json'),
 'ferc2_xbrl_db': 'sqlite:////Users/zane/code/catalyst/pudl-work/sqlite/ferc2_xbrl.sqlite',
 'ferc2_xbrl_datapackage': PosixPath('/Users/zane/code/catalyst/pudl-work/sqlite/ferc2_xbrl_datapackage.json'),
 'ferc2_xbrl_taxonomy_metadata': Posi

## FERC Form 1: Direct vs. PUDL
* The PUDL database contains a tiny fraction of the data available in the original FERC Form 1.
* We have only taken the time to clean a handful of the FERC tables. The original FERC Form 1 data is often very messy and poorly organized.
* However, if you need to access one of the original 113 tables that we haven't integrated yet, they're all available, going back to 1994.
* The original tables are only accessible via direct queries (either using SQL or pulling whole tables) from the original FERC Form 1 databases,
* So you'll have to use the `pandas.read_sql()` methods outlined above.
* If there are particular tables within the FERC Form 1 that you think are important to get cleaned up, let us know so we can prioritize them going forward!

## Explore the raw FERC Form 1 DBF data (1994-2020)

In [14]:
ferc1_dbf_engine = sa.create_engine(pudl_settings["ferc1_db"])
# see all the tables inside of the database
sa.inspect(ferc1_dbf_engine).get_table_names()

['f1_106_2009',
 'f1_106a_2009',
 'f1_106b_2009',
 'f1_208_elc_dep',
 'f1_231_trn_stdycst',
 'f1_324_elc_expns',
 'f1_325_elc_cust',
 'f1_331_transiso',
 'f1_338_dep_depl',
 'f1_397_isorto_stl',
 'f1_398_ancl_ps',
 'f1_399_mth_peak',
 'f1_400_sys_peak',
 'f1_400a_iso_peak',
 'f1_429_trans_aff',
 'f1_acb_epda',
 'f1_accumdepr_prvsn',
 'f1_accumdfrrdtaxcr',
 'f1_adit_190_detail',
 'f1_adit_190_notes',
 'f1_adit_amrt_prop',
 'f1_adit_other',
 'f1_adit_other_prop',
 'f1_allowances',
 'f1_allowances_nox',
 'f1_audit_log',
 'f1_bal_sheet_cr',
 'f1_capital_stock',
 'f1_cash_flow',
 'f1_cmmn_utlty_p_e',
 'f1_cmpinc_hedge',
 'f1_cmpinc_hedge_a',
 'f1_co_directors',
 'f1_codes_val',
 'f1_col_lit_tbl',
 'f1_comp_balance_db',
 'f1_construction',
 'f1_control_respdnt',
 'f1_cptl_stk_expns',
 'f1_csscslc_pcsircs',
 'f1_dacs_epda',
 'f1_dscnt_cptl_stk',
 'f1_edcfu_epda',
 'f1_elc_op_mnt_expn',
 'f1_elc_oper_rev_nb',
 'f1_elctrc_erg_acct',
 'f1_elctrc_oper_rev',
 'f1_electric',
 'f1_email',
 'f1_envrn

## FERC 1 Large Steam Plants (DBF)

In [15]:
pd.read_sql("f1_steam", ferc1_dbf_engine)

Unnamed: 0,respondent_id,report_year,spplmnt_num,row_number,row_seq,row_prvlg,plant_name,plant_kind,type_const,yr_const,yr_installed,tot_capacity,peak_demand,plant_hours,plnt_capability,when_not_limited,when_limited,avg_num_of_emp,net_generation,cost_land,cost_structure,cost_equipment,cost_of_plant_to,cost_per_kw,expns_operations,expns_fuel,expns_coolants,expns_steam,expns_steam_othr,expns_transfer,expns_electric,expns_misc_power,expns_rents,expns_allowances,expns_engnr,expns_structures,expns_boiler,expns_plants,expns_misc_steam,tot_prdctn_expns,expns_kwh,plant_name_f,plant_kind_f,type_const_f,yr_const_f,yr_installed_f,tot_capacity_f,peak_demand_f,plant_hours_f,plnt_capability_f,when_not_limited_f,when_limited_f,avg_num_of_emp_f,net_generation_f,cost_land_f,cost_structure_f,cost_equipment_f,cost_of_plant_to_f,cost_per_kw_f,expns_operations_f,expns_fuel_f,expns_coolants_f,expns_steam_f,expns_steam_othr_f,expns_transfer_f,expns_electric_f,expns_misc_power_f,expns_rents_f,expns_allowances_f,expns_engnr_f,expns_structures_f,expns_boiler_f,expns_plants_f,expns_misc_steam_f,tot_prdctn_expns_f,expns_kwh_f,asset_retire_cost,asset_retire_cost_f,report_prd
0,1,1994,0,1,1,N,Rockport Unit 1,Steam,Conventional,1984,1984,650.0,650.0,,,650.0,,,4.668184e+09,6395551.0,84467746.0,4.906841e+08,5.815474e+08,894.6883,1032559.0,51694529.0,,442763.0,,,353599.0,1040610.0,7559.0,,427906.0,396788.0,3185935.0,631598.0,781181.0,59995027.0,0.0129,29,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,12
1,1,1994,0,2,2,N,Rockport Unit 2,Steam,Conventional,1989,1989,650.0,650.0,,,650.0,,,4.451312e+09,74411.0,4249136.0,3.933937e+07,4.366292e+07,67.1737,1026248.0,48990225.0,,446454.0,,,384283.0,1028788.0,67311927.0,,427747.0,230300.0,3374827.0,518870.0,255391.0,123995060.0,0.0279,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,12
2,1,1994,0,3,3,N,Rockport,Steam,Conventional,1984,1989,1300.0,1300.0,,,1300.0,,,9.119496e+09,6469962.0,88716882.0,5.300235e+08,6.252103e+08,480.9310,2058807.0,100684754.0,,889217.0,,,737882.0,2069398.0,67319486.0,,855653.0,627088.0,6560762.0,1150468.0,1036572.0,183990087.0,0.0202,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,12
3,1,1994,0,4,4,N,Rockport Total Plant,Steam,Conventional,1984,1989,2600.0,2600.0,,,2600.0,,462.0,1.779316e+10,12969249.0,175466216.0,1.049180e+09,1.237616e+09,476.0061,4117640.0,196297854.0,,1778431.0,,,1475766.0,4138807.0,134884608.0,,1711307.0,1254169.0,13121517.0,2300937.0,2073142.0,363154178.0,0.0204,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,12
4,2,1994,0,1,1,N,GORGAS,Steam,Conventional,1929,1972,1417.0,1294.0,8760.0,,1302.0,,438.0,7.412375e+09,312098.0,63796151.0,3.273578e+08,3.914661e+08,276.2640,3065839.0,118304925.0,,2692720.0,,,1391099.0,7506206.0,,,2276025.0,1451092.0,17760784.0,5957567.0,645822.0,161052079.0,0.0217,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51765,193,2020,2,1,1,,VALLEY-TOTAL,Steam,Conventional,1968,1969,272.0,261.0,8779.0,263.0,280.0,263.0,52.0,4.431440e+08,5078035.0,16262496.0,2.089400e+08,2.287576e+08,841.0206,841282.0,23294256.0,,3361865.0,,-6359952.0,1159732.0,1461929.0,,,829890.0,534891.0,1236439.0,2799110.0,1695047.0,30854489.0,0.0696,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1522935.0,0,12
51766,193,2020,2,2,2,,,,,,,0.0,,,,,,,,,,,,0.0000,,,,,,,,,,,,,,,,,0.0000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,12
51767,193,2020,2,3,3,,,,,,,0.0,,,,,,,,,,,,0.0000,,,,,,,,,,,,,,,,,0.0000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,12
51768,193,2020,2,4,4,,,,,,,0.0,,,,,,,,,,,,0.0000,,,,,,,,,,,,,,,,,0.0000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,12


## Explore the raw FERC Form 1 XBRL data (2021-onward)

In [16]:
ferc1_xbrl_engine = sa.create_engine(pudl_settings["ferc1_xbrl_db"])
# see all the tables inside of the database
sa.inspect(ferc1_xbrl_engine).get_table_names()

['accumulated_deferred_income_taxes_accelerated_amortization_property_account_281_classified_by_tax_types_272_duration',
 'accumulated_deferred_income_taxes_accelerated_amortization_property_account_281_classified_by_tax_types_272_instant',
 'accumulated_deferred_income_taxes_accelerated_amortization_property_account_281_classified_by_utility_types_272_duration',
 'accumulated_deferred_income_taxes_accelerated_amortization_property_account_281_classified_by_utility_types_272_instant',
 'accumulated_deferred_income_taxes_accelerated_amortization_property_account_281_other_272_duration',
 'accumulated_deferred_income_taxes_accelerated_amortization_property_account_281_other_272_instant',
 'accumulated_deferred_income_taxes_accelerated_amortization_property_account_281_other_electric_272_duration',
 'accumulated_deferred_income_taxes_accelerated_amortization_property_account_281_other_electric_272_instant',
 'accumulated_deferred_income_taxes_accelerated_amortization_property_account_281_

## FERC 1 Large Steam Plants (XBRL)

In [17]:
pd.read_sql("steam_electric_generating_plant_statistics_large_plants_402_duration", ferc1_xbrl_engine)

Unnamed: 0,index,entity_id,filing_name,start_date,end_date,plant_name_axis,allowances,year_plant_originally_constructed,maintenance_of_boiler_plant_steam_power_generation,maintenance_supervision_and_engineering_steam_power_generation,power_production_expenses_steam_power,net_generation_excluding_plant_use,maintenance_of_structures_steam_power_generation,cost_per_kilowatt_of_installed_capacity,maintenance_of_electric_plant_steam_power_generation,plant_hours_connected_to_load,rents_steam_power_generation,net_peak_demand_on_plant,expenses_per_net_kilowatt_hour,steam_transferred_credit,net_continuous_plant_capability_limited_by_condenser_water,maintenance_of_miscellaneous_steam_plant,net_continuous_plant_capability_not_limited_by_condenser_water,plant_name,steam_from_other_sources,steam_expenses_steam_power_generation,order_number,electric_expenses_steam_power_generation,net_continuous_plant_capability,fuel_steam_power_generation,operation_supervision_and_engineering_expense,year_last_unit_of_plant_installed,miscellaneous_steam_power_expenses,plant_kind,plant_construction_type,plant_average_number_of_employees,coolants_and_water
0,0,C008999,66ce7efc-3e5d-4897-84f6-3cfe4cecf4b9,2021-01-01,2021-12-31,Attala,953.0,2001,6516.0,64662.0,76267610.0,2.492537e+09,146331.0,249.0000,3023821.0,5831.0,81048.0,525.0,0.0306,0.0,459.0,269815.0,501.0,Attala,0.0,3620.0,1.0,602457.0,0.0,66170198.0,1027157.0,2001,4871032.0,Gas Turbine,,30.0,0.0
1,1,C008999,66ce7efc-3e5d-4897-84f6-3cfe4cecf4b9,2021-01-01,2021-12-31,Baxter Wilson,12780.0,1966,1297075.0,33005.0,38316816.0,7.511598e+08,153941.0,183.0000,353605.0,2365.0,4984.0,513.0,0.0510,0.0,494.0,348093.0,559.0,Baxter Wilson,0.0,825224.0,2.0,375003.0,0.0,31857250.0,751249.0,1971,2304607.0,Steam,Outdoor Boiler,28.0,0.0
2,2,C008999,66ce7efc-3e5d-4897-84f6-3cfe4cecf4b9,2021-01-01,2021-12-31,Choctaw,1432.0,2019,11030.0,49296.0,126550900.0,3.885800e+09,609.0,560.0000,111.0,5429.0,-4577.0,934.0,0.0326,,787.0,15899786.0,848.0,Choctaw,,6126.0,3.0,6458.0,0.0,109062385.0,469544.0,2019,1048700.0,Combined Cycle,Outdoor,30.0,0.0
3,3,C008999,66ce7efc-3e5d-4897-84f6-3cfe4cecf4b9,2021-01-01,2021-12-31,Gerald Andrus,1552.0,1975,3816465.0,312387.0,23478206.0,1.751010e+08,380544.0,260.0000,1696877.0,580.0,27136.0,727.0,0.1341,,728.0,493532.0,750.0,Gerald Andrus,,1200616.0,4.0,22294.0,0.0,11006127.0,2108940.0,1975,2411736.0,Steam,Outdoor Boiler,32.0,
4,4,C008999,66ce7efc-3e5d-4897-84f6-3cfe4cecf4b9,2021-01-01,2021-12-31,Hinds,995.0,2001,6842.0,32167.0,104517867.0,3.488764e+09,196403.0,555.0000,2766469.0,7920.0,5540.0,553.0,0.0300,,482.0,575657.0,526.0,Hinds,,3800.0,5.0,215237.0,0.0,95086745.0,534657.0,2001,5093355.0,Gas Turbine,,35.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1050,6,C000500,8c54a597-27e8-4877-8801-70e1705b7623,2021-01-01,2021-12-31,PULLIAM 31,0.0,2003,0.0,10998.0,6630488.0,9.016683e+07,4524.0,397.5096,326289.0,1617.0,0.0,95.0,0.0735,0.0,79.0,0.0,106.0,PULLIAM 31,0.0,0.0,,37506.0,82.0,6028340.0,11172.0,2003,211659.0,Combustion Turbine,Conventional,3.0,0.0
1051,7,C000500,8c54a597-27e8-4877-8801-70e1705b7623,2021-01-01,2021-12-31,TWO CREEKS,0.0,2020,0.0,0.0,1675365.0,1.913110e+08,0.0,1465.3710,1252982.0,4158.0,331266.0,100.0,0.0088,0.0,,0.0,,TWO CREEKS,0.0,0.0,,0.0,100.0,0.0,0.0,2020,91117.0,Solar,Conventional,1.0,0.0
1052,8,C000500,8c54a597-27e8-4877-8801-70e1705b7623,2021-01-01,2021-12-31,WEST MARINETTE,0.0,1971,0.0,17673.0,6002419.0,6.638229e+07,10062.0,201.7359,301806.0,1234.0,0.0,151.0,0.0904,0.0,154.0,0.0,204.0,WEST MARINETTE,0.0,0.0,,143226.0,154.0,5399094.0,80666.0,1993,49892.0,Combustion Turbine,Conventional,3.0,0.0
1053,9,C000500,8c54a597-27e8-4877-8801-70e1705b7623,2021-01-01,2021-12-31,"WESTON W31, W32",0.0,1969,0.0,4554.0,1454328.0,1.287767e+07,11394.0,111.5330,258877.0,422.0,0.0,57.0,0.1129,0.0,63.0,0.0,88.0,"WESTON W31, W32",0.0,0.0,,0.0,69.0,1160318.0,4632.0,1973,14553.0,Combustion Turbine,Conventional,0.0,0.0


In [18]:
pd.read_sql("steam_electric_generating_plant_statistics_large_plants_402_instant", ferc1_xbrl_engine)

Unnamed: 0,index,entity_id,filing_name,date,plant_name_axis,cost_of_land_and_land_rights_steam_production,cost_of_equipment_steam_production,cost_of_plant,asset_retirement_costs_steam_production,cost_of_structures_and_improvements_steam_production,installed_capacity_of_plant
0,0,C008999,66ce7efc-3e5d-4897-84f6-3cfe4cecf4b9,2021-12-31,Attala,128931.0,1.202509e+08,1.371233e+08,0.0,16743407.0,551.00
1,1,C008999,66ce7efc-3e5d-4897-84f6-3cfe4cecf4b9,2021-12-31,Baxter Wilson,359115.0,8.642645e+07,9.979924e+07,136135.0,12877536.0,545.00
2,2,C008999,66ce7efc-3e5d-4897-84f6-3cfe4cecf4b9,2021-12-31,Choctaw,2761498.0,2.918589e+08,5.031417e+08,0.0,208521282.0,899.00
3,3,C008999,66ce7efc-3e5d-4897-84f6-3cfe4cecf4b9,2021-12-31,Gerald Andrus,331301.0,1.719204e+08,2.032104e+08,,30958749.0,781.00
4,4,C008999,66ce7efc-3e5d-4897-84f6-3cfe4cecf4b9,2021-12-31,Hinds,1362657.0,2.309114e+08,3.332237e+08,0.0,100949699.0,600.00
...,...,...,...,...,...,...,...,...,...,...,...
961,6,C000500,8c54a597-27e8-4877-8801-70e1705b7623,2021-12-31,PULLIAM 31,0.0,3.454014e+07,3.615350e+07,0.0,1613363.0,90.95
962,7,C000500,8c54a597-27e8-4877-8801-70e1705b7623,2021-12-31,TWO CREEKS,437227.0,1.315771e+08,1.465371e+08,9813386.0,4709377.0,100.00
963,8,C000500,8c54a597-27e8-4877-8801-70e1705b7623,2021-12-31,WEST MARINETTE,267961.0,2.919127e+07,3.776496e+07,0.0,8305732.0,187.20
964,9,C000500,8c54a597-27e8-4877-8801-70e1705b7623,2021-12-31,WESTON,4190703.0,1.085674e+09,1.341926e+09,2157039.0,249904375.0,804.30
