<a href="https://www.kaggle.com/code/catalystcooperative/01-pudl-data-access?scriptVersionId=145889607" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import sys

print(f"Python version: {sys.version}")
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import sqlalchemy as sa

print(f"{np.__version__=}")
print(f"{pd.__version__=}")
print(f"{sa.__version__=}")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in sorted(filenames):
        print(os.path.join(dirname, filename))


# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
import pathlib
pudl_path = pathlib.Path("/kaggle/input/pudl-project")

Python version: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
np.__version__='1.23.5'
pd.__version__='2.0.3'
sa.__version__='2.0.17'
/kaggle/input/pudl-project/censusdp1tract.sqlite
/kaggle/input/pudl-project/ferc1.sqlite
/kaggle/input/pudl-project/ferc1_xbrl.sqlite
/kaggle/input/pudl-project/ferc1_xbrl_datapackage.json
/kaggle/input/pudl-project/ferc1_xbrl_taxonomy_metadata.json
/kaggle/input/pudl-project/ferc2.sqlite
/kaggle/input/pudl-project/ferc2_xbrl.sqlite
/kaggle/input/pudl-project/ferc2_xbrl_datapackage.json
/kaggle/input/pudl-project/ferc2_xbrl_taxonomy_metadata.json
/kaggle/input/pudl-project/ferc6.sqlite
/kaggle/input/pudl-project/ferc60.sqlite
/kaggle/input/pudl-project/ferc60_xbrl.sqlite
/kaggle/input/pudl-project/ferc60_xbrl_datapackage.json
/kaggle/input/pudl-project/ferc60_xbrl_taxonomy_metadata.json
/kaggle/input/pudl-project/ferc6_xbrl.sqlite
/kaggle/input/pudl-project/ferc6_xbrl_datapackage.json
/kaggle/input/pudl-project/ferc6_xbrl_

### Visualization settings

In [2]:
import matplotlib

In [3]:
%matplotlib inline

In [4]:
matplotlib.rcParams["figure.figsize"] = (16, 10)
matplotlib.rcParams["figure.dpi"] = 150
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_colwidth", 1000)

# Accessing the PUDL SQLite Database
- Most of the PUDL Project data is distributed using SQLite databases.
- Python, pandas, and many other libraries have built-in support for reading data from SQLite.
- We can use the SQLAlchemy library to create a connection to the database that Pandas can use.
- Documentation of the available tables and columns can be found in the [PUDL Data Dictionary](https://catalystcoop-pudl.readthedocs.io/en/latest/data_dictionaries/pudl_db.html)

In [5]:
pudl_engine = sa.create_engine(f"sqlite:///{pudl_path}/pudl.sqlite")

## Read EIA plant data from the PUDL SQLite Database
* There are lots of different kinds of data in the PUDL DB.
* Some tables describe attributes of utilities, plants, generators, and balancing authorities.
* Other tables contain hourly, monthly, or yearly time series of fuel consumed, operating costs, or electricity generated.
* We'll read the EIA plants table, and use `.convert_dtypes()` to ensure that we don't get any `object` columns since SQLite's data types aren't as rich as those available from Pandas.
* Power plants are industrial facilities operated by a single utility, but they can have multiple owners, and host multiple generation units.
* The `plants_eia` table contains only information that pertains to all of the equipment at the plant.
* Most of these attributes are relatively stable, but they can change slowly over time, so each plant as one record for each `report_year`.

In [6]:
%%time
plants_eia = pd.read_sql("denorm_plants_eia", pudl_engine).convert_dtypes()
plants_eia.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200177 entries, 0 to 200176
Data columns (total 54 columns):
 #   Column                                        Non-Null Count   Dtype         
---  ------                                        --------------   -----         
 0   plant_id_eia                                  200177 non-null  Int64         
 1   plant_name_eia                                199398 non-null  string        
 2   city                                          190994 non-null  string        
 3   county                                        190479 non-null  string        
 4   latitude                                      192211 non-null  Float64       
 5   longitude                                     194840 non-null  Float64       
 6   state                                         199317 non-null  string        
 7   street_address                                184478 non-null  string        
 8   zip_code                                      193575 n

In [7]:
plants_eia.sample(10)

Unnamed: 0,plant_id_eia,plant_name_eia,city,county,latitude,longitude,state,street_address,zip_code,timezone,report_date,ash_impoundment,ash_impoundment_lined,ash_impoundment_status,balancing_authority_code_eia,balancing_authority_name_eia,datum,energy_storage,ferc_cogen_docket_no,ferc_cogen_status,ferc_exempt_wholesale_generator_docket_no,ferc_exempt_wholesale_generator,ferc_small_power_producer_docket_no,ferc_small_power_producer,ferc_qualifying_facility_docket_no,grid_voltage_1_kv,grid_voltage_2_kv,grid_voltage_3_kv,iso_rto_code,liquefied_natural_gas_storage,natural_gas_local_distribution_company,natural_gas_storage,natural_gas_pipeline_name_1,natural_gas_pipeline_name_2,natural_gas_pipeline_name_3,nerc_region,net_metering,pipeline_notes,primary_purpose_id_naics,regulatory_status_code,reporting_frequency_code,sector_id_eia,sector_name_eia,service_area,transmission_distribution_owner_id,transmission_distribution_owner_name,transmission_distribution_owner_state,utility_id_eia,water_source,data_maturity,plant_id_pudl,utility_name_eia,utility_id_pudl,balancing_authority_code_eia_consistent_rate
5999,393,Control Gorge,Bishop,Inyo,37.437921,-118.557,CA,1501 Gorge Rd.,93514,America/Los_Angeles,2007-01-01,,,,LDWP,,,,,False,,False,,False,,,,,,,,,,,,WECC,,,22.0,RE,,1.0,Electric Utility,,11208.0,Los Angeles City of,CA,11208,Middle Gorge Tailbay,final,1396,Los Angeles Department of Water & Power,2303,1.0
187015,62830,GSPP Imholte CSG,St. Cloud,Stearns,45.557945,-94.16324,MN,8158 County Rd 138 St,56301,America/Chicago,2023-01-01,,,,MISO,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,62719,,monthly_update,12959,Green Street Power Partners,6627,1.0
192847,64077,Hecate Energy Albany County 1,Coeymans,Albany,42.515459,-73.83345,NY,TBD,12143,America/New_York,2021-01-01,,,,NYIS,New York Independent System Operator,,False,,False,,False,,False,,115.0,,,,,,,,,,NPCC,,,22.0,NR,,2.0,IPP Non-CHP,,13573.0,Niagara Mohawk Power Corp.,NY,63715,,final,14697,Hecate Energy Albany 1 LLC,8136,1.0
160701,59228,Leonardo Wind 1 LLC,Alden,Hardin,42.498056,-93.36778,IA,140th St & G Ave,50441,America/Chicago,2016-01-01,False,False,,MISO,"Midcontinent Independent Transmission System Operator, Inc..",,False,,False,,False,QF13-642-001,True,,12.47,,,,,,,,,,MRO,,,22.0,NR,,2.0,,,9417.0,Interstate Power and Light Co,IA,59022,,final,9091,Leonardo Wind 1 LLC,5140,1.0
108030,55210,Afton Generating Station,La Mesa,Dona Ana,32.114185,-106.8465,NM,10100 West Afton Road,88044,America/Denver,2022-01-01,False,False,,PNM,Public Service Company of New Mexico,,False,,False,,False,,False,,345.0,,,,False,,False,EL PASO NATURAL GAS COMPANY LLC,,,WECC,,,22.0,RE,M,1.0,Electric Utility,,5701.0,El Paso Electric Co,NM,15473,Wells,provisional,11,Public Service Co of NM,274,1.0
15494,1109,Keokuk,Keokuk,Lee,40.3965,-91.3719,IA,523 North Water Street,52632,America/Chicago,2018-01-01,,False,,MISO,"Midcontinent Independent Transmission System Operator, Inc..",,False,,False,,False,,False,,69.0,,,,,,,,,,SERC,,,22.0,RE,A,1.0,Electric Utility,,19436.0,Union Electric Co - (MO),MO,19436,Mississippi River,final,1041,Union Electric Co - (MO),334,1.0
122626,56181,Genoa Diesel Generating Station,Genoa,Ottawa,41.5092,-83.3606,OH,Industrial Park Road,43430,America/New_York,2014-01-01,,False,,PJM,"PJM Interconnection, LLC",,,,False,,False,,False,,12.47,,,,,,,,,,RFC,,,22.0,RE,,1.0,Electric Utility,,7100.0,Village of Genoa - (OH),OH,7100,,final,4806,Village of Genoa - (OH),3675,1.0
172348,60549,"ZV Solar 3, LLC",Red Springs,Robeson,34.816182,-79.20022,NC,72 Milk Dairy Road,28377,America/New_York,2018-01-01,False,False,,DUK,Duke Energy Carolinas,,False,,False,,False,15-694-002,True,,23.0,,,,False,,False,,,,SERC,,,22.0,NR,A,2.0,NAICS-22 Non-Cogen,,3046.0,Duke Energy Progress - (NC),NC,61119,,final,9453,"SRE Utility Solar 1, LLC",5830,1.0
106732,55133,West Group Generator Building,Eagan,,44.8196,-93.1156,MN,610 Opperman Dr,55123,America/Chicago,2012-01-01,,,,MISO,,,,,False,,False,,False,,13.8,,,MISO,,,,,,,MRO,,,323.0,NR,,6.0,Industrial NAICS Non-Cogen,,25177.0,Dakota Electric Association,MN,18981,Municipality,final,4372,Thomson Corp,3465,1.0
34945,2783,Buxton,Buxton,Dare,35.267222,-75.53472,NC,Lightplant Road,27920,America/New_York,2003-01-01,,,,PJM,,,,,False,,False,,False,,,,,,,,,,,,SERC,,,22.0,,,1.0,Electric Utility,,,,,13683,,final,2159,North Carolina El Member Corp,2639,1.0


## Reading FERC plant data from the PUDL SQLite DB
* The Federal Energy Regulatory Commission (FERC) also reports data about power plants in their [Form 1 - Annual Report of Major Electric Utilities](https://catalystcoop-pudl.readthedocs.io/en/latest/data_sources/ferc1.html).
* FERC Form 1 focuses primarily on electric utility finances, rather than operations.
* The Large Steam Plants have particularly detailed capital expenses and O&M costs.

In [8]:
%%time
plants_ferc1 = pd.read_sql("denorm_plants_steam_ferc1", pudl_engine)
plants_ferc1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30709 entries, 0 to 30708
Data columns (total 55 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   report_year                             30709 non-null  int64  
 1   utility_id_ferc1                        30709 non-null  int64  
 2   utility_id_pudl                         30709 non-null  int64  
 3   utility_name_ferc1                      30709 non-null  object 
 4   plant_id_pudl                           30709 non-null  int64  
 5   plant_id_ferc1                          30709 non-null  int64  
 6   plant_name_ferc1                        30709 non-null  object 
 7   asset_retirement_cost                   8844 non-null   float64
 8   avg_num_employees                       17793 non-null  float64
 9   capacity_factor                         28571 non-null  float64
 10  capacity_mw                             30709 non-null  fl

In [9]:
plants_ferc1.sample(10)

Unnamed: 0,report_year,utility_id_ferc1,utility_id_pudl,utility_name_ferc1,plant_id_pudl,plant_id_ferc1,plant_name_ferc1,asset_retirement_cost,avg_num_employees,capacity_factor,capacity_mw,capex_annual_addition,capex_annual_addition_rolling,capex_annual_per_kw,capex_annual_per_mw,capex_annual_per_mw_rolling,capex_annual_per_mwh,capex_annual_per_mwh_rolling,capex_equipment,capex_land,capex_per_mw,capex_structures,capex_total,capex_wo_retirement_total,construction_type,construction_year,installation_year,net_generation_mwh,not_water_limited_capacity_mw,opex_allowances,opex_boiler,opex_coolants,opex_electric,opex_engineering,opex_fuel,opex_fuel_per_mwh,opex_misc_power,opex_misc_steam,opex_nonfuel_per_mwh,opex_operations,opex_per_mwh,opex_plants,opex_production_total,opex_rents,opex_steam,opex_steam_other,opex_structures,opex_total_nonfuel,opex_transfer,peak_demand_mw,plant_capability_mw,plant_hours_connected_while_generating,plant_type,record_id,water_limited_capacity_mw
13241,2002,219,364,Wisconsin Power and Light Company,552,4042,s fond du lac u2&3,,2.0,0.02665,172.0,,,,,,,,49707660.0,324546.0,299269.6,1442167.0,51474370.0,51474370.0,conventional,1994.0,1994.0,40153.8,,,,,192717.0,4750.0,2655734.0,66.139045,52979.0,117.0,8.714792,1767.0,74.9,89853.0,3005666.0,,,,7749.0,349932.0,,193.0,168.0,,combustion_turbine,f1_steam_2002_12_194_5_4,
19616,1996,251,185,MidAmerican Energy Company,497,2719,riverside,,66.0,0.410335,141.0,,,,,,,,47830840.0,195852.0,395369.9,7720470.0,55747160.0,55747160.0,conventional,1925.0,1961.0,506829.6,,26567.0,1406224.0,,273958.0,539657.0,7973502.0,15.732116,345115.0,107497.0,9.184049,256608.0,24.9,474296.0,12628250.0,,1057068.0,,167758.0,4654748.0,,139.0,135.0,8085.0,steam,f1_steam_1996_12_210_0_3,
16418,2001,230,302,Southwestern Public Service Company,1081,602,celanese i,,,0.850757,13.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1189589.0,,89442.8,,1189589.0,1189589.0,,1965.0,1965.0,99120.0,13.0,,,,,140.0,208846.0,2.107002,111321.0,4573.0,1.839286,20.0,3.9,58619.0,391156.0,,,,7637.0,182310.0,,16.0,,8349.0,combustion_turbine,f1_steam_2001_12_166_1_1,13.0
9901,2000,198,14,AEP Texas Central Company,7962,103,la palma #7,,,0.044088,64.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6344050.0,391998.0,114421.5,667023.0,7403071.0,7403071.0,outdoor,1975.0,1975.0,24988.0,,,152745.0,,135826.0,23539.0,1173506.0,46.962782,66396.0,52223.0,33.269609,40903.0,80.2,254341.0,2004847.0,,88849.0,,16519.0,831341.0,,52.0,,510.0,combustion_turbine,f1_steam_2000_12_24_1_2,52.0
15409,2021,227,272,Public Service Company of Colorado,126,518,comanche,-3206652.0,128.0,0.45141,1635.3,31689298.0,,19.378278,19378.277992,,4.900496,,1425204000.0,6139775.0,988957.9,189105880.0,1617243000.0,1620450000.0,outdoor,1973.0,2010.0,6466549.0,1171.0,,9897493.0,,2187797.0,1217134.0,127988415.0,19.792383,5890849.0,1685699.0,6.483587,1303996.0,26.3,2577470.0,169914847.0,908787.0,12983508.0,,3273699.0,41926432.0,,1761.0,1171.0,8750.0,steam,steam_electric_generating_plant_statistics_large_plants_402_2021_c000822_comanche,1160.0
4159,1998,164,349,VIRGINIA ELECTRIC AND POWER COMPANY,45,1352,bellmeade,,,0.136803,297.0,,,,,,,,18887810.0,,68487.4,1452938.0,20340750.0,20340750.0,,1991.0,1991.0,355922.0,250.0,,,,,143779.0,9989695.0,28.067091,267350.0,33868.0,7.323228,539510.0,35.4,872134.0,12596193.0,495705.0,254427.0,,-275.0,2606498.0,,,,2157.0,combined_cycle,f1_steam_1998_12_186_3_3,230.0
25199,1994,295,123,Georgia Power Company,653,1030,yates,,357.0,0.180045,1488.0,,,,,,,,267712500.0,1157244.0,202781.2,32868646.0,301738400.0,301738400.0,conventional,1950.0,1974.0,2346866.0,1229.0,,9844432.0,,1052135.0,1919024.0,50342695.0,21.451031,4471838.0,1530748.0,11.769003,1697003.0,33.2,3237650.0,77962968.0,,2709606.0,,1157837.0,27620273.0,,1303.0,,8692.0,steam,f1_steam_1994_12_57_2_3,
4524,2013,164,349,VIRGINIA ELECTRIC AND POWER COMPANY,457,1649,polyester,-4786.0,30.0,0.230504,58.0,47720534.0,17584660.0,822.767828,822767.827586,303183.850575,407.469328,150.149429,47843960.0,1015458.0,952529.2,6392058.0,55246690.0,55251480.0,conventional,1992.0,1992.0,117114.42,58.0,,672968.0,,15313.0,,4971584.0,42.450656,768478.0,1153555.0,45.878936,,88.3,165950.0,10344669.0,-10788.0,2228499.0,,379110.0,5373085.0,,63.0,58.0,3183.0,steam,f1_steam_2013_12_186_5_2,58.0
23298,1999,286,31,Arizona Public Service Company,163,28,douglas,,,0.002574,22.0,0.0,0.6666667,0.0,0.0,0.030303,0.0,0.001344,2185163.0,9557.0,99967.4,4562.0,2199282.0,2199282.0,,1972.0,1972.0,496.074,16.0,,,,,273.0,89379.0,180.172716,3201.0,21731.0,127.446712,,307.6,9538.0,152602.0,,,,28480.0,63223.0,,16.0,,102.0,combustion_turbine,f1_steam_1999_12_7_2_2,
22491,1995,280,363,Wisconsin Electric Power Company,217,2451,germantown-unit 3,,,3e-06,53.0,,,,,,,,8378407.0,16648.0,169966.4,613168.0,9008223.0,9008223.0,conventional,1978.0,,1.374,62.0,,,,43877.0,3966.0,119748.0,87152.838428,20605.0,2086.0,-86988.355167,6777.0,0.0,112592.0,226.0,,,,458.0,-119522.0,,50.0,,384.0,combustion_turbine,f1_steam_1995_12_193_6_3,53.0


## Reading data tables from the SQLite database
* In addition to tables that describe the various entities reporting data, we also have the data itself.
* For instance the EIA-923 reports fuel receipts and costs by power plant:

In [10]:
%%time
frc_eia923 = pd.read_sql("denorm_fuel_receipts_costs_eia923", pudl_engine).convert_dtypes()
frc_eia923.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 631975 entries, 0 to 631974
Data columns (total 36 columns):
 #   Column                                   Non-Null Count   Dtype         
---  ------                                   --------------   -----         
 0   report_date                              631975 non-null  datetime64[ns]
 1   plant_id_eia                             631975 non-null  Int64         
 2   plant_id_pudl                            631975 non-null  Int64         
 3   plant_name_eia                           631975 non-null  string        
 4   utility_id_eia                           631975 non-null  Int64         
 5   utility_id_pudl                          631975 non-null  Int64         
 6   utility_name_eia                         631975 non-null  string        
 7   state                                    631975 non-null  string        
 8   contract_type_code                       631737 non-null  string        
 9   contract_expiration_date  

In [11]:
frc_eia923.sample(10)

Unnamed: 0,report_date,plant_id_eia,plant_id_pudl,plant_name_eia,utility_id_eia,utility_id_pudl,utility_name_eia,state,contract_type_code,contract_expiration_date,energy_source_code,fuel_type_code_pudl,fuel_group_code,supplier_name,fuel_received_units,fuel_mmbtu_per_unit,fuel_cost_per_mmbtu,bulk_agg_fuel_cost_per_mmbtu,fuel_consumed_mmbtu,total_fuel_cost,fuel_cost_from_eiaapi,sulfur_content_pct,ash_content_pct,mercury_content_ppm,primary_transportation_mode_code,secondary_transportation_mode_code,natural_gas_transport_code,natural_gas_delivery_contract_type_code,moisture_content_pct,chlorine_content_ppm,data_maturity,mine_id_msha,mine_name,mine_state,coalmine_county_id_fips,mine_type_code
415709,2017-02-01,2965,586,Tulsa,15474,275,Public Service Co of Oklahoma,OK,S,NaT,NG,gas,natural_gas,macquarie cook,1713,1.04,2.842,,1781.52,5063.07984,False,0.0,0.0,0.0,PL,,firm,firm,,,final,,,,,
200663,2011-05-01,6250,356,Mayo,3046,97,Carolina Power & Light Co,NC,C,2018-12-01,BIT,coal,coal,consol,22735,23.85,4.045,3.5871,542229.75,2193319.33875,False,0.68,11.5,,RR,,,,,,final,4605890.0,miller creek preparation plant,WV,,P
223750,2011-10-01,1592,1816,Sithe Medway LLC,6035,1691,"Constellation Power, Inc",MA,S,NaT,DFO,oil,petroleum,gulf,11,5.85,20.3533,20.3533,64.35,1309.734855,True,0.5,0.0,,TR,,,,,,final,,,,,
354010,2015-03-01,983,114,Clifty Creek,9269,145,Indiana-Kentucky Electric Corp,IN,C,2021-12-01,BIT,coal,coal,resource fuels,119731,22.93,2.507,,2745431.83,6882797.59781,False,2.99,8.2,0.0,RV,,,,12.21,0.0,final,1519374.0,river view mine,KY,21225.0,U
254260,2012-06-01,2850,288,J M Stuart,4922,317,Dayton Power & Light Co,OH,C,2018-12-01,BIT,coal,coal,williamson energy,145460,23.27,2.675,,3384854.2,9054484.985,False,2.64,9.8,,RV,,,,,,final,1103141.0,mach 1 mine,IL,17199.0,U
481440,2018-10-01,6041,2483,H L Spurlock,5580,1571,East Kentucky Power Coop Inc,KY,C,2018-12-01,BIT,coal,coal,foresight coal sales,22492,24.0,2.08,1.9485,539808.0,1122800.64,False,2.88,7.5,0.0,RV,,,,10.4,0.0,final,1103182.0,deer run mine,IL,17135.0,U
263034,2012-08-01,55818,4655,Frederickson Power LP,56613,1810,Frederickson Power LP,WA,S,NaT,NG,gas,natural_gas,bp canada energy company,10584,1.035,4.0849,4.0849,10954.44,44747.791956,True,0.0,0.0,0.0,PL,,firm,,,,final,,,,,
176787,2010-12-01,6017,1186,Newton,520,143,Ameren Energy Generating Co,IL,C,2011-03-01,DFO,oil,petroleum,knapp oil,356,5.754,19.8042,19.8042,2048.424,40567.398581,True,0.01,0.0,,TR,,,,,,final,,,,,
560277,2020-12-01,56328,255,Tontitown,17698,301,Southwestern Electric Power Co,AR,S,NaT,NG,gas,natural_gas,southwest energy,112,1.03,17.284,,115.36,1993.88224,False,0.0,0.0,0.0,PL,,firm,firm,,,final,,,,,
203024,2011-06-01,136,1310,Seminole,21554,3118,"Seminole Electric Coop, Inc",FL,C,2014-04-01,BIT,coal,coal,alliance energy,24121,24.64,3.347,,594341.44,1989260.79968,False,2.99,9.0,,RR,,,,,,final,1502132.0,dotiki mine,KY,,U


# Read Hourly Generation & Emissions Data from Apache Parquet
* The full hourly emissions time series for thousands of US power plants covering 1995-2022 contains almost a billion records.
* The data is stored in a single [Apache Parquet file](https://parquet.apache.org/) with row-groups defined by year and state.
* This compressed columnar format enables very efficient queries with appropriate tooling, including [Dask](https://www.dask.org/) and [PyArrow](https://arrow.apache.org/docs/python/index.html).
* Reading the entire dataset into memory at once will probably exceed the available RAM.
* The filters use [Disjunctive Normal Form](https://blog.datasyndrome.com/python-and-parquet-performance-e71da65269ce)
* Using Dask's lazy evaluation and the filter criteria, we can minimize the data read off of disk and limit memory usage.
* The Dask project has lots of [tutorials and documentation](https://www.dask.org/get-started) if you want to learn more.
* Other tools like [DuckDB](https://duckdb.org/docs/data/parquet/overview.html) ([Python API](https://duckdb.org/docs/api/python/overview)) also provide good Parquet support. 

## Using Dask to selectively read Parquet data

In [12]:
%%time 
from dask import dataframe as dd
# In a DNF filter, the inner lists of conditions are combined with AND
# while the outer list of conditions are combined with OR
# So this filter will get all 2019 and 2020 records for CO and WY:
state_year_filters = [
    [('year', '=', 2019), ('state', '=', 'CO')],
    [('year', '=', 2019), ('state', '=', 'WY')],
    [('year', '=', 2020), ('state', '=', 'CO')],
    [('year', '=', 2020), ('state', '=', 'WY')],
]
co_wy_cems = dd.read_parquet(
    f"{pudl_path}/hourly_emissions_epacems.parquet",
    engine="pyarrow",
    dtype_backend="pyarrow",
    filters=state_year_filters,
).compute()
co_wy_cems.info()

Use the `index` argument to set a sorted column as your index to create a DataFrame collection with known `divisions`.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1662192 entries, 0 to 1662191
Data columns (total 16 columns):
 #   Column                     Non-Null Count    Dtype                                                       
---  ------                     --------------    -----                                                       
 0   plant_id_eia               1662192 non-null  int32[pyarrow]                                              
 1   plant_id_epa               1662192 non-null  int32[pyarrow]                                              
 2   emissions_unit_id_epa      1662192 non-null  string                                                      
 3   operating_datetime_utc     1662192 non-null  timestamp[ms, tz=UTC][pyarrow]                              
 4   year                       1662192 non-null  int32[pyarrow]                                              
 5   state                      1662192 non-null  dictionary<values=string, indices=int32, ordered=0>[pyarrow]

In [13]:
co_wy_cems.sample(10)

Unnamed: 0,plant_id_eia,plant_id_epa,emissions_unit_id_epa,operating_datetime_utc,year,state,operating_time_hours,gross_load_mw,heat_content_mmbtu,steam_load_1000_lbs,so2_mass_lbs,so2_mass_measurement_code,nox_mass_lbs,nox_mass_measurement_code,co2_mass_tons,co2_mass_measurement_code
733556,8066,8066,BW73,2019-01-23 03:00:00+00:00,2019,WY,1.0,495.0,4967.200195,,834.400024,Measured,372.540009,Calculated,521.0,Measured
389577,55283,55283,1,2020-11-10 16:00:00+00:00,2020,CO,1.0,187.0,1489.5,,0.894,Measured,43.195999,Calculated,88.5,Measured
1541861,7504,7504,1,2020-04-20 12:00:00+00:00,2020,WY,1.0,90.0,1080.0,,106.599998,Measured,154.440002,Calculated,113.300003,Measured
1367161,56445,56445,CT-01,2019-05-23 08:00:00+00:00,2019,CO,0.0,,,,,,,,,
1286120,55453,55453,6,2019-02-20 15:00:00+00:00,2019,CO,0.0,,,,,,,,,
1558611,8066,8066,BW71,2020-03-17 10:00:00+00:00,2020,WY,1.0,294.0,2846.0,,444.299988,Measured,395.593994,Calculated,298.5,Measured
255932,8219,8219,3,2020-08-24 03:00:00+00:00,2020,CO,0.0,,,,,,,,,
1440046,56998,56998,CT08,2019-09-17 05:00:00+00:00,2019,CO,0.0,,,,,,,,,
827567,469,469,4,2019-10-17 06:00:00+00:00,2019,CO,1.0,198.0,1984.599976,,1.2,Measured,140.906998,Calculated,117.599998,Measured
357530,55200,55200,CT6,2020-03-18 09:00:00+00:00,2020,CO,1.0,34.0,282.0,,0.169,Measured,4.512,Calculated,16.799999,Measured


## Read all Colorado Emissions Data

In [14]:
%%time
colorado_cems = dd.read_parquet(
    f"{pudl_path}/hourly_emissions_epacems.parquet",
    engine="pyarrow",
    dtype_backend="pyarrow",
    filters=[("state", "=", "CO")],
).compute()
colorado_cems.info()

Use the `index` argument to set a sorted column as your index to create a DataFrame collection with known `divisions`.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13631472 entries, 0 to 13631471
Data columns (total 16 columns):
 #   Column                     Dtype                                                       
---  ------                     -----                                                       
 0   plant_id_eia               int32[pyarrow]                                              
 1   plant_id_epa               int32[pyarrow]                                              
 2   emissions_unit_id_epa      string                                                      
 3   operating_datetime_utc     timestamp[ms, tz=UTC][pyarrow]                              
 4   year                       int32[pyarrow]                                              
 5   state                      dictionary<values=string, indices=int32, ordered=0>[pyarrow]
 6   operating_time_hours       float[pyarrow]                                              
 7   gross_load_mw              float[pyarrow]  

In [15]:
colorado_cems.sample(10)

Unnamed: 0,plant_id_eia,plant_id_epa,emissions_unit_id_epa,operating_datetime_utc,year,state,operating_time_hours,gross_load_mw,heat_content_mmbtu,steam_load_1000_lbs,so2_mass_lbs,so2_mass_measurement_code,nox_mass_lbs,nox_mass_measurement_code,co2_mass_tons,co2_mass_measurement_code
2074981,478,478,1,2015-01-06 20:00:00+00:00,2015,CO,1.0,,113.099998,150.0,0.068,Measured,22.846001,Calculated,6.7,Measured
12487512,525,525,H2,2012-01-27 07:00:00+00:00,2012,CO,1.0,207.0,2009.599976,,196.699997,Substitute,635.033997,Calculated,206.199997,Measured and Substitute
1872683,55283,55283,2,2010-12-06 18:00:00+00:00,2010,CO,0.0,,,,,,,,,
393861,55453,55453,2,2022-06-20 04:00:00+00:00,2022,CO,0.0,,,,,,,,,
11934318,492,492,5,2007-12-21 13:00:00+00:00,2007,CO,1.0,45.0,450.5,,288.700012,Measured,160.828003,Calculated,46.200001,Measured
2715761,478,478,3,2020-02-25 00:00:00+00:00,2020,CO,0.0,,,,,,,,,
1546069,508,508,8,2010-08-24 20:00:00+00:00,2010,CO,0.0,,,,,,,,,
8100824,6761,6761,101,2021-01-25 15:00:00+00:00,2021,CO,1.0,300.0,2605.699951,,203.699997,Measured,315.290009,Calculated,273.299988,Measured
12355359,465,465,4,2012-07-10 22:00:00+00:00,2012,CO,1.0,112.0,1356.199951,,527.299988,Measured,377.023987,Calculated,142.199997,Measured
5785541,56998,56998,CT07,2018-09-26 12:00:00+00:00,2018,CO,1.0,38.0,384.899994,,0.231,Measured,1.155,Calculated,22.9,Measured


## Visualize Hourly Power Plant Operations
* Let's find a particular power plant and look at its long-term operations.
* Say we want to investigate [Xcel Energy's troubled Comanche coal plant](https://coloradosun.com/?s=comanche%20pueblo) in Pueblo, CO?
* The EPA CEMS data only has contains the EIA Plant ID, not its name or any ownership information.
* The PUDL database links these IDs to much more extensive EIA data.
* We can look for the Comanche plant in the PUDL DB and use that information to select the appropriate EPA CEMS data to plot.

In [16]:
%%time
plants_eia = pd.read_sql("denorm_plants_eia", pudl_engine).convert_dtypes()
plants_eia.sample(10)

CPU times: user 7.01 s, sys: 366 ms, total: 7.37 s
Wall time: 7.76 s


Unnamed: 0,plant_id_eia,plant_name_eia,city,county,latitude,longitude,state,street_address,zip_code,timezone,report_date,ash_impoundment,ash_impoundment_lined,ash_impoundment_status,balancing_authority_code_eia,balancing_authority_name_eia,datum,energy_storage,ferc_cogen_docket_no,ferc_cogen_status,ferc_exempt_wholesale_generator_docket_no,ferc_exempt_wholesale_generator,ferc_small_power_producer_docket_no,ferc_small_power_producer,ferc_qualifying_facility_docket_no,grid_voltage_1_kv,grid_voltage_2_kv,grid_voltage_3_kv,iso_rto_code,liquefied_natural_gas_storage,natural_gas_local_distribution_company,natural_gas_storage,natural_gas_pipeline_name_1,natural_gas_pipeline_name_2,natural_gas_pipeline_name_3,nerc_region,net_metering,pipeline_notes,primary_purpose_id_naics,regulatory_status_code,reporting_frequency_code,sector_id_eia,sector_name_eia,service_area,transmission_distribution_owner_id,transmission_distribution_owner_name,transmission_distribution_owner_state,utility_id_eia,water_source,data_maturity,plant_id_pudl,utility_name_eia,utility_id_pudl,balancing_authority_code_eia_consistent_rate
139946,57352,Trenton Solar Farm,Trenton,Mercer,40.2314,-74.7539,NJ,500-520 Brunswick Avenue,8638,America/New_York,2019-01-01,False,False,,PJM,"PJM Interconnection, LLC",,False,,False,,False,,False,,230.0,,,,False,,False,,,,RFC,,,22,RE,A,1.0,Electric Utility,,15477.0,Public Service Elec & Gas Co,NJ,15477,,final,5633,Public Service Elec & Gas Co,276,1.0
124531,56292,Richmond County Generating Facility,Hamlet,Richmond,34.842311,-79.73609,NC,162 Cooperative Way,28345,America/New_York,2010-01-01,,,,CPLE,,,,,False,,False,,False,,,,,,,,,,,,SERC,,,22,RE,M,1.0,Electric Utility,,3046.0,Progress Energy Carolinas Inc,NC,13683,County Water,final,4879,North Carolina El Member Corp,2639,1.0
163852,59573,Tisbury Landfill Solar,Tisbury,Dukes,41.444722,-70.61306,MA,59 Highpoint Lane,2568,America/New_York,2015-01-01,False,False,,ISNE,ISO New England Inc.,,,,False,,False,15-30-000,True,,23.0,,,,,,,,,,NPCC,,,22,NR,,2.0,NAICS-22 Non-Cogen,,54913.0,NSTAR Electric Company,MA,58894,,final,7272,CF CVEC Owner One LLC,848,1.0
178138,61362,NRG Elbow Creek Energy Storage Project,Big Spring,Howard,32.167575,-101.4792,TX,1003 FM 818 W.,79720,America/Chicago,2016-01-01,False,,,ERCO,"Electric Reliability Council of Texas, Inc.",,True,,False,,False,,False,,12.47,,,,,,,,,,TRE,,,22,NR,,2.0,IPP Non-CHP,,17008.0,Sharyland Utilities LP,TX,54888,,final,10940,NRG Texas Power LLC,2701,1.0
81090,50148,Linde Wilmington,Wilmington,Los Angeles,33.78856,-118.2294,CA,Pacific Coast Highway,90744,America/Los_Angeles,2020-01-01,False,False,,LDWP,Los Angeles Department of Water and Power,,False,86-1079-000,True,,False,,False,,34.5,,,,False,SOUTHERN CALIFORNIA GAS COMPANY,False,,,,WECC,,,325,NR,A,7.0,Industrial NAICS Cogen,,11208.0,Los Angeles Department of Water & Power,CA,15320,Municipality,final,3474,Praxair Inc,2904,1.0
115846,55691,Wellton-Mohawk Generating,Wellton,Yuma,,,AZ,10800 South Ave 22 E,85356,America/Phoenix,2004-01-01,,,,,,,,,,,,,,,,,,,,,,,,,WECC,,,22,,,,,Western Area Power Administration,,,,5273,Wellton-Mohawk Irr &Drain Dist,final,13239,"Dome Valley Energy Partners, L",7066,
196498,65092,115 G Fisher,Middleborough,Plymouth,41.877476,-70.83984,MA,0 Gd Circle,2346,America/New_York,2022-01-01,,,,ISNE,ISO New England Inc.,,False,,False,,False,,False,,15.0,,,,,,,,,,NPCC,,,22,NR,A,2.0,NAICS-22 Non-Cogen,,12473.0,Town of Middleborough - (MA),MA,60025,,provisional,16475,Greenbacker Renewable Energy Corporation,1932,1.0
183291,62158,Sacket Lake Rd #1 Community Solar Farm,Thompson,Sullivan,41.38251,-74.43526,NY,585 S. Maplewood Rd,12701,America/New_York,2020-01-01,,,,NYIS,New York Independent System Operator,,False,,False,,False,18-1340-000,True,,4.8,,,,,,,,,,NPCC,,,22,NR,A,2.0,NAICS-22 Non-Cogen,,13511.0,New York State Elec & Gas Corp,NY,61194,,final,12321,Generate Capital,6027,1.0
80997,50140,Resurrection Hospital,,Cook,,,IL,,60631,America/Chicago,2010-01-01,,,,,,,,,False,,False,,False,,,,,,,,,,,,UNK,,,622,NR,,5.0,Commercial,,,,,17259,,final,13487,Resurrection Medical Center,9057,
183991,62282,Pee Dee Solar I,Dillion,Dillon,34.399694,-79.361369,SC,Off of Lockemy Hwy,29536,America/New_York,2018-01-01,False,,,CPLE,Duke Energy Progress East,,False,,False,,False,QF16-105-000,True,,22.86,,,,False,,False,,,,SERC,,,22,NR,,2.0,IPP Non-CHP,,3046.0,Duke Energy Progress - (NC),SC,61793,,final,12443,Pee Dee Solar I,8892,1.0


By selecting a few informative columns and records with "Comanche" in the plant name, we find that the coal plant we're looking for has `plant_id_eia==470`

In [17]:
plants_eia.loc[
    plants_eia.plant_name_eia.str.contains("comanche", case=False),
    [
        "plant_id_eia",
        "plant_name_eia",
        "utility_name_eia",
        "city",
        "state",
        "latitude",
        "longitude",
    ]
].drop_duplicates()

Unnamed: 0,plant_id_eia,plant_name_eia,utility_name_eia,city,state,latitude,longitude
7460,470,Comanche,Public Service Co of Colorado,Pueblo,CO,38.2081,-104.5747
50775,6145,Comanche Peak,Luminant Generation Company LLC,Glen Rose,TX,32.298365,-97.78552
50788,6145,Comanche Peak,TXU Generation Co LP,Glen Rose,TX,32.298365,-97.78552
67387,8059,Comanche,Public Service Co of Oklahoma,Lawton,OK,34.5431,-98.3244
164570,59656,Comanche Solar,Novatus Energy,Pueblo,CO,38.205278,-104.5667
164575,59656,Comanche Solar,Comanche LLC,Pueblo,CO,38.205278,-104.5667


In [18]:
comanche_cems = colorado_cems[colorado_cems.plant_id_eia==470]
comanche_cems.info()

<class 'pandas.core.frame.DataFrame'>
Index: 569760 entries, 26280 to 13044551
Data columns (total 16 columns):
 #   Column                     Non-Null Count   Dtype                                                       
---  ------                     --------------   -----                                                       
 0   plant_id_eia               569760 non-null  int32[pyarrow]                                              
 1   plant_id_epa               569760 non-null  int32[pyarrow]                                              
 2   emissions_unit_id_epa      569760 non-null  string                                                      
 3   operating_datetime_utc     569760 non-null  timestamp[ms, tz=UTC][pyarrow]                              
 4   year                       569760 non-null  int32[pyarrow]                                              
 5   state                      569760 non-null  dictionary<values=string, indices=int32, ordered=0>[pyarrow]
 6   ope