<a href="https://www.kaggle.com/code/catalystcooperative/01-pudl-data-access?scriptVersionId=150065767" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Notebook Setup

In [1]:
import sys

print(f"Python version: {sys.version}")
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import sqlalchemy as sa

print(f"{np.__version__=}")
print(f"{pd.__version__=}")
print(f"{sa.__version__=}")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in sorted(filenames):
        print(os.path.join(dirname, filename))


# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Python version: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
np.__version__='1.24.3'
pd.__version__='2.0.3'
sa.__version__='2.0.20'
/kaggle/input/pudl-project/censusdp1tract.sqlite
/kaggle/input/pudl-project/ferc1.sqlite
/kaggle/input/pudl-project/ferc1_xbrl.sqlite
/kaggle/input/pudl-project/ferc1_xbrl_datapackage.json
/kaggle/input/pudl-project/ferc1_xbrl_taxonomy_metadata.json
/kaggle/input/pudl-project/ferc2.sqlite
/kaggle/input/pudl-project/ferc2_xbrl.sqlite
/kaggle/input/pudl-project/ferc2_xbrl_datapackage.json
/kaggle/input/pudl-project/ferc2_xbrl_taxonomy_metadata.json
/kaggle/input/pudl-project/ferc6.sqlite
/kaggle/input/pudl-project/ferc60.sqlite
/kaggle/input/pudl-project/ferc60_xbrl.sqlite
/kaggle/input/pudl-project/ferc60_xbrl_datapackage.json
/kaggle/input/pudl-project/ferc60_xbrl_taxonomy_metadata.json
/kaggle/input/pudl-project/ferc6_xbrl.sqlite
/kaggle/input/pudl-project/ferc6_xbrl_datapackage.json
/kaggle/input/pudl-project/ferc6_xbrl_

### Visualization settings

In [2]:
import matplotlib

In [3]:
%matplotlib inline

In [4]:
matplotlib.rcParams["figure.figsize"] = (16, 10)
matplotlib.rcParams["figure.dpi"] = 150
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_colwidth", 1000)

# Additional PUDL Project Resources
Links to more information on Catalyst Cooperative's Public Utility Data Liberation (PUDL) Project.

## PUDL Data Dictionaries:
* [Table and column level metadata for the PUDL database](https://catalystcoop-pudl.readthedocs.io/en/latest/data_dictionaries/pudl_db.html)
* [Table level metadata for 2020 and earlier raw FERC Form 1 DBF data](https://catalystcoop-pudl.readthedocs.io/en/latest/data_dictionaries/ferc1_db.html) (**Note:** the raw FERC Form 1 data is very difficult to work with. Check whether the table you want to work with has been cleaned up and imported into PUDL)
* All XBRL-derived FERC forms (2021 and later) have extensive metadata published alongside their databases in the nightly builds (see below). These take the form of a JSON version of their XBRL taxonomy, and a [datapackage descriptor](https://specs.frictionlessdata.io/data-package/) that annotates the XBRL-derived SQLite DB.

## Nightly Build Outputs:
We attempt to rebuild all of our data products each night, based on the code in [the development branch](https://github.com/catalyst-cooperative/pudl/tree/dev) of the [main PUDL repository on GitHub](https://github.com/catalyst-cooperative/pudl). 

The most recent successful build outputs can be downloaded directly from:
* [The PUDL Project in the AWS Open Data Registry](https://registry.opendata.aws/catalyst-cooperative-pudl/)
* [Direct AWS S3 download links](https://catalystcoop-pudl.readthedocs.io/en/latest/data_access.html#access-nightly-builds) on our data access page.
* The [PUDL Project Dataset](https://www.kaggle.com/datasets/catalystcooperative/pudl-project) on Kaggle updates automatically whenever the nightly builds succeed.

## Datasette: https://data.catalyst.coop
Successful nightly build outputs are also deployed using [Simon Willison](https://simonwillison.net/)'s excellent [Datasette](https://datasette.io/). tool. It provides a simple web interface for browsing and querying all of the SQLite databases we publish.

## GitHub Discussions
We use [GitHub Discussions](https://github.com/orgs/catalyst-cooperative/discussions) to answer questions about PUDL and provide user support. Let us know if you have issues or find bugs!


# Purpose of this Notebook
- Provide an introduction to working with the data integrated into Catalyst Cooperative's Public Utility Data Liberation (PUDL) databse, from various public data sources.
- Show how to access data from both SQLite databases and Apache Parquet files.
- Show how data from different tables and data sources can be combined to do richer analyses.

# Comanche 3: A Snapshot of US Energy Transition

## Background
- In 2009 Xcel Energy Colorado (also known as Public Service Company of Colorado or PSCo) spent about 1.3 billion dollars building one of the last US coal plants in Pueblo, Colorado.
- The plant was bitterly contested from the beginning, with clean energy activists decrying the 60 years of future GHG emissions that the plant was to lock in, and declining economically accessible coal reserves in the Powder River Basin of Wyoming.
- After a decade of [high operating costs, ongoing maintenance issues](https://coloradosun.com/2021/03/03/comanche-3-cost-overruns-shutdown-electricity/), political pressure, and rapid renewable price declines, Xcel finally [agreed to shut the plant down 40 years early](https://coloradosun.com/2022/04/26/comanche-plant-xcel-coal/)
- This notebook uses Catalyst's PUDL Database explore Companche 3's brief and checkered existence, in terms of its carbon emissions, electricity generation, costs, and reliability.

## Datasets we will use
- [FERC Form 1](https://catalystcoop-pudl.readthedocs.io/en/latest/data_sources/ferc1.html) (the Annual Report of Major Electric Utilities) will provide non-fuel operating costs as well as ongoing capital expenses.
- [EIA Form 860](https://catalystcoop-pudl.readthedocs.io/en/latest/data_sources/eia860.html) (the Annual Electric Generator Report) will provide detailed physical attributes of individual generators, as well as their ownership shares.
- [EIA Form 923](https://catalystcoop-pudl.readthedocs.io/en/latest/data_sources/eia923.html) (the Power Plant Operations Report) will provide information about fuel consumption and costs, net electricity generation, and generator thermal efficiency.
- [EPA's Continuous Emissions Monitoring System](https://catalystcoop-pudl.readthedocs.io/en/latest/data_sources/epacems.html) will provide hourly power plant emissions, fuel consumption, and power output.


# Reading data from the PUDL SQLite Database
- Most of the PUDL Project data is distributed using SQLite databases.
- - Python, pandas, and many other libraries have built-in support for reading data from SQLite, and it is a file-based database that doesn't require running a database server, which is much simpler for local analysis and data distribution.
- Only one of these databases is meant for general public consumption: `pudl.sqlite`.
- The other SQLite databases pertain to various FERC forms and are unprocessed conversions of FERC's difficult to use original data formats (Visual FoxPro up to 2020, and XBRL starting in 2021). We will not look at these relatively raw inputs in this notebook.
- [SQLAlchemy](https://docs.sqlalchemy.org/en/20/) is Python's general purpose database access library, and is integrated directly with the [pandas](https://pandas.pydata.org/) data analysis library that you may already be familiar with.

## Create a connection engine for the PUDL Database

In [5]:
import pathlib
import sqlalchemy as sa

# Path to the directory that contains all the PUDL data on Kaggle:
pudl_path = pathlib.Path("/kaggle/input/pudl-project")

# Create a connection using SQLAlchemy, we'll pass this to pandas below to read data.
pudl_engine = sa.create_engine(f"sqlite:///{pudl_path}/pudl.sqlite")

## Read EIA plant data from the PUDL SQLite Database
- There are lots of different kinds of data in the PUDL DB.
- Some tables describe attributes of utilities, plants, generators, and balancing authorities.
- Other tables contain hourly, monthly, or yearly time series of fuel consumed, operating costs, or electricity generated.
- We'll read the entire [denormalized EIA plants table](https://catalystcoop-pudl.readthedocs.io/en/latest/data_dictionaries/pudl_db.html#denorm-plants-eia) -- "denormalized" just means that it has additional useful information merged in that might be duplicative, but is more convenient for interactive use.
- We'll use `.convert_dtypes()` to tell Pandas to infer data types of the columns to the best of its ability, so that we don't get any generic Python `object` columns. This is necessary because SQLite's data types aren't as rich as those available from Pandas.
- Power plants are industrial facilities operated by a single utility, but they can have multiple owners, and host multiple generation units.
- The `denorm_plants_eia` table contains only information that pertains to all of the equipment at the plant. like its location, or connections to the natural gas and electricity transmission systems.
- Most of these attributes are relatively stable, but they can change slowly over time, so each plant as one record for each `report_date`.
- There are also several ID columns in this table that will be useful for joining it with other data later.
- The table has more than 50 columns. You can look up short descriptions of what all these columns mean in the [PUDL Data Dictionary](https://catalystcoop-pudl.readthedocs.io/en/latest/data_dictionaries/pudl_db.html)

In [6]:
%%time
plants_eia = pd.read_sql("denorm_plants_eia", pudl_engine).convert_dtypes()
plants_eia.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200177 entries, 0 to 200176
Data columns (total 54 columns):
 #   Column                                        Non-Null Count   Dtype         
---  ------                                        --------------   -----         
 0   plant_id_eia                                  200177 non-null  Int64         
 1   plant_name_eia                                199398 non-null  string        
 2   city                                          190994 non-null  string        
 3   county                                        190479 non-null  string        
 4   latitude                                      192211 non-null  Float64       
 5   longitude                                     194840 non-null  Float64       
 6   state                                         199317 non-null  string        
 7   street_address                                184478 non-null  string        
 8   zip_code                                      193575 n

## A Sample of EIA Plant Data
- Note that there's data covering multiple decades, from 2001 to 2022.
- Many fields contain `<NA>` values because not all fields have been reported consistently across all years.

In [7]:
plants_eia.sample(10)

Unnamed: 0,plant_id_eia,plant_name_eia,city,county,latitude,longitude,state,street_address,zip_code,timezone,report_date,ash_impoundment,ash_impoundment_lined,ash_impoundment_status,balancing_authority_code_eia,balancing_authority_name_eia,datum,energy_storage,ferc_cogen_docket_no,ferc_cogen_status,ferc_exempt_wholesale_generator_docket_no,ferc_exempt_wholesale_generator,ferc_small_power_producer_docket_no,ferc_small_power_producer,ferc_qualifying_facility_docket_no,grid_voltage_1_kv,grid_voltage_2_kv,grid_voltage_3_kv,iso_rto_code,liquefied_natural_gas_storage,natural_gas_local_distribution_company,natural_gas_storage,natural_gas_pipeline_name_1,natural_gas_pipeline_name_2,natural_gas_pipeline_name_3,nerc_region,net_metering,pipeline_notes,primary_purpose_id_naics,regulatory_status_code,reporting_frequency_code,sector_id_eia,sector_name_eia,service_area,transmission_distribution_owner_id,transmission_distribution_owner_name,transmission_distribution_owner_state,utility_id_eia,water_source,data_maturity,plant_id_pudl,utility_name_eia,utility_id_pudl,balancing_authority_code_eia_consistent_rate
183666,62224,Oak Leaf Solar XXVII LLC,Greely,Weld,40.44,-104.64,CO,2451 E. 8th St.,80631.0,America/Denver,2022-01-01,,,,PSCO,Public Service Company of Colorado,,False,,False,,False,1902-0075,True,,12.47,,,,,,,,,,WECC,,,22,NR,A,2,NAICS-22 Non-Cogen,,15466.0,Public Service Co of Colorado,CO,60025,,provisional,12383,Greenbacker Renewable Energy Corporation,1932,1.0
134380,56955,Delano Energy Center LLC,Helm,Fresno,36.549167,-120.111111,CA,12688 S. Colorado Avenue,93627.0,America/Los_Angeles,2019-01-01,,False,,CISO,California Independent System Operator,,False,,False,,False,,False,,60.0,,,,False,,False,,,,WECC,,,22,NR,,2,IPP Non-CHP,,14328.0,Pacific Gas & Electric Co,CA,20323,,final,6233,Wellhead Services Inc,3726,1.0
58088,7233,Tesla,Colorado Springs,El Paso,38.973655,-104.9011,CO,690 W Monument Creek Rd,80840.0,America/Denver,2019-01-01,False,False,,WACM,Western Area Power Administration - Rocky Mountain Region,,False,,False,,False,,False,,34.5,,,,False,,False,,,,WECC,,,22,RE,A,1,Electric Utility,,3989.0,City of Colorado Springs - (CO),CO,3989,Municipality,final,2722,City of Colorado Springs - (CO),956,1.0
172257,60537,Limerick Road Solar Farm,Shelburne,Chittenden,44.368889,-73.24708,VT,197 Limerick Road,5482.0,America/New_York,2021-01-01,False,False,,ISNE,ISO New England Inc.,,False,,False,,False,8027,True,,12.47,,,,,,,,,,NPCC,,,22,NR,A,2,NAICS-22 Non-Cogen,,7601.0,Green Mountain Power Corp,VT,60297,,final,9332,"Limerick Road Solar, LLC",5649,1.0
103097,54890,Peoples,,,43.27975,-83.86507,MI,4516 Rathbun Road,48415.0,America/Detroit,2010-01-01,,,,MISO,,,,,False,,False,95-43-000,True,,,,,MISO,,,,,,,RFC,,,22,NR,,2,NAICS-22 Non-Cogen,,56163.0,Michigan Electric Transmission Company,MI,13559,,final,4257,North American Natural Res,2637,1.0
177231,61238,NY - Presbyt. Hospital - 525 E 68TH St,New York,New York,40.764396,-73.95393,NY,525 East 68th St,10021.0,America/New_York,2016-01-01,False,,,NYIS,New York Independent System Operator,,False,,False,,False,,False,,4.16,,,,,CONSOLIDATED EDISON NEW YORK INC,False,,,,NPCC,,,622,NR,,5,Commercial NAICS Cogen,,13511.0,New York State Elec & Gas Corp,NY,60883,,final,10818,NY - PRESBYTERIAN HOSPITAL-525 E 68TH ST,5778,1.0
49864,6087,Wallace Dam,Eatonton,Hancock,,-83.1574,GA,Highway 16 East,31024.0,America/New_York,2019-01-01,False,False,,SOCO,"Southern Company Services, Inc. - Trans",,True,,False,,False,,False,,230.0,,,,False,,False,,,,SERC,,,22,RE,M,1,Electric Utility,,7140.0,Georgia Power Co,GA,7140,Oconee River,final,627,Georgia Power Co,123,1.0
126741,56424,Mower County Wind Energy Center,Grand Meadow,Mower,43.61407,-92.6667,MN,72506 180th Street,55936.0,America/Chicago,2014-01-01,,False,,MISO,"Midcontinent Independent Transmission System Operator, Inc..",,,,False,,False,,False,,230.0,,,,,,,,,,MRO,,,22,NR,,2,NAICS-22 Non-Cogen,,4716.0,Dairyland Power Coop,WI,54819,,final,4966,FPL Energy Mower County LLC,1784,1.0
64346,7881,MEAG3,,,,,GA,unsited,,America/New_York,2020-01-01,,False,,,,,,,False,,False,,False,,,,,,,,,,,,SERC,,,22,RE,,1,Electric Utility,,,,,13100,,final,9615,Municipal Electric Authority,2530,
107724,55193,Ontelaunee Energy Center,Reading,Berks,40.4219,-75.9356,PA,5115 Pottsville Pike,19605.0,America/New_York,2005-01-01,,,,PJM,,,,,,01-24-000,True,,,,,,,,,,,,,,MAAC,,,22,,,2,NAICS-22 Non-Cogen,Metropolitan Edison Co,,,,50157,Reading Water Authority,final,4404,South Point Energy Center LLC,3243,1.0


## Reading FERC plant data from the PUDL SQLite DB
* The Federal Energy Regulatory Commission (FERC) also reports data about power plants in their [Form 1 - Annual Report of Major Electric Utilities](https://catalystcoop-pudl.readthedocs.io/en/latest/data_sources/ferc1.html).
* FERC Form 1 focuses primarily on electric utility finances, rather than operations.
* The [Large Steam Plants](https://catalystcoop-pudl.readthedocs.io/en/latest/data_dictionaries/pudl_db.html) table provides particularly detailed capital expenses and non-fuel O&M costs.

In [8]:
%%time
plants_ferc1 = pd.read_sql("denorm_plants_steam_ferc1", pudl_engine).convert_dtypes()
plants_ferc1.info()

  if (arr.astype(int) == arr).all():
  if (arr.astype(int) == arr).all():


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31879 entries, 0 to 31878
Data columns (total 55 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   report_year                             31879 non-null  Int64  
 1   utility_id_ferc1                        31879 non-null  Int64  
 2   utility_id_pudl                         31879 non-null  Int64  
 3   utility_name_ferc1                      31879 non-null  string 
 4   plant_id_pudl                           31879 non-null  Int64  
 5   plant_id_ferc1                          31879 non-null  Int64  
 6   plant_name_ferc1                        31879 non-null  string 
 7   asset_retirement_cost                   9453 non-null   Int64  
 8   avg_num_employees                       18624 non-null  Int64  
 9   capacity_factor                         29486 non-null  Float64
 10  capacity_mw                             31879 non-null  Fl

In [9]:
plants_ferc1.sample(10)

Unnamed: 0,report_year,utility_id_ferc1,utility_id_pudl,utility_name_ferc1,plant_id_pudl,plant_id_ferc1,plant_name_ferc1,asset_retirement_cost,avg_num_employees,capacity_factor,capacity_mw,capex_annual_addition,capex_annual_addition_rolling,capex_annual_per_kw,capex_annual_per_mw,capex_annual_per_mw_rolling,capex_annual_per_mwh,capex_annual_per_mwh_rolling,capex_equipment,capex_land,capex_per_mw,capex_structures,capex_total,capex_wo_retirement_total,construction_type,construction_year,installation_year,net_generation_mwh,not_water_limited_capacity_mw,opex_allowances,opex_boiler,opex_coolants,opex_electric,opex_engineering,opex_fuel,opex_fuel_per_mwh,opex_misc_power,opex_misc_steam,opex_nonfuel_per_mwh,opex_operations,opex_per_mwh,opex_plants,opex_production_total,opex_rents,opex_steam,opex_steam_other,opex_structures,opex_total_nonfuel,opex_transfer,peak_demand_mw,plant_capability_mw,plant_hours_connected_while_generating,plant_type,record_id,water_limited_capacity_mw
17323,2004,231,107,"Entergy Gulf States Louisiana, L.L.C.",639,4676,willow glen,,77.0,0.021242,2194.0,,,,,,,,364288687,2223473.0,188190.2,46377246,412889406,412889406,outdoor,1960,1976,408266.0,2045.0,,3401692.0,,68628.0,582185.0,35183948.0,86.178981,2610916.0,377336.0,39.365443,3220722.0,125.5,3561977.0,51255520.0,228249.0,1424956.0,,594911.0,16071572.0,,664.0,,6817.0,steam,f1_steam_2004_12_63_0_4,2045.0
5930,2015,168,292,South Carolina Electric & Gas Company,122,6522,coit #1 peaking,-24957.0,,0.001587,19.64,,,,,,,,3437967,36498.0,180652.0,98497,3548005,3572962,,1969,1969,273.0,18.0,,,,,,,,,,,,0.0,,,,,,,,,19.0,,33.0,combustion_turbine,f1_steam_2015_12_159_4_1,14.0
15245,1998,225,359,"Westar Energy, Inc.",576,109,tecumseh,,,0.002283,57.6,0.0,28649.666667,0.0,0.0,497.390046,0.0,24.869502,5571718,,97457.9,41856,5613574,5613574,outdoor,1972,1972,1152.0,41.0,,3701.0,,2199.0,90.0,204248.0,177.298611,2913.0,1498.0,29.602431,51.0,206.9,18529.0,238350.0,,1934.0,,3187.0,34102.0,,33.0,,298.0,combustion_turbine,f1_steam_1998_12_191_0_2,
12913,1997,215,213,New York State Electric & Gas Corporation,11296,553,hickling,,36.0,0.337121,86.5,397480.0,,4.595145,4595.144509,,1.555999,,29632458,35919.0,410335.1,5825610,35493987,35493987,conventional,1948,1952,255450.0,46.0,,367065.0,,373706.0,145769.0,4826659.0,18.894731,555750.0,151710.0,9.979926,492687.0,28.8,39136.0,7376031.0,,371837.0,,51712.0,2549372.0,,87.0,,,steam,f1_steam_1997_12_115_0_4,46.0
27463,2003,301,204,"Nevada Power Company, d/b/a NV Energy",117,756,"clark 5,6,7,8,9,10",,,0.546044,548.2,5685349.0,3792382.666667,10.370939,10370.939438,6917.881552,2.168135,1.446243,218734630,318872.0,443280.2,23952711,243006213,243006213,semioutdoor,1979,1994,2622230.0,500.0,,125585.0,,2590830.0,77997.0,137715062.0,52.5183,792762.0,3174092.0,3.843494,790864.0,56.4,1531640.0,147793588.0,469.0,975466.0,,18821.0,10078526.0,,500.0,500.0,16664.0,combined_cycle,f1_steam_2003_12_108_1_5,500.0
4458,2003,164,349,VIRGINIA ELECTRIC AND POWER COMPANY,15,1451,altavista,6201726.0,,0.647895,71.0,-431620.0,434627.333333,-6.079155,-6079.15493,6121.511737,-1.07111,1.078573,1203106,166667.0,107989.1,95724,7667223,1465497,conventional,2001,2001,402965.0,63.0,74539.0,487867.0,,54361.0,341688.0,9316838.0,23.120713,243974.0,316313.0,5.396201,296218.0,28.5,178116.0,11491318.0,16485.0,151421.0,,13498.0,2174480.0,,,,7961.0,steam,f1_steam_2003_12_186_4_2,63.0
3399,2019,162,277,"Puget Sound Energy, Inc.",201,1649,frederickson 1,443797.0,,0.558071,137.0,29418.0,245176.333333,0.21473,214.729927,1789.608273,0.043924,0.36607,60565889,699814.0,495529.4,6178023,67887523,67443726,outdoor,2002,2002,669752.0,136.0,,307975.0,,965962.0,293455.0,15954091.0,23.820893,11580.0,19083.0,6.47842,1876076.0,30.3,826570.0,20293026.0,,27729.0,,10505.0,4338935.0,,135.0,,5284.0,combined_cycle,f1_steam_2019_12_150_0_4,
18477,2000,245,360,Western Massachusetts Electric Company,772,3694,millstone no. 3,,904.0,0.920849,153.36,,,,,,,,267205102,,2465583.8,110916836,378121938,378121938,conventional,1986,1986,1237099.876,1157.0,,1541740.0,270498.0,9363.0,1668283.0,5837829.0,4.718963,4278863.0,28341.0,11.888125,2818176.0,16.6,1342083.0,20544627.0,505320.0,1487692.0,,756439.0,14706798.0,,143.0,,8784.0,nuclear,f1_steam_2000_12_190_1_2,1146.0
546,1996,29,194,Montaup Electric Company,649,1279,wyman #4 (5) (8),,,0.057025,16.62,,,,,,,,3388066,,240969.1,616842,4004908,4004908,conventional,1978,1978,8302.299,16.0,,26117.0,,11049.0,7721.0,289433.0,34.861789,60978.0,2264.0,18.391773,12100.0,53.2,12489.0,442127.0,,17239.0,,2737.0,152694.0,,16.0,,1323.0,steam,f1_steam_1996_12_104_0_5,
6837,2018,170,90,"Duke Energy Carolinas, LLC",324,7084,lee,,,0.084046,108.0,,,,,,,,61138160,,570878.5,516719,61654879,61654879,conventional,2006,2007,79514.0,96.0,,,,476930.0,-295405.0,4662395.0,58.636152,,,19.356466,452957.0,78.0,782757.0,6201505.0,,,,121871.0,1539110.0,,98.0,,1095.0,combustion_turbine,f1_steam_2018_12_45_0_5,84.0


## Selecting Comoanche Plant Data
- Rather than looking at *all* power plants, let's just look at Colorado coal plants in 2010.

In [10]:
%%time
gens_eia = pd.read_sql("denorm_generators_eia", pudl_engine).convert_dtypes(convert_floating=False)

CPU times: user 48.8 s, sys: 3.86 s, total: 52.6 s
Wall time: 54.5 s


## EIA Generators
- The EIA Generators table has more than 100 columns. We'll want to pare it down for easier use.

In [11]:
gens_eia.info(max_cols=150)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 556578 entries, 0 to 556577
Data columns (total 102 columns):
 #    Column                                     Non-Null Count   Dtype         
---   ------                                     --------------   -----         
 0    report_date                                556578 non-null  datetime64[ns]
 1    plant_id_eia                               556578 non-null  Int64         
 2    plant_id_pudl                              556538 non-null  Int64         
 3    plant_name_eia                             556575 non-null  string        
 4    utility_id_eia                             556486 non-null  Int64         
 5    utility_id_pudl                            556538 non-null  Int64         
 6    utility_name_eia                           556538 non-null  string        
 7    generator_id                               556578 non-null  string        
 8    associated_combined_heat_power             550206 non-null  boolean     

In [12]:
gens_co2010 = (
    gens_eia.loc[(gens_eia.state=="CO") & (gens_eia.report_date == "2010-01-01")]
)
gens_co2010.sample(10)

Unnamed: 0,report_date,plant_id_eia,plant_id_pudl,plant_name_eia,utility_id_eia,utility_id_pudl,utility_name_eia,generator_id,associated_combined_heat_power,bga_source,bypass_heat_recovery,capacity_mw,carbon_capture,city,cofire_fuels,county,current_planned_generator_operating_date,data_maturity,deliver_power_transgrid,distributed_generation,duct_burners,energy_source_1_transport_1,energy_source_1_transport_2,energy_source_1_transport_3,energy_source_2_transport_1,energy_source_2_transport_2,energy_source_2_transport_3,energy_source_code_1,energy_source_code_2,energy_source_code_3,energy_source_code_4,energy_source_code_5,energy_source_code_6,energy_storage_capacity_mwh,ferc_qualifying_facility,fluidized_bed_tech,fuel_type_code_pudl,fuel_type_count,generator_operating_date,generator_retirement_date,latitude,longitude,minimum_load_mw,multiple_fuels,nameplate_power_factor,net_capacity_mwdc,operating_switch,operational_status,operational_status_code,original_planned_generator_operating_date,other_combustion_tech,other_modifications_date,other_planned_modifications,owned_by_non_utility,ownership_code,planned_derate_date,planned_energy_source_code_1,planned_generator_retirement_date,planned_modifications,planned_net_summer_capacity_derate_mw,planned_net_summer_capacity_uprate_mw,planned_net_winter_capacity_derate_mw,planned_net_winter_capacity_uprate_mw,planned_new_capacity_mw,planned_new_prime_mover_code,planned_repower_date,planned_uprate_date,previously_canceled,prime_mover_code,pulverized_coal_tech,reactive_power_output_mvar,rto_iso_lmp_node_id,rto_iso_location_wholesale_reporting_id,solid_fuel_gasification,startup_source_code_1,startup_source_code_2,startup_source_code_3,startup_source_code_4,state,stoker_tech,street_address,subcritical_tech,summer_capacity_estimate,summer_capacity_mw,summer_estimated_capability_mw,supercritical_tech,switch_oil_gas,syncronized_transmission_grid,technology_description,time_cold_shutdown_full_load_code,timezone,topping_bottoming_code,turbines_inverters_hydrokinetics,turbines_num,ultrasupercritical_tech,unit_id_pudl,uprate_derate_completed_date,uprate_derate_during_year,winter_capacity_estimate,winter_capacity_mw,winter_estimated_capability_mw,zip_code
195780,2010-01-01,54372,4041,University of Colorado,22208,3583,University of Colorado,GT2,True,,True,16.0,,Boulder,,Boulder,NaT,final,True,,False,,,,,,,NG,DFO,,,,,,,,gas,1,1992-08-01,NaT,40.00759,-105.2692,,True,,,,existing,OP,NaT,,NaT,False,,S,NaT,,NaT,False,,,,,,,NaT,NaT,,CT,,,,,False,,,,,CO,,18th St and Colorado,,,15.0,,,,,Natural Gas Fired Combined Cycle,,America/Denver,T,,,,,NaT,,,16.0,,80309.0
197632,2010-01-01,55283,4453,Front Range Power Project,3989,956,Colorado Springs City of,1,False,,False,154.0,,Fountain,,El Paso,NaT,final,True,,False,,,,,,,NG,,,,,,,,,gas,1,2003-04-01,NaT,38.6281,-104.7069,,False,,,,existing,OP,2003-12-01,,NaT,,,S,NaT,,NaT,False,,,,,,,NaT,NaT,,CT,,,,,False,,,,,CO,,6615 Generation Drive,,,132.4,,,,,Natural Gas Fired Combined Cycle,,America/Denver,X,,,,,NaT,,,144.7,,80817.0
201616,2010-01-01,57377,5655,Greater Sandhill I,56720,1927,Greater Sandhill I LLC,GS-P2,False,,False,9.0,,,,Alamosa,NaT,final,True,,False,,,,,,,SUN,,,,,,,,,solar,1,2010-12-01,NaT,37.685467,-105.8909,,False,,,,existing,OP,NaT,,NaT,,,S,NaT,,NaT,False,,,,,,,NaT,NaT,,PV,,,,,False,,,,,CO,,County Road 108 & Eightmile Ln,,,9.0,,,,,Solar Photovoltaic,,America/Denver,X,,,,,NaT,,,9.0,,81146.0
181948,2010-01-01,496,154,Delta (CO),5036,3904,Delta City of,1,False,,False,0.8,,Delta,,Delta,NaT,final,True,,False,,,,,,,NG,DFO,,,,,,,,gas,2,1945-10-01,NaT,38.7314,-108.0708,,False,,,,existing,OP,NaT,,NaT,,,S,NaT,,NaT,False,,,,,,,NaT,NaT,,IC,,,,,False,,,,,CO,,1133 Main St.,,,0.8,,,,,Natural Gas Internal Combustion Engine,,America/Denver,X,,,,,NaT,,,0.8,,81416.0
181938,2010-01-01,492,1464,South Plant,3989,956,Colorado Springs City of,5,False,eia860_org,False,50.0,,Colorado Springs,,El Paso,NaT,final,True,,False,,,,,,,BIT,SUB,NG,,,,,,,coal,1,1962-11-01,NaT,38.824444,-104.8333,,True,,,,existing,OP,NaT,,NaT,,,S,NaT,,NaT,False,,,,,,,NaT,NaT,,ST,True,,,,False,NG,,,,CO,,700 S Conejos St,True,,46.0,,,,,Conventional Steam Coal,,America/Denver,X,,,,1.0,NaT,,,46.0,,80903.0
193424,2010-01-01,10755,3350,American Atlas 1 Cogen,30151,3514,Tri-State G & T Assn Inc,GT2,,,False,15.3,,Rifle,,Garfield,NaT,final,True,,False,,,,,,,NG,,,,,,,,,gas,1,1987-08-01,NaT,39.5173,-107.7299,,False,,,,existing,OP,NaT,,NaT,,,S,NaT,,NaT,False,,,,,,,NaT,NaT,,CT,,,,,False,,,,,CO,,56 B County Road 352,,,13.0,,,,,Natural Gas Fired Combined Cycle,,America/Denver,,,0.0,,,NaT,,,15.0,,81650.0
192466,2010-01-01,10003,3032,Trigen Colorado,19173,1298,Trigen-Nations Energy Co,GEN2,True,eia860_org,False,7.5,,Golden,,Jefferson,NaT,final,True,,False,,,,,,,BIT,NG,WO,DFO,,,,,,coal,1,1977-05-01,NaT,39.7606,-105.215,,True,,,,existing,OP,NaT,,NaT,,,S,NaT,,NaT,False,,,,,,,NaT,NaT,,ST,True,,,,False,NG,,,,CO,,,True,,10.0,,,,,Conventional Steam Coal,,America/Denver,T,,,,1.0,NaT,,,10.0,,80401.0
197634,2010-01-01,55283,4453,Front Range Power Project,3989,956,Colorado Springs City of,3,False,eia860_org,False,233.0,,Fountain,,El Paso,NaT,final,True,,True,,,,,,,NG,,,,,,,,,gas,1,2003-04-01,NaT,38.6281,-104.7069,,False,,,,existing,OP,2003-12-01,,NaT,,,S,NaT,,NaT,False,,,,,,,NaT,NaT,,CA,,,,,False,,,,,CO,,6615 Generation Drive,,,196.4,,,,,Natural Gas Fired Combined Cycle,,America/Denver,X,,,,1.0,NaT,,,207.0,,80817.0
198359,2010-01-01,55650,4596,Plains End Generating Station,15142,2863,Plains End LLC,GEN9,False,,False,5.7,,Arvada,,Jefferson,NaT,final,True,,False,,,,,,,NG,,,,,,,,,gas,1,2002-05-01,NaT,39.857499,-105.225967,,False,,,,existing,OP,2002-01-01,,NaT,,,W,NaT,,NaT,False,,,,,,,NaT,NaT,,IC,,,,,False,,,,,CO,,8950 Highway 93,,,5.7,,,,,Natural Gas Internal Combustion Engine,,America/Denver,X,,,,,NaT,,,5.7,,
197336,2010-01-01,55200,4409,Arapahoe Combustion Turbine,1415,3385,Black Hills Colorado LLC,UN7,False,eia860_org,False,51.8,,Denver,,Denver,NaT,final,True,,True,,,,,,,NG,,,,,,,,,gas,1,2002-10-01,NaT,39.6692,-105.0018,,False,,,,existing,OP,2002-07-01,,NaT,,,S,NaT,,NaT,False,,,,,,,NaT,NaT,,CA,,,,,False,,,,,CO,,2601 South Platte River Road,,,44.5,,,,,Natural Gas Fired Combined Cycle,,America/Denver,X,,,,1.0,NaT,,,48.6,,80223.0


In [13]:
import geopandas as gpd
map_cols = [
    "plant_id_eia",
    "plant_id_pudl",
    "plant_name_eia",
    "generator_id",
    "utility_id_eia",
    "utility_id_pudl",
    "utility_name_eia",
    "latitude",
    "longitude",
    "capacity_mw",
    "report_date",
    "state",
]

df = (
    gens_co2010.loc[:, map_cols]
    .dropna(subset=["longitude", "latitude"])
    .astype({"report_date": "string"})
)

gdf = (
    gpd.GeoDataFrame(
        df,
        geometry=gpd.points_from_xy(df.longitude, df.latitude),
        crs="EPSG:4326",
    )
)

gdf.explore(
    marker_type="circle",
    style_kwds={
        "style_function": lambda x: {"radius": 10*x["properties"]["capacity_mw"]}
    },
)


# Read Hourly Generation & Emissions Data from Apache Parquet
* The full hourly emissions time series for thousands of US power plants covering 1995-2022 contains almost a billion records.
* The data is stored in a single [Apache Parquet file](https://parquet.apache.org/) with row-groups defined by year and state.
* This compressed columnar format enables very efficient queries with appropriate tooling, including [Dask](https://www.dask.org/) and [PyArrow](https://arrow.apache.org/docs/python/index.html).
* Reading the entire dataset into memory at once will probably exceed the available RAM.
* The filters use [Disjunctive Normal Form](https://blog.datasyndrome.com/python-and-parquet-performance-e71da65269ce)
* Using Dask's lazy evaluation and the filter criteria, we can minimize the data read off of disk and limit memory usage.
* The Dask project has lots of [tutorials and documentation](https://www.dask.org/get-started) if you want to learn more.
* Other tools like [DuckDB](https://duckdb.org/docs/data/parquet/overview.html) ([Python API](https://duckdb.org/docs/api/python/overview)) also provide good Parquet support. 

## Using Dask to selectively read Parquet data

In [14]:
%%time 
from dask import dataframe as dd
# In a DNF filter, the inner lists of conditions are combined with AND
# while the outer list of conditions are combined with OR
# So this filter will get all 2019 and 2020 records for CO and WY:
state_year_filters = [
    [('year', '=', 2019), ('state', '=', 'CO')],
    [('year', '=', 2019), ('state', '=', 'WY')],
    [('year', '=', 2020), ('state', '=', 'CO')],
    [('year', '=', 2020), ('state', '=', 'WY')],
]
co_wy_cems = dd.read_parquet(
    f"{pudl_path}/hourly_emissions_epacems.parquet",
    engine="pyarrow",
    dtype_backend="pyarrow",
    filters=state_year_filters,
).compute()
co_wy_cems.info()

Use the `index` argument to set a sorted column as your index to create a DataFrame collection with known `divisions`.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1662192 entries, 0 to 1662191
Data columns (total 16 columns):
 #   Column                     Non-Null Count    Dtype                                                       
---  ------                     --------------    -----                                                       
 0   plant_id_eia               1662192 non-null  int32[pyarrow]                                              
 1   plant_id_epa               1662192 non-null  int32[pyarrow]                                              
 2   emissions_unit_id_epa      1662192 non-null  string                                                      
 3   operating_datetime_utc     1662192 non-null  timestamp[ms, tz=UTC][pyarrow]                              
 4   year                       1662192 non-null  int32[pyarrow]                                              
 5   state                      1662192 non-null  dictionary<values=string, indices=int32, ordered=0>[pyarrow]

In [15]:
co_wy_cems.sample(10)

Unnamed: 0,plant_id_eia,plant_id_epa,emissions_unit_id_epa,operating_datetime_utc,year,state,operating_time_hours,gross_load_mw,heat_content_mmbtu,steam_load_1000_lbs,so2_mass_lbs,so2_mass_measurement_code,nox_mass_lbs,nox_mass_measurement_code,co2_mass_tons,co2_mass_measurement_code
175344,6248,6248,1,2020-06-21 07:00:00+00:00,2020,CO,1.0,364.0,3937.100098,,309.0,Measured,228.352005,Calculated,412.899994,Measured
1006989,55283,55283,1,2019-04-11 04:00:00+00:00,2019,CO,1.0,142.0,1205.800049,,0.724,Measured,32.556999,Calculated,71.699997,Measured
750909,6112,6112,2,2019-01-16 04:00:00+00:00,2019,CO,1.0,184.0,1335.5,,0.801,Measured,36.058998,Calculated,79.400002,Measured
360711,55200,55200,CT6,2020-07-28 22:00:00+00:00,2020,CO,1.0,34.0,293.600006,,0.176,Measured,3.523,Calculated,17.4,Measured
141209,6112,6112,3,2020-08-02 00:00:00+00:00,2020,CO,1.0,212.0,1597.199951,,0.958,Measured,54.305,Calculated,94.900002,Measured
1119705,55645,55645,CT-02,2019-02-21 16:00:00+00:00,2019,CO,1.0,102.0,1245.699951,,0.747,Measured,36.125,Calculated,74.0,Measured
524350,56445,56445,CT-01,2020-03-15 05:00:00+00:00,2020,CO,1.0,156.0,1584.099976,,0.951,Measured,58.612,Calculated,94.099998,Measured
1222770,56998,56998,CT08,2019-11-28 01:00:00+00:00,2019,CO,0.0,,,,,,,,,
487193,55505,55505,BR2,2020-12-22 00:00:00+00:00,2020,CO,0.0,,,,,,,,,
90114,525,525,H1,2020-10-08 01:00:00+00:00,2020,CO,1.0,125.0,1371.900024,,165.100006,Measured,61.736,Calculated,140.800003,Measured


## Read all Colorado Emissions Data

In [16]:
%%time
colorado_cems = dd.read_parquet(
    f"{pudl_path}/hourly_emissions_epacems.parquet",
    engine="pyarrow",
    dtype_backend="pyarrow",
    filters=[("state", "=", "CO")],
).compute()
colorado_cems.info()

Use the `index` argument to set a sorted column as your index to create a DataFrame collection with known `divisions`.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13631472 entries, 0 to 13631471
Data columns (total 16 columns):
 #   Column                     Dtype                                                       
---  ------                     -----                                                       
 0   plant_id_eia               int32[pyarrow]                                              
 1   plant_id_epa               int32[pyarrow]                                              
 2   emissions_unit_id_epa      string                                                      
 3   operating_datetime_utc     timestamp[ms, tz=UTC][pyarrow]                              
 4   year                       int32[pyarrow]                                              
 5   state                      dictionary<values=string, indices=int32, ordered=0>[pyarrow]
 6   operating_time_hours       float[pyarrow]                                              
 7   gross_load_mw              float[pyarrow]  

In [17]:
colorado_cems.sample(10)

Unnamed: 0,plant_id_eia,plant_id_epa,emissions_unit_id_epa,operating_datetime_utc,year,state,operating_time_hours,gross_load_mw,heat_content_mmbtu,steam_load_1000_lbs,so2_mass_lbs,so2_mass_measurement_code,nox_mass_lbs,nox_mass_measurement_code,co2_mass_tons,co2_mass_measurement_code
1751760,525,525,H2,2006-07-15 07:00:00+00:00,2006,CO,1.0,284.0,2874.399902,,768.900024,Measured,980.169983,Calculated,294.899994,Measured
11864908,55127,55127,CT2,2022-05-04 11:00:00+00:00,2022,CO,0.0,,,,,,,,,
2304596,6021,6021,C1,2000-08-02 03:00:00+00:00,2000,CO,1.0,452.0,4365.299805,,1337.5,Measured,1619.526001,Calculated,447.899994,Measured
10293964,56445,56445,CT-01,2018-06-15 11:00:00+00:00,2018,CO,0.0,,,,,,,,,
2715902,10682,10682,GT2,2008-05-25 21:00:00+00:00,2008,CO,0.0,,,,,,,,,
12267784,492,492,6,1997-04-30 23:00:00+00:00,1997,CO,1.0,65.0,780.700012,,542.299988,Measured,804.901978,Calculated,80.0,Measured
13005567,469,469,3,2013-07-20 22:00:00+00:00,2013,CO,1.0,142.0,1544.699951,,148.800003,Measured,549.913025,Calculated,158.5,Measured
13328053,50707,50707,S001,2013-05-13 20:00:00+00:00,2013,CO,0.0,,,,,,,,,
11030771,525,525,H1,2020-10-09 18:00:00+00:00,2020,CO,1.0,127.0,1384.699951,,144.399994,Measured,60.926998,Calculated,142.100006,Measured
3666481,478,478,3,2019-11-03 08:00:00+00:00,2019,CO,0.0,,,,,,,,,


## Visualize Hourly Power Plant Operations
* Let's find a particular power plant and look at its long-term operations.
* Say we want to investigate [Xcel Energy's troubled Comanche coal plant](https://coloradosun.com/?s=comanche%20pueblo) in Pueblo, CO?
* The EPA CEMS data only has contains the EIA Plant ID, not its name or any ownership information.
* The PUDL database links these IDs to much more extensive EIA data.
* We can look for the Comanche plant in the PUDL DB and use that information to select the appropriate EPA CEMS data to plot.

In [18]:
%%time
plants_eia = pd.read_sql("denorm_plants_eia", pudl_engine).convert_dtypes()
plants_eia.sample(10)

CPU times: user 11.1 s, sys: 302 ms, total: 11.4 s
Wall time: 12 s


Unnamed: 0,plant_id_eia,plant_name_eia,city,county,latitude,longitude,state,street_address,zip_code,timezone,report_date,ash_impoundment,ash_impoundment_lined,ash_impoundment_status,balancing_authority_code_eia,balancing_authority_name_eia,datum,energy_storage,ferc_cogen_docket_no,ferc_cogen_status,ferc_exempt_wholesale_generator_docket_no,ferc_exempt_wholesale_generator,ferc_small_power_producer_docket_no,ferc_small_power_producer,ferc_qualifying_facility_docket_no,grid_voltage_1_kv,grid_voltage_2_kv,grid_voltage_3_kv,iso_rto_code,liquefied_natural_gas_storage,natural_gas_local_distribution_company,natural_gas_storage,natural_gas_pipeline_name_1,natural_gas_pipeline_name_2,natural_gas_pipeline_name_3,nerc_region,net_metering,pipeline_notes,primary_purpose_id_naics,regulatory_status_code,reporting_frequency_code,sector_id_eia,sector_name_eia,service_area,transmission_distribution_owner_id,transmission_distribution_owner_name,transmission_distribution_owner_state,utility_id_eia,water_source,data_maturity,plant_id_pudl,utility_name_eia,utility_id_pudl,balancing_authority_code_eia_consistent_rate
33027,2606,Schaghticoke,Schaghticoke,Rensselaer,42.8992,-73.5989,NY,Chesnut Street,12154,America/New_York,2006-01-01,,,,NYIS,,,,,False,,False,,True,,,,,,,,,,,,NPCC,,,22,NR,,2.0,NAICS-22 Non-Cogen,Niagara Mohawk Power Corp,,,,5914,Hoosic,final,2115,Erie Boulevard Hydropower LP,1650,1.0
156849,58839,Sudbury Landfill,Sudbury,Middlesex,42.363889,-71.385,MA,20 Boston Post Road,1776,America/New_York,2017-01-01,False,False,,ISNE,ISO New England Inc.,,False,,False,,False,,False,,13.8,,,,False,,False,,,,NPCC,,,22,NR,,2.0,NAICS-22 Non-Cogen,,54913.0,NSTAR Electric Company,MA,58721,,final,6803,Solar Sudbury One LF LLC,3216,1.0
82812,50278,Worumbo Hydro,Lisbon Falls,Androscoggin,43.994729,-70.06192,ME,Canal Street,4252,America/New_York,2016-01-01,False,False,,ISNE,ISO New England Inc.,,False,,False,,False,86-548-000,True,,34.5,,,,False,,False,,,,NPCC,,,22,NR,,2.0,NAICS-22 Non-Cogen,,3266.0,Central Maine Power Co,ME,57280,Androscoggin River,final,3539,Eagle Creek RE LLC,1564,1.0
40490,3406,Johnsonville,New Johnsonville,Humphreys,36.0278,-87.9861,TN,Hwy 70 PO Box 259,37134,America/Chicago,2009-01-01,,,,TVA,,,,,False,,False,,False,,,,,,,,,,,,SERC,,,22,RE,M,1.0,Electric Utility,,18642.0,Tennessee Valley Authority,TN,18642,Tennessee River,final,2296,Tennessee Valley Authority,3430,1.0
69779,10110,Recot Inc Cogen,Bakersfield,Kern,35.395063,-119.3216,CA,28801 Highway 58,93314,America/Los_Angeles,2008-01-01,,,,CISO,,,,85-264-000,True,,False,,False,,,,,,,,,,,,WECC,,,311,NR,,7.0,Industrial NAICS Cogen,,14328.0,Pacific Gas & Electric Co,CA,15743,Wells,final,3074,Frito-Lay Inc,1824,1.0
147053,57940,Wausau Paper Middletown,Middletown,Butler,39.519989,-84.40615,OH,700 Columbia Avenue,45042,America/New_York,2019-01-01,True,True,OP,PJM,"PJM Interconnection, LLC",,False,,False,,False,,False,,69.0,,,,False,,False,,,,RFC,,,322,NR,M,7.0,Industrial NAICS Cogen,,3542.0,Duke Energy Ohio Inc,OH,57317,Miami River,final,6080,Wausau Paper Middletown,3716,1.0
174,11,H Neely Henry Dam,Ohatchee,Calhoun,33.7845,-86.0524,AL,1021 Ala Highway 144,36271,America/Chicago,2016-01-01,False,False,,SOCO,"Southern Company Services, Inc. - Trans",,False,,False,,False,,False,,115.0,,,,,,,,,,SERC,,,22,RE,,1.0,Electric Utility,,195.0,Alabama Power Co,AL,195,Coosa River,final,839,Alabama Power Co,18,1.0
70198,10143,Colver,Colver,Cambria,40.55,-78.79794,PA,141 Interpower Drive,15927,America/New_York,2011-01-01,,,,PJM,,,,,False,,False,87-632-002; 87-632-004,True,,115.0,,,PJM,,,,,,,RFC,,,22,NR,A,2.0,NAICS-22 Non-Cogen,,14711.0,Pennsylvania Electric Co,OH,9379,Vetera Reservoir,final,3091,"Inter-Power/AhlCon Partners, L.P.",2110,1.0
111876,55437,Putnam Energy Center,,Putman,,,IN,,46120,America/New_York,2001-01-01,,,,,,,,,,,,,,,,,,,,,,,,,ECAR,,,22,,,,,PSI Energy Inc,,,,15493,Cloverdale Municipality,final,13283,Putnam Energy Center LLC,7145,
108705,55243,Astoria Gas Turbines,Astoria,Queens,40.787,-73.9048,NY,31-01 20th Avenue,11105,America/New_York,2017-01-01,False,False,,NYIS,New York Independent System Operator,,False,,False,99-167-000,True,,False,,138.0,,,,False,Other - See pipeline notes.,False,,,,NPCC,,Consolidated Edison Co-NY Inc,22,NR,A,2.0,,,4226.0,Consolidated Edison Co-NY Inc,NY,13582,,final,4434,NRG Astoria Gas Turbine Operations Inc,2666,1.0


By selecting a few informative columns and records with "Comanche" in the plant name, we find that the coal plant we're looking for has `plant_id_eia==470`

In [19]:
plants_eia.loc[
    plants_eia.plant_name_eia.str.contains("comanche", case=False),
    [
        "plant_id_eia",
        "plant_name_eia",
        "utility_name_eia",
        "city",
        "state",
        "latitude",
        "longitude",
    ]
].drop_duplicates()

Unnamed: 0,plant_id_eia,plant_name_eia,utility_name_eia,city,state,latitude,longitude
7460,470,Comanche,Public Service Co of Colorado,Pueblo,CO,38.2081,-104.5747
50775,6145,Comanche Peak,Luminant Generation Company LLC,Glen Rose,TX,32.298365,-97.78552
50788,6145,Comanche Peak,TXU Generation Co LP,Glen Rose,TX,32.298365,-97.78552
67387,8059,Comanche,Public Service Co of Oklahoma,Lawton,OK,34.5431,-98.3244
164570,59656,Comanche Solar,Novatus Energy,Pueblo,CO,38.205278,-104.5667
164575,59656,Comanche Solar,Comanche LLC,Pueblo,CO,38.205278,-104.5667


In [20]:
comanche_cems = colorado_cems[colorado_cems.plant_id_eia==470]
comanche_cems.info()

<class 'pandas.core.frame.DataFrame'>
Index: 569760 entries, 78840 to 13044551
Data columns (total 16 columns):
 #   Column                     Non-Null Count   Dtype                                                       
---  ------                     --------------   -----                                                       
 0   plant_id_eia               569760 non-null  int32[pyarrow]                                              
 1   plant_id_epa               569760 non-null  int32[pyarrow]                                              
 2   emissions_unit_id_epa      569760 non-null  string                                                      
 3   operating_datetime_utc     569760 non-null  timestamp[ms, tz=UTC][pyarrow]                              
 4   year                       569760 non-null  int32[pyarrow]                                              
 5   state                      569760 non-null  dictionary<values=string, indices=int32, ordered=0>[pyarrow]
 6   ope