# Working with EPA CEMS data

CEMS or **Continusous Emissions Monitoring Data** is a product of the EPA's Air Emission Measurement Center / Clean Air Market Programs.

Website: https://www.epa.gov/emc/emc-continuous-emission-monitoring-systems

### Setup

The following kernels enable interaction with the CEMS dataset through pudl.

In [31]:
# Standard libraries
import logging
import os
import pathlib
import sys

# 3rd party libraries
import dask.dataframe as dd
from dask.distributed import Client
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import seaborn as sns
import sqlalchemy as sa

# Local libraries
import pudl

In [32]:
logger=logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]

In [33]:
pudl_settings = pudl.workspace.setup.get_defaults()
#display(pudl_settings)

ferc1_engine = sa.create_engine(pudl_settings['ferc1_db'])
#display(ferc1_engine)

pudl_engine = sa.create_engine(pudl_settings['pudl_db'])
#display(pudl_engine)

In [5]:
#pudl_engine.table_names()
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine)

### Accessing CEMS data

The CEMS dataset is enormous! It contains hourly emissions data on an XXXX basis between YEAR and 2019, meaning that the full dataset is close to a billion rows and 100GB. That's a lot to store on your computer when you may only need a fraction for analysis. This'll help ensure you've got the underlying data you need saved to your computer and teach you how to access it programatically.

#### 1. Make sure you've downloaded the appropriate raw data (and only that).

Information about when / how to get the CEMS files and whether they were included in your initial download of pudl.

#### 2. Select a subset of the raw data using Dask

Dask is a python package that parallelizes pandas dataframes so that you can access larger-than-memory data. With Dask, you can select the subset of CEMS data that you'd like to analyse *before* loading the data into a dataframe. While in Dask, you can interact with the data as if it were in a pandas dataframe.

In [34]:
# Select a year or years to observe
year = 2018

# Locate the data for the given year/s on your hard drive.
epacems_path = (pudl_settings['parquet_dir'] + f'/epacems/year={year}')

# Create a Dask object for preliminary data interaction
cems_dd = dd.read_parquet(epacems_path)

Now you can learn things about the data such as column names and datatypes. If you take a look at the length of the Dask dataframe, you'll understand why we're not in pandas yet.

In [28]:
len(cems_dd) # This shows how many rows!!

36768792

In [25]:
cems_dd

Unnamed: 0_level_0,plant_id_eia,unitid,operating_datetime_utc,operating_time_hours,gross_load_mw,steam_load_1000_lbs,so2_mass_lbs,so2_mass_measurement_code,nox_rate_lbs_mmbtu,nox_rate_measurement_code,nox_mass_lbs,nox_mass_measurement_code,co2_mass_tons,co2_mass_measurement_code,heat_content_mmbtu,facility_id,unit_id_epa,state
npartitions=49,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
,int32,object,"datetime64[ns, UTC]",float32,float32,float32,float32,category[unknown],float32,category[unknown],float32,category[unknown],float32,category[unknown],float32,int32,int32,category[known]
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [37]:
cems_dd.columns.tolist()
# For a further information about the contents of each column, see:
#

['plant_id_eia',
 'unitid',
 'operating_datetime_utc',
 'operating_time_hours',
 'gross_load_mw',
 'steam_load_1000_lbs',
 'so2_mass_lbs',
 'so2_mass_measurement_code',
 'nox_rate_lbs_mmbtu',
 'nox_rate_measurement_code',
 'nox_mass_lbs',
 'nox_mass_measurement_code',
 'co2_mass_tons',
 'co2_mass_measurement_code',
 'heat_content_mmbtu',
 'facility_id',
 'unit_id_epa',
 'state']

Now that you know what's available, you'll want to pick which columns you'd like to work with and aggregate rows if necessary.

In [35]:
# A list of the columns you'd like to include in your analysis
my_cols = [
    'state',
    'plant_id_eia', 
    'unitid',
    'so2_mass_lbs', 
    'nox_mass_lbs', 
    'co2_mass_tons',
]

# Select emissions data are grouped by state, plant_id and unit_id
# goes BONKERS when I try and add state to the groupby
my_cems_dd = (
    dd.read_parquet(epacems_path, columns=my_cols)
    .assign(state=lambda x: x['state'].astype('string'))
    .groupby(['plant_id_eia', 'unitid', 'state'])[
         ['so2_mass_lbs', 'nox_mass_lbs', 'co2_mass_tons']]
    .sum())

#### 3. Transfer your desired data to pandas

Now that you've selected the data you want to work with, we'll transfer it to pandas so that all rows are accessible. It'll take a moment to run because there are so many rows.

In [36]:
# Create a pandas dataframe out of your Dask dataframe and add a column to indicate the year the data are coming from
# This may take a moment to run...
client = Client()
my_cems_df = (
    client.compute(my_cems_dd)
    .result()
    .assign(year=year)
)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 58018 instead


In [10]:
# CEMS access via DD client? why need that rather than 
# just strait pandas.read_parquet()

client = Client()
cols = ['plant_id_eia', 'unitid',
        'so2_mass_lbs', 'nox_mass_lbs', 'co2_mass_tons']

out_df = pd.DataFrame()
for yr in range(2018, 2019):
    epacems_path = (pudl_settings['parquet_dir'] + f'/epacems/year={yr}')
    cems_dd = (
        dd.read_parquet(epacems_path, columns=cols)
        .groupby(['plant_id_eia', 'unitid'])[
            ['so2_mass_lbs', 'nox_mass_lbs', 'co2_mass_tons']]
        .sum())
    cems_df = (
        client.compute(cems_dd)
        .result()
        .assign(year=yr))
    out_df = pd.concat([out_df, cems_df])

In [15]:
epacems_path = (pudl_settings['parquet_dir'] + f'/epacems/year=2018')
#pd.read_parquet(epacems_path)
cems_dd = dd.read_parquet(epacems_path)

In [16]:
cems_dd.columns.tolist()

['plant_id_eia',
 'unitid',
 'operating_datetime_utc',
 'operating_time_hours',
 'gross_load_mw',
 'steam_load_1000_lbs',
 'so2_mass_lbs',
 'so2_mass_measurement_code',
 'nox_rate_lbs_mmbtu',
 'nox_rate_measurement_code',
 'nox_mass_lbs',
 'nox_mass_measurement_code',
 'co2_mass_tons',
 'co2_mass_measurement_code',
 'heat_content_mmbtu',
 'facility_id',
 'unit_id_epa',
 'state']