| ![EEW logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/eew.jpg?raw=true) | ![EDGI logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/edgi.png?raw=true) |
|---|---|

#### This notebook is licensed under GPL 3.0. Please visit our Github repo for more information: https://github.com/edgi-govdata-archiving/ECHO-COVID19
#### The notebook was collaboratively authored by EDGI following our authorship protocol: https://docs.google.com/document/d/1CtDN5ZZ4Zv70fHiBTmWkDJ9mswEipX6eCYrwicP66Xw/
#### For more information about this project, visit https://www.environmentalenforcementwatch.org/

# Examining Data from Multiple EPA Programs

This notebook examines data from the EPA's Enforcement and Compliance History Online (ECHO) database (https://echo.epa.gov/). It includes information from EPA's programs covering air quality (the Clean Air Act, or CAA), water quality (the Clean Water Act, or CWA), drinking water (Safe Drinking Water Act, SDWA) and hazardous and other waste processing (the Resource Recovery and Conservation Act, or RCRA). 

ECHO data is available for facility violations as well as inspections and enforcement actions by EPA, state and other agencies. The data made accessible here runs from the present day (the database is refreshed weekly) back to 2001, which is when the EPA believes the data to be most reliable. It is available at the Congressional District level for a selected state, and for counties and zip codes of your choosing. 

## How to Run
* A "cell" in a Jupyter notebook is a block of code performing a set of actions making available or using specific data.  The notebook works by running one cell after another, as the notebook user selects offered options.
* If you click on a gray **code** cell, a little “play button” arrow appears on the left. If you click the play button, it will run the code in that cell (“**running** a cell”). The button will animate. When the animation stops, the cell has finished running.
![Where to click to run the cell](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/pressplay.JPG?raw=true)
* You may get a warning that the notebook was not authored by Google. We know, we authored them! It’s okay. Click “Run Anyway” to continue. 
![Error Message](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/warning-message.JPG?raw=true)
* **It is important to run cells in order because they depend on each other.**
* Run all of the cells in a Notebook to make a complete report. Please feel free to look at and **learn about each result as you create it**!

---

# **Let's begin!**

Hover over the "[ ]" on the top left corner of the cell below and you should see a "play" button appear. Click on it to run the cell then move to the next one.

These first two cells give us access to some external Python code we will need.

### 1.  Bring in some code that is stored in a Github project.

In [None]:
!git clone https://github.com/edgi-govdata-archiving/ECHO_modules.git
!git clone https://github.com/edgi-govdata-archiving/ECHO-Cross-Program.git -b allprograms-openhour
print("Done!")

### 2.  Run a few Python modules.
These will help us process and visualize the different program data sets later.

In [54]:
%run ECHO_modules/DataSet.py
%run ECHO-Cross-Program/utilities.py
%run ECHO_modules/make_data_sets.py
print("Done!")

Done!


### 3.  Run this next cell to create the widget to choose whether you want to view facilities by state, county, congressional district or zip code.
Choose the type of region and proceed to the next cell.

In [84]:
region_type = 'Congressional District'
state_cds = [ ('MA', 4), ('LA', 2), ('MA', 7)]
data_set_list = ['RCRA Violations', 'RCRA Inspections', 'RCRA Penalties', 'CAA Enforcements',
                 'CAA Violations', 'CAA Inspections', 'CAA Penalties', 'Greenhouse Gas Emissions', 
                 'CWA Violations', 'CWA Inspections', 'CWA Penalties', ]

### 4. This cell makes the data sets and stores the results for each of them from the database.

In [85]:
data_sets=make_data_sets( data_set_list )
for state, cd in state_cds:
    for ds_key, data_set in data_sets.items():
        print( ds_key )
        %time data_set.store_results( region_type=region_type, region_value=cd, state=state )

RCRA Violations
1939 program records were found
CPU times: user 43.4 ms, sys: 12.4 ms, total: 55.8 ms
Wall time: 1.83 s
RCRA Inspections
2534 program records were found
CPU times: user 45.5 ms, sys: 16.6 ms, total: 62 ms
Wall time: 1.88 s
RCRA Penalties
1065 program records were found
CPU times: user 28.6 ms, sys: 8.45 ms, total: 37 ms
Wall time: 1.29 s
CAA Inspections
270 program records were found
CPU times: user 26.3 ms, sys: 3.97 ms, total: 30.3 ms
Wall time: 790 ms
CAA Enforcements
7 program records were found
CPU times: user 19.2 ms, sys: 0 ns, total: 19.2 ms
Wall time: 256 ms
CAA Violations
85 program records were found
CPU times: user 19 ms, sys: 3.8 ms, total: 22.8 ms
Wall time: 452 ms
CAA Penalties
153 program records were found
CPU times: user 19.3 ms, sys: 3.81 ms, total: 23.2 ms
Wall time: 831 ms
Greenhouse Gas Emissions
320 program records were found
CPU times: user 23.6 ms, sys: 0 ns, total: 23.6 ms
Wall time: 665 ms
CWA Violations
2837 program records were found
CPU tim

In [None]:
# Development - save the data so we can read it again locally (quickly) without 
# going to the database

for state, cd in state_cds:
    for ds_key, data_set in data_sets.items():
        write_file( data_set.results[(region_type, cd, state)].dataframe, 
                   ds_key, region_type, state, cd )

### 5. This cell will show a chart for each data set

In [None]:
for ds_key, data_set in data_sets.items():
    print( ds_key )
    if ( ds_key != 'RCRA Penalties' ):
        data_set.show_charts()

### 6. Get the State data for comparisons

In [57]:
states = list(set([s_cd[0] for s_cd in state_cds]))  #Use conversion to set to make unique
state_echo_data = {}
state_echo_active = {}
for state in states:
    state_echo_data[state] = read_file( 'ECHO_EXPORTER', 'State', state, None )
    if ( state_echo_data[state] is None ):
        sql = 'select * from "ECHO_EXPORTER" where "FAC_STATE" = \'{}\''.format( state )
        state_echo_data[state] = get_data( sql, 'REGISTRY_ID' )
        write_file( state_echo_data[state], 'ECHO_EXPORTER', 'State', state, None )
    state_echo_active[state] = state_echo_data[state].loc[state_echo_data[state]['FAC_ACTIVE_FLAG']=='Y']
    print( 'There are {} active facilities in {}.'.format( 
        str(state_echo_active[state].shape[0]), state))


  if (await self.run_code(code, result,  async_=asy)):


There are 38125 active facilities in LA.
There are 24427 active facilities in MA.


  if (await self.run_code(code, result,  async_=asy)):


### 7. Number of currently active facilities regulated in CAA, CWA, RCRRA, GHGRP

In [58]:
def program_count( program, flag, state, cd ):
    count = cd_echo_active[cd].loc[cd_echo_active[cd][flag]=='Y'].shape[0]
    print( 'There are {} active facilities in {} CD {} tracked under {}.'.format( 
        str( count ), state, cd, program))
    return count
    
cd_echo_data = {}
cd_echo_active = {}
for state, cd in state_cds:
    rowdata = []    
    cd_echo_data[cd] = state_echo_data[state].loc[state_echo_data[state]['FAC_DERIVED_CD113'] == cd]
    cd_echo_active[cd] = cd_echo_data[cd].loc[cd_echo_data[cd]['FAC_ACTIVE_FLAG']=='Y']
    rowdata.append( ['CAA', program_count( 'CAA', 'AIR_FLAG', state, cd)] )
    rowdata.append( ['CWA', program_count( 'CWA', 'NPDES_FLAG', state, cd)] )
    rowdata.append( ['RCRA', program_count( 'RCRA', 'AIR_FLAG', state, cd)] )
    rowdata.append( ['GHG', program_count( 'CAA', 'GHG_FLAG', state, cd)] )
    filename = make_filename( 'active-facilities_All_pg3', 'Congressional District', 
                             state, cd )
    with open( filename, 'w', newline='' ) as csvfile:
        header = ['Program', 'Count']
        writer = csv.writer( csvfile )
        writer.writerow( header )
        writer.writerows( rowdata ) 
        print( "Wrote {}".format( filename ))
        

There are 464 active facilities in MA CD 4 tracked under CAA.
There are 447 active facilities in MA CD 4 tracked under CWA.
There are 464 active facilities in MA CD 4 tracked under RCRA.
There are 14 active facilities in MA CD 4 tracked under CAA.
Wrote active-facilities_All_pg3_MA-4-080920.csv
There are 496 active facilities in LA CD 2 tracked under CAA.
There are 1202 active facilities in LA CD 2 tracked under CWA.
There are 496 active facilities in LA CD 2 tracked under RCRA.
There are 64 active facilities in LA CD 2 tracked under CAA.
Wrote active-facilities_All_pg3_LA-2-080920.csv
There are 389 active facilities in MA CD 7 tracked under CAA.
There are 311 active facilities in MA CD 7 tracked under CWA.
There are 389 active facilities in MA CD 7 tracked under RCRA.
There are 14 active facilities in MA CD 7 tracked under CAA.
Wrote active-facilities_All_pg3_MA-7-080920.csv


### 8. Map all currently active facilities in the state

In [None]:
for state, cd in state_cds:
    print( 'Map for {} CD {}'.format( state, cd ))
    m = mapper(cd_echo_active[cd])
    display( m )

### 9. Number of recurring violations - total facilities with 3+ quarters out of the last 12 in non-compliance, by each program

In [59]:
states = list(set([s_cd[0] for s_cd in state_cds]))  #Use conversion to set to make unique

def get_rowdata( df, field, flag ):
    count_viol = df.loc[((df[field].str.count("S") + 
                df[field].str.count("V")) >= 3)].shape[0]
    fraction_viol = count_viol/df.loc[df[flag]=='Y'].shape[0]
    print( "    {} facility quarterly violations in the past 3 years".format( count_viol ))
    print( "    {:.2%} of active facilities with violations in the past 3 years".format( 
           fraction_viol ))
    return (count_viol, fraction_viol * 100.)

rowdata_state = {}
for state in states:
    print( "State: {}".format( state ))
    print( "  CAA")
    rowdata_state[state] = []
    rd = get_rowdata( state_echo_data[state], 'CAA_3YR_COMPL_QTRS_HISTORY', 'AIR_FLAG')
    rowdata_state[state].append([ 'CAA', state, '', rd[0], rd[1]])
    print( "  CWA")
    rd = get_rowdata( state_echo_data[state], 'CWA_13QTRS_COMPL_HISTORY', 'NPDES_FLAG')
    rowdata_state[state].append([ 'CWA', state, '', rd[0], rd[1]])
    print( "  RCRA")
    rd = get_rowdata( state_echo_data[state], 'RCRA_3YR_COMPL_QTRS_HISTORY', 'RCRA_FLAG')
    rowdata_state[state].append([ 'RCRA', state, '', rd[0], rd[1]])

for state, cd in state_cds:
    rowdata_cd = []
    print( "{} - CD {}".format( state, cd ))
    print( "  CAA")
    rd = get_rowdata( cd_echo_data[cd], 'CAA_3YR_COMPL_QTRS_HISTORY', 'AIR_FLAG')
    rowdata_cd.append([ 'CAA', state, cd, rd[0], rd[1]])
    print( "  CWA")
    rd = get_rowdata( cd_echo_data[cd], 'CWA_13QTRS_COMPL_HISTORY', 'NPDES_FLAG')
    rowdata_cd.append([ 'CWA', state, cd, rd[0], rd[1]])
    print( "  RCRA")
    rd = get_rowdata( cd_echo_data[cd], 'RCRA_3YR_COMPL_QTRS_HISTORY', 'RCRA_FLAG')
    rowdata_cd.append([ 'RCRA', state, cd, rd[0], rd[1]])
    filename = make_filename( 'recurring-violations_All_pg3', 'Congressional District', 
                             state, cd )
    with open( filename, 'w', newline='' ) as csvfile:
        header = ['Program', 'State', 'CD', 'Facilities', 'Percent']
        writer = csv.writer( csvfile )
        writer.writerow( header )
        writer.writerows( rowdata_state[state] ) 
        writer.writerows( rowdata_cd )
        print( "Wrote {}".format( filename ))


State: LA
  CAA
    46 facility quarterly violations in the past 3 years
    0.36% of active facilities with violations in the past 3 years
  CWA
    4049 facility quarterly violations in the past 3 years
    13.54% of active facilities with violations in the past 3 years
  RCRA
    260 facility quarterly violations in the past 3 years
    1.19% of active facilities with violations in the past 3 years
State: MA
  CAA
    78 facility quarterly violations in the past 3 years
    1.68% of active facilities with violations in the past 3 years
  CWA
    494 facility quarterly violations in the past 3 years
    8.10% of active facilities with violations in the past 3 years
  RCRA
    372 facility quarterly violations in the past 3 years
    1.37% of active facilities with violations in the past 3 years
MA - CD 4
  CAA
    3 facility quarterly violations in the past 3 years
    0.58% of active facilities with violations in the past 3 years
  CWA
    40 facility quarterly violations in the pas

### 10. % change in effluent violations (CWA)

In [60]:
for state, cd in state_cds:
    print( "CWA Violations - District: {}".format( cd ))
    df = data_sets["CWA Violations"].results[('Congressional District', cd, state)].dataframe.copy()

    year = df["YEARQTR"].astype("str").str[0:4:1]
    df["YEARQTR"] = year
    df.rename( columns={'YEARQTR':'YEAR'}, inplace=True )
    # Remove fields not relevant to this graph.
    df = df.drop(columns=['FAC_LAT', 'FAC_LONG', 'FAC_ZIP', 
        'FAC_EPA_REGION', 'FAC_DERIVED_WBD', 'FAC_DERIVED_CD113',
        'FAC_PERCENT_MINORITY', 'FAC_POP_DEN'])
    d = df.groupby(pd.to_datetime(df['YEAR'], format="%Y").dt.to_period("Y")).sum()
    d.index = d.index.strftime('%Y')
    d = d[ d.index > '2000' ]
    print( d )
    filename = make_filename( 'effluent-violations_CWA_pg3', 'Congressional District', 
                             state, cd )
    with open( filename, 'w', newline='' ) as csvfile:
        header = ['Year', 'Violations']
        writer = csv.writer( csvfile )
        writer.writerow( header )
        for row in d.itertuples():
            writer.writerow( [ row[0], row[1]] )
        print( "Wrote {}".format( filename ))


CWA Violations - District: 4
      NUME90Q  NUMCVDT  NUMSVCD  NUMPSCH
YEAR                                    
2001      206        0        0       72
2002      250        0        0      147
2003      410        0        0      122
2004      315        2        0       47
2005      271        2        1       94
2006      362        0        0      107
2007      348        4        0      136
2008      289        0        0      100
2009      300        8        0       53
2010      356        8        0       77
2011      245        0        0      100
2012      231       28        3      139
2013      334       44        5       97
2014      206        8        0       16
2015      201        4        0       13
2016      160        4        3       20
2017      178        4        1        4
2018      132        4        0        4
2019      219       14        0        8
2020       87        4        0        2
Wrote effluent-violations_CWA_pg3_MA-4-080920.csv
CWA Violations - Di

### 11. % change in inspections

In [81]:
def get_inspections( ds ):
    df_result = None
    df_pgm = ds.results[('Congressional District', cd, state)].dataframe.copy()
    if ( len( df_pgm ) > 0 ):
        df_pgm.rename( columns={ ds.date_field: 'Inspection_Date',
                            ds.agg_col: 'Count'}, inplace=True )
        df_pgm = df_pgm.groupby(pd.to_datetime(df_pgm['Inspection_Date'], 
                                        format=ds.date_format))[['Count']].agg('count')
        df_pgm = df_pgm.resample('Y').count()
        df_pgm.index = df_pgm.index.strftime('%Y')
        df_pgm = df_pgm[ df_pgm.index > '2000' ]
        print( df_pgm )
    else:
        print( "No records")
    return df_pgm
    
for state, cd in state_cds:
    print( "CAA Inspections - {} District: {}".format( state, cd ))
    df_caa = get_inspections( data_sets["CAA Inspections"] )
    print( "CAA Inspections - {} District: {}".format( state, cd ))
    df_cwa = get_inspections( data_sets["CWA Inspections"] )
    print( "CAA Inspections - {} District: {}".format( state, cd ))
    df_rcra = get_inspections( data_sets["RCRA Inspections"] )
    df_totals = pd.concat( [df_caa, df_cwa, df_rcra] )
    df_totals = df_totals.groupby( df_totals.index ).agg('sum')
    print( "Total inspections for {} district {}".format( state,cd ))
    print( df_totals )
    filename = make_filename( 'inspections_All_pg3', 'Congressional District', 
                             state, cd )
    df_totals.to_csv( filename )
    print( "Wrote {}".format( filename ))


CAA Inspections - MA District: 4
                 Count
Inspection_Date       
2002                 6
2003                10
2004                11
2005                10
2006                13
2007                21
2008                16
2009                17
2010                17
2011                12
2012                15
2013                19
2014                 8
2015                 5
2016                 4
2017                11
2018                 5
2019                 5
CAA Inspections - MA District: 4
                 Count
Inspection_Date       
2001                18
2002                16
2003                 8
2004                16
2005                21
2006                19
2007                16
2008                15
2009                15
2010                13
2011                11
2012                11
2013                 9
2014                10
2015                 6
2016                10
2017                10
2018                 8
2019          

### 12. % change in enforcement

In [117]:
def get_enforcements( ds ):
    df_pgm = ds.results[('Congressional District', cd, state)].dataframe.copy()
    if ( len( df_pgm ) > 0 ):
        df_pgm.rename( columns={ ds.date_field: 'Enforcement_Date',
                            ds.agg_col: 'Sum'}, inplace=True )
        df_pgm = df_pgm.groupby(pd.to_datetime(df_pgm['Enforcement_Date'], 
                                        format=ds.date_format))[['Sum']].agg('sum')
        df_pgm_count = df_pgm.copy()
        df_pgm_amount = df_pgm.resample('Y').sum()
        df_pgm_amount.index = df_pgm_amount.index.strftime('%Y')
        df_pgm_amount = df_pgm_amount[ df_pgm_amount.index > '2000' ]
        df_pgm_count = df_pgm_count.groupby(pd.to_datetime(df_pgm_count.index, 
                                        format=ds.date_format))[['Sum']].agg('count')
        df_pgm_count = df_pgm_count.resample('Y').count()
        df_pgm_count.index = df_pgm_count.index.strftime('%Y')
        df_pgm_count = df_pgm_count[ df_pgm_count.index > '2000' ]
        df_pgm = df_pgm_count.merge( df_pgm_amount, how='left', left_index=True, 
                                    right_index=True )
        df_pgm.rename( columns={ 'Sum_x': 'Count',
                            'Sum_y': 'Amount'}, inplace=True )
        print( df_pgm )
    else:
        print( "No records")
    return df_pgm
    
for state, cd in state_cds:
    print( "CAA Penalties - {} District: {}".format( state, cd ))
    df_caa = get_enforcements( data_sets["CAA Penalties"] )
    print( "CAA Penalties - {} District: {}".format( state, cd ))
    df_cwa = get_enforcements( data_sets["CWA Penalties"] )
    print( "CAA Penalties - {} District: {}".format( state, cd ))
    df_rcra = get_enforcements( data_sets["RCRA Penalties"] )
    df_totals = pd.concat( [df_caa, df_cwa, df_rcra] )
    df_totals = df_totals.groupby( df_totals.index ).agg('sum')
    print( "Total enforcements for {} district {}".format( state,cd ))
    print( df_totals )
    filename = make_filename( 'enforcements_All_pg3', 'Congressional District', 
                             state, cd )
    df_totals.to_csv( filename )
    print( "Wrote {}".format( filename ))


CAA Penalties - MA District: 4
                  Count      Amount
Enforcement_Date                   
2001                  6   761000.00
2002                  4   128080.00
2003                  8  1064400.00
2004                  2    13620.00
2005                  4    64110.00
2006                  8   563316.00
2007                  5   119664.00
2008                  7  2025810.00
2009                  4    38056.00
2010                  7   461132.67
2011                  6   261056.00
2012                  1        0.00
2013                  6  2931923.00
2014                  7   376190.00
2015                  3    13552.50
2016                  3    19825.00
2017                  3     3150.00
2018                  1     1150.00
2019                  0        0.00
2020                  1    90300.00
CAA Penalties - MA District: 4
                  Count    Amount
Enforcement_Date                 
2001                  0       0.0
2002                  1       0.0
2003      

### 13. % change in enforcement - enforcement actions