| ![EEW logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/eew.jpg?raw=true) | ![EDGI logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/edgi.png?raw=true) |
|---|---|

#### This notebook is licensed under GPL 3.0. Please visit our Github repo for more information: https://github.com/edgi-govdata-archiving/ECHO-COVID19
#### The notebook was collaboratively authored by EDGI following our authorship protocol: https://docs.google.com/document/d/1CtDN5ZZ4Zv70fHiBTmWkDJ9mswEipX6eCYrwicP66Xw/
#### For more information about this project, visit https://www.environmentalenforcementwatch.org/

# Examining Data from Multiple EPA Programs

This notebook examines data from the EPA's Enforcement and Compliance History Online (ECHO) database (https://echo.epa.gov/). It includes information from EPA's programs covering air quality (the Clean Air Act, or CAA), water quality (the Clean Water Act, or CWA), drinking water (Safe Drinking Water Act, SDWA) and hazardous and other waste processing (the Resource Recovery and Conservation Act, or RCRA). 

ECHO data is available for facility violations as well as inspections and enforcement actions by EPA, state and other agencies. The data made accessible here runs from the present day (the database is refreshed weekly) back to 2001, which is when the EPA believes the data to be most reliable. It is available at the Congressional District level for a selected state, and for counties and zip codes of your choosing. 

## How to Run
* A "cell" in a Jupyter notebook is a block of code performing a set of actions making available or using specific data.  The notebook works by running one cell after another, as the notebook user selects offered options.
* If you click on a gray **code** cell, a little “play button” arrow appears on the left. If you click the play button, it will run the code in that cell (“**running** a cell”). The button will animate. When the animation stops, the cell has finished running.
![Where to click to run the cell](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/pressplay.JPG?raw=true)
* You may get a warning that the notebook was not authored by Google. We know, we authored them! It’s okay. Click “Run Anyway” to continue. 
![Error Message](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/warning-message.JPG?raw=true)
* **It is important to run cells in order because they depend on each other.**
* Run all of the cells in a Notebook to make a complete report. Please feel free to look at and **learn about each result as you create it**!

---

# **Let's begin!**

Hover over the "[ ]" on the top left corner of the cell below and you should see a "play" button appear. Click on it to run the cell then move to the next one.

These first two cells give us access to some external Python code we will need.

### 1.  Bring in some code that is stored in a Github project.

In [None]:
!git clone https://github.com/edgi-govdata-archiving/ECHO_modules.git
!git clone https://github.com/edgi-govdata-archiving/ECHO-Cross-Program.git -b allprograms-openhour
print("Done!")

### 2.  Run a few Python modules.
These will help us process and visualize the different program data sets later.

In [None]:
%run ECHO_modules/DataSet.py
%run ECHO-Cross-Program/utilities.py
%run ECHO_modules/make_data_sets.py
print("Done!")

### 3.  Run this next cell to create the widget to choose whether you want to view facilities by state, county, congressional district or zip code.
Choose the type of region and proceed to the next cell.

In [None]:
region_type = 'Congressional District'
state_cds = [ ('MA', 4), ('LA', 2), ('MA', 7)]
data_set_list = ['RCRA Violations', 'RCRA Inspections', 'RCRA Penalties', 'CAA Enforcements',
                 'CAA Violations', 'CAA Inspections', 'Greenhouse Gas Emissions', 
                 'CWA Violations', 'CWA Inspections', 'CWA Penalties', ]

### 4. This cell makes the data sets and stores the results for each of them from the database.

In [None]:
for state, cd in state_cds:
    data_sets[ 'CWA Violations' ].store_results( region_type=region_type, region_value=cd, state=state )


In [None]:
data_sets=make_data_sets( data_set_list )
for state, cd in state_cds:
    for ds_key, data_set in data_sets.items():
        print( ds_key )
        %time data_set.store_results( region_type=region_type, region_value=cd, state=state )

In [None]:
# Development - save the data so we can read it again locally (quickly) without 
# going to the database

for state, cd in state_cds:
    for ds_key, data_set in data_sets.items():
        write_file( data_set.results[(region_type, cd, state)].dataframe, 
                   ds_key, region_type, state, cd )

### 5. This cell will show a chart for each data set

In [None]:
for ds_key, data_set in data_sets.items():
    print( ds_key )
    if ( ds_key != 'RCRA Penalties' ):
        data_set.show_charts()

### 6. Get the State data for comparisons

In [None]:
states = list(set([s_cd[0] for s_cd in state_cds]))  #Use conversion to set to make unique
state_echo_data = {}
state_echo_active = {}
for state in states:
    state_echo_data[state] = read_file( 'ECHO_EXPORTER', 'State', state, None )
    if ( state_echo_data[state] is None ):
        sql = 'select * from "ECHO_EXPORTER" where "FAC_STATE" = \'{}\''.format( state )
        state_echo_data[state] = get_data( sql, 'REGISTRY_ID' )
        write_file( state_echo_data[state], 'ECHO_EXPORTER', 'State', state, None )
    state_echo_active[state] = state_echo_data[state].loc[state_echo_data[state]['FAC_ACTIVE_FLAG']=='Y']
    print( 'There are {} active facilities in {}.'.format( 
        str(state_echo_active[state].shape[0]), state))


### 7. Number of currently active facilities regulated in CAA, CWA, RCRRA, GHGRP

In [None]:
def program_count( program, flag, state, cd ):
    count = cd_echo_active[cd].loc[cd_echo_active[cd][flag]=='Y'].shape[0]
    print( 'There are {} active facilities in {} CD {} tracked under {}.'.format( 
        str( count ), state, cd, program))
    return count
    
cd_echo_data = {}
cd_echo_active = {}
for state, cd in state_cds:
    rowdata = []    
    cd_echo_data[cd] = state_echo_data[state].loc[state_echo_data[state]['FAC_DERIVED_CD113'] == cd]
    cd_echo_active[cd] = cd_echo_data[cd].loc[cd_echo_data[cd]['FAC_ACTIVE_FLAG']=='Y']
    rowdata.append( ['CAA', program_count( 'CAA', 'AIR_FLAG', state, cd)] )
    rowdata.append( ['CWA', program_count( 'CWA', 'NPDES_FLAG', state, cd)] )
    rowdata.append( ['RCRA', program_count( 'RCRA', 'AIR_FLAG', state, cd)] )
    rowdata.append( ['GHG', program_count( 'CAA', 'GHG_FLAG', state, cd)] )
    filename = make_filename( 'active-facilities_All_pg3', 'Congressional District', 
                             state, cd )
    with open( filename, 'w', newline='' ) as csvfile:
        header = ['Program', 'Count']
        writer = csv.writer( csvfile )
        writer.writerow( header )
        writer.writerows( rowdata ) 
        print( "Wrote {}".format( filename ))
        

### 8. Map all currently active facilities in the state

In [None]:
for state, cd in state_cds:
    print( 'Map for {} CD {}'.format( state, cd ))
    m = mapper(cd_echo_active[cd])
    display( m )

### 9. Number of recurring violations - total facilities with 3+ quarters out of the last 12 in non-compliance, by each program

In [None]:
states = list(set([s_cd[0] for s_cd in state_cds]))  #Use conversion to set to make unique

def get_rowdata( df, field, flag ):
    count_viol = df.loc[((df[field].str.count("S") + 
                df[field].str.count("V")) >= 3)].shape[0]
    fraction_viol = count_viol/df.loc[df[flag]=='Y'].shape[0]
    print( "    {} facility quarterly violations in the past 3 years".format( count_viol ))
    print( "    {:.2%} of active facilities with violations in the past 3 years".format( 
           fraction_viol ))
    return (count_viol, fraction_viol * 100.)

rowdata_state = {}
for state in states:
    print( "State: {}".format( state ))
    print( "  CAA")
    rowdata_state[state] = []
    rd = get_rowdata( state_echo_data[state], 'CAA_3YR_COMPL_QTRS_HISTORY', 'AIR_FLAG')
    rowdata_state[state].append([ 'CAA', state, '', rd[0], rd[1]])
    print( "  CWA")
    rd = get_rowdata( state_echo_data[state], 'CWA_13QTRS_COMPL_HISTORY', 'NPDES_FLAG')
    rowdata_state[state].append([ 'CWA', state, '', rd[0], rd[1]])
    print( "  RCRA")
    rd = get_rowdata( state_echo_data[state], 'RCRA_3YR_COMPL_QTRS_HISTORY', 'RCRA_FLAG')
    rowdata_state[state].append([ 'RCRA', state, '', rd[0], rd[1]])

for state, cd in state_cds:
    rowdata_cd = []
    print( "{} - CD {}".format( state, cd ))
    print( "  CAA")
    rd = get_rowdata( cd_echo_data[cd], 'CAA_3YR_COMPL_QTRS_HISTORY', 'AIR_FLAG')
    rowdata_cd.append([ 'CAA', state, cd, rd[0], rd[1]])
    print( "  CWA")
    rd = get_rowdata( cd_echo_data[cd], 'CWA_13QTRS_COMPL_HISTORY', 'NPDES_FLAG')
    rowdata_cd.append([ 'CWA', state, cd, rd[0], rd[1]])
    print( "  RCRA")
    rd = get_rowdata( cd_echo_data[cd], 'RCRA_3YR_COMPL_QTRS_HISTORY', 'RCRA_FLAG')
    rowdata_cd.append([ 'RCRA', state, cd, rd[0], rd[1]])
    filename = make_filename( 'recurring-violations_All_pg3', 'Congressional District', 
                             state, cd )
    with open( filename, 'w', newline='' ) as csvfile:
        header = ['Program', 'State', 'CD', 'Facilities', 'Percent']
        writer = csv.writer( csvfile )
        writer.writerow( header )
        writer.writerows( rowdata_state[state] ) 
        writer.writerows( rowdata_cd )
        print( "Wrote {}".format( filename ))


### 10. % change in effluent violations (CWA)

In [None]:
for cd in cds:
    print( "CWA Violations - District: {}".format( cd ))
    df = data_sets["CWA Violations"].results[('Congressional District', cd, state)].dataframe.copy()

    year = df["YEARQTR"].astype("str").str[0:4:1]
    df["YEARQTR"] = year
    df.rename( columns={'YEARQTR':'YEAR'}, inplace=True )
    # Remove fields not relevant to this graph.
    df = df.drop(columns=['FAC_LAT', 'FAC_LONG', 'FAC_ZIP', 
        'FAC_EPA_REGION', 'FAC_DERIVED_WBD', 'FAC_DERIVED_CD113',
        'FAC_PERCENT_MINORITY', 'FAC_POP_DEN'])
    d = df.groupby(pd.to_datetime(df['YEAR'], format="%Y").dt.to_period("Y")).sum()
    d.index = d.index.strftime('%Y')
    d = d[ d.index > '2000' ]
    print( d )

### 11. % change in inspections

In [None]:
df_caa = {}
df_cwa = {}
df_rcra = {}
df_totals = {}
for cd in cds:
    print( "CAA Inspections - District: {}".format( cd ))
    df_caa[cd] = data_sets["CAA Inspections"].results[('Congressional District', cd, state)].dataframe.copy()
    if ( len( df_caa[cd] ) > 0 ):
        df_caa[cd].rename( columns={ data_sets["CAA Inspections"].date_field: 'Inspection_Date',
                            data_sets['CAA Inspections'].agg_col: 'Count'}, inplace=True )
        df_caa[cd] = df_caa[cd].groupby(pd.to_datetime(df_caa[cd]['Inspection_Date'], format=)[['Count']].agg('count')
        df_caa[cd] = df_caa[cd].resample('Y').count()
        df_caa[cd].index = df_caa[cd].index.strftime('%Y')
        df_caa[cd] = df_caa[cd][ df_caa[cd].index > '2000' ]
        print( df_caa[cd] )
    else:
        print( "No records")
    print( "CWA Inspections - District: {}".format( cd ))
    df_cwa[cd] = data_sets["CWA Inspections"].results[('Congressional District', cd, state)].dataframe.copy()
    if ( len( df_cwa[cd] ) > 0 ):
        df_cwa[cd].rename( columns={ data_sets["CWA Inspections"].date_field: 'Inspection_Date',
                            data_sets['CWA Inspections'].agg_col: 'Count'}, inplace=True )
        df_cwa[cd] = df_cwa[cd].groupby('Inspection_Date')[['Count']].agg('count')
        print( df_cwa[cd] )
    else:
        print( "No records")
    print( "RCRA Inspections - District: {}".format( cd ))
    df_rcra[cd] = data_sets["RCRA Inspections"].results[('Congressional District', cd, state)].dataframe.copy()
    if ( len( df_rcra[cd] ) > 0 ):
        df_rcra[cd].rename( columns={ data_sets["RCRA Inspections"].date_field: 'Inspection_Date',
                            data_sets['RCRA Inspections'].agg_col: 'Count'}, inplace=True )
        df_rcra[cd] = df_rcra[cd].groupby('Inspection_Date')[['Count']].agg('count')
        print( df_rcra[cd] )
    else:
        print( "No records")
    df_totals[cd] = pd.concat( [df_caa[cd], df_cwa[cd], df_rcra[cd]] )
    print( "Total inspections for district {}".format(cd))
    print( df_totals[cd] )

### 12. % change in enforcement - penalties

In [None]:
df_caa = {}
df_cwa = {}
df_rcra = {}
df_totals = {}
for cd in cds:
    print( "CAA Penalties - District: {}".format( cd ))
    df_caa[cd] = data_sets["CAA Penalties"].results[('Congressional District', cd, state)].dataframe.copy()
    if ( len( df_caa[cd] ) > 0 ):
        df_caa[cd].rename( columns={ data_sets["CAA Penalties"].date_field: 'Penalty_Date',
                            data_sets['CAA Penalties'].agg_col: 'Sum'}, inplace=True )
        df_caa[cd] = df_caa[cd].groupby('Penalty_Date')[['Sum']].agg('sum')
        print( df_caa[cd] )
    else:
        print( "No records")
    print( "CWA Penalties - District: {}".format( cd ))
    df_cwa[cd] = data_sets["CWA Penalties"].results[('Congressional District', cd, state)].dataframe.copy()
    if ( len( df_cwa[cd] ) > 0 ):
        df_cwa[cd].rename( columns={ data_sets["CWA Penalties"].date_field: 'Penalty_Date',
                            data_sets['CWA Penalties'].agg_col: 'Sum'}, inplace=True )
        df_cwa[cd] = df_cwa[cd].groupby('Penalty_Date')[['Sum']].agg('sum')
        print( df_cwa[cd] )
    else:
        print( "No records")
    print( "RCRA Penalties - District: {}".format( cd ))
    df_rcra[cd] = data_sets["RCRA Penalties"].results[('Congressional District', cd, state)].dataframe.copy()
    if ( len( df_rcra[cd] ) > 0 ):
        df_rcra[cd].rename( columns={ data_sets["RCRA Penalties"].date_field: 'Penalty_Date',
                            data_sets['RCRA Penalties'].agg_col: 'Sum'}, inplace=True )
        df_rcra[cd] = df_rcra[cd].groupby('Penalty_Date')[['Sum']].agg('sum')
        print( df_rcra[cd] )
    else:
        print( "No records")
    df_totals[cd] = pd.concat( [df_caa[cd], df_cwa[cd], df_rcra[cd]] )
    print( "Total penalties for district {}".format(cd))
    print( df_totals[cd] )

### 13. % change in enforcement - enforcement actions