| ![EEW logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/eew.jpg?raw=true) | ![EDGI logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/edgi.png?raw=true) |
|---|---|

#### This notebook is licensed under GPL 3.0. Please visit our Github repo for more information: https://github.com/edgi-govdata-archiving/ECHO-COVID19
#### The notebook was collaboratively authored by EDGI following our authorship protocol: https://docs.google.com/document/d/1CtDN5ZZ4Zv70fHiBTmWkDJ9mswEipX6eCYrwicP66Xw/
#### For more information about this project, visit https://www.environmentalenforcementwatch.org/

# Examining Data from Multiple EPA Programs

This notebook examines data from the EPA's Enforcement and Compliance History Online (ECHO) database (https://echo.epa.gov/). It includes information from EPA's programs covering air quality (the Clean Air Act, or CAA), water quality (the Clean Water Act, or CWA), and hazardous and other waste processing (the Resource Recovery and Conservation Act, or RCRA). 

ECHO data is available for facility violations as well as inspections and enforcement actions by EPA, state and other agencies. The data made accessible here runs from the present day (the database is refreshed weekly) back to 2001, which is when the EPA believes the data to be most reliable. The notebook can be run to produce data for multiple Congressional Districts and states of your choosing. 

## How to Run
* A "cell" in a Jupyter notebook is a block of code performing a set of actions making available or using specific data.  The notebook works by running one cell after another, as the notebook user selects offered options.
* If you click on a gray **code** cell, a little “play button” arrow appears on the left. If you click the play button, it will run the code in that cell (“**running** a cell”). The button will animate. When the animation stops, the cell has finished running.
![Where to click to run the cell](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/pressplay.JPG?raw=true)
* You may get a warning that the notebook was not authored by Google. We know, we authored them! It’s okay. Click “Run Anyway” to continue. 
![Error Message](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/warning-message.JPG?raw=true)
* **It is important to run cells in order because they depend on each other.**
* Run all of the cells in a Notebook to make a complete report. Please feel free to look at and **learn about each result as you create it**!

---

# **Let's begin!**

Hover over the "[ ]" on the top left corner of the cell below and you should see a "play" button appear. Click on it to run the cell then move to the next one.

These first two cells give us access to some external Python code we will need.

### 1.  Bring in some code that is stored in a Github project.
These two github repositories hold Python code that the notebook uses.
* ECHO_modules holds code that can be used in this and other notebooks--the DataSet class, the make_data_sets() function, etc.
* The ECHO-Cross-Program repository is the one this notebook is contained in.  We clone it to be able to use the utilities.py file contained in it.

In [None]:
!git clone https://github.com/edgi-govdata-archiving/ECHO_modules.git
!git clone https://github.com/edgi-govdata-archiving/ECHO-Cross-Program.git -b enhancements-09-11
!pip install geopandas
print("Done!")

### 2.  Run a few Python modules.
These will help us process and visualize the different program data sets later.
* The DataSet class knows how to read the database for an ECHO data set--e.g. CWA Violations.
* The utilities.py has Python code that helps with showing charts and maps, making filenames, etc.
* The make_data_set.py has code that creates a DataSet object for each of the ECHO data sets, using the appropriate database tables.  

In [None]:
%run ECHO_modules/DataSet.py
%run ECHO-Cross-Program/utilities.py
%run ECHO_modules/make_data_sets.py
print("Done!")

### 3.  This cell contains the parameters of the notebook run.  You can change the (state, CD) pairs to run the notebook for multiple congressional districts in multiple states.  After setting the (state, CD) pairs you want, you can instruct the notebook to Run All and it will step through all of the remaining cells.  You can then come back and examine the results.

In [None]:
region_type = 'Congressional District'
should_make_charts = True
state_cds = [('WA',2)]
# state_cds = [('AZ',1), ('CA',18), ('CA',29), ('CA',36), ('CA',44), ('CA',52),
#     ('CA',6), ('CA',9), ('CO',1), ('FL',12), ('FL',14),
#     ('FL',9), ('GA',1), ('IA',2), ('IL',1), ('IL',15), ('IL',16),
#     ('IL',2), ('IL',9), ('IN',5), ('IN',8), ('KY',2), ('LA',1),
#     ('MA',4), ('MD',3), ('MI',12), ('MI',6), ('MI',7), ('MO',7),
#     ('NC',1), ('NC',8), ('NH',2), ('NJ',6), ('NM',3), ('NY',16),
#     ('NY',20), ('NY',9), ('OH',5), ('OH',6), ('OK',2), ('OR',2),
#     ('OR',5), ('PA',18), ('PA',1), ('SC',3), ('TX',17), ('TX',22),
#     ('TX',26), ('TX',33), ('VA',4), ('VA',9), ('WA',5), ('WV',1),
#     ('AK',None), ('AL',None), ('AR',None), ('DE',None), ('IA',None),
#     ('IL',None), ('IN',None), ('MA',None), ('MD',None), ('MD',None),
#     ('MS',None), ('ND',None), ('NJ',None), ('NY',None), ('OK',None),
#     ('OR',None), ('RI',None), ('SD',None), ('VT',None), ('WV',None),
#     ('WY',None)]

# state_cds = [ ('WY', 1), ('DE', None), ('NJ', 6), ('OR', 2), ('NY', 9)]
# Change this^! For example, instead of running New Jersey's 6th Congressional ('NJ', 6) you could do Wisconsin's 2nd ('WI', 2)
# See here: https://www.govtrack.us/congress/members/map
# data_set_list = ['RCRA Violations', 'RCRA Penalties',
#                  'CAA Violations', 'CAA Penalties',
#                  'CWA Violations', 'CWA Penalties', ] 
                 #CAA Enforcements, CWA Enforcements, RCRA Enforcements
data_set_list = ['RCRA Violations', 'RCRA Inspections', 'RCRA Penalties',
                 'CAA Violations', 'CAA Inspections', 'CAA Penalties', 'Greenhouse Gas Emissions', 
                 'CWA Violations', 'CWA Inspections', 'CWA Penalties', ] 


### 4. This cell makes the data sets and stores the results for each of them from the database.  
This may take some time to run if you are looking at multiple congressional districts.
* The data_set_list from cell #3 is given to the make_data_sets() function which creates a DataSet object for each item in the list.
* Go through each of the (state, cd) pairs in the state_cd list specified in cell #3 and have the DataSet object store results returned by the database for that specific state and CD.
* Also go through each unique state in the list and store data for the entire state.

In [None]:
data_sets=make_data_sets( data_set_list )
print( "Congressional District data sets:")
for state, cd in state_cds:
    if ( cd is None ):
        continue
    for ds_key, data_set in data_sets.items():
        print( state + '-' + str(cd) + ' - ' + ds_key )
        data_set.store_results( region_type=region_type, region_value=cd, state=state )

print( "State data sets:")
states = list(set([s_cd[0] for s_cd in state_cds]))  #Use conversion to set to make unique
for state in states:
    for ds_key, data_set in data_sets.items():
        print( state + ' - ' + ds_key )
        data_set.store_results( region_type='State', region_value=None, state=state )

### 5. This cell will generate a chart for each data set and each (state, Congressional District) pair.
Call each of the DataSet objects' show_charts() methods to render a chart of the data.

In [None]:
if ( should_make_charts ):
    for ds_key, data_set in data_sets.items():
        print( ds_key )
        if ( ds_key != 'RCRA Penalties' ):
            data_set.show_charts()

### 6. Get the State data for comparisons
Ask the database for ECHO_EXPORTER records for facilities in the state.
* state_echo_data is a dictionary with the state name as key and the data as value, for all records.
* state_echo_active is a dictionary for all records in state_echo_data identified as active.

In [None]:
states = list(set([s_cd[0] for s_cd in state_cds]))  #Use conversion to set to make unique
state_echo_data = {}
state_echo_active = {}
for state in states:
    state_echo_data[state] = read_file( 'ECHO_EXPORTER', 'State', state, None )
    if ( state_echo_data[state] is None ):
        sql = 'select * from "ECHO_EXPORTER" where "FAC_STATE" = \'{}\''.format( state )
        state_echo_data[state] = get_data( sql, 'REGISTRY_ID' )
        write_dataset( state_echo_data[state], 'ECHO_EXPORTER', 'State', state, None )
    state_echo_active[state] = state_echo_data[state].loc[state_echo_data[state]['FAC_ACTIVE_FLAG']=='Y']
    print( 'There are {} active facilities in {}.'.format( 
        str(state_echo_active[state].shape[0]), state))

### 7. Number of currently active facilities regulated in CAA, CWA, RCRRA, GHGRP
* The program_count() function looks at the ECHO_EXPORTER data that is passed in and counts the number of facilities have the 'flag' parameter set to 'Y' (AIR_FLAG, NPDES_FLAG, RCRA_FLAG, GHG_FLAG)
* cd_echo_data is a dictionary with key (state, cd), where the state_echo_data is filtered for records of the current CD.
* cd_echo_active is a dictionary for active facilities in the CD.
* The number of records from these dictionaries is written into a file named like 'active-facilities_All_pg3', in a directory identified by the state and CD, e.g. "LA2".

In [None]:
def program_count( echo_data, program, flag, state, cd ):
    count = echo_data.loc[echo_data[flag]=='Y'].shape[0]
    print( 'There are {} active facilities in {} CD {} tracked under {}.'.format( 
        str( count ), state, cd, program))
    return count
    
cd_echo_data = {}
cd_echo_active = {}
for state, cd in state_cds:
    rowdata = []    
    if ( cd is None ):
        this_echo_data = state_echo_data[state]
        filename = make_filename( 'active-facilities_All_pg3', 'State', 
                             None, state )
    else:
        this_echo_data = state_echo_data[state].loc[state_echo_data[state]['FAC_DERIVED_CD113'] == cd]
        cd_echo_data[(state,cd)] = this_echo_data
        filename = make_filename( 'active-facilities_All_pg3', 'Congressional District', 
                             state, cd )
    this_echo_active = this_echo_data.loc[this_echo_data['FAC_ACTIVE_FLAG']=='Y']
    if ( cd is not None ):
        cd_echo_active[(state,cd)] = this_echo_active
    rowdata.append( ['CAA', program_count( this_echo_active, 'CAA', 'AIR_FLAG', state, cd)] )
    rowdata.append( ['CWA', program_count( this_echo_active, 'CWA', 'NPDES_FLAG', state, cd)] )
    rowdata.append( ['RCRA', program_count( this_echo_active, 'RCRA', 'RCRA_FLAG', state, cd)] )
    rowdata.append( ['GHG', program_count( this_echo_active, 'GHG', 'GHG_FLAG', state, cd)] )
    with open( filename, 'w', newline='' ) as csvfile:
        header = ['Program', 'Count']
        writer = csv.writer( csvfile )
        writer.writerow( header )
        writer.writerows( rowdata ) 
        print( "Wrote {}".format( filename ))     

### 8. Map all currently active facilities in each district

In [None]:
if ( should_make_charts ):
    import geopandas

    for state, cd in state_cds:
        print( 'Map for {} CD {}'.format( state, cd ))
        if ( cd is None ):
            this_data = state_echo_active[state]
        else:
            this_data = cd_echo_active[(state, cd)]
        # Only map CAA, CWA, RCRA, or GHG facilities active in this district:
        map_data = this_data.loc[(this_data['AIR_FLAG']=="Y") | (this_data['NPDES_FLAG']=="Y") |
                (this_data['RCRA_FLAG']=="Y")| (this_data['GHG_FLAG']=="Y")]
        m = mapper(map_data)
        if ( cd is not None ):
            url = "https://raw.githubusercontent.com/unitedstates/districts/gh-pages/cds/2016/{}-{}/shape.geojson".format( state, str(cd))
            map_data = geopandas.read_file(url)
            w = folium.GeoJson(
                map_data,
                name = "EPA Regions",
            ).add_to(m) #m is the map object created to hold the facility points. we want to add this shape object to that map object
            folium.GeoJsonTooltip(fields=["District"]).add_to(w)

        display( m )

### 9. Number of recurring violations - facilities with 3+ quarters out of the last 12 in non-compliance, by each program
For each unique state and then each CD, we look at active records and count facilities that have 'S' or 'V' violations in 3 or more quarters.  The fields looked at are:
* CAA - CAA_3YR_COMPL_QTRS_HISTORY
* CWA - CWA_13QTRS_COMPL_HISTORY (Actually 13 quarters instead of 3 years.)
* RCRA - RCRA_3YR_COMPL_QTRS_HISTORY

* The get_rowdata() function takes the dataframe passed to it, and looks for records with 'S' or 'V' violations in more than 3 quarters. It divides the violations by the number of facilities, returning the raw count of facilities in violation more than 3 months and the percentage of facilities.

In [None]:
states = list(set([s_cd[0] for s_cd in state_cds]))  #Use conversion to set to make unique

def get_rowdata( df, field, flag ):
    count_viol = df.loc[((df[field].str.count("S") + 
                df[field].str.count("V")) >= 3)].shape[0]
    fraction_viol = count_viol/df.loc[df[flag]=='Y'].shape[0]
    print( "    {} facilities with at least 3 quarters in non-compliance over the past 3 years".format( count_viol ))
    print( "    {:.2%} of active facilities with at least 3 quarters in non-compliance over the past 3 years".format( 
           fraction_viol ))
    return (count_viol, fraction_viol * 100.)

rowdata_state = {}
for state in states:
    print( "State: {}".format( state ))
    print( "  CAA")
    rowdata_state[state] = []
    rd = get_rowdata( state_echo_active[state], 'CAA_3YR_COMPL_QTRS_HISTORY', 'AIR_FLAG')
    rowdata_state[state].append([ 'CAA', state, '', rd[0], rd[1]])
    print( "  CWA")
    rd = get_rowdata( state_echo_active[state], 'CWA_13QTRS_COMPL_HISTORY', 'NPDES_FLAG')
    rowdata_state[state].append([ 'CWA', state, '', rd[0], rd[1]])
    print( "  RCRA")
    rd = get_rowdata( state_echo_active[state], 'RCRA_3YR_COMPL_QTRS_HISTORY', 'RCRA_FLAG')
    rowdata_state[state].append([ 'RCRA', state, '', rd[0], rd[1]])

for state, cd in state_cds:
    if ( cd is None ):
        filename = make_filename( 'recurring-violations_All_pg3', 'State',
                             None, state )
        with open( filename, 'w', newline='' ) as csvfile:
            header = ['Program', 'State', '', 'Facilities', 'Percent']
            writer = csv.writer( csvfile )
            writer.writerow( header )
            writer.writerows( rowdata_state[state] ) 
            print( "Wrote {}".format( filename ))
    else:
        filename = make_filename( 'recurring-violations_All_pg3', 'Congressional District',
                             state, cd )
        rowdata_cd = []
        print( "{} - CD {}".format( state, cd ))
        print( "  CAA")
        rd = get_rowdata( cd_echo_active[(state,cd)], 'CAA_3YR_COMPL_QTRS_HISTORY', 'AIR_FLAG')
        rowdata_cd.append([ 'CAA', state, cd, rd[0], rd[1]])
        print( "  CWA")
        rd = get_rowdata( cd_echo_active[(state,cd)], 'CWA_13QTRS_COMPL_HISTORY', 'NPDES_FLAG')
        rowdata_cd.append([ 'CWA', state, cd, rd[0], rd[1]])
        print( "  RCRA")
        rd = get_rowdata( cd_echo_active[(state,cd)], 'RCRA_3YR_COMPL_QTRS_HISTORY', 'RCRA_FLAG')
        rowdata_cd.append([ 'RCRA', state, cd, rd[0], rd[1]])
        with open( filename, 'w', newline='' ) as csvfile:
            header = ['Program', 'State', 'CD', 'Facilities', 'Percent']
            writer = csv.writer( csvfile )
            writer.writerow( header )
            writer.writerows( rowdata_state[state] ) 
            writer.writerows( rowdata_cd )
            print( "Wrote {}".format( filename ))

### 10. Percent change in violations (CWA)
For each CD and then each unique state, 
* the quarter is identified in 5 digits, the 1st 4 are year and then the quarter, as in 20013 for the 3rd quarter of 2001
* the quarter is stipped off, so that there will now be 4 records for the facility for 2001
* the values for the 4 types of violations--NUME90Q,NUMCVDT,NUMSVCD,NUMPSCH--are added together, over all facilities, to get a single value for the year
The results for all years are stored in the dictionary effluent_violations_all, and the value for 2019 in its own effluent_violations_2019 dictionary.  The key for the dictionaries is (state,cd).  These will be used in a later cell.

In [None]:
effluent_violations_2019 = {}  #For use later
effluent_violations_all = {}  #For use later
states = list(set([s_cd[0] for s_cd in state_cds]))  #Use conversion to set to make unique

def get_cwa_df( df ):
    year = df["YEARQTR"].astype("str").str[0:4:1]
    df["YEARQTR"] = year
    df.rename( columns={'YEARQTR':'YEAR'}, inplace=True )
    # Remove fields not relevant to this graph.
    df = df.drop(columns=['FAC_LAT', 'FAC_LONG', 'FAC_ZIP', 
        'FAC_EPA_REGION', 'FAC_DERIVED_WBD', 'FAC_DERIVED_CD113',
        'FAC_PERCENT_MINORITY', 'FAC_POP_DEN'])
    d = df.groupby(pd.to_datetime(df['YEAR'], format="%Y").dt.to_period("Y")).sum()
    d.index = d.index.strftime('%Y')
    d = d[ d.index > '2000' ]
    d['Total'] = d.sum(axis=1)
    return( d )

for state, cd in state_cds:
    if ( cd is None ):
        continue
    print( "CWA Violations - {} District: {}".format( state, cd ))
    df = data_sets["CWA Violations"].results[('Congressional District', cd, state)].dataframe.copy()
    effluent_violations_all[ (state, cd) ] = get_cwa_df( df )
    display(effluent_violations_all[ (state, cd) ])
    filename = make_filename( 'violations_CWA_pg3', 'Congressional District', 
                             state, cd )
    with open( filename, 'w', newline='' ) as csvfile:
        header = ['Year', 'Violations']
        writer = csv.writer( csvfile )
        writer.writerow( header )
        for row in effluent_violations_all[ (state, cd) ].itertuples():
            if ( row[0] == '2019' ):
                effluent_violations_2019[(state,cd)] = row[5]
            writer.writerow( [ row[0], row[5]] )
        print( "Wrote {}".format( filename ))

for state in states:
    filename = make_filename( 'violations_CWA_pg3', 'State', None, state )
    df = data_sets["CWA Violations"].results[('State', None, state)].dataframe.copy()
    cwa_all_df = get_cwa_df( df )
    effluent_violations_all[ (state, None) ] = cwa_all_df
    effluent_violations_2019[ (state, None) ] = cwa_all_df[cwa_all_df.index == '2019']['NUME90Q'][0]
    with open( filename, 'w', newline='' ) as csvfile:
        header = ['Year', 'Violations']
        writer = csv.writer( csvfile )
        writer.writerow( header )
        for row in cwa_all_df.itertuples():
            if ( row[0] == '2019' ):
                effluent_violations_2019[(state,None)] = row[5]
            writer.writerow( [ row[0], row[5]] )
        print( "Wrote {}".format( filename ))


### 11. Percent change in inspections
For each CD the date field for that program type is used to count up all inspections for the year.  (The date field for each data set is identified in make_data_sets() when the DataSet object is created.  It shows up here as ds.date_field.)

In [None]:
def get_inspections( ds, ds_type ):
    df_result = None
    df_pgm = ds.results[ ds_type ].dataframe.copy()
    if ( len( df_pgm ) > 0 ):
        df_pgm.rename( columns={ ds.date_field: 'Date',
                            ds.agg_col: 'Count'}, inplace=True )
        df_pgm = df_pgm.groupby(pd.to_datetime(df_pgm['Date'], 
                            format=ds.date_format, errors='coerce'))[['Count']].agg('count')
        df_pgm = df_pgm.resample('Y').sum()
        df_pgm.index = df_pgm.index.strftime('%Y')
        df_pgm = df_pgm[ df_pgm.index > '2000' ]
        display( df_pgm )
    else:
        print( "No records")
    return df_pgm
    
for state, cd in state_cds:
    if ( cd is None ):
        ds_type = ('State', None, state)
    else:
        ds_type = ('Congressional District', cd, state)
    print( "CAA Inspections - {} District: {}".format( state, cd ))
    df_caa = get_inspections( data_sets["CAA Inspections"], ds_type )
    
    print( "CWA Inspections - {} District: {}".format( state, cd ))
    df_cwa = get_inspections( data_sets["CWA Inspections"], ds_type )
   
    print( "RCRA Inspections - {} District: {}".format( state, cd ))
    df_rcra = get_inspections( data_sets["RCRA Inspections"], ds_type )
    
    df_totals = pd.concat( [df_caa, df_cwa, df_rcra] )
    df_totals = df_totals.groupby( df_totals.index ).agg('sum')
    print( "Total inspections for {} district {}".format( state,cd ))
    display( df_totals )
    
    for file in [{"inspections_All_pg3":df_totals}, {"inspections_CAA_pg3":df_caa}, 
                 {"inspections_CWA_pg3":df_cwa}, {"inspections_RCRA_pg3":df_rcra}]:
        if ( cd is None ):
            file_type = ds_type
        else:
            file_type = ('Congressional District', state, cd)
        filename = make_filename( list(file.keys())[0], *file_type )
        list(file.values())[0].to_csv( filename )
        print( "Wrote {}".format( filename ))

### 12. Percent change in enforcement - penalties and number of enforcements
* For each CD the number of enforcements and amount of penalty are retrieved from the agg_col field (specified in make_data_sets() for each DataSet).  
* The number of penalties and amount are accummulated for each year.

In [None]:
def get_enforcements( ds, ds_type ):
    df_pgm = ds.results[ds_type].dataframe.copy()
    if ( len( df_pgm ) > 0 ):
        df_pgm.rename( columns={ ds.date_field: 'Date',
                            ds.agg_col: 'Amount'}, inplace=True )
        if ds.name == "CWA Penalties":
            df_pgm['Amount'] = df_pgm['Amount'].fillna(0) + \
                    df_pgm['STATE_LOCAL_PENALTY_AMT'].fillna(0)                            
        df_pgm["Count"] = 1
        df_pgm = df_pgm.groupby(pd.to_datetime(df_pgm['Date'], 
                format="%m/%d/%Y", errors='coerce')).agg({'Amount':'sum','Count':'count'})

        df_pgm = df_pgm.resample('Y').sum()
        df_pgm.index = df_pgm.index.strftime('%Y')
        df_pgm = df_pgm[ df_pgm.index >= "2001" ]
        display(df_pgm )
    else:
        print( "No records")
    return df_pgm
    
for state, cd in state_cds:
    if ( cd is None ):
        ds_type = ('State', None, state)
        file_type = ds_type
    else:
        ds_type = ('Congressional District', cd, state)
        file_type = ('Congressional District', state, cd)
    print( "CAA Penalties - {} District: {}".format( state, cd ))
    df_caa = get_enforcements( data_sets["CAA Penalties"], ds_type )
    filename = make_filename( 'enforcements_CAA_pg5', *file_type )
    df_caa.to_csv( filename )
    print( "Wrote {}".format( filename ))
    
    print( "CWA Penalties - {} District: {}".format( state, cd ))
    df_cwa = get_enforcements( data_sets["CWA Penalties"], ds_type )
    filename = make_filename( 'enforcements_CWA_pg6', *file_type )
    df_cwa.to_csv( filename )
    print( "Wrote {}".format( filename ))
    
    print( "RCRA Penalties - {} District: {}".format( state, cd ))
    df_rcra = get_enforcements( data_sets["RCRA Penalties"], ds_type )
    filename = make_filename( 'enforcements_RCRA_pg7', *file_type )
    df_rcra.to_csv( filename )
    print( "Wrote {}".format( filename ))
    
    df_totals = pd.concat( [df_caa, df_cwa, df_rcra] )
    df_totals = df_totals.groupby( df_totals.index ).agg('sum')
    print( "Total enforcements for {} district {}".format( state,cd ))
    display( df_totals )
    filename = make_filename( 'enforcements_All_pg3', *file_type )
    df_totals.to_csv( filename )
    print( "Wrote {}".format( filename ))

### 13.a. 2019 - inspections per 1000 regulated facilities - by district
* For each CD the inspections data is again grouped into years.
* The get_num_events() function counts all events it gets from get_events() for the year that is requested, which is 2019.
* This number is divided by the number of facilities in the district, from the program_count() function of cell #7.
* The result is multiplied by 1000, equivalent to dividint the denominator (number of facilities) by 1000.

In [None]:
def get_events( ds, ds_type ):
    df_pgm = ds.results[ ds_type ].dataframe.copy()
    df_pgm.rename( columns={ ds.date_field: 'Date',
                        ds.agg_col: 'Count'}, inplace=True )
    
    try:
        df_pgm = df_pgm.groupby(pd.to_datetime(df_pgm['Date'], 
                        format=ds.date_format, errors='coerce'))[['Count']].agg('count')
    except ValueError:
        print( "Error with date {}".format(df_pgm['Date']))
    df_pgm = df_pgm.resample('Y').sum()
    df_pgm.index = df_pgm.index.strftime('%Y')
    df_pgm = df_pgm[ df_pgm.index >= '2001']
    return( df_pgm )

def get_num_events( ds, ds_type, state, cd, year='2019' ):
    df_pgm = get_events( ds, ds_type )
    if ( len( df_pgm ) > 0 ):
        num_events = df_pgm[ df_pgm.index == year ]
        if ( num_events.empty ):
            return 0
        else:
            return num_events['Count'][0]
    
for state, cd in state_cds:
    if ( cd is None ):
        continue
    ds_type = ('Congressional District', cd, state)
    cd_echo_data[(state,cd)] = state_echo_data[state].loc[state_echo_data[state]['FAC_DERIVED_CD113'] == cd]
    cd_echo_active[(state,cd)] = cd_echo_data[(state,cd)].loc[cd_echo_data[(state,cd)]['FAC_ACTIVE_FLAG']=='Y']
    filename = make_filename( 'inspectionsper1000_All_pg4', 'Congressional District', 
                             state, cd )
    with open( filename, 'w', newline='' ) as csvfile:
        header = ['Program', 'Num / 1000']
        writer = csv.writer( csvfile )
        writer.writerow( header )
        try:
            num = 1000. * get_num_events( data_sets["CAA Inspections"], ds_type, state, cd ) / \
                program_count( cd_echo_active[(state,cd)], 'CAA', 'AIR_FLAG', state, cd)
            print("CAA inspections per 1000 regulated facilities: ", num)
            writer.writerow( ['CAA', num] )
        except pd.errors.OutOfBoundsDatetime:
            print( "Bad date in state CWA data")
        try:
            num = 1000. * get_num_events( data_sets["CWA Inspections"], ds_type, state, cd ) / \
                program_count( cd_echo_active[(state,cd)], 'CWA', 'NPDES_FLAG', state, cd)
            print("CWA inspections per 1000 regulated facilities: ", num)
            writer.writerow( ['CWA', num] )
        except pd.errors.OutOfBoundsDatetime:
            print( "Bad date in state CWA data")
        try:
            num = 1000. * get_num_events( data_sets["RCRA Inspections"], ds_type, state, cd ) / \
                program_count( cd_echo_active[(state,cd)], 'RCRA', 'RCRA_FLAG', state, cd)
            print("RCRA inspections per 1000 regulated facilities: ", num)
            writer.writerow( ['RCRA', num] )
        except pd.errors.OutOfBoundsDatetime:
            print( "Bad date in state CWA data")
        print( "Wrote {}".format( filename ))

### 13.b. inspections since 2001
This cell will report no results, but will just save data to some CSVs.
For each CD, then for each unique state, the get_events() function of cell #13a will return a count of all inspections per year.

In [None]:
states = list(set([s_cd[0] for s_cd in state_cds]))  #Use conversion to set to make unique

for state, cd in state_cds:
    if ( cd is None ):
        continue
    ds_type = ('Congressional District', cd, state)
    filename = make_filename( 'inspections_CAA_pg5', 'Congressional District',
                             state, cd )
    ds = data_sets["CAA Inspections"]
    df_pgm = get_events( ds, ds_type )
    df_pgm.to_csv( filename )
    print( "Wrote {}".format( filename ))
    filename = make_filename( 'inspections_CWA_pg6', 'Congressional District',
                             state, cd )
    ds = data_sets["CWA Inspections"]
    df_pgm = get_events( ds, ds_type )
    df_pgm.to_csv( filename )
    print( "Wrote {}".format( filename ))
    filename = make_filename( 'inspections_RCRA_pg7', 'Congressional District',
                             state, cd )
    ds = data_sets["RCRA Inspections"]
    df_pgm = get_events( ds, ds_type )
    df_pgm.to_csv( filename )
    print( "Wrote {}".format( filename ))

for state in states:
    ds_type = ('State', None, state)
    filename = make_filename( 'inspections_CAA_pg5', *ds_type )
    ds = data_sets["CAA Inspections"]
    df_pgm = get_events( ds, ds_type )
    df_pgm.to_csv( filename )
    print( "Wrote {}".format( filename ))
    filename = make_filename( 'inspections_CWA_pg6', *ds_type )
    ds = data_sets["CWA Inspections"]
    df_pgm = get_events( ds, ds_type )
    df_pgm.to_csv( filename )
    print( "Wrote {}".format( filename ))
    filename = make_filename( 'inspections_RCRA_pg7', *ds_type )
    ds = data_sets["RCRA Inspections"]
    df_pgm = get_events( ds, ds_type )
    df_pgm.to_csv( filename )
    print( "Wrote {}".format( filename ))


### 14. 2019 - inspections per 1000 regulated facilities - by state
This cell repeats the computation done in cell #13a for the full state.  The functions of cell #13a are re-used.

In [None]:
states = list(set([s_cd[0] for s_cd in state_cds]))  #Use conversion to set to make unique

for state in states:
    ds_type = ('State', None, state)
    filename = make_filename( 'inspectionsper1000_All_pg4', *ds_type )
    with open( filename, 'w', newline='' ) as csvfile:
        header = ['Program', 'Num / 1000']
        writer = csv.writer( csvfile )
        writer.writerow( header )
        try:
            num = 1000. * get_num_events( data_sets["CAA Inspections"], ds_type, state, None ) / \
                program_count( state_echo_active[state], 'CAA', 'AIR_FLAG', state, None)
            print("CAA inspections per 1000 regulated facilities: ", num)
            writer.writerow( ['CAA', num] )
        except pd.errors.OutOfBoundsDatetime:
            print( "Bad date in state CWA data")
        try:
            num = 1000. * get_num_events( data_sets["CWA Inspections"], ds_type, state, None ) / \
                program_count( state_echo_active[state], 'CWA', 'NPDES_FLAG', state, None)
            print("CWA inspections per 1000 regulated facilities: ", num)
            writer.writerow( ['CWA', num] )
        except pd.errors.OutOfBoundsDatetime:
            print( "Bad date in state CWA data")
        try:
            num = 1000. * get_num_events( data_sets["RCRA Inspections"], ds_type, state, None ) / \
                program_count( state_echo_active[state], 'RCRA', 'RCRA_FLAG', state, None)
            print("RCRA inspections per 1000 regulated facilities: ", num)
            writer.writerow( ['RCRA', num] )
        except pd.errors.OutOfBoundsDatetime:
            print( "Bad date in state CWA data")
        print( "Wrote {}".format( filename ))

### 15.a. 2019 - violations per 1000 regulated facilities - by district
For each CD the get_num_events() function from cell #13a and the program_count() function from cell #7 are re-used with violations data sets this time.  The calculation is the same as in cell #13a.

In [None]:
for state, cd in state_cds:
    if ( cd is None ):
        continue
    ds_type = ('Congressional District', cd, state)
    filename = make_filename( 'violationsper1000_All_pg4', 'Congressional District', 
                         state, cd )
    with open( filename, 'w', newline='' ) as csvfile:
        header = ['Program', 'Num / 1000']
        writer = csv.writer( csvfile )
        writer.writerow( header )
        try:
            num = 1000. * get_num_events( data_sets["CAA Violations"], ds_type, state, cd ) / \
                program_count( cd_echo_active[(state,cd)], 'CAA', 'AIR_FLAG', state, cd)
            print("CAA violations per 1000 regulated facilities: ", num)
            writer.writerow( ['CAA', num] )
        except pd.errors.OutOfBoundsDatetime:
            print( "Bad date in state CAA data")
        try:
            # Have to handle CWA Violations differently - use saved dictionary from cell 10
            num = effluent_violations_2019[(state,cd)]
            num = 1000. * num / \
                program_count( cd_echo_active[(state,cd)], 'CWA', 'NPDES_FLAG', state, cd)
            print("CWA violations per 1000 regulated facilities: ", num)
            writer.writerow( ['CWA', num] )
        except pd.errors.OutOfBoundsDatetime:
            print( "Bad date in state CWA data")
        try:
            num = 1000. * get_num_events( data_sets["RCRA Violations"], ds_type, state, cd ) / \
                program_count( cd_echo_active[(state,cd)], 'RCRA', 'RCRA_FLAG', state, cd)
            print("RCRA violations per 1000 regulated facilities: ", num)
            writer.writerow( ['RCRA', num] )
        except pd.errors.OutOfBoundsDatetime:
            print( "Bad date in state CWA data")
        print( "Wrote {}".format( filename ))

### 15.b. violations since 2001
The calculation done in cell #13b is repeated for violations.

In [None]:
for state, cd in state_cds:
    if ( cd is None ):
        ds_type = ('State', None, state)
        file_type = ds_type
    else:
        ds_type = ('Congressional District', cd, state)
        file_type = ('Congressional District', state, cd)
    cd_echo_data[cd] = state_echo_data[state].loc[state_echo_data[state]['FAC_DERIVED_CD113'] == cd]
    filename = make_filename( 'violations_CAA_pg5', *file_type )
    ds = data_sets["CAA Violations"]
    df_pgm = get_events( ds, ds_type )
    df_pgm.to_csv( filename )
    print( "Wrote {}".format( filename ))
    filename = make_filename( 'violations_CWA_pg6', *file_type )
    # Have to handle CWA Violations differently - use saved dictionary from cell 10
    effluent_violations_all[ (state, cd) ]['Total'].to_csv( filename )
    print( "Wrote {}".format( filename ))
    filename = make_filename( 'violations_RCRA_pg7', *file_type )
    ds = data_sets["RCRA Violations"]
    df_pgm = get_events( ds, ds_type )
    df_pgm.to_csv( filename )
    print( "Wrote {}".format( filename ))

### 16. 2019 - violations per 1000 regulated facilities - by state
The calculations of cell #14 are repeated for violations.

In [None]:
states = list(set([s_cd[0] for s_cd in state_cds]))  #Use conversion to set to make unique

for state in states:
    ds_type = ('State', None, state)
    filename = make_filename( 'violationsper1000_All_pg4', 'State', None, state )
    with open( filename, 'w', newline='' ) as csvfile:
        header = ['Program', 'Num / 1000']
        writer = csv.writer( csvfile )
        writer.writerow( header )
        try:
            num = 1000. * get_num_events( data_sets["CAA Violations"], ds_type, state, None ) / \
                program_count( state_echo_active[state], 'CAA', 'AIR_FLAG', state, None)
            print("CAA violations per 1000 regulated facilities: ", num)
            writer.writerow( ['CAA', num] )
        except pd.errors.OutOfBoundsDatetime:
            print( "Bad date in state CWA data")
        try:
            num = 1000. * effluent_violations_2019[ (state, None) ] / \
                program_count( state_echo_active[state], 'CWA', 'NPDES_FLAG', state, None)
            print("CWA violations per 1000 regulated facilities: ", num)
            writer.writerow( ['CWA', num] )
        except pd.errors.OutOfBoundsDatetime:
            print( "Bad date in state CWA data")
        try:
            num = 1000. * get_num_events( data_sets["RCRA Violations"], ds_type, state, None ) / \
                program_count( state_echo_active[state], 'RCRA', 'RCRA_FLAG', state, None)
            print("RCRA violations per 1000 regulated facilities: ", num)
            writer.writerow( ['RCRA', num] )
        except pd.errors.OutOfBoundsDatetime:
            print( "Bad date in state CWA data")
        print( "Wrote {}".format( filename ))

### 17. 2019 - enforcement counts and amounts per violating facility - by district
* The get_num_facilities() function combines the violations into years, then counts the number of facilities with violations for each year.
* The get_enf_per_fac() function combines enforcements into years, then counts the enforcements and sums the amount of penalties, before dividing by the results from get_num_facilities().
* These functions are called for each CD, and for CAA, CWA and RCRA.

In [None]:
def get_num_facilities( program, ds_type, year=2019 ):
    ds = data_sets[program]
    df_pgm = ds.results[ ds_type ].dataframe.copy()
    if ( len( df_pgm ) > 0 ):
        df_pgm.rename( columns={ ds.date_field: 'Date',
                            ds.agg_col: 'Count'}, inplace=True )
        if ( program == 'CWA Violations' ):
            yr = df_pgm['Date'].astype( 'str' ).str[0:4:1]
            df_pgm['Date'] = pd.to_datetime( yr, format="%Y" )
        else:
            df_pgm['Date'] = pd.to_datetime( df_pgm['Date'], format=ds.date_format, errors='coerce' )
        df_pgm_year = df_pgm[ df_pgm['Date'].dt.year == year].copy()
        df_pgm_year['Date'] = pd.DatetimeIndex( df_pgm_year['Date']).year
        num_fac = len(df_pgm_year.index.unique())            
        return num_fac

def get_enf_per_fac( ds_enf, ds_type, num_fac, year='2019' ):
    df_pgm = ds_enf.results[ ds_type ].dataframe.copy()
    if ( len( df_pgm ) > 0 ):
        if ( ds_enf.name == 'CWA Penalties'):
            ## This has been done in Cell 12.
            # df_pgm['Amount'] = df_pgm['FED_PENALTY_ASSESSED_AMT'].fillna(0) + \
            #                        df_pgm['STATE_LOCAL_PENALTY_AMT'].fillna(0)
            df_pgm.rename( columns={ds_enf.date_field: 'Date'}, inplace=True )
        else:
            df_pgm.rename( columns={ ds_enf.date_field: 'Date',
                            ds_enf.agg_col: 'Amount'}, inplace=True )
        df_pgm["Count"] = 1
        df_pgm = df_pgm.groupby(pd.to_datetime(df_pgm['Date'], 
                            format="%m/%d/%Y", errors='coerce')).agg({'Amount':'sum','Count':'count'})

        df_pgm = df_pgm.resample('Y').sum()
        df_pgm.index = df_pgm.index.strftime('%Y')
        df_pgm = df_pgm[ df_pgm.index == "2019" ]
      
        if df_pgm.empty:
          df_pgm['Num_enf_per_fac'] = None
          df_pgm['Amt_enf_per_fac'] = None
          print("There were no enforcement actions taken in 2019")
        else:
          df_pgm['Num_enf_per_fac'] = df_pgm.apply( 
              lambda row: None if ( num_fac == 0 ) else row.Count / num_fac, axis=1 )
          df_pgm['Amt_enf_per_fac'] = df_pgm.apply( 
              lambda row: None if ( num_fac == 0 ) else row.Amount / num_fac, axis=1 )
          print(df_pgm)

    else:
        print( "No records")
    return df_pgm
    
for state, cd in state_cds:
    if ( cd is None ):
        continue
    ds_type = ('Congressional District', cd, state)
    
    num_fac = get_num_facilities( "CAA Violations", ds_type )
    print( "CAA Penalties - {} District: {} - {} facilities with violations in 2019".format( state, cd, num_fac ))
    df_caa = get_enf_per_fac( data_sets["CAA Penalties"], ds_type, num_fac )
    
    num_fac = get_num_facilities( "CWA Violations", ds_type )
    print( "CWA Penalties - {} District: {} - {} facilities with violations in 2019".format( state, cd, num_fac ))
    df_cwa = get_enf_per_fac( data_sets["CWA Penalties"], ds_type, num_fac )
    
    num_fac = get_num_facilities( "RCRA Violations", ds_type )
    print( "RCRA Penalties - {} District: {} - {} facilities with violations in 2019".format( state, cd, num_fac ))
    df_rcra = get_enf_per_fac( data_sets["RCRA Penalties"], ds_type, num_fac )
    
    df_totals = pd.concat( [df_caa, df_cwa, df_rcra] )
    df_totals = df_totals.groupby( df_totals.index ).agg('sum')
    print( "Total enforcements for {} district {} in 2019".format( state,cd ))
    print( df_totals )
    
    filename = make_filename( 'enforcementsperviolatingfacility_All_pg4', 'Congressional District', 
                             state, cd )
    df_totals.to_csv( filename )
    print( "Wrote {}".format( filename ))

### 18. 2019 - enforcement counts and amounts per violating facility - by state
This cell repeats the calculations of cell #17 for unique states.

In [None]:
states = list(set([s_cd[0] for s_cd in state_cds]))  #Use conversion to set to make unique

for state in states:
    ds_type = ('State', None, state)

    num_fac = get_num_facilities( "CAA Violations", ds_type )
    print( "CAA Penalties - {} - {} facilities with violations".format( state,  num_fac ))
    df_caa = get_enf_per_fac( data_sets["CAA Penalties"], ds_type, num_fac )
    
    num_fac = get_num_facilities( "CWA Violations", ds_type )
    print( "CWA Penalties - {} - {} facilities with violations".format( state, num_fac ))
    df_cwa = get_enf_per_fac( data_sets["CWA Penalties"], ds_type, num_fac )
    
    num_fac = get_num_facilities( "RCRA Violations", ds_type )
    print( "RCRA Penalties - {} - {} facilities with violations".format( state, num_fac ))
    df_rcra = get_enf_per_fac( data_sets["RCRA Penalties"], ds_type, num_fac )
    
    df_totals = pd.concat( [df_caa, df_cwa, df_rcra] )
    df_totals = df_totals.groupby( df_totals.index ).agg('sum')
    print( "Total enforcements for {} ".format( state ))
    print( df_totals )
    
    filename = make_filename( 'enforcementsperviolatingfacility_All_pg4', 'State', 
                             None, state )
    df_totals.to_csv( filename )
    print( "Wrote {}".format( filename ))

### 19.  GHG emissions in these districts and states (2010-2018)
For each state and then each CD, the get_ghg_emissions() function is called.  It combines emissions records into years and sums the amounts.

In [None]:
Threadstates = list(set([s_cd[0] for s_cd in state_cds]))  #Use conversion to set to make unique
state_emissions = {}

def get_ghg_emissions( ds, ds_type ):
    df_result = ds.results[ ds_type ].dataframe
    if ( df_result is None ):
        print( "No records" )
        return None
    else:
        df_pgm = df_result.copy()
    if ( df_pgm is not None and len( df_pgm ) > 0 ):
        df_pgm.rename( columns={ ds.date_field: 'Date',
                            ds.agg_col: 'Sum'}, inplace=True )
        df_pgm = df_pgm.groupby(pd.to_datetime(df_pgm['Date'], 
                                format=ds.date_format, errors='coerce'))[['Sum']].agg('sum')
        df_pgm = df_pgm.resample('Y').sum()
        df_pgm.index = df_pgm.index.strftime('%Y')
        #df_pgm = df_pgm[ df_pgm.index == '2018' ]
    else:
        print( "No records")
    return df_pgm

for state in states:
    ds_type = ('State', None, state)
    print( "Greenhouse Gas Emissions - State: {}".format( state ))
    df_ghg = get_ghg_emissions( data_sets["Greenhouse Gas Emissions"], ds_type )
    state_emissions[state] = df_ghg 
    display(state_emissions[state])
    filename = make_filename( 'emissions2018_GHGRP_pg4', 'State', None, state )
    df_ghg.to_csv( filename )
    print( "Wrote GHG emissions data to: ", filename)

for state, cd in state_cds:
    if ( cd is None ):
        continue
    ds_type = ('Congressional District', cd, state)
    print( "Greenhouse Gas Emissions - {} District: {}".format( state, cd ))
    df_ghg = get_ghg_emissions( data_sets["Greenhouse Gas Emissions"], ds_type )
    display(df_ghg)
    
    filename = make_filename( 'emissions2018_GHGRP_pg4', 'Congressional District', state, cd )
    if (df_ghg is not None):
      join = df_ghg.join(state_emissions[state], lsuffix="_District", rsuffix="_State")
    else:
      join = state_emissions[state]
    join.to_csv(filename)
    print( "Wrote GHG emissions data to: ", filename)

### 20. Top 10 facilities with compliance problems over the past 3 years
* The get_top_violators() function counts non-compliance quarters ('S' and 'V' violations) for facilities and then sorts the facilities.
* The chart_top_violators() function draws the chart.
* The functions are called for each CD.

In [None]:
import seaborn as sns

states = list(set([s_cd[0] for s_cd in state_cds]))  #Use conversion to set to make unique

df_violators = {}


for state, cd in state_cds:
    if ( cd is None ):
        df_active = state_echo_active[state]
        df_type = ('State', None, state)
    else:
        df_active = cd_echo_active[(state,cd)]
        df_type = ('Congressional District', state, cd)
    df = df_active.loc[ df_active['AIR_FLAG'] == 'Y'].copy()
    df_violators[(state,cd,'CAA')] = get_top_violators( df, 'AIR_FLAG', state, cd, 
            'CAA_3YR_COMPL_QTRS_HISTORY', 'CAA_FORMAL_ACTION_COUNT', 20 )
    filename = make_filename( 'noncomp_CAA_pg6', *df_type )
    df_violators[(state,cd,'CAA')].to_csv( filename )
    print( "Wrote {}".format( filename ))
    if ( should_make_charts ):
        display( chart_top_violators( df_violators[(state,cd,'CAA')], state, cd, 'CAA' ))
    
    df = df_active.loc[ df_active['NPDES_FLAG'] == 'Y'].copy()
    df_violators[(state,cd,'CWA')] = get_top_violators( df, 'NPDES_FLAG', state, cd, 
            'CWA_13QTRS_COMPL_HISTORY', 'CWA_FORMAL_ACTION_COUNT', 20 )
    filename = make_filename( 'noncomp_CWA_pg6', *df_type )
    df_violators[(state,cd,'CWA')].to_csv( filename )
    print( "Wrote {}".format( filename ))
    if ( should_make_charts ):
        display( chart_top_violators( df_violators[(state,cd,'CWA')], state, cd, 'CWA' ))
    
    df = df_active.loc[ df_active['RCRA_FLAG'] == 'Y'].copy()
    df_violators[(state,cd,'RCRA')] = get_top_violators( df, 'RCRA_FLAG', state, cd, 
            'RCRA_3YR_COMPL_QTRS_HISTORY', 'RCRA_FORMAL_ACTION_COUNT', 20 )
    filename = make_filename( 'noncomp_RCRA_pg7', *df_type )
    df_violators[(state,cd,'RCRA')].to_csv( filename )
    print( "Wrote {}".format( filename ))
    if ( should_make_charts ):
        display( chart_top_violators( df_violators[(state,cd,'RCRA')], state, cd, 'RCRA' ))
 

### 21. Map the top 10 facilities with compliance problems over the past 3 years
A map is drawn showing the facilities identified in cell #20.

In [None]:
should_make_charts = True

In [None]:
def violators_map( viol_dict, key ):
    print( 'Map for {}, {} CD {}'.format( key[0], str(key[1]), key[2] ))
    map_data = viol_dict[key]
    m = mapper(map_data)
    if ( cd is not None ):
        url = "https://raw.githubusercontent.com/unitedstates/districts/gh-pages/cds/2016/{}-{}/shape.geojson".format( state, str(cd))
        map_data = geopandas.read_file(url)
        w = folium.GeoJson(
            map_data,
            name = "EPA Regions",
        ).add_to(m) #m is the map object created to hold the facility points. we want to add this shape object to that map object
        folium.GeoJsonTooltip(fields=["District"]).add_to(w)
    display( m )    

if ( should_make_charts ):
    import geopandas

    for state, cd in state_cds:
        violators_map( df_violators, (state,cd,'CAA') )
        violators_map( df_violators, (state,cd,'CWA') )
        violators_map( df_violators, (state,cd,'RCRA') )
