<a id="top"></a>
# Goal: Download the EPA ECHO Program's 'Facility Detail' Data for ~1,500 Carceral Facilities.

### Background
Thus far offline we\'ve been examining EPA ECHO (Enforcement and Compliance History Online) data for carceral facilities by looking at a summary CSV download where the unit of analysis is the EPA facility, with one row per facility with summary data taken from ECHO.

To get more detail on individual facilities, this utility notebook will download the ECHO Program's 'Facility Detail' Data and save the results as JSON text files. The ['Facility Detail' API](https://echo.epa.gov/tools/web-services/detailed-facility-report) is one of several offered by [ECHO's web services](https://echo.epa.gov/tools/web-services/).

A future notebook will begin analysis of the ECHO data by importing these JSON text files. I may also end up writing a version of this notebook in R, as there are more R users in the group.

We're still working on identifying all the carceral EPA facilities, but we'll be downloading data for the ~1,500 coded as carceral by NAICS/SIC code, to begin work in parallel.

### Outstanding Issues
It appears ECHO's API is returning malformed JSON for certain records, for about a dozen facilities out of ~1,500; this causes an error in our code. Because we're collecting data for exploration, I've left this unresolved for now.

### Begin Notebook

In [1]:
import requests
#let's be good API citizens
import requests_cache #conda install -c conda-forge requests-cache
requests_cache.install_cache("ECHOcache") #cache, need verify if install reads/create
import time #for sleep

import pandas as pd
import numpy as np

from datetime import datetime, timezone
import os
from pathlib import Path

#widen our display just in case
pd.set_option('display.max_columns', None)

#display all output not just last
#https://stackoverflow.com/questions/36786722/how-to-display-full-output-in-jupyter-not-only-last-result
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Read in 4 million EPA Facilities, some background in [this notebook](https://github.com/benmillam/epa-facilities/blob/master/EPA%20Facilities%20Registry%20Data%20to%20ID%20Incarceration%20Facilities.ipynb).

In [2]:
fac = pd.read_csv('NATIONAL_SINGLE.CSV')

  interactivity=interactivity, compiler=compiler, result=result)


We'll get the Facility ID for 1,527 EPA facilities coded as CORRECTIONAL INSTITUTIONS with NAICS code (922140) or SIC code (9223). Some facilities may have multiple codes. 

In [7]:
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
#string matching not exact, will return for any substring instance
naics_correctional = fac['NAICS_CODES'].str.contains('922140', na=False, regex=False) #a logical Series
sum(naics_correctional)

sic_correctional = fac['SIC_CODES'].str.contains('9223', na=False, regex=False) #a logical Series
sum(sic_correctional)

naics_or_sic_correctional = naics_correctional | sic_correctional
sum(naics_or_sic_correctional)

prison_fac_ids = fac[naics_or_sic_correctional]['REGISTRY_ID']
sum(pd.isna(prison_fac_ids))

982

1274

1527

0

Now our functions to query the API and save the results:

In [8]:
#helper to write JSON string to a file for storage
def save_json_as_text_file(json_string, facid, verbose = False):
    """
    Saves a JSON string as a UTF-8 encoded text file.
        
        Args:
            json_string (str): A single string.
            facid (str): The EPA facility ID.
            verbose (bool): Whether to print status messages.
            
        Returns:
            None
    """
    #build filename
    facid = str(facid) #convert to string just in case
    time_accessed = datetime.now(timezone.utc).strftime("%Y-%m-%d-%Z") #'2019-11-07-UTC' string for filename
    filename = facid + '-' + time_accessed + '-facility-detail-JSON.txt'
    
    #write file
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(json_string)
        if verbose:
            print('File saved for facility ID ', facid)
    return None
    

Note: API is public, no guidance on rate limits aside from "Note: To download a large volume of data, please use the [ECHO Data Downloads](https://echo.epa.gov/tools/data-downloads)."  Still, we'll be good citizens, use a cache, and sleep between requests.

In [20]:
#helper to return a single EPA facility query
def get_single_facility_detail(facid, verbose = False):
    """
    Gets ECHO facility detail results for a single facility from https://echo.epa.gov/tools/web-services/detailed-facility-report.
    Saves the resulting JSON in a text file and returns a dictionary from requests.json().
    
        Args:
            facid (str): The EPA facility registry ID.
            verbose (bool): Whether to print status messages.
            
        Returns:
            a dictionary (dict) from requests.json()
    """
    
    api_url = 'https://ofmpub.epa.gov/echo/dfr_rest_services.get_dfr'
    
    response = requests.get(api_url, params = {
        "p_id": facid,
        "output": 'JSON',
    })
    
    #if not cached, then sleep to be polite
    if not response.from_cache:
        time.sleep(1.5) #no public rate limit info, estimate
    
    #if bad requests response, notify user; note we aren't checking integrity of API results!
    try:
        response.raise_for_status()
        result = response.json() #returns dictionary, #API results in JSON
    except:
        print("Requests error in your get_single_facility_detail function.")
        raise #note we'll catch the error in the parent loop, but we'd like to know about in real time
    
    #save text file
    try:
        save_json_as_text_file(response.text, facid, verbose)
    except:
        print("File save error in your get_single_facility_detail function.")
        raise #note we'll catch the error in the parent loop   
    
    return result

In [11]:
#check for duplicate facilities, just an FYI/sanity check
sum(prison_fac_ids.duplicated())

0

#### Now we gather and save results for all our prisons via our functions above

In [12]:
#hrmmm, not sure where to store/encapsulate subdirectory to save results in
subdirectory = 'facility-detail-results-' + datetime.now(timezone.utc).strftime("%Y-%m-%d")
if not Path(subdirectory).exists():
    os.mkdir(subdirectory)

old_working_directory = os.getcwd()

os.chdir(Path(subdirectory))

prison_results = dict()

prison_errors = list()

#query API/store in dictionary keyed by facility ID, save JSON text file
# for i loop so we can occasionally report progress
for i in range(0, prison_fac_ids.size):
    
    fac_id = prison_fac_ids.iloc[i]
    try:
        prison_results[fac_id] = get_single_facility_detail(fac_id)
    except:
        prison_errors.append(fac_id)
        
    if i % 100 == 0:
        print('Working on results {0} of {1}'.format(i+1, prison_fac_ids.size))

os.chdir(old_working_directory)

Working on results 1 of 1527
Working on results 101 of 1527
Working on results 201 of 1527
Working on results 301 of 1527
Working on results 401 of 1527
Working on results 501 of 1527
Working on results 601 of 1527
Working on results 701 of 1527
Working on results 801 of 1527
Working on results 901 of 1527
Working on results 1001 of 1527
Requests error in your get_single_facility_detail function.
Working on results 1101 of 1527
Working on results 1201 of 1527
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Working on results 1301 of 1527
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your g

And view the Facility IDs that drew errors:

In [13]:
prison_errors

[110011634794,
 110006663039,
 110034245458,
 110034825027,
 110035309216,
 110015817339,
 110005079333,
 110011692034,
 110005022661,
 110006126441,
 110039172631,
 110021021124,
 110005358059,
 110005393467,
 110005364658,
 110015534698]

#### Here we're copying/pasting the same code from above to rerun on the errors...
[inefficient]...

In [21]:
#hrmmm, not sure where to store/encapsulate subdirectory to save results in
subdirectory = 'facility-detail-results-' + datetime.now(timezone.utc).strftime("%Y-%m-%d")
if not Path(subdirectory).exists():
    os.mkdir(subdirectory)

old_working_directory = os.getcwd()

os.chdir(Path(subdirectory))

#danger don't overwrite!!!
#prison_results = dict()

prison_repeat_errors = list()

#query API/store in dictionary keyed by facility ID, save JSON text file
# for i loop so we can occasionally report progress
for i in range(0, len(prison_errors)):
    
    fac_id = prison_errors[i]
    try:
        prison_results[fac_id] = get_single_facility_detail(fac_id, verbose = True) #verbose to see invid
    except:
        prison_repeat_errors.append(fac_id)
        
    if i % 100 == 0:
        print('Working on results {0} of {1}'.format(i+1, prison_fac_ids.size))

os.chdir(old_working_directory)

Requests error in your get_single_facility_detail function.
Working on results 1 of 1527
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.
Requests error in your get_single_facility_detail function.


In [18]:
prison_repeat_errors

[110011634794,
 110006663039,
 110034245458,
 110034825027,
 110035309216,
 110015817339,
 110005079333,
 110011692034,
 110005022661,
 110006126441,
 110039172631,
 110021021124,
 110005358059,
 110005393467,
 110005364658,
 110015534698]

We've retained the the error output below, and confirmed with further testing and from offline manual inspection: it appears ECHO's API is returning malformed JSON for certain records.

Because we're collecting this data for exploration, we'll leave this unsolved for now.

In [22]:
get_single_facility_detail(110005079333, verbose = True)

Requests error in your get_single_facility_detail function.


JSONDecodeError: Expecting property name enclosed in double quotes: line 1315 column 1 (char 28544)