| ![EEW logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/eew.jpg?raw=true) | ![EDGI logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/edgi.png?raw=true) |
|---|---|

#### This notebook is licensed under GPL 3.0. Please visit our Github repo for more information: 
#### The notebook was collaboratively authored by the Environmental Data & Governance Initiative (EDGI) following our authorship protocol: https://docs.google.com/document/d/1CtDN5ZZ4Zv70fHiBTmWkDJ9mswEipX6eCYrwicP66Xw/
#### For more information about this project, visit https://www.environmentalenforcementwatch.org/

*Note: This notebook pulls data from a copy of EPA's ECHO database hosted by Stony Brook University. The data sets are updated on a weekly basis, meaning that some of the results from your run may not exactly match those in other EEW data products or from your previous run.* 

# EEW's Watershed Notebook

---



This is a Jupyter Notebook (a way to organize Python code) hosted on a platform known as Google Colab. Hosting on Google Colab allows you to access and visualize data without actually needing to do any coding! 

Here we use [EPA's ECHO database](https://echo.epa.gov/) to understand who is polluting what and where in watersheds across the United States. We acheive this by accessing data related to the Clean Water Act. More information on the Clean Water Act can be found [here](https://docs.google.com/presentation/d/1g6ZN3B5jvs3F1VAigiUtNNezjXdJnzuELfo9Deo9Y2w/edit?usp=sharing). 


This notebook asks you to select a geographic area you probably know for your area of interest—**a zipcode**—and finds the USGS watershed boundaries, known as a Hydrologic Unit Code or **HUC code**, that intersect with the zipcode. The rest of the notebook gathers information about facilities that report under the Clean Water Act in the watersheds. 

Be sure to read the instructions in "How to Run" (below) and the notes above each cell for important tips and context! 

## How to Run
* A "cell" in a Jupyter Notebook is a block of code performing a set of actions making available or using specific data.  The notebook works by running one cell after another as the notebook user selects offered options.
* 
If you click on a gray **code** cell, a little “play button” arrow appears on the left. If you click the play button, it will run the code in that cell (“**running** a cell”). The button will animate. When the animation stops, the cell has finished running.
![Where to click to run the cell](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/pressplay.JPG?raw=true)
* When you run the first cell, you may get a warning that the notebook was not authored by Google. We know, we authored them! It’s okay. Click “Run Anyway” to continue. 
![Error Message](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/warning-message.JPG?raw=true)
* **It is important to run cells in order because they depend on each other.** However, some cells are optional. That will be indicated in the text below those cells. 

# 1. Begin! 
Here we load some helper code to get us going. If your environment already has these loaded this cell may be skipped. (If you're not sure, it's best to run this cell!) You'll know this is completed when a small "Done!" appears at the bottom of the gray cell (right before Step 2). 

In [1]:
# We have a folder of chunks of reusable code that we're using across different
#  Notebooks. This step goes and gets the relevant code from that folder so we
#  can use it here. (https://github.com/edgi-govdata-archiving/ECHO_modules/)
!git clone https://github.com/edgi-govdata-archiving/ECHO_modules.git -b reorganization &>/dev/null;

# Geopandas is an open source library for working with geographic data using the
#   data structures library "pandas" (common in Python for data processing).
#   (https://geopandas.org/)
!pip install geopandas  &>/dev/null;

# Topojson is an open source library that lets us keep file sizes small when
#   working with geographic data, so the Notebooks can run faster while still
#   working with detailed shapes. (https://github.com/mattijn/topojson)
!pip install topojson &>/dev/null;

# Install rtree to enable geopandas to clip data spatially
!pip install rtree &>/dev/null;

import warnings
warnings.filterwarnings('ignore')

# This code block will print a lot of data as it fetches and installs the libraries
#   Specified above. When it's done, the line below lets us know by printing "Done!"
print("Done!")

Done!


# 2. Run some stuff.
This cell must be run to bring in some utility functions. Like Step 1, we're just looking for a "Done!" at the bottom of this gray cell. 

In [None]:
# These code blocks come from our folder (https://github.com/edgi-govdata-archiving/ECHO_modules/)
# Each of the files contains a series of function definitions. By running
#   those files here, we make the functions available in this Notebook.
%run ECHO_modules/utilities.py
%run ECHO_modules/presets.py
%run ECHO_modules/class.py
print("Done!")

# 3. Where do you want to search?
a. Run the following cell to choose which zip code(s) you want to start with. You won't need to know the HUC for your watersheds of interest - we'll find the ones that at least partially cover your zip code(s). Separate each zip code with a comma. For example, if you wanted to list several zipcodes you would do that like this: 98225, 14303, 40218

In [None]:
units = show_pick_region_widget( "Zip Code" )
units

b. Run the cell below to choose whether you want to look at
the **HUC8 or HUC10** watershed level. HUC stands for Hydrologic Unit Code and is the method the USGS uses to identify watersheds across the United States. The number after HUC (2,4,6,8,10,12) indicates the number of digits in that particular number, with more digits indicating a smaller watershed. HUC8 describes subbasins, of which there are approx. 2,200 in the U.S. HUC10 is a slightly smaller region, with approx. 20,000 across the United States. The image below, from USGS, helps to visualize this well. 
![USGS HUC visualization](https://prd-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/styles/atom_page_medium/public/thumbnails/image/WBD_Base_HUStructure_small.png)

Additonal information on HUCs can be found [here](https://nas.er.usgs.gov/hucs.aspx). Run the cell below to choose to look at the HUC8 or HUC10 watershed level.




In [None]:
region_field = {k: v for k, v in region_field.items() if k in ["HUC8 Watersheds", "HUC10 Watersheds"]}
region = show_region_type_widget( region_field )
region

# 4. Get the data!

This step pulls the data we need for the selected zipcode(s) and watershed level (HUC) you selected from the copy of the ECHO database. 

In [None]:
units_list = [u for u in str(units.value).split(",")] # parse commas
data = Echo(units_list, region.value, [], intersection=True, intersecting_geo="Zip Codes") # A configuration specific to this watershed notebook
print("Done!")

# 5. Show me the data! 
First, let's look at all the facilities regulated under the Clean Water Act (CWA) in this area. Most facilities regulated under the CWA are industrial facilties, like a chemical-producing factory, or municipal facilities, like a wastewater treatment plant. 

 The map below displays the watershed that the zipcode(s) you selected are within. You can zoom in to see the names of individual facilities. 

In [None]:
data.show_facility_map()

This next chart shows you the top 20 violators in this 
watershed for the last 13 quarters (last 3 years + most recent quarter). Facilities that have been in violation the same number of quarters are not listed in any particular order. 

*Because there are often many more than 20 facilities that have been in violation during the last 13 quarters and they are not reported in any partiuclar order, facilities that have been in violation for many quarters may not appear on this chart. You can show more than 20 facilities by changing the number 20 in the code to something else before running the cell.*  


In [None]:
data.show_top_violators("CWA", 20)

# 6. Explore!
a. There are several components of the Clean Water Act - we are focused on its National Pollutant Discharge Elimination System (NPDES) here. A facility can be *insepected* for compliance with NPDES, it can be found in *violation* of the program, and EPA/state equivalents can levy *enforcement actions* against violating facilities. Additionally, facilities are required to submit reports summarizing their discharges into waterbodies.

Select a variable of interest and how you would like to see the data displayed: as a map, chart, or table. You can run this as many times as you want! Change your selection here (drop down and buttons below) then re-run the play button at 6b below.

Some more detail about available data:

- CWA Violations = The number of times a facility was non-compliant with CWA NPDES in each quarter. There are different kinds of non-compliance (see below).
- CWA Inspections = Facility inspections made by regulators.
- CWA Enforcements = Data on monetary penalties levied against polluting facilities as well as other enforcement actions such as administrative orders.
- Effluent Violations = CWA NPDES violations related to discharges of effluent into waterbodies. This can include violations caused by facilities discharging more than the permitted amount of a substance.
- 2020 Discharge Monitoring Reports = Reports on effluent discharges submitted quarterly by facilities. These are checked to determine Effluent Violations.

Here's where you can find more detailed definitions of terms used in the data: https://echo.epa.gov/tools/data-downloads/icis-npdes-download-summary

In [None]:
data_sets = {k: v for k, v in attribute_tables.items() if v['echo_type'] == "NPDES"}

explore = show_data_set_widget( data_sets ) 
visualization = widgets.ToggleButtons(
    options=['Map', 'Chart', 'Table'],
    description='Visualization:',
    disabled=False,
    button_style='success', # 'success', 'info', 'warning', 'danger' or ''
    tooltips=['Show a map of facilities', 'Chart data over time', 'Show '],
)
display(explore)
display(visualization)

b. Map, chart, or show a table of the data of interest. 


**If you're displaying a map**, the orange circles indicate a facility that matches whatever data variable you're looking at (i.e., if you've selected CWA Inspections, the orange circles would represent facilities that have been inspected since 2001, or if you selected CWA Penalties the orange circles would represent facilities that have been penalized). The black dots indicate other facilities in the watershed that report under the CWA but do not match the specific information you've asked about above. They may have been inspected or penalized, for instance, but we can not determine that from EPA's database. Finally, the size of the orange circle corresponds to the magnitude of the data variable for each facility. For example, if you were looking at penalty data, a facility that was issued a \$100,000 penalty would have a larger orange circle representing it than a facility that was issued a \$10,000 penalty. 

**If you're displaying the CWA Violations chart**, below is a detailed explaination of the legend that will appear:

*Teal bar = NUME90Q*: Number of E90 Violations in Quarter. A count of the number of effluent violations (E90) reported in the quarter, defined by YEARQTR.

*Orange bar = NUMCVDT*: Number of Compliance Schedule Violations in Quarter. A count of the number of compliance schedule violations reported in the quarter, defined by YEARQTR. A compliance schedule violation is based on if there is a water-quality target that is trying to be reached for a certain pollutant in an area. 

*Pink bar = NUMSVCD*:Number of Single Event Violations in Quarter. A count of the number of single event violations reported in the quarter, defined by PRHQRTR.

*Dark blue bar = NUMPSCH*: Number of Permit Schedule Violations in Quarter. A count of the number of permit schedule violations reported in the quarter, defined by YEARQTR.This type of violations is the majority of violations. They are caused by not submitting reports on time or the like.

**If you're displaying the Effluent Violations or DMR charts**, you will see the number of violations or reports per year broken down by chemical.

**For other charts**, the green bars will indicate the total number of violations, inspections, or penalties in each year since 2001.

More information on definitions can be found [here](https://echo.epa.gov/tools/data-downloads/icis-npdes-download-summary).




In [None]:
program = explore.value

if visualization.value == "Map":
  data.show_program_map(program)
elif visualization.value == "Chart":
  data.show_chart(program)
elif visualization.value == "Table":
  data.show_data(program)

c. In this cell you may save the Step 6 data to your computer.
After running it, you can access your files by clicking on the 'Files' tab in the menu on the left-hand side of the notebook (it looks like a folder). You may have to hit 'Refresh' if you don't see your file. Then, you can click on the ... next to your file and choose "Download". The CSV spreadsheet will download to wherever your browser usually saves files (e.g. Downloads folder)

In [None]:
if ( len( data.results[program] ) > 0 ):
  write_dataset( df=data.results[program], base=program, type=region.value, state="", region=region.value )
else:
  print( "There is no data for this program and region." )

# 7. Looking at Discharge Monitoring Reports
Here we can focus in on specific facilities and pollutants of interest.

a. Select specific facilities and/or pollutants from this/these watershed(s).


In [None]:
data.add("2020 Discharge Monitoring") # Adds the requisite data if it's not already added

facility_list = list(data.results["2020 Discharge Monitoring"][ 'FAC_NAME' ].unique())
facility_list.sort()
facility_widget = widgets.SelectMultiple(
    options = facility_list,
    description='Facility:',
    disabled=False,
)
display(facility_widget)
param_list = list(data.results["2020 Discharge Monitoring"][ 'PARAMETER_DESC' ].unique())
param_list.sort()
param_widget = widgets.SelectMultiple(
    options = param_list,
    description = 'Parameter:',
    disabled = False
)
display(param_widget)
display(visualization)

b. Show the chart type for the data you selected! 


**If you're displaying a map of a particular pollutant**, the orange circles indicate facilities that have reported on that pollutant. The black dots indicate other facilities in the watershed that report under the CWA but haven't reported on the pollutant. Finally, the size of the orange circle corresponds to the number of reports. For example, a facility that has submitted 100 reports on benzene discharges would be represented by a larger orange circle than the a faciity that submitted 10 reports.

In [None]:
data.results["2020 DMR Filtered"] = data.results["2020 Discharge Monitoring"]

if (len(facility_widget.value) == 0): # if no facilities actually selected, select all
  facility_widget.value = facility_list
if (len(param_widget.value) == 0): # if no parameters actually selected, select all
  param_widget.value = param_list
data.results["2020 DMR Filtered"] = data.results["2020 DMR Filtered"].loc[(
    data.results["2020 DMR Filtered"][ 'FAC_NAME' ].isin(facility_widget.value) & 
    data.results["2020 DMR Filtered"][ 'PARAMETER_DESC' ].isin(param_widget.value))]

# Create a custom-named program table "2020 DMR Filtered" for the Echo class based on the 2020 Discharge Monitoring table
presets.attribute_tables["2020 DMR Filtered"] = presets.attribute_tables["2020 Discharge Monitoring"]

if visualization.value == "Map":
  data.show_program_map("2020 DMR Filtered")
elif visualization.value == "Chart":
  data.show_chart("2020 DMR Filtered")
elif visualization.value == "Table":
  data.show_data("2020 DMR Filtered")

c. In this cell you may save the Step 7 data to your computer.
After running it, you can access your files by clicking on the 'Files' tab in the menu on the left-hand side of the notebook (it looks like a folder). You may have to hit 'Refresh' if you don't see your file. Then, you can click on the ... next to your file and choose "Download". The CSV spreadsheet will download to wherever your browser usually saves files (e.g. Downloads folder)

In [None]:
if ( len( data.results["2020 DMR Filtered"] ) > 0 ):
  write_dataset( df=data.results["2020 DMR Filtered"], base="2020 DMR Filtered", type=region.value, state="", region=region.value )
else:
  print( "There is no data available to save." )

# Appendix: Other Use Cases

Additional features which are in the works! 

In [None]:
#Map facilities in multiple watersheds
#this_watersheds = Echo(["07080208", "07080105"], "HUC8 Watersheds")

#this_watersheds.add("CWA Penalties")
#this_watersheds.show_program_map("CWA Penalties")
#this_watersheds.show_chart("CWA Penalties")

#this_watersheds.add("CWA Violations")
#this_watersheds.show_program_map("CWA Violations")
#this_watersheds.show_chart("CWA Violations")

# Multiple zip codes show map
#zips = Echo([14303,14207,14219], "Zip Codes", ["CWA Violations"])
#zips.show_facility_map()

# Huc(s) that intersect with zipcodes
#hucs = Echo([14303,14207,14219], "HUC10 Watersheds", ["CWA Violations"], intersection=True, intersecting_geo="Zip Codes")
#huc = Echo([14303], "HUC10 Watersheds", ["CWA Violations"], intersection=True, intersecting_geo="Zip Codes")
# Read this as, get the HUC10 watersheds and their CWA violations that intersect with this/these zip code/s
#huc.show_program_map("CWA Violations")
#huc.show_top_violators("CWA", 20)

"""
Zips of interest
14303 (Niagara Falls along the Niagara River – industrial corridor ) = 04270101
14207 (Black Rock/Scajaquada Creek) = 04270101
14219 (South towns, Woodlawn Beach) = '04120103', '04260000' (Lake Erie?)
"""

# Getting no program data at first and then adding Effluent Violations
zip = Echo(['02150'], "Zip Codes")
zip.add("Effluent Violations")
zip.show_data("Effluent Violations")

In [None]:
zip.show_facility_map()