| ![EEW logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/main/Jupyter%20instructions/eew.jpg?raw=true) | ![EDGI logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/main/Jupyter%20instructions/edgi.png?raw=true) |
|---|---|


#### This notebook is licensed under GPL 3.0. Please visit our Github repo for more information: 
#### The notebook was collaboratively authored by the Environmental Data & Governance Initiative (EDGI) following [our authorship protocol](https://docs.google.com/document/d/1CtDN5ZZ4Zv70fHiBTmWkDJ9mswEipX6eCYrwicP66Xw/)
#### For more information about this project, visit [our website](https://www.environmentalenforcementwatch.org/)

*Note: This notebook pulls data from a copy of EPA's ECHO database hosted by Stony Brook University. The data sets are updated sporadically, meaning that some of the results from your run may not exactly match those in other EEW data products or from your previous run.* 

# EEW's Watershed Notebook

---



This is a Jupyter Notebook - a way to organize Python computer programming code. Hosting the notebook on Google Colab allows you to access and visualize data without actually needing to do any coding! The code is left visible for individuals with a knowledge of Python or for those wondering how this site was put together. This allows for a more interactive user experience. 

In this notebook, we use [EPA's ECHO database](https://echo.epa.gov/) to understand who is polluting what and where in watersheds across the United States. We do this by accessing EPA data related to the Clean Water Act. More information on the Clean Water Act can be found [here](https://docs.google.com/presentation/d/1g6ZN3B5jvs3F1VAigiUtNNezjXdJnzuELfo9Deo9Y2w/edit?usp=sharing). 

The notebook asks you to select a **a zip code** and finds the United States Geological Survey (USGS) watershed boundaries, known as Hydrologic Unit Codes or **HUC codes** that intersect with your zip code. The rest of the notebook then gathers information about facilities that report under the Clean Water Act in the watersheds. 

Be sure to read the instructions in "How to Run" (below)  and the notes above each cell for important tips and context! You may also wish to watch the tutorial video linked to in the first cell of code below.

## How to Run
![Instructions for running a Jupyter Notebook](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/main/overall_instructions.png?raw=true)

# 0. Getting Ready
Run the cell below to load a YouTube video where you can get a demo of how to use this notebook to learn about water pollution and polluters in your community.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('gR5jVqb43os')

For further reference you might be interested in our [guide available here](https://docs.google.com/document/d/1fOL1O30cAXS7iOZMItSekG8E8KIUfH-EaYG601W-ncM/edit?usp=sharing).

*Please note that we continue to refine the notebook, so there may be some differences between the tutorial video, the guide, and the steps that follow below in the notebook.*

# 1. Begin! 
First, we load some code to help us get going in our analysis of water pollution and polluters. You'll know this is completed when "Done!" appears at the bottom of the gray cell (right before Step 2). 

In [None]:
# This code block will fetch and install the libraries specified below.

# We have a folder of chunks of reusable code that we're using across different
#  Notebooks. This step goes and gets the relevant code from that folder so we
#  can use it here. (https://github.com/edgi-govdata-archiving/ECHO_modules/)
!git clone https://github.com/edgi-govdata-archiving/ECHO_modules.git -b reorganization &>/dev/null;

# Geopandas is an open source library for working with geographic data using the
#   data structures library "pandas" (common in Python for data processing).
#   (https://geopandas.org/)
!pip install geopandas  &>/dev/null;

# Topojson is an open source library that lets us keep file sizes small when
#   working with geographic data, so the Notebooks can run faster while still
#   working with detailed shapes. (https://github.com/mattijn/topojson)
!pip install topojson &>/dev/null;

# Install rtree to enable geopandas to clip data spatially
!pip install rtree &>/dev/null;

import warnings
warnings.filterwarnings('ignore')

%run ECHO_modules/utilities.py
%run ECHO_modules/presets.py
%run ECHO_modules/class.py
from IPython.core.display import HTML

print("Done!")

# 2. Where do you want to search?
You don't need to know the name of your watershed - we'll find the “HUCs” that at least partially cover your zip code(s). HUC stands for Hydrologic Unit Code and is the method the USGS uses to identify watersheds across the United States. HUCs come in different sizes (2,4,6,8,10,12). The HUC code essentially determines how big your search radius is – HUC8 being bigger and HUC10 smaller. HUC8 describes subbasins, of which there are approximately 2,200 in the U.S. There are approximately 20,000 HUC10 watersheds across the United States. The image below, from USGS, helps to visualize this. Additional information on HUCs can be found [here](https://nas.er.usgs.gov/hucs.aspx). 
 
![USGS HUC illustration](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/main/huc.png?raw=true) Source: [South Dakota State University](https://extension.sdstate.edu/sites/default/files/inline-images/W-01296-03-Water-Hydrologic-Unit-Code-HUG.png).

![Instructions for selecting a zip code and huc](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/main/ziphuc_instructions.png?raw=true) 

In [None]:
# (a)
units = show_pick_region_widget( "Zip Code" )
units

In [None]:
# (b)
region_field = {k: v for k, v in region_field.items() if k in ["HUC8 Watersheds", "HUC10 Watersheds", "HUC12 Watersheds"]} 
region = show_region_type_widget( region_field )
region

# 3. Get the data!
![Instructions for getting data](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/main/data_instructions.png?raw=true) 

In [None]:
# (a)
units_list = [u for u in str(units.value).split(",")] # parse commas
data = Echo(units_list, region.value, [], intersection=True, intersecting_geo="Zip Codes")
print("Done!")

In [None]:
# (b)
data.show_map()

In [None]:
# (c)
region_widget = None
input = list(data.spatial_data[spatial_tables[region.value]['id_field'].lower()].unique())
region_widget = widgets.SelectMultiple(
    options=input,
    description='Watersheds',
    disabled=False
)
region_widget

In [None]:
# (d)
units_list = [u for u in region_widget.value] # parse commas
units_list = [str(u) if len(str(u)) in [8,10,12] else "0" + str(u) for u in units_list] # account for missing 0s
data = Echo(units_list, region.value)
print("Done!")

# 4. Show me the data! 
![Instructions to produce overview of watershed data](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/main/summary_instructions.png?raw=true)

In [None]:
# (a)
data.facilities = data.results["facilities"].loc[data.results["facilities"]["NPDES_FLAG"]=="Y"]
data.show_facility_map()

*Facilities that have been non-compliant with the CWA for the same number of quarters are not listed in any particular order. Because there are often more than 20 facilities that have been non-compliant during the last 13 quarters, some facilities may not appear on this chart. You can show more than 20 facilities by changing the number 20 in the cell to something else before running it.*  

In [None]:
# (b)
data.show_top_violators("CWA", 20)

*Information for Fiscal Year 2021 is coming soon! This cell may take up to 5 minutes to run if there are many facilities in the watershed(s) you have selected.*

In [None]:
# (c)
data.add("2020 Discharge Monitoring") 
top_pollutants = data.results['2020 Discharge Monitoring'].groupby(['PARAMETER_DESC'])[["FAC_NAME"]].nunique()
top_pollutants = top_pollutants.rename(columns={"FAC_NAME": "# of facilities"})
top_pollutants.sort_values(by="# of facilities", ascending=False)

In [None]:
# (d)
pollutants = widgets.Dropdown(
    options=list(top_pollutants.index),
    description='Pollutants:',
    disabled=False,
)
display(pollutants)

In [None]:
# (e)
top_pollutors = data.results['2020 Discharge Monitoring'].groupby(['PARAMETER_DESC', 'FAC_NAME', 'STANDARD_UNIT_DESC'])[["DMR_VALUE_STANDARD_UNITS"]].sum()
top_pollutors = top_pollutors.rename(columns={"STANDARD_UNIT_DESC": "units", "DMR_VALUE_STANDARD_UNITS": "values"})
display(HTML("<h3>"+pollutants.value+":</h3"))
top_pollutors.loc[pollutants.value].sort_values(by="values", ascending=False)

# 5. Explore!
There are several components of the Clean Water Act; we are focused on its National Pollutant Discharge Elimination System (NPDES). A facility can be *inspected* for compliance with NPDES, it can be found in *violation* of NPDES, and EPA or its state-level equivalents can levy *enforcement actions* against violating facilities. Additionally, facilities are required to submit reports summarizing their discharges into water bodies.

- **CWA Violations** = The number of facilities that were non-compliant with CWA NPDES in each quarter.
- **CWA Inspections** = The number of inspections of facilities made by state or federal regulators.
- **CWA Enforcements** = The amount of monetary penalties levied against polluting facilities, as well as the number of other enforcement actions such as administrative orders.
- **2020 Discharge Monitoring Reports** = The reports of wastewater discharges that facilities are required to submit. These are checked by EPA in order to determine Effluent Violations.

You can find more detailed definitions of these and other related terms [here](https://echo.epa.gov/tools/data-downloads/icis-npdes-download-summary).

![Instructions for exploring the data in depth](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/main/explore_instructions.png?raw=true) 


In [None]:
# (a)
data_sets = {k: v for k, v in attribute_tables.items() if v['echo_type'] == "NPDES"}

explore = show_data_set_widget( data_sets ) 
visualization = widgets.ToggleButtons(
    options=['Map', 'Chart', 'Table'],
    description='Visualization:',
    disabled=False,
    button_style='success', # 'success', 'info', 'warning', 'danger' or ''
    tooltips=['Show a map of facilities', 'Chart data over time', 'Show '],
)
display(explore)
display(visualization)

In [None]:
# (b)
program = explore.value

if visualization.value == "Map":
  data.show_program_map(program)
elif visualization.value == "Chart":
  data.show_chart(program)
elif visualization.value == "Table":
  data.show_data(program)

In [None]:
# (c)
if ( len( data.results[program] ) > 0 ):
  write_dataset( df=data.results[program], base=program, type=region.value, state="", region=region.value )
  print( "Saved!" )
else:
  print( "There is no data for this program and region." )

# 6. Looking at Discharge Monitoring Reports
![Instructions for investigating DMRs](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/main/dmr_instructions.png?raw=true) 


In [None]:
# (a)
data.add("2020 Discharge Monitoring") # Adds the requisite data if it's not already added

facility_list = list(data.results["2020 Discharge Monitoring"][ 'FAC_NAME' ].unique())
facility_list.sort()
facility_widget = widgets.SelectMultiple(
    options = facility_list,
    description='Facility:',
    disabled=False,
)
display(facility_widget)
param_list = list(data.results["2020 Discharge Monitoring"][ 'PARAMETER_DESC' ].unique())
param_list.sort()
param_widget = widgets.SelectMultiple(
    options = param_list,
    description = 'Parameter:',
    disabled = False
)
display(param_widget)
display(visualization)

In [None]:
# (b)
data.results["2020 DMR Filtered"] = data.results["2020 Discharge Monitoring"]

if (len(facility_widget.value) == 0): # if no facilities actually selected, select all
  facility_widget.value = facility_list
if (len(param_widget.value) == 0): # if no parameters actually selected, select all
  param_widget.value = param_list
data.results["2020 DMR Filtered"] = data.results["2020 DMR Filtered"].loc[(
    data.results["2020 DMR Filtered"][ 'FAC_NAME' ].isin(facility_widget.value) & 
    data.results["2020 DMR Filtered"][ 'PARAMETER_DESC' ].isin(param_widget.value))]

# Create a custom-named program table "2020 DMR Filtered" for the Echo class based on the 2020 Discharge Monitoring table
presets.attribute_tables["2020 DMR Filtered"] = presets.attribute_tables["2020 Discharge Monitoring"]

if visualization.value == "Map":
  data.show_program_map("2020 DMR Filtered")
elif visualization.value == "Chart":
  data.show_chart("2020 DMR Filtered")
elif visualization.value == "Table":
  data.show_data("2020 DMR Filtered")

In [None]:
# (c)
if ( len( data.results["2020 DMR Filtered"] ) > 0 ):
  write_dataset( df=data.results["2020 DMR Filtered"], base="2020 DMR Filtered", type=region.value, state="", region=region.value )
  print( "Saved!" )
else:
  print( "There is no data available to save." )

# 7. Share your work!
![Instructions for sharing results](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/main/share_instructions.png?raw=true) 

# 8. Tell us what you found, whether anything went wrong, or if you would like to arrange a 1:1 workshop.
Send an email to environmentalenforcementwatch@gmail.com or reach us on Twitter [@EEW_Network](https://www.twitter.com/eew_network)