| ![EEW logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/eew.jpg?raw=true) | ![EDGI logo](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/edgi.png?raw=true) |
|---|---|


#### This notebook is licensed under GPL 3.0. Please visit our Github repo for more information: 
#### The notebook was collaboratively authored by the Environmental Data & Governance Initiative (EDGI) following [our authorship protocol](https://docs.google.com/document/d/1CtDN5ZZ4Zv70fHiBTmWkDJ9mswEipX6eCYrwicP66Xw/)
#### For more information about this project, visit [our website](https://www.environmentalenforcementwatch.org/)

*Note: This notebook pulls data from a copy of EPA's ECHO database hosted by Stony Brook University. The data sets are updated sporadically, meaning that some of the results from your run may not exactly match those in other EEW data products or from your previous run.* 

# EEW's Watershed Notebook

---



This is a Jupyter Notebook - a way to organize Python computer programming code. Hosting the notebook on Google Colab allows you to access and visualize data without actually needing to do any coding! The code is left visible for individuals with a knowledge of Python or for those wondering how this site was put together. This allows for a more interactive user experience. 

In this notebook, we use [EPA's ECHO database](https://echo.epa.gov/) to understand who is polluting what and where in watersheds across the United States. We do this by accessing EPA data related to the Clean Water Act. More information on the Clean Water Act can be found [here](https://docs.google.com/presentation/d/1g6ZN3B5jvs3F1VAigiUtNNezjXdJnzuELfo9Deo9Y2w/edit?usp=sharing). 

The notebook asks you to select a **a zip code** and finds the United States Geological Survey (USGS) watershed boundaries, known as Hydrologic Unit Codes or **HUC codes** that intersect with your zip code. The rest of the notebook then gathers information about facilities that report under the Clean Water Act in the watersheds. 

Be sure to read the instructions in "How to Run" (below)  and the notes above each cell for important tips and context! You may also wish to watch the tutorial video linked to in the first cell of code below.

## How to Run
* A "cell" in a Jupyter Notebook is a block of code performing a set of actions making available or using specific data.  The notebook works by running one cell after another as the notebook user selects offered options.
* 
If you click on a gray **code** cell, a little “play button” arrow appears on the left. If you click the play button, it will run the code in that cell (“**running** a cell”). The button will animate. When the animation stops, the cell has finished running.
![Where to click to run the cell](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/pressplay.JPG?raw=true)
* When you run the first cell, you may get a warning that the notebook was not authored by Google. We know, we authored them! It’s okay. Click “Run Anyway” to continue. 
![Error Message](https://github.com/edgi-govdata-archiving/EEW-Image-Assets/blob/master/Jupyter%20instructions/warning-message.JPG?raw=true)
* **Unless otherwise noted, it is important to run cells in order because they depend on each other.** 

# 0. Getting Ready
Run the cell below to load a YouTube video where you can get a demo of how to use this notebook to learn about water pollution and polluters in your community.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('gR5jVqb43os')

For further reference you might be interested in our [guide available here](https://docs.google.com/document/d/1fOL1O30cAXS7iOZMItSekG8E8KIUfH-EaYG601W-ncM/edit?usp=sharing).

*Please note that we continue to refine the notebook, so there may be some differences between the tutorial video, the guide, and the steps that follow below in the notebook.*

# 1. Begin! 
a. First, we load some code to help us get going in our analysis of water pollution and polluters. You'll know this is completed when "Done!" appears at the bottom of the gray cell (right before Step 2). 

In [None]:
# We have a folder of chunks of reusable code that we're using across different
#  Notebooks. This step goes and gets the relevant code from that folder so we
#  can use it here. (https://github.com/edgi-govdata-archiving/ECHO_modules/)
!git clone https://github.com/edgi-govdata-archiving/ECHO_modules.git -b reorganization &>/dev/null;

# Geopandas is an open source library for working with geographic data using the
#   data structures library "pandas" (common in Python for data processing).
#   (https://geopandas.org/)
!pip install geopandas  &>/dev/null;

# Topojson is an open source library that lets us keep file sizes small when
#   working with geographic data, so the Notebooks can run faster while still
#   working with detailed shapes. (https://github.com/mattijn/topojson)
!pip install topojson &>/dev/null;

# Install rtree to enable geopandas to clip data spatially
!pip install rtree &>/dev/null;

import warnings
warnings.filterwarnings('ignore')

# This code block will print a lot of data as it fetches and installs the libraries
#   Specified above. When it's done, the line below lets us know by printing "Done!"
print("Done!")

b. This cell must be run to load in some utility functions. Like Step 1, we're just looking for a "Done!" at the bottom of this gray cell. 

In [None]:
# These code blocks come from our folder (https://github.com/edgi-govdata-archiving/ECHO_modules/)
# Each of the files contains a series of function definitions. By running
#   those files here, we make the functions available in this Notebook.
%run ECHO_modules/utilities.py
%run ECHO_modules/presets.py
%run ECHO_modules/class.py
from IPython.core.display import HTML
print("Done!")

# 2. Where do you want to search?
a. Run the following cell to choose which zip code(s) you want to start with. Separate each zip code with a comma. For example, if you wanted to list several zip codes you would do that like this: 98225, 14303, 40218 Once you enter your zip code, there's no need to press enter, just proceed to the next step, 2b.

In [None]:
units = show_pick_region_widget( "Zip Code" )
units

b. You don't need to know the name of your watershed - we'll find the “HUCs” that at least partially cover your zip code(s). HUC stands for Hydrologic Unit Code and is the method the USGS uses to identify watersheds across the United States. HUCs come in different sizes (2,4,6,8,10,12). The HUC code essentially determines how big your search radius is – HUC8 being bigger and HUC10 smaller. HUC8 describes subbasins, of which there are approximately 2,200 in the U.S. There are approximately 20,000 HUC10 watersheds across the United States. The image below, from USGS, helps to visualize this. 
 
![USGS HUC visualization](https://prd-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/styles/atom_page_medium/public/thumbnails/image/WBD_Base_HUStructure_small.png)

Additional information on HUCs can be found [here](https://nas.er.usgs.gov/hucs.aspx). 

Run the cell below to choose whether you want to look at
the **HUC8 or HUC10** watershed level. Once you've chosen from the dropdown menu, there’s no need to press enter; you can proceed to the next step (4).

In [None]:
region_field = {k: v for k, v in region_field.items() if k in ["HUC8 Watersheds", "HUC10 Watersheds", "HUC12 Watersheds"]} 
region = show_region_type_widget( region_field )
region

# 3. Get the data!

a. This step pulls the data for the selected zip code(s) and watershed level (HUC8 or 10) you selected from our copy of EPA’s Enforcement and Compliance History (ECHO) database. You just need to press play!

In [None]:
units_list = [u for u in str(units.value).split(",")] # parse commas
data = Echo(units_list, region.value, [], intersection=True, intersecting_geo="Zip Codes")
print("Done!")

b. Next, let's actually see where the watersheds are. Running this cell produces a map of the watershed(s) intersecting your zip code. The watersheds are shaded blue and the zip code is orange. Hover your cursor over each watershed to see its HUC code.

In [None]:
data.show_map()

c. The zip code(s) you entered probably cross several watersheds, but you might want to focus on just one or two. Run the cell below and then click on the HUC code for the watershed(s) you want to examine more closely.

In [None]:
region_widget = None
input = list(data.spatial_data[spatial_tables[region.value]['id_field'].lower()].unique())
region_widget = widgets.SelectMultiple(
    options=input,
    description='Watersheds',
    disabled=False
)
region_widget

To pick more than watershed, hold the “shift” key while clicking on each code. After selecting one or more watersheds, run this cell to filter the data.

In [None]:
units_list = [u for u in region_widget.value] # parse commas
units_list = [str(u) if len(str(u)) in [8,10,12] else "0" + str(u) for u in units_list] # account for missing 0s
data = Echo(units_list, region.value)
print("Done!")

# 4. Show me the data! 
a. First, let's look at all the facilities regulated under the Clean Water Act (CWA) in this area. Most facilities regulated under the CWA are industrial facilities, like chemical-producing factories, or municipal facilities, like wastewater treatment plants. 

Running this cell produces a map:

In [None]:
data.facilities = data.results["facilities"].loc[data.results["facilities"]["NPDES_FLAG"]=="Y"]
data.show_facility_map()

The map will likely have some circles that are yellow and say, e.g. “40,” some circles that are green say, e.g. ”9,” and then some orange circles without numbers. The numbers refer to how many regulated facilities are located in that general area. You can zoom in to see individual facilities, each represented by an orange dot. You can click on them to see their name.

b. This next cell produces a chart that shows you 20 of the most consistent violators of the Clean Water Act in this watershed(s) for the last 13 quarters (the last 3 years plus the most recent quarter). Facilities that have been non-compliant with the CWA for the same number of quarters are not listed in any particular order. 

*Because there are often more than 20 facilities that have been non-compliant during the last 13 quarters, some facilities may not appear on this chart. You can show more than 20 facilities by changing the number 20 in the cell to something else before running it.*  

In [None]:
data.show_top_violators("CWA", 20)

c. Now, we’ll rank pollutants in terms of the number of facilities reporting having released them into the watershed(s) during Fiscal Year 2020 (October 2019 - September 2020). Information for Fiscal Year 2021 is coming soon! This cell may take up to 5 minutes to run if there are many facilities in the watershed(s) you have selected.

In [None]:
data.add("2020 Discharge Monitoring") 
top_pollutants = data.results['2020 Discharge Monitoring'].groupby(['PARAMETER_DESC'])[["FAC_NAME"]].nunique()
top_pollutants = top_pollutants.rename(columns={"FAC_NAME": "# of facilities"})
top_pollutants.sort_values(by="# of facilities", ascending=False)

d. You can select one of these pollutants to see which facility releases the most of it. First, run this cell to choose the pollutant.

In [None]:
pollutants = widgets.Dropdown(
    options=list(top_pollutants.index),
    description='Pollutants:',
    disabled=False,
)
display(pollutants)

Now, run this cell to see the ranking. 

In [None]:
top_pollutors = data.results['2020 Discharge Monitoring'].groupby(['PARAMETER_DESC', 'FAC_NAME', 'STANDARD_UNIT_DESC'])[["DMR_VALUE_STANDARD_UNITS"]].sum()
top_pollutors = top_pollutors.rename(columns={"STANDARD_UNIT_DESC": "units", "DMR_VALUE_STANDARD_UNITS": "values"})
display(HTML("<h3>"+pollutants.value+":</h3"))
top_pollutors.loc[pollutants.value].sort_values(by="values", ascending=False)

Please note that some pollutants are measured in two or more different units. You may also see some values as "0". That is likely because the facility made a report to EPA, but reported no discharge of the pollutant.

# 5. Explore!
There are several components of the Clean Water Act; we are focused on its National Pollutant Discharge Elimination System (NPDES). A facility can be *inspected* for compliance with NPDES, it can be found in *violation* of NPDES, and EPA or its state-level equivalents can levy *enforcement actions* against violating facilities. Additionally, facilities are required to submit reports summarizing their discharges into water bodies.

- **CWA Violations** = The number of facilities that were non-compliant with CWA NPDES in each quarter. There are different kinds of non-compliance (see 5b below for more information).
- **CWA Inspections** = The number of inspections of facilities made by state or federal regulators.
- **CWA Enforcements** = The amount of monetary penalties levied against polluting facilities, as well as the number of other enforcement actions such as administrative orders.
- **2020 Discharge Monitoring Reports** = The reports of wastewater discharges that facilities are required to submit. These are checked by EPA in order to determine Effluent Violations.

You can find more detailed definitions of these and other related terms [here](https://echo.epa.gov/tools/data-downloads/icis-npdes-download-summary).

a. First, run this cell to select one of these aspects of the CWA NPDES and how you would like to see data about it for your watershed(s) displayed: as a map, chart, or table. You can run this as many times as you want! Just change your selection in the drop down menu and buttons below, then re-run the play button at 5b.


In [None]:
data_sets = {k: v for k, v in attribute_tables.items() if v['echo_type'] == "NPDES"}

explore = show_data_set_widget( data_sets ) 
visualization = widgets.ToggleButtons(
    options=['Map', 'Chart', 'Table'],
    description='Visualization:',
    disabled=False,
    button_style='success', # 'success', 'info', 'warning', 'danger' or ''
    tooltips=['Show a map of facilities', 'Chart data over time', 'Show '],
)
display(explore)
display(visualization)

b. Map, chart, or show a table of the data you’re interested in. 

In [None]:
program = explore.value

if visualization.value == "Map":
  data.show_program_map(program)
elif visualization.value == "Chart":
  data.show_chart(program)
elif visualization.value == "Table":
  data.show_data(program)

**If you're displaying a map**, the orange circles you’ll see represent facilities that match whatever item of interest you're looking at. For instance, if you've selected CWA Inspections, the orange circles represent facilities that have been inspected since 2001. Or, if you selected CWA Penalties the orange circles represent facilities that have been penalized since 2001. The size of the orange circle corresponds to the magnitude of the item of interest for each facility. For example, if you are looking at penalty data, a facility that was issued \$100,000 in penalties would have a larger orange circle than a facility that was issued \$10,000 in penalties. The black dots, on the other hand, represent other facilities in the watershed(s) that report under the CWA but that do not match the specific information you requested above in 5a. For instance, if you selected CWA Inspections, the black dots are facilities that haven’t reported an inspection since at least 2001. 

**If you're displaying the CWA Violations chart**, below is a detailed explanation of the legend that will appear:

*Teal bar = NUME90Q*: The number of effluent violations reported in each quarter. These violations refer to when facilities report releasing more of a particular polluted than is permitted. 

*Orange bar = NUMCVDT*: The number of compliance schedule violations reported in each quarter. A compliance schedule violation is based on whether there is a water-quality target for a pollutant in this watershed(s). 

*Pink bar = NUMSVCD*: The number of single event violations reported in each quarter.

*Dark blue bar = NUMPSCH*: The number of permit schedule violations reported in the quarter, defined by YEARQTR. This type accounts for the majority of CWA NPDES violations. They are caused by not submitting reports on time, for instance.

**If you're displaying the Effluent Violations or Discharge Monitoring Reports (DMR) charts**, you will see the number of violations or reports per year broken down by chemical.

**For other charts**, the green bars will indicate the total number of violations, inspections, or amount of penalties in each year since 2001.

More information can be found [here](https://echo.epa.gov/tools/data-downloads/icis-npdes-download-summary).

c. In this cell you can save the data to your computer.

In [None]:
if ( len( data.results[program] ) > 0 ):
  write_dataset( df=data.results[program], base=program, type=region.value, state="", region=region.value )
  print( "Saved!" )
else:
  print( "There is no data for this program and region." )

You can access your files by clicking on the 'Files' tab in the menu on the left-hand side of the notebook (it looks like a folder). Then, click on the ... next to your file and choose "Download". The spreadsheet will download to wherever your browser usually saves files (e.g. the Downloads folder)

# 6. Looking at Discharge Monitoring Reports
Here we dive deeper into specific facilities and pollutants of interest. We will be using pollution reports (Discharge Monitoring Reports) submitted by facilities to EPA during Fiscal Year 2020 (October 1, 2019 to September 30, 2020). Information for Fiscal Years 2021 and 2022 coming soon!


a. Run this cell and then select specific facilities and/or pollutants from your watershed(s).

In [None]:
data.add("2020 Discharge Monitoring") # Adds the requisite data if it's not already added

facility_list = list(data.results["2020 Discharge Monitoring"][ 'FAC_NAME' ].unique())
facility_list.sort()
facility_widget = widgets.SelectMultiple(
    options = facility_list,
    description='Facility:',
    disabled=False,
)
display(facility_widget)
param_list = list(data.results["2020 Discharge Monitoring"][ 'PARAMETER_DESC' ].unique())
param_list.sort()
param_widget = widgets.SelectMultiple(
    options = param_list,
    description = 'Parameter:',
    disabled = False
)
display(param_widget)
display(visualization)

You can look at more than one facility by using the “shift” key and clicking on each. Likewise, you can look at more than one pollutant by using the “shift” key and clicking on each. To view all pollutants reported by all facilities, you can just proceed to 6c without selecting anything.

b. Map, chart, or create a table for the data you selected! 

In [None]:
data.results["2020 DMR Filtered"] = data.results["2020 Discharge Monitoring"]

if (len(facility_widget.value) == 0): # if no facilities actually selected, select all
  facility_widget.value = facility_list
if (len(param_widget.value) == 0): # if no parameters actually selected, select all
  param_widget.value = param_list
data.results["2020 DMR Filtered"] = data.results["2020 DMR Filtered"].loc[(
    data.results["2020 DMR Filtered"][ 'FAC_NAME' ].isin(facility_widget.value) & 
    data.results["2020 DMR Filtered"][ 'PARAMETER_DESC' ].isin(param_widget.value))]

# Create a custom-named program table "2020 DMR Filtered" for the Echo class based on the 2020 Discharge Monitoring table
presets.attribute_tables["2020 DMR Filtered"] = presets.attribute_tables["2020 Discharge Monitoring"]

if visualization.value == "Map":
  data.show_program_map("2020 DMR Filtered")
elif visualization.value == "Chart":
  data.show_chart("2020 DMR Filtered")
elif visualization.value == "Table":
  data.show_data("2020 DMR Filtered")

**If you're displaying a map of a particular pollutant**, the orange circles represent facilities that reported discharging that pollutant. The size of the orange circle corresponds to the number of reports. For example, a facility that submitted 100 reports on benzene discharges would be represented by a larger orange circle than a facility that submitted 10 reports. The black dots represent other facilities in the watershed that are regulated under the CWA but which didn’t report discharging the pollutant.


c. In this cell you can save the data to your computer.

In [None]:
if ( len( data.results["2020 DMR Filtered"] ) > 0 ):
  write_dataset( df=data.results["2020 DMR Filtered"], base="2020 DMR Filtered", type=region.value, state="", region=region.value )
  print( "Saved!" )
else:
  print( "There is no data available to save." )

You can access your files by clicking on the 'Files' tab in the menu on the left-hand side of the notebook (it looks like a folder). Then, click on the ... next to your file and choose "Download". The spreadsheet will download to wherever your browser usually saves files (e.g. the Downloads folder)

# 7. Share your work!
The easiest way to save and share your work is to:

1.   Be sure to download the datasets you created in 5c and 6c
2.   Under "File" in the top-left corner of the Colab window, select "Print" and choose "Save as PDF" This will download a static PDF copy of the notebook to your computer.
3.   Email, tweet, and otherwise let the world know about your findings!

# 8. Tell us what you found, whether anything went wrong, or if you would like to arrange a 1:1 workshop.
Send an email to environmentalenforcementwatch@gmail.com or reach us on Twitter [@EEW_Network](https://www.twitter.com/eew_network)