PA2 Notebook 1: Fecth Data
===


This is the first notebook for DRWI Pollution Assessment Stage 2 (PA2) analysis.
It fetches and prepares all the input data and modeling necessary for the Stage 2 Assessment.

The general data analysis pipeline is to:
- Run Model My Watershed (MMW) Multi-Year Model (GWLF-E)
  -  for every HUC12 in DRWI.
- Proccess MMW HUC12 results through [WikiSRAT microservice](https://github.com/TheAcademyofNaturalSciences/WikiSRATMicroService) API.
  - WikiSRAT downscales HUC12 loads to NHD+v2 catchments and route pollution through the NHD+v2 stream reach network. 
  - WikiSRAT runs group MMW HUC12 results by HUC8 to properly route loads through the stream network.
  - Stream network routing includes attenuation due to physical and biological processes within surface waters. 
  - WikiSRAT gets run multiple times to simulate:
    - Baseline results, with no restoration or protection practices.
    - Restoration results, which include the  

# Installation and Setup

Carefully follow our **[Installation Instructions](README.md#get-started)**, especially including:
- Creating a virtual environment for this repository (step 3)

## Import Python Dependencies

In [1]:
from pathlib import Path

import numpy     as np
import pandas    as pd
import geopandas as gpd

# packages for data requests
import requests
from requests.auth import HTTPBasicAuth
import json

In [2]:
print("Geopandas: ", gpd.__version__)
# print("spatialpandas: ", spd.__version__)
# print("datashader: ", ds.__version__)
# print("pygeos: ", pygeos.__version__)

Geopandas:  0.10.2


In [3]:
!conda-develop /Users/aaufdenkampe/Documents/Python/pollution-assessment/src

path exists, skipping /Users/aaufdenkampe/Documents/Python/pollution-assessment/src
completed operation for: /Users/aaufdenkampe/Documents/Python/pollution-assessment/src


In [4]:
# Custom functions for Pollution Assessment
import pollution_assessment as pa
import pollution_assessment.plot
# Confirm that the `pa.plot` sub-module is imported
dir(pa)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'plot']

## Set Paths to Input and Output Files with `pathlib`

Use the [pathlib](https://docs.python.org/3/library/pathlib.html) library (built-in to Python 3) to manage paths indpendentely of OS or environment.

This blog post describes `pathlib`'s benefits relative to using the `os` library or manual approaches.
- https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f

In [5]:
# Find your current working directory, which should be folder for this notebook.
Path.cwd()

PosixPath('/Users/aaufdenkampe/Documents/Python/pollution-assessment/stage2')

In [6]:
# Set your project directory to your local folder for your clone of this repository
project_path = Path.cwd().parent
project_path

PosixPath('/Users/aaufdenkampe/Documents/Python/pollution-assessment')

In [7]:
# Assign relative paths for data folders. End with a slash character, `/`.
pa1_data_folder = Path('stage1/data/')
pa2_mmw_folder  = Path('stage2/DRB_GWLFE/mmw_results/')
pa2_wikisrat_folder = Path('stage2/wikiSRAT/')

# PA2 Names & Units

## PA2 File Naming Conventions
NOTE: For Stage 2 we changed naming slightly from Stage 1, to better facilitate building various options.

Level 1 names, scale:
* `catch` indicates catchment-level data, for the local catchment only
* `reach` indicates reach-level data, which includes all upstream contributions

Level 2 names, model:
* `_base` indicates model baseline outputs (no conservation)
* `_rest` indicates model with restoration reductions
* `_prot` indicates model with protection projects, avoided loads

**Clusters** are geographic priority areas, which include parts of pristine headwaters and working forests of the upper watershed, farmlands, suburbs, and industrial and urban centers downstream, and the coastal plain where the river and emerging groundwater empties into either the Delaware Bay or the Atlantic Coast.

There are 8 included in the DRB:
- Poconos-Kittaninny, 
- Upper Lehigh,  
- New Jersey Highlands, 
- Middle Schuylkill, 
- Schuylkill Highlands, 
- Upstream Suburban Philadelphia, 
- Brandywine-Christina, 
- Kirkwood-Cohansey Aquifer. 

**Focus areas** are smaller geographic units within clusters. 

# Create Catch & Reach Files for COMIDs & Geographies
Read Stage 1 dataframes to get the COMIDs & geometries for NHD+v2 catchments and reaches, as a foundation for all Stage 2 work

- Background: stage1/WikiSRAT_AnalysisViz.ipynb
- Parquet to GeoDataFrame: https://geopandas.readthedocs.io/en/latest/docs/reference/api/geopandas.read_parquet.html

In [8]:
%%time
# read data from parquet files
catch_gdf = gpd.read_parquet(project_path / pa1_data_folder /'base_df_catch.parquet')
reach_gdf = gpd.read_parquet(project_path / pa1_data_folder /'base_df_reach.parquet')

CPU times: user 1.18 s, sys: 129 ms, total: 1.31 s
Wall time: 1.29 s


In [9]:
catch_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 19496 entries, 1748535 to 932040370
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   tp_load             19496 non-null  float64 
 1   tn_load             19496 non-null  float64 
 2   tss_load            19496 non-null  float64 
 3   catchment_hectares  19496 non-null  float64 
 4   watershed_hectares  19496 non-null  float64 
 5   tp_loadrate_ws      19496 non-null  float64 
 6   tn_loadrate_ws      19496 non-null  float64 
 7   tss_loadrate_ws     19496 non-null  float64 
 8   maflowv             19496 non-null  float64 
 9   geom_catchment      19496 non-null  geometry
 10  cluster             17358 non-null  category
 11  sub_focusarea       186 non-null    Int64   
 12  nord                18870 non-null  Int64   
 13  nordstop            18844 non-null  Int64   
 14  huc12               19496 non-null  category
 15  streamorder       

In [10]:
reach_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 19496 entries, 1748535 to 932040370
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   tp_conc             16823 non-null  float64 
 1   tn_conc             16823 non-null  float64 
 2   tss_conc            16823 non-null  float64 
 3   catchment_hectares  19496 non-null  float64 
 4   watershed_hectares  19496 non-null  float64 
 5   maflowv             19496 non-null  float64 
 6   geom                19494 non-null  geometry
 7   cluster             17358 non-null  category
 8   sub_focusarea       186 non-null    Int64   
 9   nord                18870 non-null  Int64   
 10  nordstop            18844 non-null  Int64   
 11  huc12               19496 non-null  category
 12  streamorder         19496 non-null  int64   
 13  headwater           19496 non-null  int64   
 14  phase               4082 non-null   category
 15  fa_name           

In [11]:
reach_gdf.headwater.unique()

array([1, 0])

## Remove Stage 1 results

In [12]:
pa1_catch_vars = ['tp_load', 'tn_load', 'tss_load',
                  'tp_loadrate_ws', 'tn_loadrate_ws', 'tss_loadrate_ws',]
catch_gdf.drop(pa1_catch_vars, axis='columns', inplace=True)

In [13]:
pa1_reach_vars = ['tp_conc', 'tn_conc', 'tss_conc',]
reach_gdf.drop(pa1_reach_vars, axis='columns', inplace=True)

# Read Results from MMW-WikiSRAT
Read Stage 2 baseline results from MMW-WikiSRAT, which were run and saved separately using  for five different scenarios differntiated by `run_group`, using the`stage2/wikiSRAT/run_srat_with_bmps.py` script by Sara Damiano.

Run groups
- 'No restoration or protection', 
- 'Direct WPF Restoration', 
- 'Direct and Indirect WPF Restoration', 
- 'All Restoration', 
- 'Direct WPF Protection'

Python Docs:
- Read CSV to Pandas: 
  - Guide: https://pandas.pydata.org/docs/user_guide/io.html#csv-text-files 
  - Ref:   https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
- Style (4-space hanging indent; nothing on first line)
  - https://google.github.io/styleguide/pyguide.html#34-indentation

## Catchment Loads
- Loads are local loads from the land to the stream reach.
- Load units are `kg/y`.

In [14]:
# Catchment Total Loads
catch_loads = pd.read_csv(
    project_path / 'stage2/wikiSRAT/' / 'catchment_total_local_load.csv',
    dtype = {
        'Source': 'category',
        'huc': 'category',
        'gwlfe_endpoint': 'category',
        'run_group': 'category',
        'funding_sources': 'category',
        'with_attenuation': bool,
    }
)
catch_loads.info()  # Not setting index, as each `comid` has 5 runs

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109200 entries, 0 to 109199
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        109200 non-null  int64   
 1   Source            109200 non-null  category
 2   Sediment          109200 non-null  float64 
 3   TotalN            109200 non-null  float64 
 4   TotalP            109200 non-null  float64 
 5   comid             109200 non-null  int64   
 6   huc               109200 non-null  category
 7   gwlfe_endpoint    109200 non-null  category
 8   huc_level         109200 non-null  int64   
 9   run_group         109200 non-null  category
 10  funding_sources   87360 non-null   category
 11  with_attenuation  109200 non-null  bool    
dtypes: bool(1), category(5), float64(3), int64(3)
memory usage: 5.7 MB


In [15]:
# Confirm only total local loads
catch_loads.Source.unique()

['Total Local Load']
Categories (1, object): ['Total Local Load']

In [16]:
# List Run Groups
catch_loads.run_group.unique()

['No restoration or protection', 'Direct WPF Restoration', 'Direct and Indirect WPF Restoration', 'All Restoration', 'Direct WPF Protection']
Categories (5, object): ['All Restoration', 'Direct WPF Protection', 'Direct WPF Restoration', 'Direct and Indirect WPF Restoration', 'No restoration or protection']

In [17]:
# List Funding Sources per run group
catch_loads.funding_sources.unique()

[NaN, 'Delaware River Restoration Fund', 'Delaware River Restoration Fund, Delaware Riv..., 'Delaware River Restoration Fund, Delaware Riv..., 'Delaware River Watershed Protection Fund - Fo...]
Categories (4, object): ['Delaware River Restoration Fund', 'Delaware River Restoration Fund, Delaware Riv..., 'Delaware River Restoration Fund, Delaware Riv..., 'Delaware River Watershed Protection Fund - Fo...]

In [20]:
# Catchment Point source loads
catch_loads_sources = pd.read_csv(
    project_path / 'stage2/wikiSRAT/catchment_sources_local_load.csv',
    dtype = {
        'Source': 'category',
        'huc': 'category',
        'gwlfe_endpoint': 'category',
        'run_group': 'category',
        'funding_sources': 'category',
        'with_attenuation': bool,
    }
)
catch_loads_sources.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371280 entries, 0 to 371279
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        371280 non-null  int64   
 1   Source            371280 non-null  category
 2   Sediment          279948 non-null  float64 
 3   TotalN            367308 non-null  float64 
 4   TotalP            367308 non-null  float64 
 5   comid             371280 non-null  int64   
 6   huc               371280 non-null  category
 7   gwlfe_endpoint    371280 non-null  category
 8   huc_level         371280 non-null  int64   
 9   run_group         371280 non-null  category
 10  funding_sources   0 non-null       category
 11  with_attenuation  371280 non-null  bool    
dtypes: bool(1), category(5), float64(3), int64(3)
memory usage: 19.5 MB


In [21]:
list(catch_loads_sources.Source.unique())

['Barren Areas',
 'Cropland',
 'Farm Animals',
 'Hay/Pasture',
 'High-Density Mixed',
 'Low-Density Mixed',
 'Low-Density Open Space',
 'Medium-Density Mixed',
 'Open Land',
 'Point Sources',
 'Reach Concentration',
 'Septic Systems',
 'Stream Bank Erosion',
 'Subsurface Flow',
 'Total Local Load',
 'Wetlands',
 'Wooded Areas']

In [22]:
catch_loads_ps = catch_loads_sources.loc[catch_loads_sources.Source == 'Point Sources']
catch_loads_ps.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21840 entries, 9 to 371272
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   Unnamed: 0        21840 non-null  int64   
 1   Source            21840 non-null  category
 2   Sediment          0 non-null      float64 
 3   TotalN            21840 non-null  float64 
 4   TotalP            21840 non-null  float64 
 5   comid             21840 non-null  int64   
 6   huc               21840 non-null  category
 7   gwlfe_endpoint    21840 non-null  category
 8   huc_level         21840 non-null  int64   
 9   run_group         21840 non-null  category
 10  funding_sources   0 non-null      category
 11  with_attenuation  21840 non-null  bool    
dtypes: bool(1), category(5), float64(3), int64(3)
memory usage: 1.3 MB


In [23]:
# Confirm that there are no duplicate comids (i.e. same unique number as data frame)
catch_loads_ps.comid.unique().size

21840

In [24]:
catch_loads_ps.set_index('comid', inplace=True)

In [25]:
catch_loads_ps.index

Int64Index([ 2612780,  2612782,  2612792,  2612794,  2612920,  2612948,
             2612950,  2612952,  2612790,  2612800,
            ...
             9889028,  9889030,  9889032,  9889034,  9889036,  9891532,
            10466473, 10466475, 10466691, 27081103],
           dtype='int64', name='comid', length=21840)

## Reach Concentrations
- Concentrations are average annual concentrations leaving the stream reach, and including all upstream loads and attenuation (if turned on).
- Concentration units are `mg/L`.

In [26]:
# Reach total concentrations
reach_concs = pd.read_csv(
    project_path / 'stage2/wikiSRAT/reach_concentrations.csv',
    dtype = {
        'Source': 'category',
        'huc': 'category',
        'gwlfe_endpoint': 'category',
        'run_group': 'category',
        'funding_sources': 'category',
        'with_attenuation': bool,
    }
)
reach_concs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109200 entries, 0 to 109199
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        109200 non-null  int64   
 1   Source            109200 non-null  category
 2   Sediment          89340 non-null   float64 
 3   TotalN            89340 non-null   float64 
 4   TotalP            89340 non-null   float64 
 5   comid             109200 non-null  int64   
 6   huc               109200 non-null  category
 7   gwlfe_endpoint    109200 non-null  category
 8   huc_level         109200 non-null  int64   
 9   run_group         109200 non-null  category
 10  funding_sources   87360 non-null   category
 11  with_attenuation  109200 non-null  bool    
dtypes: bool(1), category(5), float64(3), int64(3)
memory usage: 5.7 MB


In [27]:
# Confirm only total local loads
reach_concs.Source.unique()

['Reach Concentration']
Categories (1, object): ['Reach Concentration']

In [28]:
# List Run Groups
reach_concs.run_group.unique()

['No restoration or protection', 'Direct WPF Restoration', 'Direct and Indirect WPF Restoration', 'All Restoration', 'Direct WPF Protection']
Categories (5, object): ['All Restoration', 'Direct WPF Protection', 'Direct WPF Restoration', 'Direct and Indirect WPF Restoration', 'No restoration or protection']

In [30]:
# Reach Point Source Concentrations
reach_concs_ps_runs = pd.read_csv(
    project_path / 'stage2/wikiSRAT/reach_pt_source_conc.csv',
    dtype = {
        'Source': 'category',
        'huc': 'category',
        'gwlfe_endpoint': 'category',
        'run_group': 'category',
        'funding_sources': 'category',
        'with_attenuation': bool,
    }
)
reach_concs_ps_runs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109200 entries, 0 to 109199
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        109200 non-null  int64   
 1   Source            109200 non-null  category
 2   Sediment          89340 non-null   float64 
 3   TotalN            89340 non-null   float64 
 4   TotalP            89340 non-null   float64 
 5   comid             109200 non-null  int64   
 6   huc               109200 non-null  category
 7   gwlfe_endpoint    109200 non-null  category
 8   huc_level         109200 non-null  int64   
 9   run_group         109200 non-null  category
 10  funding_sources   87360 non-null   category
 11  with_attenuation  109200 non-null  bool    
dtypes: bool(1), category(5), float64(3), int64(3)
memory usage: 5.7 MB


In [31]:
reach_concs_ps = reach_concs_ps_runs.loc[reach_concs_ps_runs.run_group == 'No restoration or protection']
reach_concs_ps.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21840 entries, 0 to 109195
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   Unnamed: 0        21840 non-null  int64   
 1   Source            21840 non-null  category
 2   Sediment          17868 non-null  float64 
 3   TotalN            17868 non-null  float64 
 4   TotalP            17868 non-null  float64 
 5   comid             21840 non-null  int64   
 6   huc               21840 non-null  category
 7   gwlfe_endpoint    21840 non-null  category
 8   huc_level         21840 non-null  int64   
 9   run_group         21840 non-null  category
 10  funding_sources   0 non-null      category
 11  with_attenuation  21840 non-null  bool    
dtypes: bool(1), category(5), float64(3), int64(3)
memory usage: 1.3 MB


In [32]:
reach_concs_ps.set_index('comid', inplace=True)

In [33]:
reach_concs_ps.index

Int64Index([ 2612780,  2612782,  2612792,  2612794,  2612920,  2612948,
             2612950,  2612952,  2612790,  2612800,
            ...
             9889028,  9889030,  9889032,  9889034,  9889036,  9891532,
            10466473, 10466475, 10466691, 27081103],
           dtype='int64', name='comid', length=21840)

# Add Stage 2 data to Stage 1 dataframe