DRWI Pollution Assessment Stage 2  
Notebook 1: Fecth Data
===

This first notebook fetches and prepares all the input data and modeling necessary for the Stage 2 Assessment.

The general data analysis pipeline is to:
- Run Model My Watershed (MMW) Multi-Year Model (GWLF-E)
  -  for every HUC12 in DRWI.
- Proccess MMW HUC12 results through [WikiSRAT microservice](https://github.com/TheAcademyofNaturalSciences/WikiSRATMicroService) API.
  - WikiSRAT downscales HUC12 loads to NHD+v2 catchments and route pollution through the NHD+v2 stream reach network. 
  - WikiSRAT runs group MMW HUC12 results by HUC8 to properly route loads through the stream network.
  - Stream network routing includes attenuation due to physical and biological processes within surface waters. 
  - WikiSRAT gets run multiple times to simulate:
    - Baseline results, with no restoration or protection practices.
    - Restoration results, which include the  

# Installation and Setup

Carefully follow our **[Installation Instructions](README.md#get-started)**, especially including:
- Creating a virtual environment for this repository (step 3)

## Import Python Dependencies

In [1]:
from pathlib import Path

import numpy     as np
import pandas    as pd
import geopandas as gpd

# packages for data requests
import requests
from requests.auth import HTTPBasicAuth
import json

In [2]:
print("Geopandas: ", gpd.__version__)
# print("spatialpandas: ", spd.__version__)
# print("datashader: ", ds.__version__)
# print("pygeos: ", pygeos.__version__)

Geopandas:  0.10.2


In [3]:
sys.path

['/Users/aaufdenkampe/Documents/Python/pollution-assessment/stage2',
 '/Users/aaufdenkampe/opt/anaconda3/envs/drwi_pa/lib/python39.zip',
 '/Users/aaufdenkampe/opt/anaconda3/envs/drwi_pa/lib/python3.9',
 '/Users/aaufdenkampe/opt/anaconda3/envs/drwi_pa/lib/python3.9/lib-dynload',
 '',
 '/Users/aaufdenkampe/opt/anaconda3/envs/drwi_pa/lib/python3.9/site-packages',
 '/Users/aaufdenkampe/Documents/Python/pollution-assessment/src']

In [4]:
!conda-develop /Users/aaufdenkampe/Documents/Python/pollution-assessment/src

path exists, skipping /Users/aaufdenkampe/Documents/Python/pollution-assessment/src
completed operation for: /Users/aaufdenkampe/Documents/Python/pollution-assessment/src


In [5]:
# Custom functions for Pollution Assessment
import pollution_assessment as pa
import pollution_assessment.plot
# Confirm that the `pa.plot` sub-module is imported
dir(pa)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'plot']

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'plot']

## Set Paths to Input and Output Files with `pathlib`

Use the [pathlib](https://docs.python.org/3/library/pathlib.html) library (built-in to Python 3) to manage paths indpendentely of OS or environment.

This blog post describes `pathlib`'s benefits relative to using the `os` library or manual approaches.
- https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f

In [6]:
# Find your current working directory, which should be folder for this notebook.
Path.cwd()

PosixPath('/Users/aaufdenkampe/Documents/Python/pollution-assessment/stage2')

In [7]:
# Set your project directory to your local folder for your clone of this repository
project_path = Path.cwd().parent
project_path

PosixPath('/Users/aaufdenkampe/Documents/Python/pollution-assessment')

In [8]:
# Assign relative paths for data folders. End with a slash character, `/`.
pa1_data_folder = Path('stage1/data/')
pa2_mmw_folder  = Path('stage2/DRB_GWLFE/mmw_results/')
pa2_mmw_rest_folder = Path('stage2/Restoration/')

# PA2 Names & Units

## PA2 File Naming Conventions
NOTE: For Stage 2 we changed naming slightly from Stage 1, to better facilitate building various options.

Level 1 names, scale:
* `catch` indicates catchment-level data, for the local catchment only
* `reach` indicates reach-level data, which includes all upstream contributions

Level 2 names, model:
* `_base` indicates model baseline outputs (no conservation)
* `_rest` indicates model with restoration reductions
* `_prot` indicates model with protection projects, avoided loads

**Clusters** are geographic priority areas, which include parts of pristine headwaters and working forests of the upper watershed, farmlands, suburbs, and industrial and urban centers downstream, and the coastal plain where the river and emerging groundwater empties into either the Delaware Bay or the Atlantic Coast.

There are 8 included in the DRB:
- Poconos-Kittaninny, 
- Upper Lehigh,  
- New Jersey Highlands, 
- Middle Schuylkill, 
- Schuylkill Highlands, 
- Upstream Suburban Philadelphia, 
- Brandywine-Christina, 
- Kirkwood-Cohansey Aquifer. 

**Focus areas** are smaller geographic units within clusters. 


## Correct WikiSRAT Name for Load

The [WikiSRAT microservice](https://github.com/TheAcademyofNaturalSciences/WikiSRATMicroService) output gives `tploadrate_total`, etc., which Sarah D. saved to the `catchment_load_rates.csv` file.

This is a mistake, and it actually is the **local** NHDplus **catchment annual load (kg/ha)**. 

Mike suspected this error and we confirmed it during Stage 1 by comparing to Model My Watershed subbasin modeling output for specific COMIDs.

Upstream loads are not returned from WikiSRAT, although they are calculated internally to generate average annual stream reach concentrations.
Upstream loads can therfore be calculated by dividing average anual concentrations (mg/L) by mean annual flow (cfs) to get (kg/y) (after including some unit conversions).

## Units for Loads & Concs

Loads are in kg/y.  
Concentrations are in mg/L.

From Mike:
- `Loadrate_total_ws` was my attempt to get the loadrate totals to kg per hectare
  - Fails if mean annual flow (`maflowv` (CFS)) doesn’t exist (-9999).
  - `loadrate_total_ws  = ((loadate_conc * 28.3168 * 31557600 / 1000000) * maflowv) / watershed_hectares`
    - `31557600 = 365.25 * 24 * 60 * 60`
    - 28.3168 liters in a cubic foot
    - 1000000 mg in kg

From: https://www.fws.gov/r5gomp/gom/nhd-gom/NHDPLUS_UserGuide.pdf
> MAFlowV: Mean Annual Flow (cfs) at bottom of flowline as computed by Vogel Method

### Unit tracking opitons for future
Presently, we are not keeping track of metadata on units for data series because Pandas does not readily support it. There are approaches that we could implement in the future.

Options for the future:
- https://stackoverflow.com/questions/14688306/adding-meta-information-metadata-to-pandas-dataframe
- https://pandas.pydata.org/docs/reference/api/pandas.Series.attrs.html
- https://stackoverflow.com/questions/39419178/how-can-i-manage-units-in-pandas-data
- https://github.com/hgrecco/pint
- https://towardsdatascience.com/add-metadata-to-your-dataset-using-apache-parquet-75360d2073bd

# Create Catch & Reach Files for COMIDs & Geographies
Read Stage 1 dataframes to get the COMIDs & geometries for NHD+v2 catchments and reaches, as a foundation for all Stage 2 work

- Background: stage1/WikiSRAT_AnalysisViz.ipynb
- Parquet to GeoDataFrame: https://geopandas.readthedocs.io/en/latest/docs/reference/api/geopandas.read_parquet.html

In [9]:
%%time
# read data from parquet files
catch_base_gdf = gpd.read_parquet(project_path / pa1_data_folder /'base_df_catch.parquet')
reach_base_gdf = gpd.read_parquet(project_path / pa1_data_folder /'base_df_reach.parquet')

CPU times: user 1.36 s, sys: 174 ms, total: 1.53 s
Wall time: 1.58 s


In [10]:
catch_base_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 19496 entries, 1748535 to 932040370
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   tp_load             19496 non-null  float64 
 1   tn_load             19496 non-null  float64 
 2   tss_load            19496 non-null  float64 
 3   catchment_hectares  19496 non-null  float64 
 4   watershed_hectares  19496 non-null  float64 
 5   tp_loadrate_ws      19496 non-null  float64 
 6   tn_loadrate_ws      19496 non-null  float64 
 7   tss_loadrate_ws     19496 non-null  float64 
 8   maflowv             19496 non-null  float64 
 9   geom_catchment      19496 non-null  geometry
 10  cluster             17358 non-null  category
 11  sub_focusarea       186 non-null    Int64   
 12  nord                18870 non-null  Int64   
 13  nordstop            18844 non-null  Int64   
 14  huc12               19496 non-null  category
 15  streamorder       

In [11]:
reach_base_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 19496 entries, 1748535 to 932040370
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   tp_conc             16823 non-null  float64 
 1   tn_conc             16823 non-null  float64 
 2   tss_conc            16823 non-null  float64 
 3   catchment_hectares  19496 non-null  float64 
 4   watershed_hectares  19496 non-null  float64 
 5   maflowv             19496 non-null  float64 
 6   geom                19494 non-null  geometry
 7   cluster             17358 non-null  category
 8   sub_focusarea       186 non-null    Int64   
 9   nord                18870 non-null  Int64   
 10  nordstop            18844 non-null  Int64   
 11  huc12               19496 non-null  category
 12  streamorder         19496 non-null  int64   
 13  headwater           19496 non-null  int64   
 14  phase               4082 non-null   category
 15  fa_name           

## Remove Stage 1 results

In [12]:
pa1_catch_vars = ['tp_load', 'tn_load', 'tss_load',
                  'tp_loadrate_ws', 'tn_loadrate_ws', 'tss_loadrate_ws',]
catch_base_gdf.drop(pa1_catch_vars, axis='columns', inplace=True)

In [13]:
pa1_reach_vars = ['tp_conc', 'tn_conc', 'tss_conc',]
reach_base_gdf.drop(pa1_reach_vars, axis='columns', inplace=True)

# Read Baseline Results from MMW-WikiSRAT
Read Stage 2 baseline results from MMW-WikiSRAT, which were run and saved separately by Sara D.

- CSV to Pandas: 
  - Guide: https://pandas.pydata.org/docs/user_guide/io.html#csv-text-files 
  - Ref:   https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html



In [14]:

wikisrat_catchment_loads = pd.read_csv(project_path / pa2_mmw_folder /
                                            'catchment_loading_rates.csv',
                                            index_col = 'comid',
                                            dtype = {
                                                'Source':'category',
                                                'gwlfe_endpoint':'category',
                                                'huc_run_name':'category',
                                                'huc_run_states':'category',
                                                'land_use_source':'category',
                                                'closest_weather_stations':'category',
                                                'stream_layer':'category',
                                                'weather_source':'category',
                                            }
                                           )


In [15]:
wikisrat_catchment_loads.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21840 entries, 2612780 to 9891532
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Unnamed: 0                21840 non-null  int64   
 1   TotalN                    21840 non-null  float64 
 2   TotalP                    21840 non-null  float64 
 3   Sediment                  21840 non-null  float64 
 4   gwlfe_endpoint            21840 non-null  category
 5   huc                       21840 non-null  int64   
 6   huc_level                 21840 non-null  int64   
 7   huc_run                   21840 non-null  int64   
 8   huc_run_level             21840 non-null  int64   
 9   huc_run_name              21840 non-null  category
 10  huc_run_states            21840 non-null  category
 11  huc_run_areaacres         21840 non-null  float64 
 12  land_use_source           21840 non-null  category
 13  closest_weather_stations  21840 non-nu

In [16]:
wikisrat_catchment_concs = pd.read_csv(project_path / pa2_mmw_folder /
                                       'reach_concentrations.csv',
                                       index_col = 'comid',
                                       dtype = {
                                           'Source':'category',
                                           'gwlfe_endpoint':'category',
                                           'huc_run_name':'category',
                                           'huc_run_states':'category',
                                           'land_use_source':'category',
                                           'closest_weather_stations':'category',
                                           'stream_layer':'category',
                                           'weather_source':'category',
                                       }
                                      )

In [17]:
wikisrat_catchment_concs.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 21840 entries, 2612780 to 9891532
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Unnamed: 0                21840 non-null  int64   
 1   TotalN                    15834 non-null  float64 
 2   TotalP                    15834 non-null  float64 
 3   Sediment                  15834 non-null  float64 
 4   gwlfe_endpoint            21840 non-null  category
 5   huc                       21840 non-null  int64   
 6   huc_level                 21840 non-null  int64   
 7   huc_run                   21840 non-null  int64   
 8   huc_run_level             21840 non-null  int64   
 9   huc_run_name              21840 non-null  category
 10  huc_run_states            21840 non-null  category
 11  huc_run_areaacres         21840 non-null  float64 
 12  land_use_source           21840 non-null  category
 13  closest_weather_stations  21840 non-nu

In [18]:
# Explore categoricals
wikisrat_catchment_concs.weather_source.unique()

['NASA_NLDAS_2000_2019', 'USEPA_1960_1990']
Categories (2, object): ['NASA_NLDAS_2000_2019', 'USEPA_1960_1990']

## Add Stage 2 data to Stage 1 dataframe

In [19]:
catch_base_gdf['tp_load'] = wikisrat_catchment_loads.TotalP
catch_base_gdf['tn_load'] = wikisrat_catchment_loads.TotalN
catch_base_gdf['tss_load'] = wikisrat_catchment_loads.Sediment

In [20]:
catch_base_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 19496 entries, 1748535 to 932040370
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   catchment_hectares  19496 non-null  float64 
 1   watershed_hectares  19496 non-null  float64 
 2   maflowv             19496 non-null  float64 
 3   geom_catchment      19496 non-null  geometry
 4   cluster             17358 non-null  category
 5   sub_focusarea       186 non-null    Int64   
 6   nord                18870 non-null  Int64   
 7   nordstop            18844 non-null  Int64   
 8   huc12               19496 non-null  category
 9   streamorder         19496 non-null  int64   
 10  headwater           19496 non-null  int64   
 11  phase               4082 non-null   category
 12  fa_name             4082 non-null   category
 13  tp_load             19496 non-null  float64 
 14  tn_load             19496 non-null  float64 
 15  tss_load          

In [21]:
reach_base_gdf['tp_conc2'] = wikisrat_catchment_concs.TotalP
reach_base_gdf['tn_conc2'] = wikisrat_catchment_concs.TotalN
reach_base_gdf['tss_conc2'] = wikisrat_catchment_concs.Sediment

In [22]:
reach_base_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 19496 entries, 1748535 to 932040370
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   catchment_hectares  19496 non-null  float64 
 1   watershed_hectares  19496 non-null  float64 
 2   maflowv             19496 non-null  float64 
 3   geom                19494 non-null  geometry
 4   cluster             17358 non-null  category
 5   sub_focusarea       186 non-null    Int64   
 6   nord                18870 non-null  Int64   
 7   nordstop            18844 non-null  Int64   
 8   huc12               19496 non-null  category
 9   streamorder         19496 non-null  int64   
 10  headwater           19496 non-null  int64   
 11  phase               4082 non-null   category
 12  fa_name             4082 non-null   category
 13  tp_conc2            14712 non-null  float64 
 14  tn_conc2            14712 non-null  float64 
 15  tss_conc2         

## Baseline results by source

In [39]:

wikisrat_catch_base_loads_source = pd.read_csv(project_path / 'stage2/Restoration/' /
                                        'catchment_sources_local_load.csv',
                                        # index_col = 'comid',
                                        dtype = {
                                            'Source': 'category',
                                            'huc': 'category',
                                            'gwlfe_endpoint': 'category',
                                            'run_group': 'category',
                                            'with_attenuation': bool,
                                        }
                                       )
wikisrat_catch_base_loads_source.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371280 entries, 0 to 371279
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        371280 non-null  int64   
 1   Source            371280 non-null  category
 2   Sediment          279948 non-null  float64 
 3   TotalN            367308 non-null  float64 
 4   TotalP            367308 non-null  float64 
 5   comid             371280 non-null  int64   
 6   huc               371280 non-null  category
 7   gwlfe_endpoint    371280 non-null  category
 8   huc_level         371280 non-null  int64   
 9   run_group         371280 non-null  category
 10  funding_sources   0 non-null       float64 
 11  with_attenuation  371280 non-null  bool    
dtypes: bool(1), category(4), float64(4), int64(3)
memory usage: 22.0 MB


In [42]:
# Confirm Baseline only
wikisrat_catch_base_loads_source.run_group.unique()

['No restoration or protection']
Categories (1, object): ['No restoration or protection']

In [44]:
# Source types
wikisrat_catch_base_loads_source.Source.unique()

['Barren Areas', 'Cropland', 'Farm Animals', 'Hay/Pasture', 'High-Density Mixed', ..., 'Stream Bank Erosion', 'Subsurface Flow', 'Total Local Load', 'Wetlands', 'Wooded Areas']
Length: 17
Categories (17, object): ['Barren Areas', 'Cropland', 'Farm Animals', 'Hay/Pasture', ..., 'Subsurface Flow', 'Total Local Load', 'Wetlands', 'Wooded Areas']

# Read Restoration Results from MMW-WikiSRAT
Read Stage 2 baseline results from MMW-WikiSRAT, which were run and saved separately by Sara D.

In [50]:

wikisrat_catch_loads = pd.read_csv(project_path / 'stage2/Restoration/' /
                                        'catchment_total_local_load.csv',
                                        # index_col = 'comid',
                                        dtype = {
                                            'Source': 'category',
                                            'huc': 'category',
                                            'gwlfe_endpoint': 'category',
                                            'run_group': 'category',
                                            'funding_sources': 'category',
                                            'with_attenuation': bool,
                                        }
                                       )
wikisrat_catch_loads.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109200 entries, 0 to 109199
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        109200 non-null  int64   
 1   Source            109200 non-null  category
 2   Sediment          109200 non-null  float64 
 3   TotalN            109200 non-null  float64 
 4   TotalP            109200 non-null  float64 
 5   comid             109200 non-null  int64   
 6   huc               109200 non-null  category
 7   gwlfe_endpoint    109200 non-null  category
 8   huc_level         109200 non-null  int64   
 9   run_group         109200 non-null  category
 10  funding_sources   87360 non-null   category
 11  with_attenuation  109200 non-null  bool    
dtypes: bool(1), category(5), float64(3), int64(3)
memory usage: 5.7 MB


In [48]:
# Confirm only total local loads
wikisrat_catch_loads.Source.unique()

['Total Local Load']
Categories (1, object): ['Total Local Load']

In [49]:
# List Run Groups
wikisrat_catch_loads.run_group.unique()

['No restoration or protection', 'Direct WPF Restoration', 'Direct and Indirect WPF Restoration', 'All Restoration', 'Direct WPF Protection']
Categories (5, object): ['All Restoration', 'Direct WPF Protection', 'Direct WPF Restoration', 'Direct and Indirect WPF Restoration', 'No restoration or protection']

In [55]:
# List Run Groups
wikisrat_catch_loads.funding_sources.unique()

[NaN, 'Delaware River Restoration Fund', 'Delaware River Restoration Fund, Delaware Riv..., 'Delaware River Restoration Fund, Delaware Riv..., 'Delaware River Watershed Protection Fund - Fo...]
Categories (4, object): ['Delaware River Restoration Fund', 'Delaware River Restoration Fund, Delaware Riv..., 'Delaware River Restoration Fund, Delaware Riv..., 'Delaware River Watershed Protection Fund - Fo...]

In [51]:
wikisrat_reach_concs = pd.read_csv(project_path / 'stage2/Restoration/' /
                                        'reach_concentrations.csv',
                                        # index_col = 'comid',
                                        dtype = {
                                            'Source': 'category',
                                            'huc': 'category',
                                            'gwlfe_endpoint': 'category',
                                            'run_group': 'category',
                                            'funding_sources': 'category',
                                            'with_attenuation': bool,
                                        }
                                       )
wikisrat_reach_concs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109200 entries, 0 to 109199
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        109200 non-null  int64   
 1   Source            109200 non-null  category
 2   Sediment          89340 non-null   float64 
 3   TotalN            89340 non-null   float64 
 4   TotalP            89340 non-null   float64 
 5   comid             109200 non-null  int64   
 6   huc               109200 non-null  category
 7   gwlfe_endpoint    109200 non-null  category
 8   huc_level         109200 non-null  int64   
 9   run_group         109200 non-null  category
 10  funding_sources   87360 non-null   category
 11  with_attenuation  109200 non-null  bool    
dtypes: bool(1), category(5), float64(3), int64(3)
memory usage: 5.7 MB


In [52]:
wikisrat_reach_concs

Unnamed: 0.1,Unnamed: 0,Source,Sediment,TotalN,TotalP,comid,huc,gwlfe_endpoint,huc_level,run_group,funding_sources,with_attenuation
0,0,Reach Concentration,17.456653,0.240183,0.022851,2612780,020401010101,wikiSRAT,12,No restoration or protection,,True
1,1,Reach Concentration,17.456653,0.240183,0.022851,2612780,020401010101,wikiSRAT,12,Direct WPF Restoration,Delaware River Restoration Fund,True
2,2,Reach Concentration,17.456653,0.240183,0.022851,2612780,020401010101,wikiSRAT,12,Direct and Indirect WPF Restoration,"Delaware River Restoration Fund, Delaware Rive...",True
3,3,Reach Concentration,17.456559,0.240182,0.022851,2612780,020401010101,wikiSRAT,12,All Restoration,"Delaware River Restoration Fund, Delaware Rive...",True
4,4,Reach Concentration,17.456653,0.240183,0.022851,2612780,020401010101,wikiSRAT,12,Direct WPF Protection,Delaware River Watershed Protection Fund - For...,True
...,...,...,...,...,...,...,...,...,...,...,...,...
109195,109195,Reach Concentration,240.331181,2.089320,0.103850,27081103,020403040501,wikiSRAT,12,No restoration or protection,,True
109196,109196,Reach Concentration,240.331181,2.089320,0.103850,27081103,020403040501,wikiSRAT,12,Direct WPF Restoration,Delaware River Restoration Fund,True
109197,109197,Reach Concentration,240.331181,2.089320,0.103850,27081103,020403040501,wikiSRAT,12,Direct and Indirect WPF Restoration,"Delaware River Restoration Fund, Delaware Rive...",True
109198,109198,Reach Concentration,240.290098,2.089113,0.103779,27081103,020403040501,wikiSRAT,12,All Restoration,"Delaware River Restoration Fund, Delaware Rive...",True


In [53]:
# Point source derived conc
wikisrat_reach_concs_ps = pd.read_csv(project_path / 'stage2/Restoration/' /
                                        'reach_pt_source_conc.csv',
                                        # index_col = 'comid',
                                        dtype = {
                                            'Source': 'category',
                                            'huc': 'category',
                                            'gwlfe_endpoint': 'category',
                                            'run_group': 'category',
                                            'funding_sources': 'category',
                                            'with_attenuation': bool,
                                        }
                                       )
wikisrat_reach_concs_ps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109200 entries, 0 to 109199
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        109200 non-null  int64   
 1   Source            109200 non-null  category
 2   Sediment          89340 non-null   float64 
 3   TotalN            89340 non-null   float64 
 4   TotalP            89340 non-null   float64 
 5   comid             109200 non-null  int64   
 6   huc               109200 non-null  category
 7   gwlfe_endpoint    109200 non-null  category
 8   huc_level         109200 non-null  int64   
 9   run_group         109200 non-null  category
 10  funding_sources   87360 non-null   category
 11  with_attenuation  109200 non-null  bool    
dtypes: bool(1), category(5), float64(3), int64(3)
memory usage: 5.7 MB


In [54]:
wikisrat_reach_concs_ps

Unnamed: 0.1,Unnamed: 0,Source,Sediment,TotalN,TotalP,comid,huc,gwlfe_endpoint,huc_level,run_group,funding_sources,with_attenuation
0,0,Point Source Derived Concentration,0.0,0.0,0.0,2612780,020401010101,wikiSRAT,12,No restoration or protection,,True
1,1,Point Source Derived Concentration,0.0,0.0,0.0,2612780,020401010101,wikiSRAT,12,Direct WPF Restoration,Delaware River Restoration Fund,True
2,2,Point Source Derived Concentration,0.0,0.0,0.0,2612780,020401010101,wikiSRAT,12,Direct and Indirect WPF Restoration,"Delaware River Restoration Fund, Delaware Rive...",True
3,3,Point Source Derived Concentration,0.0,0.0,0.0,2612780,020401010101,wikiSRAT,12,All Restoration,"Delaware River Restoration Fund, Delaware Rive...",True
4,4,Point Source Derived Concentration,0.0,0.0,0.0,2612780,020401010101,wikiSRAT,12,Direct WPF Protection,Delaware River Watershed Protection Fund - For...,True
...,...,...,...,...,...,...,...,...,...,...,...,...
109195,109195,Point Source Derived Concentration,0.0,0.0,0.0,27081103,020403040501,wikiSRAT,12,No restoration or protection,,True
109196,109196,Point Source Derived Concentration,0.0,0.0,0.0,27081103,020403040501,wikiSRAT,12,Direct WPF Restoration,Delaware River Restoration Fund,True
109197,109197,Point Source Derived Concentration,0.0,0.0,0.0,27081103,020403040501,wikiSRAT,12,Direct and Indirect WPF Restoration,"Delaware River Restoration Fund, Delaware Rive...",True
109198,109198,Point Source Derived Concentration,0.0,0.0,0.0,27081103,020403040501,wikiSRAT,12,All Restoration,"Delaware River Restoration Fund, Delaware Rive...",True


# Other stuff....

In [27]:

base_sourc_df = pd.read_csv(mmw_data_folder /'wikisrat_catchment_sources.csv')


NameError: name 'mmw_data_folder' is not defined

In [5]:
%%time
# read data from parquet files
base_catch_gdf = gpd.read_parquet(data_folder /'base_df_catch.parquet')
base_reach_gdf = gpd.read_parquet(data_folder /'base_df_reach.parquet')

rest_catch_gdf = gpd.read_parquet(data_folder /'rest_df_catch.parquet')
rest_reach_gdf = gpd.read_parquet(data_folder /'rest_df_reach.parquet')

point_src_gdf = gpd.read_parquet(data_folder /'point_source_df.parquet')

proj_prot_gdf = gpd.read_parquet(data_folder /'prot_proj_df.parquet')
proj_rest_gdf = gpd.read_parquet(data_folder /'rest_proj_df.parquet')

cluster_gdf = gpd.read_parquet(data_folder /'cluster_df.parquet')   

mmw_huc12_loads_df = pd.read_parquet(data_folder /'mmw_huc12_loads_df.parquet')

Wall time: 2.2 s


In [6]:
focusarea_gdf = gpd.read_parquet(data_folder /'fa_phase2_df.parquet')
focusarea_gdf.cluster = focusarea_gdf.cluster.replace('Kirkwood Cohansey Aquifer', 'Kirkwood - Cohansey Aquifer') # update name for consistency with other files 
focusarea_gdf.set_index('name', inplace=True)

Follow this notebook with WikiSRAT_Analysis.ipynb for analysis of fetched data. 