PA2 Notebook 1: Fecth Data
===


This is the first notebook for DRWI Pollution Assessment Stage 2 (PA2) analysis.
It fetches and prepares all the input data and modeling necessary for the Stage 2 Assessment.

The general data fectching pipeline is to:
- Run Model My Watershed (MMW) Multi-Year Model (GWLF-E)
  -  for every HUC12 in DRWI.
- Proccess MMW HUC12 results through [WikiSRAT microservice](https://github.com/TheAcademyofNaturalSciences/WikiSRATMicroService) API.
  - WikiSRAT downscales HUC12 loads to NHD+v2 catchments and route pollution through the NHD+v2 stream reach network. 
  - WikiSRAT runs group MMW HUC12 results by HUC8 to properly route loads through the stream network.
  - Stream network routing includes attenuation due to physical and biological processes within surface waters. 
  - WikiSRAT gets run multiple times to simulate:
    - Baseline results, with no restoration or protection practices.
    - Restoration results, which include 3 run groups
    - Protection results
- Read, organize, and save combined results for PA2 analysis calculations (Notebook 2).

# Installation and Setup

Carefully follow our **[Installation Instructions](README.md#get-started)**, especially including:
- Creating a virtual environment for this repository (step 3)

## Import Python Dependencies

In [1]:
from pathlib import Path

import pandas    as pd
import geopandas as gpd

# packages for data requests
import requests
from requests.auth import HTTPBasicAuth
import json

# To save to Parquet files
import pyarrow

In [2]:
!conda-develop /Users/aaufdenkampe/Documents/Python/pollution-assessment/src

path exists, skipping /Users/aaufdenkampe/Documents/Python/pollution-assessment/src
completed operation for: /Users/aaufdenkampe/Documents/Python/pollution-assessment/src


In [3]:
# Custom functions for Pollution Assessment
import pollution_assessment as pa
import pollution_assessment.calc
import pollution_assessment.plot
# Confirm that sub-modules are imported
dir(pa)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'calc',
 'plot']

## Set Paths to Input and Output Files with `pathlib`

Use the [pathlib](https://docs.python.org/3/library/pathlib.html) library (built-in to Python 3) to manage paths indpendentely of OS or environment.

This blog post describes `pathlib`'s benefits relative to using the `os` library or manual approaches.
- https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f

In [4]:
# Find your current working directory, which should be folder for this notebook.
Path.cwd()

PosixPath('/Users/aaufdenkampe/Documents/Python/pollution-assessment/stage2')

In [5]:
# Set your project directory to your local folder for your clone of this repository
project_path = Path.cwd().parent
project_path

PosixPath('/Users/aaufdenkampe/Documents/Python/pollution-assessment')

In [6]:
# Assign relative paths for data INPUT folders. 
# For folders, end with a slash character, `/`.
pa1_data_folder = Path('stage1/data/')
pa2_mmw_folder  = Path('stage2/DRB_GWLFE/mmw_results/')
pa2_wikisrat_folder = Path('stage2/wikiSRAT/')

In [7]:
# Assign relative paths for the data OUTPUT folder.
pa2_data_output_folder = Path('stage2/data_output')

# PA2 Names & Units

## PA2 File Naming Conventions
NOTE: For Stage 2 we changed naming slightly from Stage 1, to better facilitate building various options.

Level 1 names, scale:
* `reach` indicates reach-level data, which includes all upstream contributions
* `catch` indicates catchment-level data, for the local catchment only

Level 2 names, model:
* `_base` indicates model baseline outputs (no conservation)
* `_rest` indicates model with restoration reductions
* `_prot` indicates model with protection projects, avoided loads

**Clusters** are geographic priority areas, which include parts of pristine headwaters and working forests of the upper watershed, farmlands, suburbs, and industrial and urban centers downstream, and the coastal plain where the river and emerging groundwater empties into either the Delaware Bay or the Atlantic Coast.

There are 8 included in the DRB:
- Poconos-Kittaninny, 
- Upper Lehigh,  
- New Jersey Highlands, 
- Middle Schuylkill, 
- Schuylkill Highlands, 
- Upstream Suburban Philadelphia, 
- Brandywine-Christina, 
- Kirkwood-Cohansey Aquifer. 

**Focus areas** are smaller geographic units within clusters. 

# Create GeoDataframes for COMID Catchment & Reach Geometries
Read Stage 1 dataframes to get the COMIDs & geometries for NHD+v2 catchments and reaches, as a foundation for all Stage 2 work

- Background: stage1/WikiSRAT_AnalysisViz.ipynb
- Parquet to GeoDataFrame: https://geopandas.readthedocs.io/en/latest/docs/reference/api/geopandas.read_parquet.html

In [8]:
%%time
# read data from parquet files
catch_gdf = gpd.read_parquet(project_path / pa1_data_folder /'base_df_catch.parquet')
reach_gdf = gpd.read_parquet(project_path / pa1_data_folder /'base_df_reach.parquet')

CPU times: user 1.13 s, sys: 105 ms, total: 1.24 s
Wall time: 1.22 s


In [9]:
catch_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 19496 entries, 1748535 to 932040370
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   tp_load             19496 non-null  float64 
 1   tn_load             19496 non-null  float64 
 2   tss_load            19496 non-null  float64 
 3   catchment_hectares  19496 non-null  float64 
 4   watershed_hectares  19496 non-null  float64 
 5   tp_loadrate_ws      19496 non-null  float64 
 6   tn_loadrate_ws      19496 non-null  float64 
 7   tss_loadrate_ws     19496 non-null  float64 
 8   maflowv             19496 non-null  float64 
 9   geom_catchment      19496 non-null  geometry
 10  cluster             17358 non-null  category
 11  sub_focusarea       186 non-null    Int64   
 12  nord                18870 non-null  Int64   
 13  nordstop            18844 non-null  Int64   
 14  huc12               19496 non-null  category
 15  streamorder       

In [10]:
reach_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 19496 entries, 1748535 to 932040370
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   tp_conc             16823 non-null  float64 
 1   tn_conc             16823 non-null  float64 
 2   tss_conc            16823 non-null  float64 
 3   catchment_hectares  19496 non-null  float64 
 4   watershed_hectares  19496 non-null  float64 
 5   maflowv             19496 non-null  float64 
 6   geom                19494 non-null  geometry
 7   cluster             17358 non-null  category
 8   sub_focusarea       186 non-null    Int64   
 9   nord                18870 non-null  Int64   
 10  nordstop            18844 non-null  Int64   
 11  huc12               19496 non-null  category
 12  streamorder         19496 non-null  int64   
 13  headwater           19496 non-null  int64   
 14  phase               4082 non-null   category
 15  fa_name           

In [11]:
reach_gdf.headwater.unique()

array([1, 0])

## Remove Stage 1 results

In [12]:
pa1_catch_vars = ['tp_load', 'tn_load', 'tss_load',
                  'tp_loadrate_ws', 'tn_loadrate_ws', 'tss_loadrate_ws',]
catch_gdf.drop(pa1_catch_vars, axis='columns', inplace=True)

In [13]:
pa1_reach_vars = ['tp_conc', 'tn_conc', 'tss_conc',]
reach_gdf.drop(pa1_reach_vars, axis='columns', inplace=True)

## Set CRS

In [14]:
# Check CRS, which appears preserved in Parquet file metadata.
catch_gdf.crs

<Derived Projected CRS: EPSG:32618>
Name: WGS 84 / UTM zone 18N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Between 78°W and 72°W, northern hemisphere between equator and 84°N, onshore and offshore. Bahamas. Canada - Nunavut; Ontario; Quebec. Colombia. Cuba. Ecuador. Greenland. Haiti. Jamica. Panama. Turks and Caicos Islands. United States (USA). Venezuela.
- bounds: (-78.0, 0.0, -72.0, 84.0)
Coordinate Operation:
- name: UTM zone 18N
- method: Transverse Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [15]:
%%time
# Test method to reproject CRS to 3857, which is useful for visualization
catch_gdf.to_crs(epsg=3857, inplace=True)
reach_gdf.to_crs(epsg=3857, inplace=True)

CPU times: user 5.89 s, sys: 52.5 ms, total: 5.95 s
Wall time: 5.98 s


In [16]:
# Check CRS
catch_gdf.crs

<Derived Projected CRS: EPSG:3857>
Name: WGS 84 / Pseudo-Mercator
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: World between 85.06°S and 85.06°N.
- bounds: (-180.0, -85.06, 180.0, 85.06)
Coordinate Operation:
- name: Popular Visualisation Pseudo-Mercator
- method: Popular Visualisation Pseudo Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [17]:
reach_gdf.crs

<Derived Projected CRS: EPSG:3857>
Name: WGS 84 / Pseudo-Mercator
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: World between 85.06°S and 85.06°N.
- bounds: (-180.0, -85.06, 180.0, 85.06)
Coordinate Operation:
- name: Popular Visualisation Pseudo-Mercator
- method: Popular Visualisation Pseudo Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

# Read Results from MMW-WikiSRAT
Read Stage 2 baseline results from MMW-WikiSRAT, which were run and saved separately using  for five different scenarios differntiated by `run_group`, using the`stage2/wikiSRAT/run_srat_with_bmps.py` script by Sara Damiano.

Run groups
- 'No restoration or protection', 
- 'Direct WPF Restoration', 
- 'Direct and Indirect WPF Restoration', 
- 'All Restoration', 
- 'Direct WPF Protection'

Python Docs:
- Read CSV to Pandas: 
  - Guide: https://pandas.pydata.org/docs/user_guide/io.html#csv-text-files 
  - Ref:   https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
- Style (4-space hanging indent; nothing on first line)
  - https://google.github.io/styleguide/pyguide.html#34-indentation

## Reach Concentrations
- Concentrations are average annual concentrations leaving the stream reach, and including all upstream loads and attenuation (if turned on).
- Concentration units are `mg/L`.

In [29]:
# Reach total concentrations
reach_concs = pd.read_csv(
    project_path / 'stage2/wikiSRAT/reach_concentrations.csv',
    dtype = {
        'Source': 'category',
        'huc': 'category',
        'gwlfe_endpoint': 'category',
        'run_group': 'category',
        'funding_sources': 'category',
        'with_attenuation': bool,
    }
)
reach_concs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109200 entries, 0 to 109199
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        109200 non-null  int64   
 1   Source            109200 non-null  category
 2   Sediment          89340 non-null   float64 
 3   TotalN            89340 non-null   float64 
 4   TotalP            89340 non-null   float64 
 5   comid             109200 non-null  int64   
 6   huc               109200 non-null  category
 7   gwlfe_endpoint    109200 non-null  category
 8   huc_level         109200 non-null  int64   
 9   run_group         109200 non-null  category
 10  funding_sources   87360 non-null   category
 11  with_attenuation  109200 non-null  bool    
dtypes: bool(1), category(5), float64(3), int64(3)
memory usage: 5.7 MB


In [30]:
# Confirm only total concentration
reach_concs.Source.unique()

['Reach Concentration']
Categories (1, object): ['Reach Concentration']

In [31]:
# List Run Groups
reach_concs.run_group.unique()

['No restoration or protection', 'Direct WPF Restoration', 'Direct and Indirect WPF Restoration', 'All Restoration', 'Direct WPF Protection']
Categories (5, object): ['All Restoration', 'Direct WPF Protection', 'Direct WPF Restoration', 'Direct and Indirect WPF Restoration', 'No restoration or protection']

In [32]:
# Reach Point Source Concentrations
reach_concs_ps_runs = pd.read_csv(
    project_path / 'stage2/wikiSRAT/reach_pt_source_conc.csv',
    dtype = {
        'Source': 'category',
        'huc': 'category',
        'gwlfe_endpoint': 'category',
        'run_group': 'category',
        'funding_sources': 'category',
        'with_attenuation': bool,
    }
)
reach_concs_ps_runs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109200 entries, 0 to 109199
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        109200 non-null  int64   
 1   Source            109200 non-null  category
 2   Sediment          89340 non-null   float64 
 3   TotalN            89340 non-null   float64 
 4   TotalP            89340 non-null   float64 
 5   comid             109200 non-null  int64   
 6   huc               109200 non-null  category
 7   gwlfe_endpoint    109200 non-null  category
 8   huc_level         109200 non-null  int64   
 9   run_group         109200 non-null  category
 10  funding_sources   87360 non-null   category
 11  with_attenuation  109200 non-null  bool    
dtypes: bool(1), category(5), float64(3), int64(3)
memory usage: 5.7 MB


In [33]:
# Confirm Source = 'Point Source Derived Concentration'
reach_concs_ps_runs.Source.unique()

['Point Source Derived Concentration']
Categories (1, object): ['Point Source Derived Concentration']

In [34]:
# NOTE that this contains all run groups
# We only need baseline results, as all have identical values
reach_concs_ps_runs.run_group.unique()

['No restoration or protection', 'Direct WPF Restoration', 'Direct and Indirect WPF Restoration', 'All Restoration', 'Direct WPF Protection']
Categories (5, object): ['All Restoration', 'Direct WPF Protection', 'Direct WPF Restoration', 'Direct and Indirect WPF Restoration', 'No restoration or protection']

In [35]:
%%time
# Concatentate Point Sources data to main dataframe
frames = [reach_concs,
          reach_concs_ps_runs.loc[reach_concs_ps_runs.run_group == 'No restoration or protection'],
         ]
reach_concs_df = pd.concat(frames)
reach_concs_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 131040 entries, 0 to 109195
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        131040 non-null  int64   
 1   Source            131040 non-null  object  
 2   Sediment          107208 non-null  float64 
 3   TotalN            107208 non-null  float64 
 4   TotalP            107208 non-null  float64 
 5   comid             131040 non-null  int64   
 6   huc               131040 non-null  category
 7   gwlfe_endpoint    131040 non-null  category
 8   huc_level         131040 non-null  int64   
 9   run_group         131040 non-null  category
 10  funding_sources   87360 non-null   category
 11  with_attenuation  131040 non-null  bool    
dtypes: bool(1), category(4), float64(3), int64(3), object(1)
memory usage: 8.8+ MB
CPU times: user 29.8 ms, sys: 2.48 ms, total: 32.3 ms
Wall time: 31.3 ms


In [36]:
# Confirm two source types
reach_concs_df.Source.value_counts()

Reach Concentration                   109200
Point Source Derived Concentration     21840
Name: Source, dtype: int64

In [37]:
# Drop duplicate column
reach_concs_df.drop('Unnamed: 0', axis='columns', inplace=True)

In [38]:
reach_concs_df

Unnamed: 0,Source,Sediment,TotalN,TotalP,comid,huc,gwlfe_endpoint,huc_level,run_group,funding_sources,with_attenuation
0,Reach Concentration,17.456653,0.240183,0.022851,2612780,020401010101,wikiSRAT,12,No restoration or protection,,True
1,Reach Concentration,17.456653,0.240183,0.022851,2612780,020401010101,wikiSRAT,12,Direct WPF Restoration,Delaware River Restoration Fund,True
2,Reach Concentration,17.456653,0.240183,0.022851,2612780,020401010101,wikiSRAT,12,Direct and Indirect WPF Restoration,"Delaware River Restoration Fund, Delaware Rive...",True
3,Reach Concentration,17.456559,0.240182,0.022851,2612780,020401010101,wikiSRAT,12,All Restoration,"Delaware River Restoration Fund, Delaware Rive...",True
4,Reach Concentration,17.456653,0.240183,0.022851,2612780,020401010101,wikiSRAT,12,Direct WPF Protection,Delaware River Watershed Protection Fund - For...,True
...,...,...,...,...,...,...,...,...,...,...,...
109175,Point Source Derived Concentration,0.000000,0.000000,0.000000,9891532,020403040501,wikiSRAT,12,No restoration or protection,,True
109180,Point Source Derived Concentration,0.000000,0.000000,0.000000,10466473,020403040501,wikiSRAT,12,No restoration or protection,,True
109185,Point Source Derived Concentration,0.000000,0.000000,0.000000,10466475,020403040501,wikiSRAT,12,No restoration or protection,,True
109190,Point Source Derived Concentration,0.000000,0.000000,0.000000,10466691,020403040501,wikiSRAT,12,No restoration or protection,,True


## Catchment Loads
- Loads are local loads from the land to the stream reach.
- Load units are `kg/y`.

In [18]:
# Catchment 'Total Local Load' for every COMID and Run Group
catch_loads = pd.read_csv(
    project_path / 'stage2/wikiSRAT/' / 'catchment_total_local_load.csv',
    dtype = {
        'Source': 'category',
        'huc': 'category',
        'gwlfe_endpoint': 'category',
        'run_group': 'category',
        'funding_sources': 'category',
        'with_attenuation': bool,
    }
)
catch_loads.info()  # Not setting index, as each `comid` has 5 runs

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109200 entries, 0 to 109199
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        109200 non-null  int64   
 1   Source            109200 non-null  category
 2   Sediment          109200 non-null  float64 
 3   TotalN            109200 non-null  float64 
 4   TotalP            109200 non-null  float64 
 5   comid             109200 non-null  int64   
 6   huc               109200 non-null  category
 7   gwlfe_endpoint    109200 non-null  category
 8   huc_level         109200 non-null  int64   
 9   run_group         109200 non-null  category
 10  funding_sources   87360 non-null   category
 11  with_attenuation  109200 non-null  bool    
dtypes: bool(1), category(5), float64(3), int64(3)
memory usage: 5.7 MB


In [19]:
# Confirm only total local loads
catch_loads.Source.unique()

['Total Local Load']
Categories (1, object): ['Total Local Load']

In [20]:
# List Run Groups
catch_loads.run_group.unique()

['No restoration or protection', 'Direct WPF Restoration', 'Direct and Indirect WPF Restoration', 'All Restoration', 'Direct WPF Protection']
Categories (5, object): ['All Restoration', 'Direct WPF Protection', 'Direct WPF Restoration', 'Direct and Indirect WPF Restoration', 'No restoration or protection']

In [21]:
# List Funding Sources per run group
catch_loads.funding_sources.unique()

[NaN, 'Delaware River Restoration Fund', 'Delaware River Restoration Fund, Delaware Riv..., 'Delaware River Restoration Fund, Delaware Riv..., 'Delaware River Watershed Protection Fund - Fo...]
Categories (4, object): ['Delaware River Restoration Fund', 'Delaware River Restoration Fund, Delaware Riv..., 'Delaware River Restoration Fund, Delaware Riv..., 'Delaware River Watershed Protection Fund - Fo...]

In [22]:
# Catchment loads by 'Source' for Run Group 0 (baseline)
catch_loads_sources = pd.read_csv(
    project_path / 'stage2/wikiSRAT/catchment_sources_local_load.csv',
    dtype = {
        'Source': 'category',
        'huc': 'category',
        'gwlfe_endpoint': 'category',
        'run_group': 'category',
        'funding_sources': 'category',
        'with_attenuation': bool,
    }
)
catch_loads_sources.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371280 entries, 0 to 371279
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        371280 non-null  int64   
 1   Source            371280 non-null  category
 2   Sediment          279948 non-null  float64 
 3   TotalN            367308 non-null  float64 
 4   TotalP            367308 non-null  float64 
 5   comid             371280 non-null  int64   
 6   huc               371280 non-null  category
 7   gwlfe_endpoint    371280 non-null  category
 8   huc_level         371280 non-null  int64   
 9   run_group         371280 non-null  category
 10  funding_sources   0 non-null       category
 11  with_attenuation  371280 non-null  bool    
dtypes: bool(1), category(5), float64(3), int64(3)
memory usage: 19.5 MB


In [23]:
# Confirm Run Group is for baseline results (i.e. 'No restoration or protection')
catch_loads_sources.run_group.unique()

['No restoration or protection']
Categories (1, object): ['No restoration or protection']

In [24]:
# Get Source names, for 'Point Sources'
list(catch_loads_sources.Source.unique())

['Barren Areas',
 'Cropland',
 'Farm Animals',
 'Hay/Pasture',
 'High-Density Mixed',
 'Low-Density Mixed',
 'Low-Density Open Space',
 'Medium-Density Mixed',
 'Open Land',
 'Point Sources',
 'Reach Concentration',
 'Septic Systems',
 'Stream Bank Erosion',
 'Subsurface Flow',
 'Total Local Load',
 'Wetlands',
 'Wooded Areas']

In [25]:
%%time
# Concatentate Point Sources data to main dataframe
frames = [catch_loads,
          catch_loads_sources.loc[catch_loads_sources.Source == 'Point Sources'],
         ]
catch_loads_df = pd.concat(frames)
catch_loads_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 131040 entries, 0 to 371272
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Unnamed: 0        131040 non-null  int64   
 1   Source            131040 non-null  object  
 2   Sediment          109200 non-null  float64 
 3   TotalN            131040 non-null  float64 
 4   TotalP            131040 non-null  float64 
 5   comid             131040 non-null  int64   
 6   huc               131040 non-null  category
 7   gwlfe_endpoint    131040 non-null  category
 8   huc_level         131040 non-null  int64   
 9   run_group         131040 non-null  object  
 10  funding_sources   87360 non-null   category
 11  with_attenuation  131040 non-null  bool    
dtypes: bool(1), category(3), float64(3), int64(3), object(2)
memory usage: 9.6+ MB
CPU times: user 45.3 ms, sys: 1.69 ms, total: 47 ms
Wall time: 46.2 ms


In [26]:
# Confirm two source types
catch_loads_df.Source.value_counts()

Total Local Load    109200
Point Sources        21840
Name: Source, dtype: int64

In [27]:
# Drop duplicate column
catch_loads_df.drop('Unnamed: 0', axis='columns', inplace=True)

In [28]:
catch_loads_df

Unnamed: 0,Source,Sediment,TotalN,TotalP,comid,huc,gwlfe_endpoint,huc_level,run_group,funding_sources,with_attenuation
0,Total Local Load,30060.129179,413.59125,39.349002,2612780,020401010101,wikiSRAT,12,No restoration or protection,,True
1,Total Local Load,30060.129179,413.59125,39.349002,2612780,020401010101,wikiSRAT,12,Direct WPF Restoration,Delaware River Restoration Fund,True
2,Total Local Load,30060.129179,413.59125,39.349002,2612780,020401010101,wikiSRAT,12,Direct and Indirect WPF Restoration,"Delaware River Restoration Fund, Delaware Rive...",True
3,Total Local Load,30059.966727,413.59035,39.348603,2612780,020401010101,wikiSRAT,12,All Restoration,"Delaware River Restoration Fund, Delaware Rive...",True
4,Total Local Load,30060.129179,413.59125,39.349002,2612780,020401010101,wikiSRAT,12,Direct WPF Protection,Delaware River Watershed Protection Fund - For...,True
...,...,...,...,...,...,...,...,...,...,...,...
371204,Point Sources,,0.00000,0.000000,9891532,020403040501,wikiSRAT,12,No restoration or protection,,True
371221,Point Sources,,0.00000,0.000000,10466473,020403040501,wikiSRAT,12,No restoration or protection,,True
371238,Point Sources,,0.00000,0.000000,10466475,020403040501,wikiSRAT,12,No restoration or protection,,True
371255,Point Sources,,0.00000,0.000000,10466691,020403040501,wikiSRAT,12,No restoration or protection,,True


# Save Outputs to Parquet Files
The code below converts the data into a locally saved parquet file to avoid having to access the database every time we run the visualization script.  

Apache Parquet has become the high-performance binary cloud format of choice for storing dataframes.
- https://pandas.pydata.org/docs/user_guide/io.html#io-parquet
- https://anaconda.org/TomAugspurger/pandas-performance/notebook
- https://geopandas.readthedocs.io/en/latest/docs/reference/api/geopandas.GeoDataFrame.to_parquet.html

NOTE: If you get a `UserWarning`, then upgrade `GeoPandas >=0.11` to get full support for GeoParquet.
- Release Notes: https://geopandas.org/en/stable/docs/changelog.html#version-0-11-june-20-2022. 

In [39]:
print("Geopandas: ", gpd.__version__)

Geopandas:  0.11.1


In [40]:
# Save Outputs to this folder
output_path = project_path / pa2_data_output_folder
output_path

PosixPath('/Users/aaufdenkampe/Documents/Python/pollution-assessment/stage2/data_output')

In [41]:
%%time
# Save GeoDataframes for Plotting by COMID
reach_gdf.to_parquet(output_path /'reach_gdf.parquet',compression='gzip')
catch_gdf.to_parquet(output_path /'catch_gdf.parquet',compression='gzip')

CPU times: user 4.42 s, sys: 59.1 ms, total: 4.48 s
Wall time: 4.49 s


In [42]:
%%time
# Save WikiSRAT results
reach_concs_df.to_parquet(output_path /'reach_concs_df.parquet',compression='gzip')
catch_loads_df.to_parquet(output_path /'catch_loads_df.parquet',compression='gzip')

CPU times: user 332 ms, sys: 8.42 ms, total: 340 ms
Wall time: 335 ms
