DRWI Pollution Assessment Stage 2  
Notebook 1: Fecth Data
===

This first notebook fetches and prepares all the input data and modeling necessary for the Stage 2 Assessment.

# Installation and Setup

Carefully follow our **[Installation Instructions](README.md#get-started)**, especially including:
- Creating a virtual environment for this repository (step 3)

## Import Python Dependencies

In [1]:
from pathlib import Path

import numpy     as np
import pandas    as pd
import geopandas as gpd

# packages for data requests
import requests
from requests.auth import HTTPBasicAuth
import json

In [2]:
print("Geopandas: ", gpd.__version__)
# print("spatialpandas: ", spd.__version__)
# print("datashader: ", ds.__version__)
# print("pygeos: ", pygeos.__version__)

Geopandas:  0.10.2


## Set Paths to Input and Output Files with `pathlib`

Use the [pathlib](https://docs.python.org/3/library/pathlib.html) library (built-in to Python 3) to manage paths indpendentely of OS or environment.

This blog post describes `pathlib`'s benefits relative to using the `os` library or manual approaches.
- https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f

In [3]:
# Find your current working directory, which should be folder for this notebook.
Path.cwd()

PosixPath('/Users/aaufdenkampe/Documents/Python/pollution-assessment/stage2')

In [13]:
# Set your project directory to your local folder for your clone of this repository
project_path = Path.cwd().parent
project_path

PosixPath('/Users/aaufdenkampe/Documents/Python/pollution-assessment')

In [14]:
# Assign relative paths for data folders. End with a slash character, `/`.
pa1_data_folder = Path('stage1/data/')
pa2_mmw_folder  = Path('stage2/DRB_GWLFE/mmw_results/')

# Naming Conventions
Info to help parse table names below:
* `base_` indicates model baseline outputs (no conservation)
* `rest_` indicates model with restoration reductions
* `prot_` indicates protection projects
* `catch` indicates catchment-level data
* `reach` indicates reach data

**Clusters** are geographic units. There are 8 included in the DRB: Poconos-Kittaninny, Upper Lehigh,  New Jersey Highlands, Middle Schuylkill, Schuylkill Highlands, Upstream Suburban Philadelphia, Brandywine-Christina, Kirkwood-Cohansey Aquifer. These priority locations include parts of pristine headwaters and working forests of the upper watershed, farmlands, suburbs, and industrial and urban centers downstream, and the coastal plain where the river and emerging groundwater empties into either the Delaware Bay or the Atlantic Coast.

**Focus areas** are smaller geographic units within clusters. 

## Read Input Data Files


If you get an error, make sure you've navigated to the `stage2` folder. 

## Read Stage 1 Files for COMIDs & Geographies
- Background: stage1/WikiSRAT_AnalysisViz.ipynb
- Parquet to GeoDataFrame: https://geopandas.readthedocs.io/en/latest/docs/reference/api/geopandas.read_parquet.html

In [16]:
%%time
# read data from parquet files
base_catch_gdf = gpd.read_parquet(project_path / pa1_data_folder /'base_df_catch.parquet')
base_reach_gdf = gpd.read_parquet(project_path / pa1_data_folder /'base_df_reach.parquet')

CPU times: user 1.21 s, sys: 143 ms, total: 1.35 s
Wall time: 2.51 s


In [62]:
base_reach_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 19496 entries, 1748535 to 932040370
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   tp_conc             16823 non-null  float64 
 1   tn_conc             16823 non-null  float64 
 2   tss_conc            16823 non-null  float64 
 3   catchment_hectares  19496 non-null  float64 
 4   watershed_hectares  19496 non-null  float64 
 5   maflowv             19496 non-null  float64 
 6   geom                19494 non-null  geometry
 7   cluster             17358 non-null  category
 8   sub_focusarea       186 non-null    Int64   
 9   nord                18870 non-null  Int64   
 10  nordstop            18844 non-null  Int64   
 11  huc12               19496 non-null  category
 12  streamorder         19496 non-null  int64   
 13  headwater           19496 non-null  int64   
 14  phase               4082 non-null   category
 15  fa_name           

## Read MMW Results
- CSV to Pandas: 
  - Guide: https://pandas.pydata.org/docs/user_guide/io.html#csv-text-files 
  - Ref:   https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [60]:

wikisrat_catchment_load_rates = pd.read_csv(project_path / pa2_mmw_folder /
                                            'wikisrat_catchment_load_rates.csv',
                                            index_col = 'comid',
                                            dtype = {
                                                'Source':'category',
                                                'gwlfe_endpoint':'category',
                                                'huc_run_name':'category',
                                                'huc_run_states':'category',
                                                'land_use_source':'category',
                                                'closest_weather_stations':'category',
                                                'stream_layer':'category',
                                                'weather_source':'category',
                                            }
                                           )


In [59]:
wikisrat_catchment_load_rates.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21840 entries, 2612780 to 27081103
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Unnamed: 0                21840 non-null  int64   
 1   Source                    21840 non-null  category
 2   Sediment                  21840 non-null  float64 
 3   TotalN                    21840 non-null  float64 
 4   TotalP                    21840 non-null  float64 
 5   huc                       21840 non-null  int64   
 6   gwlfe_endpoint            21840 non-null  category
 7   huc_level                 21840 non-null  int64   
 8   huc_run                   21840 non-null  int64   
 9   huc_run_level             21840 non-null  int64   
 10  huc_run_name              21840 non-null  category
 11  huc_run_states            21840 non-null  category
 12  huc_run_areaacres         21840 non-null  float64 
 13  land_use_source           21840 non-n

In [38]:
wikisrat_catchment_concs = pd.read_csv(project_path / pa2_mmw_folder /
                                       'wikisrat_catchment_concs.csv',
                                       index_col = 'comid',
                                       dtype = {
                                           'Source':'category',
                                           'gwlfe_endpoint':'category',
                                           'huc_run_name':'category',
                                           'huc_run_states':'category',
                                           'land_use_source':'category',
                                           'closest_weather_stations':'category',
                                           'stream_layer':'category',
                                           'weather_source':'category',
                                       }
                                      )

In [39]:
wikisrat_catchment_concs.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 21840 entries, 2612780 to 27081103
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Unnamed: 0                21840 non-null  int64   
 1   Source                    21840 non-null  category
 2   Sediment                  15834 non-null  float64 
 3   TotalN                    15834 non-null  float64 
 4   TotalP                    15834 non-null  float64 
 5   huc                       21840 non-null  int64   
 6   gwlfe_endpoint            21840 non-null  category
 7   huc_level                 21840 non-null  int64   
 8   huc_run                   21840 non-null  int64   
 9   huc_run_level             21840 non-null  int64   
 10  huc_run_name              21840 non-null  category
 11  huc_run_states            21840 non-null  category
 12  huc_run_areaacres         21840 non-null  float64 
 13  land_use_source           21840 non-n

In [47]:
# Explore categoricals
wikisrat_catchment_concs.weather_source.unique()

['NASA_NLDAS_2000_2019', 'USEPA_1960_1990']
Categories (2, object): ['NASA_NLDAS_2000_2019', 'USEPA_1960_1990']

## Add Stage 2 data to Stage 1 dataframe

In [64]:
base_reach_gdf['tp_conc2'] = wikisrat_catchment_concs.TotalP
base_reach_gdf['tn_conc2'] = wikisrat_catchment_concs.TotalN
base_reach_gdf['tss_conc2'] = wikisrat_catchment_concs.Sediment

In [65]:
base_reach_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 19496 entries, 1748535 to 932040370
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   tp_conc             16823 non-null  float64 
 1   tn_conc             16823 non-null  float64 
 2   tss_conc            16823 non-null  float64 
 3   catchment_hectares  19496 non-null  float64 
 4   watershed_hectares  19496 non-null  float64 
 5   maflowv             19496 non-null  float64 
 6   geom                19494 non-null  geometry
 7   cluster             17358 non-null  category
 8   sub_focusarea       186 non-null    Int64   
 9   nord                18870 non-null  Int64   
 10  nordstop            18844 non-null  Int64   
 11  huc12               19496 non-null  category
 12  streamorder         19496 non-null  int64   
 13  headwater           19496 non-null  int64   
 14  phase               4082 non-null   category
 15  fa_name           

# Other stuff....

In [None]:

base_sourc_df = pd.read_csv(mmw_data_folder /'wikisrat_catchment_sources.csv')


In [5]:
%%time
# read data from parquet files
base_catch_gdf = gpd.read_parquet(data_folder /'base_df_catch.parquet')
base_reach_gdf = gpd.read_parquet(data_folder /'base_df_reach.parquet')

rest_catch_gdf = gpd.read_parquet(data_folder /'rest_df_catch.parquet')
rest_reach_gdf = gpd.read_parquet(data_folder /'rest_df_reach.parquet')

point_src_gdf = gpd.read_parquet(data_folder /'point_source_df.parquet')

proj_prot_gdf = gpd.read_parquet(data_folder /'prot_proj_df.parquet')
proj_rest_gdf = gpd.read_parquet(data_folder /'rest_proj_df.parquet')

cluster_gdf = gpd.read_parquet(data_folder /'cluster_df.parquet')   

mmw_huc12_loads_df = pd.read_parquet(data_folder /'mmw_huc12_loads_df.parquet')

Wall time: 2.2 s


In [6]:
focusarea_gdf = gpd.read_parquet(data_folder /'fa_phase2_df.parquet')
focusarea_gdf.cluster = focusarea_gdf.cluster.replace('Kirkwood Cohansey Aquifer', 'Kirkwood - Cohansey Aquifer') # update name for consistency with other files 
focusarea_gdf.set_index('name', inplace=True)

Follow this notebook with WikiSRAT_Analysis.ipynb for analysis of fetched data. 