# Dataset Compilation
This script walks you through how to compile and clean a dataset. We will use the USGS dataRetrieval tool to compile flow and water quality data. We will then compile watershed characteristics.

## Set up
I use Anaconda as a package manager because it simplifies package management, dependencies, and deployment for you. I also like the built in applications, including Spyder (my preferred IDE because it's UI is most similar to R Studio and MATLAB.) and Jupyter Notebooks. 

Front end steps that are not shown: 
1. Set up your virtual environment. I do this in my Anaconda Prompt terminal 
    - conda create --name dataExploration
2. Activate the virtual environment
    - conda activate dataExploration
3. Download the required packages. 
    - conda install dataretrieval
    - ...

In [1]:
# Import all libraries at the top of your code so you can easily see and organize all the packges you are using. 
import pandas as pd
import numpy as np
from dataretrieval import nwis, wqp
import os
from pathlib import Path

**TIP: File and folder organization**

My preferred approach to folder organization is to have 3 folders: 
- INPUT: All my raw input data. I will rarely save edited files here.
- OUTPUT: All my code outputs and intermediate files.
- CODE: All my code will be in this folder. This is the folder I push to git. I like to label the use of my codes based on their function. Example of this
    - 'DATA' for scripts that organize data.
    - "ANA" for scripts that are used for analysis.
    - "FIG" for scripts that generate figures.
    - "MODEL" for model wrapper scripts.
    - "FUN" for functions that other scripts will call.

When calling files from INPUTS or OUTPUT folders, using absolute paths is more reliable and easier to debug. Relative paths are way more flexible and often requires smaller blocks of code. However, it's good practice to use absolute paths but ultimately it's based on preference. 


In [2]:
inputDataFilepath ='C:/Users/danyk/Work/4_Data_Science/DataExplorationWorkshop/INPUTS/'
outputDataFilepath ='C:/Users/danyk/Work/4_Data_Science/DataExplorationWorkshop/OUTPUT/'

In [3]:
# First we want to get the list of watersheds we will be working with. Since we will want all our sites to have watershed characteristics, we will use that dataset to subset all available flow and water quality sites. 
MetricTable = pd.read_csv(inputDataFilepath+'Dataset1_BasinID/BasinID.txt', sep=",", dtype={'STAID': str})

# Always useful to check yoru data. In this case, it's important to note that 'STAID' need to be read as strings because of their leading '0's. 
MetricTable.head()

Unnamed: 0,STAID,STANAME,DRAIN_SQKM,HUC02,LAT_GAGE,LNG_GAGE,STATE,HCDN-2009,CLASS,AGGECOREGION
0,1011000,"Allagash River near Allagash, Maine",3186.8,1,47.069611,-69.079544,ME,,Non-ref,NorthEast
1,1013500,"Fish River near Fort Kent, Maine",2252.7,1,47.237394,-68.582642,ME,yes,Ref,NorthEast
2,1015800,"Aroostook River near Masardis, Maine",2313.8,1,46.523003,-68.371764,ME,,Non-ref,NorthEast
3,1016500,"MACHIAS RIVER NEAR ASHLAND, ME",847.8,1,46.628311,-68.434792,ME,,Non-ref,NorthEast
4,1017000,"Aroostook River at Washburn, Maine",4278.9,1,46.777294,-68.157194,ME,,Non-ref,NorthEast


It's good practice to check the columns when you read in a new dataframe.
- DRAIN_SQKM: Watershed area in square km.
- LAT and LONG of the gage/pour point.
- HUC02: What HUC region does this watershed fall in.
- CLASS: GAGES reference or non-ref watershed. Ref watersheds are watersheds with a lower human impact (agriculture, hydrology changes, etc.)
- AGGECOREGION: Ecoregion based on topology.

## Compiling water quality and discharge data

The USGS has a library called 'dataRetrieval' which helps with loading hydrologic and water quality data into Python. dataRetrieval library was originally built for R, and has better documentation. But they do have a python version, which is what we will use. 

We want to compile inorganic nitrogen fraction and discharge. With this we see that we have dissolved fraction of Inorganic nitrogen (nitrate and nitrite) and discharge. The units are likely mg/L as N and ft3/s, but we need to double check. You always want to double check your data to make sure you are pulling the **right** data and that you understand **what** you are pulling. It sounds trivial, but this is how mistakes happen. 

In [71]:
# What solutes do we want to pull? USGS has parameter codes for discharge, solute, and each of their solutes
N_paramCd = 'Inorganic nitrogen (nitrate and nitrite)'
Q_paramCd = '00060'

In [None]:
# Over what period? We are defining this to only pull data within this timeframe. 
# We are defining different data structures because the functions have different data structures. 
N_startDate = "01-01-1980"
N_endDate = "12-31-2020"
Q_startDate = "1980-01-01"
Q_endDate = "2020-12-31"

## Inorganic Nitrogen Concentration Data
Now that we have the stations we will be using, we wnat to find which ones have nitrate concentration data and discharge data. We will start by pulling inorganic nitrogen concentration data because there are fewer stations with available data. 

In [28]:
# Now we want to read in the data. We will read in nitrate solute data first, because these data are a lot more sparse than discharge. 

# There are too many sites to call all at once, so we will chunk the list of sites. 
siteNumbers = MetricTable['STAID'].tolist()
split_siteNumbers = np.array_split(siteNumbers, 50)
rawDailyWQData_l = []

# loop through chunks and make requests 
for site_list_a in split_siteNumbers:
    #site_list = site_list_a.tolist()
    site_list = ['USGS-' + site for site in site_list_a]
    data, metadata = wqp.get_results(siteid=site_list, startDateLo=N_startDate, startDateHi=N_endDate,characteristicName=N_paramCd)
    rawDailyWQData_l.append(data)  # Append results to list

rawDailyWQData = pd.concat(rawDailyWQData_l, ignore_index=True)

  df = pd.read_csv(StringIO(response.text), delimiter=',')


In [6]:
pd.set_option('display.max_columns', None)
rawDailyWQData.head(8)

Unnamed: 0,OrganizationIdentifier,OrganizationFormalName,ActivityIdentifier,ActivityTypeCode,ActivityMediaName,ActivityMediaSubdivisionName,ActivityStartDate,ActivityStartTime/Time,ActivityStartTime/TimeZoneCode,ActivityEndDate,ActivityEndTime/Time,ActivityEndTime/TimeZoneCode,ActivityDepthHeightMeasure/MeasureValue,ActivityDepthHeightMeasure/MeasureUnitCode,ActivityDepthAltitudeReferencePointText,ActivityTopDepthHeightMeasure/MeasureValue,ActivityTopDepthHeightMeasure/MeasureUnitCode,ActivityBottomDepthHeightMeasure/MeasureValue,ActivityBottomDepthHeightMeasure/MeasureUnitCode,ProjectIdentifier,ActivityConductingOrganizationText,MonitoringLocationIdentifier,ActivityCommentText,SampleAquifer,HydrologicCondition,HydrologicEvent,SampleCollectionMethod/MethodIdentifier,SampleCollectionMethod/MethodIdentifierContext,SampleCollectionMethod/MethodName,SampleCollectionEquipmentName,ResultDetectionConditionText,CharacteristicName,ResultSampleFractionText,ResultMeasureValue,ResultMeasure/MeasureUnitCode,MeasureQualifierCode,ResultStatusIdentifier,StatisticalBaseCode,ResultValueTypeName,ResultWeightBasisText,ResultTimeBasisText,ResultTemperatureBasisText,ResultParticleSizeBasisText,PrecisionValue,ResultCommentText,USGSPCode,ResultDepthHeightMeasure/MeasureValue,ResultDepthHeightMeasure/MeasureUnitCode,ResultDepthAltitudeReferencePointText,SubjectTaxonomicName,SampleTissueAnatomyName,ResultAnalyticalMethod/MethodIdentifier,ResultAnalyticalMethod/MethodIdentifierContext,ResultAnalyticalMethod/MethodName,MethodDescriptionText,LaboratoryName,AnalysisStartDate,ResultLaboratoryCommentText,DetectionQuantitationLimitTypeName,DetectionQuantitationLimitMeasure/MeasureValue,DetectionQuantitationLimitMeasure/MeasureUnitCode,PreparationStartDate,ProviderName
0,USGS-ME,USGS Maine Water Science Center,nwisma.01.98100751,Sample-Routine,Water,Surface Water,1980-10-20,15:00:00,EDT,,,,,,,,,,,,U.S. Geological Survey-Water Resources Discipline,USGS-01049265,,,Not determined,Routine sample,USGS,USGS,USGS,Unknown,,Inorganic nitrogen (nitrate and nitrite),Total,0.12,mg/l as N,,Historical,,Actual,,,,,,,630,,,,,,,,,,,,,,,,,NWIS
1,USGS-ME,USGS Maine Water Science Center,nwisma.01.98100751,Sample-Routine,Water,Surface Water,1980-10-20,15:00:00,EDT,,,,,,,,,,,,U.S. Geological Survey-Water Resources Discipline,USGS-01049265,,,Not determined,Routine sample,USGS,USGS,USGS,Unknown,,Inorganic nitrogen (nitrate and nitrite),Dissolved,0.13,mg/l as N,,Historical,,Actual,,,,,,,631,,,,,,,,,,,,,,,,,NWIS
2,USGS-MA,USGS Massachusetts Water Science Center,nwisma.01.98001110,Sample-Routine,Water,Surface Water,1980-04-15,15:15:00,EST,,,,,,,,,,,,U.S. Geological Survey-Water Resources Discipline,USGS-01103500,,,Not determined,Routine sample,USGS,USGS,USGS,Unknown,,Inorganic nitrogen (nitrate and nitrite),Total,0.14,mg/l as N,,Historical,,Actual,,,,,,,630,,,,,,,,,,,,,,,,,NWIS
3,USGS-MA,USGS Massachusetts Water Science Center,nwisma.01.98001110,Sample-Routine,Water,Surface Water,1980-04-15,15:15:00,EST,,,,,,,,,,,,U.S. Geological Survey-Water Resources Discipline,USGS-01103500,,,Not determined,Routine sample,USGS,USGS,USGS,Unknown,,Inorganic nitrogen (nitrate and nitrite),Dissolved,0.15,mg/l as N,,Historical,,Actual,,,,,,,631,,,,,,,,,,,,,,,,,NWIS
4,USGS-MA,USGS Massachusetts Water Science Center,nwisma.01.98001114,Sample-Routine,Water,Surface Water,1980-06-24,12:30:00,EDT,,,,,,,,,,,,U.S. Geological Survey-Water Resources Discipline,USGS-01103500,,,Not determined,Routine sample,USGS,USGS,USGS,Unknown,,Inorganic nitrogen (nitrate and nitrite),Total,0.3,mg/l as N,,Historical,,Actual,,,,,,,630,,,,,,,,,,,,,,,,,NWIS
5,USGS-MA,USGS Massachusetts Water Science Center,nwisma.01.98001114,Sample-Routine,Water,Surface Water,1980-06-24,12:30:00,EDT,,,,,,,,,,,,U.S. Geological Survey-Water Resources Discipline,USGS-01103500,,,Not determined,Routine sample,USGS,USGS,USGS,Unknown,,Inorganic nitrogen (nitrate and nitrite),Dissolved,0.3,mg/l as N,,Historical,,Actual,,,,,,,631,,,,,,,,,,,,,,,,,NWIS
6,USGS-ME,USGS Maine Water Science Center,nwisma.01.98000220,Sample-Routine,Water,Surface Water,1980-01-22,16:00:00,EST,,,,,,,,,,,,U.S. Geological Survey-Water Resources Discipline,USGS-01022500,,,Not determined,Routine sample,USGS,USGS,USGS,Unknown,,Inorganic nitrogen (nitrate and nitrite),Total,0.05,mg/l as N,,Historical,,Actual,,,,,,,630,,,,,,,,,,,,,,,,,NWIS
7,USGS-ME,USGS Maine Water Science Center,nwisma.01.98000220,Sample-Routine,Water,Surface Water,1980-01-22,16:00:00,EST,,,,,,,,,,,,U.S. Geological Survey-Water Resources Discipline,USGS-01022500,,,Not determined,Routine sample,USGS,USGS,USGS,Unknown,,Inorganic nitrogen (nitrate and nitrite),Dissolved,0.06,mg/l as N,,Historical,,Actual,,,,,,,631,,,,,,,,,,,,,,,,,NWIS


Take a look at the columns from the water quality data. There is a lot of information included and I suggest you go through these columns before you use the data! It's crucial that you check your data so you can check if any columns has relevant information.
Thing to note:
- 'ActivityStartDate' has 2 entries every day. Strange! 
- 'ResultSampleFraction' we can see that there isas both 'dissolved' and 'total' (unfiltered)... These are two different solutes. We need to investigate this.

But for now, since we are assuming a familiarity with the data, we'll just isolate the columns we want and move on. 
- 'ActivityStartDate': Date of sample collected
- 'MonitoringLocationIdentifier': Site ID
- 'ResultSampleFraction': Type of DIN 
- 'ResultMeasureValue': Measured value
- 'ResultMeasure/MeasureUnitCode': Units

### Data cleaning
Data cleaning is often a tedious and time consuming step, but arguably one of the most important one. Spending time on the front end cleaning your data will save you time and energy down the line.  

#### Getting to know your data
Starting off, you might want to just check the unique values (categorical) or the max and mix (numeric) of the values in each relevant columns.

In [109]:
# Data is messy! Check it and clean it. This is our first (and definitely not our last) use of  Split-Apply-Combine.
uncleanDailyWQData = rawDailyWQData[['MonitoringLocationIdentifier', 'ActivityStartDate','ResultMeasureValue','ResultSampleFractionText','ResultMeasure/MeasureUnitCode']]
uncleanDailyWQData = uncleanDailyWQData.rename(columns={"MonitoringLocationIdentifier": "USGSSite", 
                                                        "ActivityStartDate": "Date", 
                                                        "ResultMeasureValue": "Conc", 
                                                        "ResultSampleFractionText": "Fraction", 
                                                        "ResultMeasure/MeasureUnitCode": "Units"})

DailyWQData = uncleanDailyWQData.copy()

# There are a few values we can remove off the hop. We don't want NA values or 0.  
DailyWQData = DailyWQData.dropna(subset=['Conc'])
DailyWQData = DailyWQData.loc[DailyWQData['Conc'] != 0]
DailyWQData.head(10)

Unnamed: 0,USGSSite,Date,Conc,Fraction,Units
0,USGS-01049265,1980-10-20,0.12,Total,mg/l as N
1,USGS-01049265,1980-10-20,0.13,Dissolved,mg/l as N
2,USGS-01103500,1980-04-15,0.14,Total,mg/l as N
3,USGS-01103500,1980-04-15,0.15,Dissolved,mg/l as N
4,USGS-01103500,1980-06-24,0.3,Total,mg/l as N
5,USGS-01103500,1980-06-24,0.3,Dissolved,mg/l as N
6,USGS-01022500,1980-01-22,0.05,Total,mg/l as N
7,USGS-01022500,1980-01-22,0.06,Dissolved,mg/l as N
8,USGS-01103500,1980-09-09,0.06,Total,mg/l as N
9,USGS-01103500,1980-09-09,0.05,Dissolved,mg/l as N


In [110]:
grouped_solute = DailyWQData.groupby('Fraction').agg({
    'Date': ['min', 'max'],
    'Conc': ['min', 'mean', 'max']
})
grouped_solute.head()

Unnamed: 0_level_0,Date,Date,Conc,Conc,Conc
Unnamed: 0_level_1,min,max,min,mean,max
Fraction,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Bed Sediment,1980-01-02,2014-11-24,0.92,11.348767,240.0
Dissolved,1980-01-02,2020-12-31,0.001,1.527721,637.0
Total,1980-01-02,2020-12-28,0.001,1.577619,250.0


In [111]:
grouped_solute = DailyWQData.groupby('USGSSite').agg({
    'Conc': ['min', 'mean', 'max']
})
grouped_solute = grouped_solute.sort_values(by=('Conc', 'max'), ascending=False)
grouped_solute.head(10)

Unnamed: 0_level_0,Conc,Conc,Conc
Unnamed: 0_level_1,min,mean,max
USGSSite,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
USGS-04199000,0.14,4.400685,637.0
USGS-07019185,0.03,4.270769,250.0
USGS-07249400,0.01,1.285598,240.0
USGS-02336526,0.044,1.609385,230.0
USGS-08010000,0.05,3.6375,130.0
USGS-09431500,0.02,1.08468,130.0
USGS-07189540,0.467,11.270645,117.0
USGS-04212100,0.044,0.548967,97.1
USGS-06893300,0.63,6.79125,86.0
USGS-04193500,0.02,5.423071,65.4


Three takeaways:
- Station identifiers have 'USGS' in front of them.
- We definitely do not want bed sediment. Remove that!
- We want dissolved and the number of datapoints are similar, so we can just take dissolve.
- Those maximum values are also very high for mg/L as N! We will need to remove outliers.

In [112]:
# We are using Boolean Indexing to remove all other fractions. You can also use the 'drop' function. 
DailyWQData = DailyWQData[DailyWQData['Fraction'] == 'Dissolved']

In [113]:
# Let's check the units to make sure they are what we expect and that they are consistent through the entire dataset. 
unitsAvail = DailyWQData['Units'].unique()
print(unitsAvail)

['mg/l as N']


The only units available are 'mg-N / L' which is what we want! Great, we don't need to convert anything. 

#### Removing Outliers
**Removing outliers is more an art than a science.**

There are many different ways you can remove outliers, and some are better suited for some data distributions than others. We know that a lot of envirionmental/hydrological data has a log-distributions (skewed and heavy tails), thus removing outliers based on a normal distribution might remove non-outliers. Try out a few different approaches and pick the one that maintains the distribution of your data but removes some values that are likely erroneous. 

In [114]:
uniqSites = DailyWQData['USGSSite'].unique()
rmIdx= []
for uniqSite in uniqSites:
    temp = DailyWQData[DailyWQData['USGSSite'] == uniqSite].copy()
    
    # Calculating log10 concentration. 
    temp['LogConc'] = np.log10(temp['Conc'])

    # Finding the thresholds of concentration for this specific site. 
    mean = temp['LogConc'].mean()
    std_dev = temp['LogConc'].std()
    lowerThreshold = mean - 2 * std_dev
    upperThreshold = mean + 2 * std_dev
    
    # Indices are from DailyWQData. So we can collect indices and filter data after the loop.
    rmIdx_t = temp[(temp['LogConc'] >= lowerThreshold) & (temp['LogConc'] <= upperThreshold)].index
    rmIdx.extend(rmIdx_t)
    
DailyWQData = DailyWQData.loc[rmIdx]
print(len(DailyWQData)/len(uncleanDailyWQData))

0.5739667395561063


After cleaning, we have just under 60% of our data remaining.

Now we also have to remove the USGS sigifier in front of the site ID for consistency with our basin data and because USGS data from NWIS data does not have that identifier.

In [116]:
DailyWQData['Site'] = DailyWQData['USGSSite'].str.slice(start=5)
col = DailyWQData.pop("Site")
DailyWQData.insert(0, col.name, col)
DailyWQData.head(1)

Unnamed: 0,Site,USGSSite,Date,Conc,Fraction,Units
1,1049265,USGS-01049265,1980-10-20,0.13,Dissolved,mg/l as N


Great! Moving along...

### Discharge
Now that we have the list of our final set of water quality stations, let's compile the flow data.
USGS records daily flow data, but only water quality sampled monthly, at most. We will use the water quality dataframe to pull only the relevant data.

We will be using the parameter code set at the start of the script "00060" with is the USGS parameter for daily mean discharge. Let's check the units!

In [117]:
qInfo, md = nwis.get_pmcodes(parameterCd='00060', partial=False)
qInfo.head()

Unnamed: 0,parameter_cd,group,parm_nm,epa_equivalence,result_statistical_basis,result_time_basis,result_weight_basis,result_particle_size_basis,result_sample_fraction,result_temperature_basis,CASRN,SRSName,parm_unit
0,60,Physical,"Discharge, cubic feet per second",Not checked,Mean,1 Day,,,,,,"Stream flow, mean. daily",ft3/s


Units are in cubic feet per second (ft3/s). So we will have to convert that to cubic meters per second (m3/s). 

In [150]:
uniqSites = DailyWQData['Site'].unique()
uniqSite = uniqSites[1]

#for uniqSite in uniqSites:

# Isolate the date
WQ_dates = DailyWQData.loc[DailyWQData['Site'] == uniqSite, 'Date']

# Pull Q data
rawDailyQ, md = nwis.get_dv(sites=uniqSite, parameterCd=Q_paramCd, start=Q_startDate, end=Q_endDate)
rawDailyQ.reset_index(inplace=True)
rawDailyQ.rename(columns={'index': 'datetime'}, inplace=True)
rawDailyQ['Date'] = rawDailyQ['datetime'].dt.date

# Only keep the data that has the same dates as WQ samples.
WQ_dates = DailyWQData.loc[DailyWQData['Site'] == uniqSite, 'Date']
filtered_df = rawDailyQ[rawDailyQ['Date'].isin(WQ_dates)]
rawDailyQ.head()
# not workign because a ismatch of data types

Unnamed: 0,datetime,site_no,00060_Mean,00060_Mean_cd,Date
0,1980-01-01 00:00:00+00:00,1103500,207.0,A,1980-01-01
1,1980-01-02 00:00:00+00:00,1103500,188.0,A,1980-01-02
2,1980-01-03 00:00:00+00:00,1103500,173.0,A,1980-01-03
3,1980-01-04 00:00:00+00:00,1103500,155.0,A,1980-01-04
4,1980-01-05 00:00:00+00:00,1103500,148.0,A,1980-01-05


In [151]:
print(rawDailyQ['Date'].dtype)
print(WQ_dates.dtype)

object
object
