# Streamflow and precipitation data from Boulder Creek
Grouping and aggregating data

## Setup

In [None]:
# Libraries used in this demo
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Data file path
streamflow_pth = './boulder_creek_streamflow_wy2010-wy2020.tsv'

First, let's import the streamflow data into pandas:

In [None]:
# Read streamflow from csv file
streamflow_names = ['agency', 'site', 'date', 'streamflow', 'code']
streamflow_df = pd.read_csv(streamflow_pth,
                            sep = '\t',
                            comment = '#',
                            # Skip header rows so pandas gets datatypes right
                            skiprows = [29, 30, 3511, 3512],
                            names = streamflow_names,
                            # Dates as datetime instead of object
                            parse_dates = ['date'],
                            usecols = ['site', 'date', 'streamflow'])

# Rename sites to upstream and downstream
streamflow_df.site = streamflow_df.site.replace(
    6727500, 'upstream').replace(
    6730200, 'downstream')

# Preview data
streamflow_df.head()

The upstream gauge only has data from April 1 to September 30 because the stream is empty or frozen at other times. For a better comparison, let's limit the data to those months:

In [None]:
# Filter months between April and September, inclusive
streamflow_df = streamflow_df.loc[streamflow_df.date.dt.month.between(4, 9)]

# Check that the minimum month is 4 and the maximum is 9
streamflow_df.date.dt.month.describe(percentiles = [])

## How much of the streamflow at the downstream gauge comes from the upstream gauge’s branch, on average?

To calculate this value, we need to take the average streamflow for each gauge separately. We do this by *grouping* the DataFrame by site before computing the mean:

In [None]:
# Take the mean for each site separately


## Is the pattern of monthly streamflow the same or different for each of these gauge locations?

To figure this out, we must group by the site AND the month of the year before averaging (or summarizing in some other way):

In [None]:
# Put the month in a separate column


# Group by site AND month
streamflow_mean_monthly =

# View results
streamflow_mean_monthly

Since the downstream gauge is on a higher order of stream than the upstream gauge, the streamflow there is much higher. To really compare the two locations, it will be helpful to normalize the streamflow values by the maximum monthly streamflow for each gauge.

In [None]:
# Normalize by the maximum value
streamflow_mean_monthly['streamflow_norm'] = 

# Now we should have 6 values for each site - one for every month of the year
streamflow_mean_monthly

Let's plot the annual pattern to see if it looks similar:

In [None]:
# Bar plot of monthly average streamflow
ax = streamflow_mean_monthly['streamflow_norm'].unstack(0).plot.bar()
ax.set_ylabel('Normalized streamflow')

### How are the monthly streamflow patterns similar for the two gauges? How are they different? Why?

Streamflow at both gauges peaks in the spring and is lower in the summer. The peak for the upstream gauge is earlier in the season, probably because that gauge is in an area where the snow melts out earlier than it does for the rest of the basin.

## Which of the gauges has the most variability (relative to the mean) in monthly and daily streamflow?

### What are the steps to computing and normalizing the daily variability?

### What are the steps to computing and normalizing the monthly variability?

### Compute the relative variability (ratio of standard deviation to the mean) of daily streamflow at each gauge

### Compute the relative variability of monthly streamflow at each gauge

### What do you observe about the relative variability of streamflow at each gauge? What about aggregated to different time intervals? Why?

## Think of an additional way to group and aggregate the streamflow data and implement it, or use grouping and aggregating to analyse your own data