# Examples for Trend Analysis

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt
from scipy import stats
import urllib

## 1. Data preparation - Here we use `St Lawrence River` as an example

In this section, we downloaded USGS data from St Lawrence River.

In [None]:
args = {
    'site_no': '04264331',
    'begin_date': '1936-01-01',
    'end_date': '2022-12-31'
}
query = urllib.parse.urlencode(args)
verde_url = (
    f'https://waterdata.usgs.gov/nwis/dv?'
    f'cb_00060=on&format=rdb&referred_module=sw&{query}'
)
response = urllib.request.urlopen(verde_url)
df1 = pd.read_table(
    response,
    comment='#',
    sep='\s+',
    names=['agency', 'site', 'date', 'streamflow', 'quality_flag'],
    index_col=2,
    parse_dates=True,
    date_format='yyyy-mm-dd',
    engine='python')
# discard the first two rows
df1 = df1.iloc[2:]
# Now convert the streamflow data to floats and
# the index to datetimes. When processing raw data
# it's common to have to do some extra postprocessing
df1['streamflow'] = df1['streamflow'].astype(np.float64)
df1.index = pd.DatetimeIndex(df1.index)

# we calculated the annual peak flow for the St Lawrence River
st_law_peakflow = df1[['streamflow']].groupby(df1.index.year).max()
st_law_peakflow.columns = ['peakflow_cfs']

# 2. Linear regression

It is mostly illustration. It contains a total number of **two** practices in this session.

## 2.1. Least Squared Linear Regression

**St Lawrence River Example: Whether we can use annual mean flow to predict the annual peak flow?**

As a continuation from Problem 1, we look at the St Lawrence River. The annual peak flow was calculated above with the variable name `st_law_peakflow`. We first calculate the annual mean flow.

In [None]:
st_law_mean_flow = df1[['streamflow']].groupby(df1.index.year).mean()
st_law_mean_flow.columns = ['meanflow_cfs']

In [None]:
# concatenate the annual peakflow data and annual mean data
# in one file for convenience
st_law_flow_df = pd.concat([st_law_peakflow,st_law_mean_flow],axis=1)

In [None]:
# add years column for plotting convenience
st_law_flow_df['year'] = st_law_flow_df.index.values

#### Practice #1: Plot the time series of Annual Peak Flow and Annual Mean flow
Please plot both time series in one figure and assign different colors to them </br>
Plot type: Line plot </br>
Colors: Orange for Annual Mean Flow, and Blue or Annual Peak Flow

In [None]:
# INSERT your code here

#### What does the above plot show?

What you see above is a plot of the time series of annual mean flow (orange line) and annual peak flow (blue line). For a year with more water availability (annual mean flow), we might expect a high peak flow as well. We can check this by examining a regression between the annual mean flow and annual peak flow.

### 2.1.1. The first step to any regression or correlation analysis is to create a scatter plot of the data.

#### Practice #2: Please generate the scatter plot. 

* Please add the xlabel "Annual Mean Flow [cfs]", and ylabel "Annual Peak Flow [cfs]"
* Please add the title "Streamflow at St Lawrence River, NY \n1936-2022"

In [None]:
# Insert your code here

#### Linear regression: Could we use Annual Mean flow to predict Annual Peak Flow?

The plot above suggests that this is a borderline case for applying linear regression analysis. What rules of linear regression might we worry about here? (heteroscedasticity)

We will proceed with calculating the regression and then look at the residuals to get a better idea of whether this is the best approach.

---

### 2.1.2. Manual calculation of linear regression

Here we'll first compute it manually, solving for our y-intercept, $B_0$, and slope $B_1$:

$B_1 = \displaystyle \frac{n(\sum_{i=1}^{n}x_iy_i)-(\sum_{i=1}^{n}x_i)(\sum_{i=1}^{n}y_i)}{n(\sum_{i=1}^{n}x_i^2)-(\sum_{i=1}^{n}x_i)^2}$

$B_0 = \displaystyle \frac{(\sum_{i=1}^{n}y_i)-B_1(\sum_{i=1}^{n}x_i)}{n} = \bar{y} - B_1\bar{x}$

In [None]:
n = len(st_law_flow_df) # length of our dataset

x = st_law_flow_df.meanflow_cfs # using x for shorthand below
y = st_law_flow_df.peakflow_cfs # using y for shorthand below

B1 = ( n*np.sum(x*y) - np.sum(x)*np.sum(y) ) / ( n*np.sum(x**2) - np.sum(x)**2 ) # B1 parameter, slope
B0 = np.mean(y) - B1*np.mean(x) # B0 parameter, y-intercept

print('B0 : {}'.format(np.round(B0,4)))
print('B1 : {}'.format(np.round(B1,4)))

Then our linear model to predict $y$ at each $x_i$ is: $\hat{y}_i = B_0 + B_1x_i$

In [None]:
y_predicted = B0 + B1*x

And our residuals are: $(y_i - \hat{y}_i)$

In [None]:
residuals = (y - y_predicted)

Finally, compute our Sum of Squared Errors (from our residuals) and Total Sum of Squares to get the correlation coefficient, R, for this linear model.

$SSE = \displaystyle\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ 

$SST = \displaystyle\sum_{i=1}^{n} (y_i - \bar{y}_i)^2$

$R^2 = 1 - \displaystyle \frac{SSE}{SST}$

And compute the standard error of the estimate, $\sigma$ for this model.

$\sigma = \sqrt{\displaystyle\frac{SSE}{(n-2)}}$

In [None]:
sse = np.sum(residuals**2)

sst = np.sum( (y - np.mean(y))**2 )

r_squared = 1 - sse/sst
r = np.sqrt(r_squared)

s = np.sqrt(sse/(n-2))

In [None]:
print('SSE : {} cfs'.format(np.round(sse,2)))
print('SST : {} cfs'.format(np.round(sst,2)))
print('R^2 : {}'.format(np.round(r_squared,3)))
print('R : {}'.format(np.round(r,3)))
print('sigma : {}'.format(np.round(s,3)))

In [None]:
fig, [ax1, ax2, ax3] = plt.subplots(nrows=1, ncols=3, figsize=(14,4), tight_layout=True)

# Scatterplot
st_law_flow_df.plot.scatter(x='meanflow_cfs', y='peakflow_cfs', c='k', ax=ax1);

# Plot the regression line, we only need two points to define a line, use xmin and xmax
ax1.plot([x.min(), x.max()], [B0 + B1*x.min(), B0 + B1*x.max()] , '-r')

ax1.set_xlabel('Annual Mean Flow (cfs)')
ax1.set_ylabel('Annual Peak Flow (cfs)');

# ax1.set_xlim((0,3000))
# ax1.set_ylim((0,1000));

# Plot the residuals
ax2.plot(st_law_flow_df.year,residuals,'-o')

ax2.set_xlabel('Years')
ax2.set_ylabel('Residuals, SWE (mm)');

# Plot a histogram of the residuals
ax3.hist(residuals, bins=10)

ax3.set_xlabel('Residuals, SWE (mm)')
ax3.set_ylabel('Number of Data Points');

---

### Linear regression using the scipy library

Now we'll use the `scipy.stats.linregress()` function to do the same thing. Review the documentation or help text for this function before proceeding. 

In [None]:
stats.linregress?

In [None]:
# use the linear regression function
slope, intercept, rvalue, pvalue, stderr = stats.linregress(st_law_flow_df.meanflow_cfs, 
                                                            st_law_flow_df.peakflow_cfs)

print('B0 : {}'.format(np.round(intercept,4)))
print('B1 : {}'.format(np.round(slope,4)))

print('R^2 : {}'.format(np.round(rvalue**2,3)))
print('R : {}'.format(np.round(rvalue,3)))
print('stderr : {}'.format(np.round(stderr,3)))

Do we get the same results as above?

No, our "standard error" is different. Why is that? If you look into the documentation for the lingregress function, you'll see that it calls this output the "standard error of the **gradient**" meaning the standard error of the slope, $B1$.

This is related to the "standard error", $\sigma$ like:

$SE_{B_1} = \displaystyle \frac{\sigma}{\sqrt{SST_x}} $ where $SST_x = \displaystyle\sum_{i=1}^{n} (x_i - \bar{x}_i)^2$

Compute the standard error from the standard error of the gradient:

In [None]:
# Compute the SST for x
sst_x = np.sum( (x - np.mean(x))**2 )

# Compute the standard error
sigma = stderr * np.sqrt(sst_x)
print('sigma : {}'.format(np.round(sigma,3)))

This should now match what we solved for manually above.

Finally, plot the result

In [None]:
fig, ax = plt.subplots(figsize=(6,6),dpi=200)

# Scatterplot
st_law_flow_df.plot.scatter(x='meanflow_cfs', y='peakflow_cfs', c='k', ax=ax);

# Create points for the regression line
x_1 = np.linspace(st_law_flow_df.meanflow_cfs.min(), 
                st_law_flow_df.meanflow_cfs.max(), 2) # make two x coordinates from min and max values of SLI_max
y_1 = slope * x_1 + intercept # y coordinates using the slope and intercept from our linear regression to draw a regression line

# Plot the regression line
ax.plot(x_1, y_1, '-r')

ax.set_xlabel("Annual Mean Flow [cfs]")
ax.set_ylabel("Annual Peak Flow [cfs]")



We've used the slope and intercept from the linear regression, what were the other values the function returned to us?

This function gives us our R value, we can report how well our linear regression fits our data with this or R-squared (you can see in this case linear regression did a poor job)

In [None]:
print('r-value = {}'.format(rvalue))

print('r-squared = {}'.format(rvalue**2))

This function also performed a two-sided "Wald Test" (t-distribution) to test if the slope of the linear regression is different from zero (null hypothesis is that the slope is not different from a slope of zero). Be careful using this default statistical test though, is this the test that you really need to use on your data set?

In [None]:
print('p-value = {}'.format(pvalue))

And finally it gave us the standard error of the gradient

In [None]:
print('standard error = {}'.format(stderr))

Now use this linear model to predict a $y$ (Annual Peak Flow) value for each $x$ (Annual Mean Flow) value:

In [None]:
y_predicted = slope * st_law_flow_df.meanflow_cfs + intercept

**Plot residuals**

We should make a plot of the residuals (actual - predicted values)

In [None]:
residuals = st_law_flow_df.peakflow_cfs - y_predicted

For a good linear fit, we hope that our residuals are small, don't have any trends or patterns themselves, want them to be normally distributed:

In [None]:
f, (ax1, ax2) = plt.subplots(1,2,figsize=(9,4),dpi=200)

ax1.plot(st_law_flow_df.year,residuals)
ax1.set_xlabel('years')
ax1.set_ylabel('residuals (cfs)')

ax2.hist(residuals)
ax2.set_xlabel('residuals (cfs)')
ax2.set_ylabel('count')

f.tight_layout()

## 2.2. Confidence Interval for the Slope (B1)

**Compute the confidence intervals around our B1 parameter, the slope**

We first specify our $\alpha$ for our chosen level of confidence (95%), and our degrees of freedom $dof = n - 2$

In [None]:
# our alpha for 95% confidence
alpha = 0.05

# length of the dataset
n = len(x)
print(n)
# degrees of freedom
dof = n - 2

Now, compute the Standard Error of the Gradient (Slope):

$s_{B_1} = \displaystyle \frac{s}{\sqrt{SST_x}} $

In [None]:
# standard error of the gradient (slope)
sB1 = s/np.sqrt(sst_x)

This follows a t-distribution, find the t-value that corresponds with our $\alpha$ and $dof$

In [None]:
# t-value for alpha/2 with n-2 degrees of freedom
t = stats.t.ppf(1-alpha/2, dof)

Compute the upper and lower limits for the B1 parameter

In [None]:
# compute the upper and lower limits on our B1 (slope) parameter
B1_upper = B1 + t * sB1
B1_lower = B1 - t * sB1

# compute the corresponding upper and lower B0 values (y intercepts)
B0_upper = y.mean() - B1_upper*x.mean()
B0_lower = y.mean() - B1_lower*x.mean()

**Plot the data, linear regression model, and confidence intervals for B1**

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7,7), dpi=200, tight_layout=True)

# Scatterplot of original data
ax.scatter(x, y, c='k', label='Original Data')

# Plot the regression line, we only need two points to define a line, use xmin and xmax
ax.plot([x.min(), x.max()], [B0 + B1*x.min(), B0 + B1*x.max()] , '-r', label='Least Squares Linear Regression Model')

# Plot the mean line, we only need two points to define a line, use xmin and xmax
ax.plot([x.min(), x.max()], [y.mean(), y.mean()] , '--m', label='Mean Y')

# Plot the upper and lower confidence limits for the standard error of the gradient (slope)
ax.plot([x.min(), x.max()], [B0_upper + B1_upper*x.min(), B0_upper + B1_upper*x.max()] , '--r', label='Upper B0 confidence limit (95%)')
ax.plot([x.min(), x.max()], [B0_lower + B1_lower*x.min(), B0_lower + B1_lower*x.max()] , '--r', label='Upper B0 confidence limit (95%)')


# Add legend
plt.legend(loc='lower right');

# Add axes labels and title
ax.set_xlabel("Annual Mean Flow [cfs]")
ax.set_ylabel("Annual Peak Flow [cfs]")
ax.set_title('Linear Regression Model with Confidence Intervals');

## 2.3. Confidence Interval for Predicted Values of y

**Compute confidence limits for the predicted values of y**

To compute confidence limits on our predicted values of y, we need to predict some values of y first!

For the prediction intervals, I'm naming the variables `p_x` and `p_y`, in the equations below these correspond to $x^*$ and $\hat{y}^*$.

In [None]:
# an array of x values
p_x = np.linspace(x.min(),x.max(),100)

# using our model parameters to predict y values
p_y = B0 + B1*p_x

For some value $x^*$ we want to predict a corresponding $y^*$ using our model.

$\hat{y}^* = \hat{B}_0 + \hat{B}_1x^*$

But what is the undercertainty of the $\hat{y}^*$ we'll calculate? We can compute a prediction interval for a given confidence (such a 95%).

The error of our prediction is the difference between the "true" value of $y^*$ for $x^*$, and our predicted $\hat{y}^*$:

$B_0 + B_1x^* - \hat{B}_0 + \hat{B}_1x^*$

The variance of this prediction error ($\sigma_{E_P}^2$) will help define our prediction intervals, and can be computed as follows:

$\sigma_{E_p}^2(x^*) = s^2 \Bigg[ 1 + \displaystyle\frac{1}{n} + \displaystyle\frac{n(x^*-\bar{x})^2}{n \sum{x_i^2} + (\sum{x_i})^2} \Bigg]$

or

$\sigma_{E_p}^2(x^*) = s^2 \Bigg[ 1 + \displaystyle\frac{1}{n} + \displaystyle\frac{(x^*-\bar{x})^2}{SST_x} \Bigg]$

Now compute our error of prediction ($\sigma_{E_p}$) for each p_x:

In [None]:
sigma_ep = np.sqrt( s**2 * (1+ 1/n + ( ( n*(p_x-x.mean())**2 ) / 
                                      ( n*np.sum(x**2) - np.sum(x)**2 ) ) ) )

The lower and upper confidence limits based on predicted y and confidence intervals (which follow a t-distribution) can be computed as:

$y^* \pm t_{\frac{\alpha}{2},n-2} \cdot \sigma_{E_p}(x^*)$

In [None]:
alpha = 0.05

n = len(p_x)
dof = n - 2

t = stats.t.ppf(1-alpha/2, dof)

p_y_lower = p_y - t * sigma_ep
p_y_upper = p_y + t * sigma_ep

**Finally, plot the upper and lower confidence limits for the predicted y values**

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7,7), dpi=300, tight_layout=True)

# Scatterplot of original data
ax.scatter(x, y, c='k', label='Original Data')

# Plot the regression line, we only need two points to define a line, use xmin and xmax
ax.plot([x.min(), x.max()], [B0 + B1*x.min(), B0 + B1*x.max()] , '-r', label='Least Squares Linear Regression Model')

# Plot the mean line, we only need two points to define a line, use xmin and xmax
ax.plot([x.min(), x.max()], [y.mean(), y.mean()] , '--m', label='Mean Y')

# Plot the mean x line
plt.axvline(x.mean(),c='k', linestyle='--', label='Mean X Value')

# Plot the upper and lower confidence limits for the standard error of the gradient (slope)
ax.plot([x.min(), x.max()], [B0_upper + B1_upper*x.min(), B0_upper + B1_upper*x.max()] , '--r', label='Upper B0 confidence limit (95%)')
ax.plot([x.min(), x.max()], [B0_lower + B1_lower*x.min(), B0_lower + B1_lower*x.max()] , '--r', label='Upper B0 confidence limit (95%)')

# Plot confidence limits on our predicted Y values
ax.plot(p_x, p_y_upper, ':b', label='Upper Y prediction interval (95%)')
ax.plot(p_x, p_y_lower, ':b', label='Lower Y prediction interval (95%)')

# Add legend
plt.legend(loc='lower right');

# Add axes labels and title
ax.set_xlabel("Annual Mean Flow [cfs]")
ax.set_ylabel("Annual Peak Flow [cfs]")
ax.set_title('Linear Regression Model with Confidence Intervals');

Our upper and lower predicted y confidence limits look almost parallel, but are they? 

To inspect this, we can plot the difference between the two versus x to see how our 95% interval changes shape as we move along the x axis, and see that they "pivot" around the mean x value of the original dataset.

In [None]:
p_y_difference = p_y_upper - p_y_lower
plt.figure(figsize=[5,3],dpi=300)
plt.plot(p_x, p_y_difference, label='p_y_difference')
plt.axvline(x.mean(),c='k', linestyle='--', label='Mean X Value')

plt.legend()
plt.xlabel('Prediction Input (Annual Mean Flow, cfs)')
plt.ylabel('Difference Between Upper and Lower\nY Prediction Confidence Bounds ($\Delta$cfs)')
plt.title('Difference Between Upper and Lower\nY Prediction 95% Confidence Bounds');

As we'd expect, they're not quite parallel (they vary along the x-axis) and are narrowest at $\bar{x}$ where we have higher confidence in our ability to make predictions with the model.

---
# 2.4. Linear regression with scipy

**How do we do this quickly in python?**

As always, there are a few options, two of the easier ones that are in packages we already have here are:
- `scipy.stats.linregress()` we've used this previously
- `numpy.polyfit()` we can fit a 1st order polynomial (linear function)

I'm going to use the scipy function below (remember, this outputs our standard error of the gradient for us already):

In [None]:
B1, B0, r, p, sB1 = stats.linregress(x, y)

Compute the upper and lower limits for the B1 parameter

In [None]:
# our alpha for 95% confidence
alpha = 0.05

# length of the original dataset
n = len(x)
# degrees of freedom
dof = n - 2

# t-value for alpha/2 with n-2 degrees of freedom
t = stats.t.ppf(1-alpha/2, dof)

# compute the upper and lower limits on our B1 (slope) parameter
B1_upper = B1 + t * sB1
B1_lower = B1 - t * sB1

# compute the corresponding upper and lower B0 values (y intercepts)
B0_upper = y.mean() - B1_upper*x.mean()
B0_lower = y.mean() - B1_lower*x.mean()

Create some predictions values, compute our error of prediction (sigma_ep) for each p_x, then the lower and upper confidence limits (for 95%) can be computed as:

In [None]:
# an array of x values
p_x = np.linspace(x.min(),x.max(),100)

# using our model parameters to predict y values
p_y = B0 + B1*p_x

# calculate the standard error of the predictions
sigma_ep = np.sqrt( s**2 * (1 + 1/n + ( ( n*(p_x-x.mean())**2 ) / ( n*np.sum(x**2) - np.sum(x)**2 ) ) ) )

# our chosen alpha
alpha = 0.05

# compute our degrees of freedom with the length of the predicted dataset
n_p = len(p_x)
dof = n_p - 2

# get the t-value for our alpha and degrees of freedom
t = stats.t.ppf(1-alpha/2, dof)

# compute the upper and lower limits at each of the p_x values
p_y_lower = p_y - t * sigma_ep
p_y_upper = p_y + t * sigma_ep

**Plot it all again**

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7,7), dpi=300, tight_layout=True)

# Scatterplot of original data
ax.scatter(x, y, c='k', label='Original Data')

# Plot the mean line, we only need two points to define a line, use xmin and xmax
ax.plot([x.min(), x.max()], [y.mean(), y.mean()] , '--m', label='Mean Y')

# Plot the mean x line
plt.axvline(x.mean(),c='k', linestyle='--', label='Mean X Value')

# Plot the linear regression model
ax.plot([x.min(), x.max()], [B0 + B1*x.min(), B0 + B1*x.max()], '-r', label='Least Squares Linear Regression Model')

# Plot the upper and lower confidence limits for the standard error of the gradient (slope)
ax.plot([x.min(), x.max()], [B0_upper + B1_upper*x.min(), B0_upper + B1_upper*x.max()] , '--r', label='Upper B0 confidence limit (95%)')
ax.plot([x.min(), x.max()], [B0_lower + B1_lower*x.min(), B0_lower + B1_lower*x.max()] , '--r', label='Upper B0 confidence limit (95%)')

# Plot confidence limits on our predicted Y values
ax.plot(p_x, p_y_upper, ':b', label='Upper Y prediction interval (95%)')
ax.plot(p_x, p_y_lower, ':b', label='Lower Y prediction interval (95%)')

# Add legend
plt.legend(loc='lower right');

# Add axes labels and title
ax.set_xlabel("Annual Mean Flow [cfs]")
ax.set_ylabel("Annual Peak Flow [cfs]")
ax.set_title('Flow Scatterplot');

## 2.5. Quantile Regression

## Steps to create a quantile regression model:

**1)** For each of your two datasets, create an empirical CDF

We can do this with a custom function like the `cunnane_quantile_array()` function below, which gives us quantile values given an array of numbers.

However, in this case, we want to be able to "look up" any quantile value (even those that lie between data points). For this, we can use `scipy.stats.mstats.mquantiles()` instead.

Review the documentation for [scipy.stats.mstats.mquantiles](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.mquantiles.html), recall that the default options give us the Cunnane plotting position. Note how the function handles quantiles as they approach 0 or 1 at the lowest and highest end of our values. How many (quantile) values should we use in the input to this function to create an empirical CDF?

In [None]:
# This function should be able to accept any one-dimensional numpy array or list, of numbers
# It returns two numpy arrays, one of the sorted numbers, the other of the plotting position
def cunnane_quantile_array(numbers):
    '''This function also computes the Cunnane plotting position given an array or list of numbers (rather than a pandas dataframe).
    It has two outputs, first the sorted numbers, second the Cunnane plotting position for each of those numbers.
    [Steven Pestana, spestana@uw.edu, Oct. 2020]'''
    
    # 1) sort the data, using the numpy sort function (np.sort())
    sorted_numbers = np.sort(numbers)
    
    # length of the list of numbers
    n = len(sorted_numbers) 
    
    # make an empty array, of the same length. below we will add the plotting position values to this array
    cunnane_plotting_position = np.empty(n)
    
    # 2) compute the Cunnane plotting position for each number, using a for loop and the enumerate function
    for rank, number in enumerate(sorted_numbers):
        cunnane_plotting_position[rank] = ( (rank+1) - (2/5) ) / ( n + (1/5) )
    
    return sorted_numbers, cunnane_plotting_position

We can create both types of quantile plots and look at them together. When building the quantile regression model, we'll use both.

In [None]:
plt.figure(figsize=(10,10),dpi=300)

# Here we use the actual values from the dataset to create the plots
# BLC -> PF
# SLI -> MF
PF_ordered, PF_quantile = cunnane_quantile_array(st_law_flow_df['peakflow_cfs'])
MF_ordered, MF_quantile = cunnane_quantile_array(st_law_flow_df['meanflow_cfs'])
plt.plot(PF_ordered, PF_quantile, 'o', markeredgecolor='b', markerfacecolor='None', markersize=7, label='Annual Peak Flow Quantile Plot from observed values')
plt.plot(MF_ordered, MF_quantile, 'o', markeredgecolor='r', markerfacecolor='None', markersize=7, label='Annual Mean Flow Quantile Plot from observed values')


# We can also create these by picking arbitrary quantile values, then using the scipy.stats.mstats.mquantiles function
quantiles = np.linspace(0,1,100) # 100 quantile values linearly spaced between 0 and 1
plt.plot(stats.mstats.mquantiles(st_law_flow_df['peakflow_cfs'], quantiles), quantiles, 
         'b.', label='Annual Peak Flow Quantile Plot from interpolated probabilities', alpha=0.7)
plt.plot(stats.mstats.mquantiles(st_law_flow_df['meanflow_cfs'], quantiles), quantiles, 
         'r.', label='Annual Mean Flow Quantile Plot from interpolated probabilities', alpha=0.7)

plt.ylabel('Quantile')
plt.xlabel('Flow (cfs)')
# plt.xlim((0,2500))
plt.ylim((0,1))
plt.title('Quantiles of Flow data')
plt.legend(loc="best");

**2)** Use the two empirical CDFs as a way of looking-up (or mapping) values from the predictor to the predictand, by matching which physical value corresponds to the same quantile.

The example below does this with one data point, where we start with a value of SWE at Slide Canyon, look up its quantile, then find the corresponding SWE value at Blue Canyon.

In [None]:
# we will aslo need this 1d interpolation function
from scipy.interpolate import interp1d

# This is our empirical cdf of the Slide Canyon data, which also includes values down to 0 and up to 1.
MF_quantile = np.linspace(0,1,100)
MF_ordered = stats.mstats.mquantiles(st_law_flow_df['meanflow_cfs'], MF_quantile)

# When Slide Canyon has SWE equal to it's median, how much snow can we expect at Blue Canyon?
MF_test = st_law_flow_df['meanflow_cfs'].median()

# Create a linear interpolation object based on these values (this lets us look up any value, x, and get back the y value)
f_MF = interp1d(MF_ordered, MF_quantile)
MF_test_quantile = f_MF(MF_test)

print('In the empirical Annual Mean Flow CDF,'+ 
      'a value of {} cfs (the median) corresponds'.format(MF_test)+
      'to a quantile of {}'.format(np.round(MF_test_quantile,2)))

In [None]:
plt.figure(figsize=(10,10),dpi=300)

# We can also create these by picking arbitrary quantile values, then using the scipy.stats.mstats.mquantiles function
quantiles = np.linspace(0,1,100) # 100 quantile values linearly spaced between 0 and 1
plt.plot(stats.mstats.mquantiles(st_law_flow_df['peakflow_cfs'], quantiles), quantiles, 
         'b.', label='Annual Peak Flow Quantile Plot from interpolated probabilities', alpha=0.7)
plt.plot(stats.mstats.mquantiles(st_law_flow_df['meanflow_cfs'], quantiles), quantiles, 
         'r.', label='Annual Mean Quantile Plot from interpolated probabilities', alpha=0.7)

# Plot the test point value
plt.plot(MF_test,MF_test_quantile,'D', markerfacecolor='m', markeredgecolor='k',markersize=10, label='MF_test ({},{})'.format(MF_test, np.round(MF_test_quantile,2)))
# Plot a line from the x-axis to the test point
plt.plot([MF_test, MF_test], [0, MF_test_quantile], c='m', linestyle='-')
# Plot a line from the test point to the y-axis
plt.plot([0, MF_test], [MF_test_quantile, MF_test_quantile], c='k', linestyle='-')

plt.ylabel('Quantile')
plt.xlabel('Flow (cfs)')
plt.xlim((180000,380000))
plt.ylim((0,1))
plt.title('Quantiles of Flow data')
plt.legend(loc="best");

We see that our test value corresponds to the median value at Slide Canyon, quantile value 0.5. 

(Yes, you would hope so, since I defined it as the median to begin with, but it's always best practice to start coding with a situation where you know the right answer.)

Now, we need to take this Slide Canyon quantile value (0.5) and find the Blue Canyon SWE value that corresponds to its same quantile value (finding the Blue Canyon median in this case).

We first need to create an interpolation object that lets us translate from Blue Canyon quantile values to Blue Canyon SWE values:

In [None]:
# This is our empirical cdf of the Blue Canyon data, which also includes values down to 0 and up to 1.
PF_quantile = np.linspace(0,1,100)
PF_ordered = stats.mstats.mquantiles(st_law_flow_df['peakflow_cfs'], PF_quantile)

# Create a linear interpolation object based on these values (this lets us look up any value, y, and get back the x value) 
# *note we've reversed the order of quantiles and SWE compared the the first interpolation object we created
g_PF = interp1d(PF_quantile, PF_ordered)

# So if we look up a quantile value in our function g_BLC()
PF_test = g_PF(MF_test_quantile)

print('In the empirical Annual Peak Flow CDF,' +
      'a quantile of {} corresponds'.format(np.round(MF_test_quantile,2))+
      ' to a flow value of {} cfs (the median)'.format(PF_test))

Visualize the complete problem:

In [None]:
plt.figure(figsize=(10,10),dpi=300)

# We can also create these by picking arbitrary quantile values, then using the scipy.stats.mstats.mquantiles function
quantiles = np.linspace(0,1,100) # 100 quantile values linearly spaced between 0 and 1
plt.plot(stats.mstats.mquantiles(st_law_flow_df['peakflow_cfs'], quantiles), quantiles, 
         'b.', label='Annual Peak Flow Quantile Plot from interpolated probabilities', alpha=0.7)
plt.plot(stats.mstats.mquantiles(st_law_flow_df['meanflow_cfs'], quantiles), quantiles, 
         'r.', label='Annual Mean Quantile Plot from interpolated probabilities', alpha=0.7)

# Plot the test point value
plt.plot(MF_test,MF_test_quantile,'D', markerfacecolor='m', markeredgecolor='k',markersize=10, label='MF_test ({},{})'.format(MF_test, np.round(MF_test_quantile,2)))
# Plot a line from the x-axis to the test point
plt.plot([MF_test, MF_test], [0, MF_test_quantile], c='m', linestyle='-')
# Plot a line from the test point to the y-axis
plt.plot([0, PF_test], [MF_test_quantile, MF_test_quantile], c='k', linestyle='-')

# Plot the Blue Canyon test point value
plt.plot(PF_test,MF_test_quantile,'D', markerfacecolor='c', markeredgecolor='k',markersize=10, label='MF_test ({},{})'.format(PF_test, np.round(MF_test_quantile,2)))
# Plot a line from the test point to the x-axis
plt.plot([PF_test, PF_test], [0, MF_test_quantile], c='c', linestyle='-')

plt.ylabel('Quantile')
plt.xlabel('Flow (cfs)')
plt.xlim((180000,380000))
plt.ylim((0,1))
plt.title('Quantiles of Flow data')
plt.legend(loc="best");

---
### 2.5.1. Aplly to full dataset

Now that we've walked through a single-point example, we can apply these steps efficiently to the whole dataset, starting from the beginning:

1) Create empirical CDFs for both data sets

In [None]:
quantiles = np.linspace(0,1,100)

# This is our empirical cdf of the Slide Canyon data, which also includes values down to 0 and up to 1.
MF_ordered = stats.mstats.mquantiles(st_law_flow_df['meanflow_cfs'], quantiles)

# This is our empirical cdf of the Blue Canyon data, which also includes values down to 0 and up to 1.
PF_ordered = stats.mstats.mquantiles(st_law_flow_df['peakflow_cfs'], quantiles)

2) Use the CDFs to "look up" Annual Mean Flow to predict Annual Peak Flow

In [None]:
# Create our interpolation function for looking up a quantile given a value of SWE at Slide Canyon
f_MF = interp1d(MF_ordered, quantiles)
# Create our interpolation function for looking up SWE at Blue Canyon given a quantile
g_PF = interp1d(quantiles, PF_ordered)

# Now, we can create a prediction for every value in the Slide Canyon dataset to come up with a matching prediction for the Blue Canyon dataset
PF_predicted=g_PF( f_MF( st_law_flow_df['meanflow_cfs'] ) )

Plot the results:

In [None]:
# And we can see how well this did by making a time series plot of our actual and predicted values
# Original data:
plt.figure(figsize=(10,5),dpi=300)
plt.plot(st_law_flow_df['year'],st_law_flow_df['meanflow_cfs'],'b-', label='Annual Mean Flow');
plt.plot(st_law_flow_df['year'],st_law_flow_df['peakflow_cfs'],'r-', label='Annual Peak Flow');

# Predicted with linear regression between Slide Canyon and Blue Canyon
plt.plot(st_law_flow_df['year'],PF_predicted,'k--', 
         label='Annual Peak Flow Predicted from Quantile Regression')
plt.legend()
plt.title('Timeline of Annual Peak Flow')
plt.xlabel('Year')
plt.ylabel('Flow [cfs]');
