# Lecture 14 - Parameter Estimation, Least Squares 🔲

## C&C Problem 20.25

Find best-fit linear relationship between annual precipitation (cm) and annual streamflow (m$^3$/s). Data given in the code below.
- Report $r^2$ value and compare to `scipy.stats.linregress` (docs [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html))
- Estimate annual streamflow is precipitation is 120 cm
- If drainage area is 1100 km$^2$, estimate what fraction of precipitation is lost via other processes (evaporation, groundwater infiltration, consumptive use).

In a real application we would do this with many more than 8 data points, and also likely with a hydrologic model instead of linear regression. This is just an example to show the method.

In [None]:
# Import Libraries
import numpy as np
import matplotlib.pyplot as plt

### 🤝 Load & Visualize Data

First, make a scatter plot of the data. Here the points are copied into the code for convenience. For larger datasets, we would read the points from a file using `np.loadtxt` ([docs](https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html)) or similar functions.

In [None]:
# Load Data
P = np.array([88.9, 108.5, 104.1, 139.7, 127, 94, 116.8, 99.1]) # cm
Q = np.array([14.6, 16.7, 15.3, 23.2, 19.5, 16.1, 18.1, 16.6]) # m^3/s

# Plot Data
plt.scatter(P,Q)
plt.xlabel('Precipitation (cm)')
plt.ylabel('Streamflow (m^3/s)')
plt.show()

### 💪 Use least-squares regression to determine the linear fitting paramters $a_0$ and $a_1$:

1. $a_1=\frac{n\sum{x_i y_i}-\sum{x_i}\sum{y_i}}{n\sum{x^2_i}-\big( \sum{x_i} \big)^2}$

2. $a_0 = \bar{y}-a_1 \bar{x}$

Find the best-fit line and report $r^2$. Add to plot. Compare to built-in SciPy function.

In [None]:
# Fitting Parameters
n = #[insert code here]
a1 = #[insert code here]
a0 = #[insert code here]
print('Slope:', a1)
print('Intercept:', a0)

### 💪 Calculate the $r^2$ Value

1. $S_t = \sum{(y_i-\bar{y})^2}$
2. $S_r=\sum{(y_i - a_0 - a_1 x_i)^2}$
3. $r^2 = \frac{S_t-S_r}{S_t}$

In [None]:
# r^2 Determination
St = #[insert code here]
Sr = #[insert code here]
print('r2 = ', #[insert code here])

### 🤝 Compare to SciPy Function

In [None]:
# Compare to SciPy Function
from scipy import stats
res = stats.linregress(P,Q)
print('Scipy function:')
print('Slope: ', res.slope)
print('Intercept: ', res.intercept)
print('r2: ', res.rvalue ** 2)

### 🤝 Add Best-Fit Line to Scatter Plot

In [None]:
# Plot Results
plt.scatter(P,Q)
plt.plot(P, a0 + a1*P, c='red')
plt.xlabel('Precipitation [cm/yr]')
plt.ylabel('Streamflow [m^3/s]')
plt.show()

❓ What is the streamflow if precipitation is 120 cm?

In [None]:
# Use linear regression values to solve for this:


❓ If drainage area is 1100 km$^2$, what fraction of the precipitaiton is lost via other processes (e.g., evaporation, groundwater infiltration, consumptive use) at P = 120 cm?

In [None]:
# Total Precipitation Volume in km^3
A_drainage = 1100 #km^2
P_km = 120*10e-5 #km/yr
P_total = P_km*A_drainage #km^3/yr

# Total Streamflow Volume in km^3/yr: multiply by (km^3/m^3)*(s/yr)
Qtotal = a0+a1*(120) * (10e-9) * (3.154e7) #km^3/yr

# Fraction of Streamflow
print(f'~{(Qtotal/P_total)*100:0.2f}% of the total annual precipitation becomes streamflow.')

## 🤝 Logistic growth with harvesting - ODE

This problem is adapted from the SIMIODE textbook, Project 3.5.2. Given the logistic growth model with harvesting (from L11):

$$ \frac{dP}{dt} = rP \big(1-\frac{P}{K} \big) - hP $$

Along with annual data below for $P$ and $h$ over the period 1978 - 2007, find the best-fit parameters $r$ and $K$ to minimize the sum of squared residuals. Report $r^2$.

In [None]:
# Load Data
P_data = [72148, 73793, 74082, 92912, 82323, 59073, 59920, 48789, 70638, 67462, 68702, 61191, 49599, 46266, 34877, 28827, 21980, 17463, 18057, 22681, 20196, 25776, 23796, 19240, 16495, 12167, 21104, 18871, 21241, 22962]
h_data = [0.18847, 0.149741, 0.21921, 0.17678, 0.28203, 0.34528, 0.20655, 0.33819, 0.14724, 0.19757, 0.23154, 0.20860, 0.33565, 0.29534, 0.33185, 0.35039, 0.28270, 0.19928, 0.18781, 0.19357, 0.18953, 0.17011, 0.15660, 0.28179, 0.25287, 0.25542, 0.08103, 0.087397, 0.081952, 0.10518]

# Plot Data
plt.scatter(np.arange(1978,2007+1),P_data)
plt.xlabel('Year')
plt.ylabel('Population (metric tons)')
plt.show()

Solving our ODE would give us $P(t)$ which we can then compare to `P_data`. However, we don't know what `r` and `K` are. Therefore, we define `params` as an input to our function, where `params = [r_guess, K_guess]`. This allows us to vary our `r` and `K` values as inputs as we use a gradient-based optimization method to estimate the value for `r` and `K`.

Note that we are using Euler's method because we're working with discrete time steps of $\Delta t = 1$ year. Since Euler's method evaluates our derivative $f(t,P)$, we also need to specify `params` as an input.

In [None]:
# Define our Rate Function with `params`
def f(t,P,params):
    r, K = params
    return r * P * (1 - P / K) - h_data[t] * P

# Define Euler's Method with `params`
def euler(f, y0, xmin, xmax, h, params):
    x = np.arange(xmin, xmax+h, h)
    y = np.zeros(len(x))
    y[0] = y0
    for i in range(len(x)-1):
        y[i+1] = y[i] + f(x[i],y[i],params) * h
    return x,y

* Note that the derivative function needs to access `h_data[t]` from the table. This will only work if we are using an integer timestep. The same goes for the evaluation of the squared residuals - the ODE solution must be defined at all points in the data table. More accurate solution methods would require us to evaluate fractional timesteps, which would be more challenging.

* Do not confuse the step size `h` in Euler's method with the harvesting rate `h_data` in the logistic growth equation, which comes from the data table.

Next set up the residual function: for a given $(r,K)$ combination, we want to run the ODE solution and return the $S_r$. Use the observed `P_data[0]` as the initial condition of the model, and run from $t=0$ to $t=29$ (30 years).

In [None]:
# Define the Sum of Square Residuals
def Sr(x):
    t,P = euler(f, P_data[0], 0, 29, 1, params=x)
    return np.sum((P_data - P) ** 2)

The input vector `x = (r_guess,K_guess)`. We can evaluate $S_r$ for any combination of parameters. For our guess values, we know:
* $r \in [0,1]$ by definition.
* The carrying capacity $K$ is more difficult, but will probably be on the order of the population values in the table, $K \approx 100,000$.

In [None]:
print(Sr([0.2,100000]))

❓ Describe what we did in the code above.

* [insert response here]

Is this even a good value of $S_r$?

Let's see how it changes with different parameter values, in this case a random sample.

In [None]:
# Define a random number of samples
num_samples = 1000
# Define the same number of random guesses for r and K
r_test = np.random.rand(num_samples) # between 0-1
K_test = np.random.rand(num_samples) * 1e6 # between 0 - 1e6

# Solve for Sr for all of these guesses
Sr_test = np.zeros(num_samples) # Initialize array to fill in

for i in range(num_samples):
    Sr_test[i] = Sr([r_test[i], K_test[i]])

In [None]:
# plot the log of Sr_test to see differences across orders of magnitude
plt.scatter(r_test, K_test, c=np.log(Sr_test))
plt.xlabel('r')
plt.ylabel('K')
plt.title('Sum Squared Residuals')
plt.colorbar()
plt.show()

* Note that some of our $r,K$ combinations produce an overflow warning and return `NaN`, likely because the errors are so large.
* From this plot it looks like an optimal parameter combination might fall near $r=0.25, K=300,000$.

Let's use a gradient-based optimization to find out.

In [None]:
from scipy.optimize import minimize
res = minimize(Sr, x0=[0.25, 300000])
print(res)

* The result seems reasonable ($r=0.19, K=340,000$).
* The message about precision loss could be because the parameters are very different orders of magnitude, or because the `Sr` values are so large.
* The scatter plot also brings up the concept of parameter sensitivity: the value of $r$ has a large effect on the error, but if we find the right $r$, the value of $K$ can take a range of values without changing the error too much. We will cover parameter sensitivity in more detail next class.

Similar to how we plotted the log-transform of `Sr`, it might make sense to optimize `log(Sr)` too.

In [None]:
best_x = res.x
t,P = euler(f, P_data[0], 0, 29, 1, params=best_x)
plt.scatter(t+1978, P_data)
plt.plot(t+1978, P, c='red')
plt.xlabel('Year')
plt.ylabel('Population [metric tons]')
plt.show()

How does our fitted model look?

* The model is not a smooth curve because of the `h_data` values from the table. With a constant `h` this would look more like our L11 code.

What is the $r^2$ value?

In [None]:
best_Sr = res.fun # minimum Sr value from the optimization
St = np.sum((P_data - np.mean(P_data))**2)
print('r2 = ', (St - best_Sr) / St)