# Worksheet 5-1: Nonlinear Regression

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/WCC-Engineering/ENGR240/blob/main/Class%20Demos%20and%20Activities/Week%205/Worksheet%205-1_template%20nonlinear%20regression.ipynb)

## Overview

In this worksheet, we'll explore three different approaches to fit nonlinear models to data:

1. **Linear regression on transformed data**: Converting a nonlinear model to linear form
2. **Nonlinear regression using scipy.optimize.minimize**: Direct optimization of the sum of squared residuals
3. **Nonlinear regression using scipy.optimize.curve_fit**: A more convenient wrapper for nonlinear regression

We'll apply these techniques to a bacteria growth rate model and compare their results and implementation complexity.

## The Model

We'll work with a bacterial growth rate model that describes how the growth rate depends on substrate concentration:

$$k = k_{max} \frac{c^2}{c_s + c^2}$$

where:
- $k$ is the growth rate (number/day)
- $c$ is the substrate concentration (mg/L)
- $k_{max}$ is the maximum possible growth rate
- $c_s$ is the half-saturation constant

Our goal is to find the values of $k_{max}$ and $c_s$ that best fit our experimental data.

In [None]:
# Import necessary libraries
import numpy as np
from scipy import optimize
import matplotlib.pyplot as plt

# Set plotting style
plt.style.use('seaborn-v0_8-whitegrid')

# Set random seed for reproducibility
np.random.seed(42)

## Experimental Data

Let's start with our experimental measurements of bacterial growth rate at different substrate concentrations:

In [None]:
# Substrate concentration (mg/L)
c = np.array([0.5, 0.8, 1.5, 2.5, 4.0])

# Growth rate (number/day)
k = np.array([1.0, 2.5, 5.1, 7.3, 9.1])

# Plot the experimental data
plt.figure(figsize=(10, 6))
plt.scatter(c, k, color='black', s=50, label='Experimental data')
plt.xlabel('Substrate Concentration (mg/L)')
plt.ylabel('Growth Rate (number/day)')
plt.title('Bacterial Growth Rate vs. Substrate Concentration')
plt.legend()
plt.grid(True)
plt.show()

Let's define our model function that we'll use throughout this worksheet:

In [None]:
def kmodel(c, kmax, cs):
    """Bacterial growth rate model
    
    Parameters:
    -----------
    c : array_like
        Substrate concentration (mg/L)
    kmax : float
        Maximum growth rate (number/day)
    cs : float
        Half-saturation constant
        
    Returns:
    --------
    k : array_like
        Growth rate (number/day)
    """
    return kmax * c**2 / (cs + c**2)

## Task 1: Linear Regression with Transformed Data

Often, we can transform a nonlinear model into a linear form that allows us to use simple linear regression. For our model:

$$k = k_{max} \frac{c^2}{c_s + c^2}$$

### 1.1 Derive the Linear Transformation

In the cell below, derive a linear transformation of this model. Hint: Start by taking the reciprocal (1/k) of both sides.

**Your derivation here:**

Taking the reciprocal of both sides:

$$\frac{1}{k} = \frac{c_s + c^2}{k_{max} \cdot c^2} = \frac{c_s}{k_{max} \cdot c^2} + \frac{1}{k_{max}}$$

This is now in the form of a linear equation:

$$\frac{1}{k} = \frac{c_s}{k_{max}} \cdot \frac{1}{c^2} + \frac{1}{k_{max}}$$

Which has the form $Y = mX + b$ where:
- $Y = \frac{1}{k}$
- $X = \frac{1}{c^2}$
- $m = \frac{c_s}{k_{max}}$
- $b = \frac{1}{k_{max}}$

### 1.2 Implement the Linear Regression

Now, implement the linear regression using the transformed variables and NumPy's `polyfit` function.

In [None]:
# Transform the data
X = 1/c**2  # X = 1/c²
Y = 1/k     # Y = 1/k

# Use np.polyfit to perform linear regression
# Complete the line below to fit a 1st-degree polynomial (straight line) to X and Y
p = np.polyfit(X, Y, 1)

# Extract the slope (m) and intercept (b)
m, b = p

# Calculate kmax and cs from m and b
# Remember: b = 1/kmax and m = cs/kmax
kmax1 = 1/b
cs1 = m * kmax1

print(f"Linear Regression Results:")
print(f"m = {m:.4f}, b = {b:.4f}")
print(f"kmax = {kmax1:.4f}")
print(f"cs = {cs1:.4f}")

### 1.3 Visualize the Linear Fit

Let's visualize both the transformed data with the linear fit and the original data with the nonlinear model using our fitted parameters.

In [None]:
# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot the transformed data and linear fit
ax1.scatter(X, Y, color='black', s=50)
X_line = np.linspace(min(X), max(X), 100)
Y_line = m * X_line + b
ax1.plot(X_line, Y_line, 'r-', linewidth=2)
ax1.set_xlabel('1/c² (L²/mg²)')
ax1.set_ylabel('1/k (day/number)')
ax1.set_title('Transformed Data with Linear Fit')
ax1.grid(True)

# Plot the original data and nonlinear model
ax2.scatter(c, k, color='black', s=50)
c_line = np.linspace(0, max(c)*1.2, 100)
k_line = kmodel(c_line, kmax1, cs1)
ax2.plot(c_line, k_line, 'r-', linewidth=2)
ax2.set_xlabel('Substrate Concentration (mg/L)')
ax2.set_ylabel('Growth Rate (number/day)')
ax2.set_title('Original Data with Nonlinear Model')
ax2.grid(True)

plt.tight_layout()
plt.show()

## Task 2: Fit Quality Statistics

Now that we have our first fit, let's evaluate its quality using standard statistical measures. The two most common metrics for assessing curve fit quality are:

1. **Coefficient of Determination (R²)**: Indicates the proportion of variance in the dependent variable that is predictable from the independent variable. R² ranges from 0 to 1, where 1 indicates perfect prediction.

2. **Standard Error of the Estimate (Syx)**: Represents the average distance between the observed values and the predicted values. Lower values indicate better fit.

### 2.1 Calculate Fit Quality Metrics

For our nonlinear model, we need to calculate these metrics using the original (non-transformed) data, even though we performed the fit on transformed data.

In [None]:
# Calculate predicted values using our model with kmax1 and cs1
k_pred1 = kmodel(c, kmax1, cs1)

# Calculate residuals (differences between observed and predicted values)
residuals = k - k_pred1

# Sum of Squared Residuals (Sr)
Sr = np.sum(residuals**2)

# Total Sum of Squares (St) - variation around the mean
St = np.sum((k - np.mean(k))**2)

# Coefficient of Determination (R²)
r_squared1 = 1 - Sr/St

# Standard Error of the Estimate (Syx)
# n-2 represents degrees of freedom (n data points minus 2 parameters)
Syx_1 = np.sqrt(Sr/(len(k)-2))

print(f"Fit Quality Metrics for Linear Regression Approach:")
print(f"Sum of Squared Residuals (Sr) = {Sr:.4f}")
print(f"Total Sum of Squares (St) = {St:.4f}")
print(f"Coefficient of Determination (R²) = {r_squared1:.4f}")
print(f"Standard Error of the Estimate (Syx) = {Syx_1:.4f}")

### 2.2 Visualize the Fit

Let's create a more detailed visualization of our first fit, including the residuals.

In [None]:
# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10), gridspec_kw={'height_ratios': [3, 1]})

# Plot the data and model fit
ax1.scatter(c, k, color='black', s=50, label='Experimental data')
c_line = np.linspace(0, max(c)*1.2, 100)
k_line = kmodel(c_line, kmax1, cs1)
ax1.plot(c_line, k_line, 'r-', linewidth=2, label=f'Model fit: kmax={kmax1:.2f}, cs={cs1:.2f}')

# Add fit quality statistics to the plot
stats_text = f"R² = {r_squared1:.4f}\nSyx = {Syx_1:.4f}"
ax1.text(0.05, 0.95, stats_text, transform=ax1.transAxes, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

ax1.set_xlabel('Substrate Concentration (mg/L)')
ax1.set_ylabel('Growth Rate (number/day)')
ax1.set_title('Bacterial Growth Rate: Data and Model Fit')
ax1.legend()
ax1.grid(True)

# Plot the residuals
ax2.scatter(c, residuals, color='blue', s=50)
ax2.axhline(y=0, color='r', linestyle='-', alpha=0.3)
ax2.set_xlabel('Substrate Concentration (mg/L)')
ax2.set_ylabel('Residuals')
ax2.set_title('Residuals (Observed - Predicted)')
ax2.grid(True)

plt.tight_layout()
plt.show()

## Task 3: Nonlinear Regression using scipy.optimize.minimize

Instead of transforming our model, we can directly fit the nonlinear model to the data using optimization techniques. In this task, we'll use `scipy.optimize.minimize` to minimize the sum of squared residuals.

### 3.1 Define the Objective Function

First, we need to define an objective function that calculates the sum of squared residuals (Sr). This is what we're trying to minimize.

In [None]:
def objective_function(params, c_data, k_data):
    """Calculate the sum of squared residuals between model predictions and data
    
    Parameters:
    -----------
    params : array_like
        Model parameters [kmax, cs]
    c_data : array_like
        Substrate concentration data
    k_data : array_like
        Growth rate data
        
    Returns:
    --------
    Sr : float
        Sum of squared residuals
    """
    # Extract parameters
    kmax, cs = params
    
    # Calculate predicted values
    k_pred = kmodel(c_data, kmax, cs)
    
    # Calculate residuals
    residuals = k_data - k_pred
    
    # Return sum of squared residuals
    return np.sum(residuals**2)

### 3.2 Perform the Optimization

Now, let's use `scipy.optimize.minimize` to find the parameters that minimize our objective function.

In [None]:
# Initial guess for parameters [kmax, cs]
initial_guess = [1, 1]

# Minimize the objective function
result = optimize.minimize(objective_function, 
                          initial_guess, 
                          args=(c, k), 
                          method='Nelder-Mead',
                          tol=1e-8)

# Extract the optimized parameters
kmax2, cs2 = result.x

print("Nonlinear Regression Results (minimize):")
print(f"Optimization successful: {result.success}")
print(f"Function evaluations: {result.nfev}")
print(f"kmax = {kmax2:.4f}")
print(f"cs = {cs2:.4f}")

### 3.3 Calculate Fit Quality Metrics

Let's evaluate the quality of this fit.

In [None]:
# Calculate predicted values
k_pred2 = kmodel(c, kmax2, cs2)

# Calculate sum of squared residuals
Sr2 = np.sum((k - k_pred2)**2)

# Calculate coefficient of determination (R²)
r_squared2 = 1 - Sr2/St

# Calculate standard error of the estimate
Syx_2 = np.sqrt(Sr2/(len(k)-2))

print(f"Fit Quality Metrics for Nonlinear Regression (minimize):")
print(f"Sum of Squared Residuals (Sr) = {Sr2:.4f}")
print(f"Coefficient of Determination (R²) = {r_squared2:.4f}")
print(f"Standard Error of the Estimate (Syx) = {Syx_2:.4f}")

## Task 4: Nonlinear Regression using scipy.optimize.curve_fit

The `scipy.optimize.curve_fit` function provides a more convenient interface for fitting models to data. Under the hood, it uses `scipy.optimize.minimize` but with a more user-friendly API specifically designed for curve fitting.

### 4.1 Implement the Curve Fit

In [None]:
# Use curve_fit to find the optimal parameters
# The first argument is the model function
# The second and third arguments are the x and y data
# p0 is the initial guess for parameters
popt, pcov = optimize.curve_fit(kmodel, c, k, p0=[1, 1])

# Extract the optimized parameters
kmax3, cs3 = popt

# Extract the parameter covariance matrix
# The diagonal elements are the variances of the parameters
parameter_variances = np.diag(pcov)
parameter_std_dev = np.sqrt(parameter_variances)

print("Nonlinear Regression Results (curve_fit):")
print(f"kmax = {kmax3:.4f} ± {parameter_std_dev[0]:.4f}")
print(f"cs = {cs3:.4f} ± {parameter_std_dev[1]:.4f}")
print(f"Parameter covariance matrix:")
print(pcov)

### 4.2 Calculate Fit Quality Metrics

In [None]:
# Calculate predicted values
k_pred3 = kmodel(c, kmax3, cs3)

# Calculate sum of squared residuals
Sr3 = np.sum((k - k_pred3)**2)

# Calculate coefficient of determination (R²)
r_squared3 = 1 - Sr3/St

# Calculate standard error of the estimate
Syx_3 = np.sqrt(Sr3/(len(k)-2))

print(f"Fit Quality Metrics for Nonlinear Regression (curve_fit):")
print(f"Sum of Squared Residuals (Sr) = {Sr3:.4f}")
print(f"Coefficient of Determination (R²) = {r_squared3:.4f}")
print(f"Standard Error of the Estimate (Syx) = {Syx_3:.4f}")

### 4.3 Compare All Three Methods

Let's visualize the results of all three fitting methods on a single plot.

In [None]:
# Define a range of concentrations for plotting
c_plot = np.linspace(0, 5, 100)

# Calculate model predictions for each method
k_model1 = kmodel(c_plot, kmax1, cs1)
k_model2 = kmodel(c_plot, kmax2, cs2)
k_model3 = kmodel(c_plot, kmax3, cs3)

# Create the plot
plt.figure(figsize=(12, 8))

# Plot the data points
plt.scatter(c, k, color='black', s=80, label='Experimental data')

# Plot the model fits
plt.plot(c_plot, k_model1, 'r-', linewidth=2, 
         label=f'Linear regression: kmax={kmax1:.2f}, cs={cs1:.2f}, R²={r_squared1:.4f}')
plt.plot(c_plot, k_model2, 'b--', linewidth=2, 
         label=f'Nonlinear (minimize): kmax={kmax2:.2f}, cs={cs2:.2f}, R²={r_squared2:.4f}')
plt.plot(c_plot, k_model3, 'g-.', linewidth=2, 
         label=f'Nonlinear (curve_fit): kmax={kmax3:.2f}, cs={cs3:.2f}, R²={r_squared3:.4f}')

# Add labels and title
plt.xlabel('Substrate Concentration (mg/L)', fontsize=12)
plt.ylabel('Growth Rate (number/day)', fontsize=12)
plt.title('Comparison of Three Curve Fitting Approaches', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True)

# Show the plot
plt.tight_layout()
plt.show()

## Task 5: Reflection and Discussion

### 5.1 Compare the Results

Let's create a summary table to compare the results of the three methods.

In [None]:
# Create a comparison table
methods = ['Linear Regression (transformed)', 'Nonlinear Regression (minimize)', 'Nonlinear Regression (curve_fit)']
kmax_values = [kmax1, kmax2, kmax3]
cs_values = [cs1, cs2, cs3]
r_squared_values = [r_squared1, r_squared2, r_squared3]
syx_values = [Syx_1, Syx_2, Syx_3]

# Print the comparison table
print("Comparison of Curve Fitting Methods:")
print("-" * 80)
print(f"{'Method':<30} {'kmax':<10} {'cs':<10} {'R²':<10} {'Syx':<10}")
print("-" * 80)
for i, method in enumerate(methods):
    print(f"{method:<30} {kmax_values[i]:<10.4f} {cs_values[i]:<10.4f} {r_squared_values[i]:<10.4f} {syx_values[i]:<10.4f}")
print("-" * 80)