# ENGR 240: Curve Fitting with Linear and Nonlinear Regression

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/WCC-Engineering/ENGR240-Demos-and-Worksheets/blob/main/Week%205/Worksheet%205-1%20Curve%20Fitting%20Intro.ipynb)

## Introduction

Many engineering problems involve fitting mathematical models to experimental data. In this worksheet, we'll explore three different approaches to fitting a bacterial growth model to a set of experimental data.

### Learning Objectives
- Apply linear regression to transformed nonlinear models
- Implement nonlinear regression using `scipy.optimize.minimize`
- Implement nonlinear regression using `scipy.optimize.curve_fit`
- Calculate and analyze fit quality statistics
- Compare different curve fitting approaches for the same problem

### Mathematical Background

The bacterial growth model we'll be working with relates the growth rate $k$ to substrate concentration $c$ with the following equation:

$$k = k_{max} \frac{c^2}{c_s + c^2}$$

where:
- $k$ = growth rate (number/day)
- $c$ = substrate concentration (mg/L)
- $k_{max}$ = maximum growth rate (number/day)
- $c_s$ = half-saturation constant (mg/L)$^2$

This relationship exhibits nonlinear behavior but can be linearized through algebraic transformation. We'll explore how to fit this model using both transformation-based and direct nonlinear regression approaches.

## Setup and Imports

Let's import the necessary libraries and set up our experimental data:

In [None]:
import numpy as np
from scipy import optimize
import matplotlib.pyplot as plt

# Experimental data
c = np.array([0.5, 0.8, 1.5, 2.5, 4.0])  # Substrate concentration (mg/L)
k = np.array([1.0, 2.5, 5.1, 7.3, 9.1])  # Growth rate (number/day)

## Task 1: Linear Regression on Transformed Data

One approach to fitting nonlinear models is to transform them into a linear form that can be analyzed using simple linear regression.

### Step 1: Transform the Equation

Our original model is: $k = k_{max} \frac{c^2}{c_s + c^2}$

To transform this into a linear relationship, let's start by taking the reciprocal of both sides:

$$\frac{1}{k} = \frac{c_s + c^2}{k_{max} \cdot c^2} = \frac{c_s}{k_{max} \cdot c^2} + \frac{1}{k_{max}}$$

This gives us:

$$\frac{1}{k} = \frac{c_s}{k_{max}} \cdot \frac{1}{c^2} + \frac{1}{k_{max}}$$

Which is in the form $y = m \cdot x + b$ where:
- $y = \frac{1}{k}$
- $x = \frac{1}{c^2}$
- $m = \frac{c_s}{k_{max}}$
- $b = \frac{1}{k_{max}}$

### Step 2: Perform Linear Regression

Now let's use `numpy.polyfit` to find the linear relationship between $\frac{1}{c^2}$ and $\frac{1}{k}$:

In [None]:
# Calculate transformed variables for linear regression
x_transformed = 1 / c**2  # 1/c^2
y_transformed = 1 / k     # 1/k

# Perform linear regression using numpy.polyfit
# polyfit returns [m, b] for the line y = m*x + b
p = np.polyfit(x_transformed, y_transformed, 1)

# Extract the slope (m) and intercept (b)
m, b = p
print(f"Linear regression results: y = {m:.4f}x + {b:.4f}")

# Calculate the original model parameters
kmax1 = 1 / b
cs1 = m * kmax1

print(f"Derived model parameters:")
print(f"kmax = {kmax1:.4f}")
print(f"cs = {cs1:.4f}")

### Step 3: Visualize the Fit

Let's define a function for our bacterial growth model and plot the fitted curve along with the original data:

In [None]:
def kmodel(c, kmax, cs):
    """Growth rate model: k = kmax * c^2 / (cs + c^2)"""
    return kmax * c**2 / (cs + c**2)

# Create a range of concentrations for smooth curve plotting
c_model = np.linspace(0, 5, 100)

# Calculate predicted growth rates using our fitted parameters
k_pred1 = kmodel(c_model, kmax1, cs1)

# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(c, k, 'ko', label='Experimental data')
plt.plot(c_model, k_pred1, 'r-', label='Linear regression (transformed)')
plt.xlabel('Substrate Concentration (mg/L)')
plt.ylabel('Growth Rate (number/day)')
plt.title('Bacterial Growth Model - Linear Regression on Transformed Data')
plt.legend()
plt.grid(True)
plt.show()

## Task 2: Nonlinear Regression using scipy.optimize.minimize

Instead of transforming the model, we can directly fit the nonlinear model by minimizing the squared norm of residuals using `scipy.optimize.minimize`.

### Step 1: Define the Squared Norm of Residuals Function

We need to define a function that calculates the squared norm of residuals (sum of squared differences) between our model predictions and the experimental data. Think of this as the squared Euclidean distance between our model vector and the data vector.

Your task is to implement this function following these guidelines:

In [None]:
def residual_norm_squared(params, c_data, k_data):
    """Calculate the squared norm of residuals between model and data

    Parameters:
    -----------
    params : array-like
        Model parameters [kmax, cs]
    c_data : array-like
        Substrate concentration data
    k_data : array-like
        Growth rate data

    Returns:
    --------
    float
        Squared norm of residuals
    """
    # Extract the model parameters from the params array
    # Remember that params[0] is kmax and params[1] is cs

    # Calculate the predicted k values using the kmodel function
    # defined earlier and the current parameter estimates

    # Calculate the residuals (difference between observed and predicted values)

    # Return the squared norm of residuals (sum of squared residuals)
    # Hint: You can use np.sum()

    pass  # Replace with your implementation

### Step 2: Perform Nonlinear Regression using minimize

Now let's use `scipy.optimize.minimize` to find the parameter values that minimize the squared norm of residuals:

In [None]:
# Initial guess for parameters [kmax, cs]
initial_guess = [1.0, 1.0]

# Perform the optimization
result = optimize.minimize(residual_norm_squared, initial_guess,
                          args=(c, k),
                          method='Nelder-Mead',
                          tol=1e-8)

# Extract the optimized parameters
kmax2, cs2 = result.x

print(f"Optimization result: {result.message}")
print(f"Number of function evaluations: {result.nfev}")
print(f"\nOptimized parameters:")
print(f"kmax = {kmax2:.4f}")
print(f"cs = {cs2:.4f}")

# Add this fit to our plot
k_pred2 = kmodel(c_model, kmax2, cs2)

plt.figure(figsize=(10, 6))
plt.plot(c, k, 'ko', label='Experimental data')
plt.plot(c_model, k_pred1, 'r-', label='Linear regression (transformed)')
plt.plot(c_model, k_pred2, 'b--', label='Nonlinear regression (minimize)')
plt.xlabel('Substrate Concentration (mg/L)')
plt.ylabel('Growth Rate (number/day)')
plt.title('Bacterial Growth Model - Comparison of Fitting Methods')
plt.legend()
plt.grid(True)
plt.show()

## Task 3: Nonlinear Regression using scipy.optimize.curve_fit

SciPy provides a more specialized function `curve_fit` for directly fitting a function to data. Let's implement this approach:

In [None]:
# Perform curve fitting
# curve_fit automatically minimizes the sum of squared residuals
popt, pcov = optimize.curve_fit(kmodel, c, k, p0=[1.0, 1.0])

# Extract the optimized parameters
kmax3, cs3 = popt

# Calculate the standard deviations of the parameters
perr = np.sqrt(np.diag(pcov))

print(f"Optimized parameters (curve_fit):")
print(f"kmax = {kmax3:.4f} ± {perr[0]:.4f}")
print(f"cs = {cs3:.4f} ± {perr[1]:.4f}")

# Add this fit to our plot
k_pred3 = kmodel(c_model, kmax3, cs3)

plt.figure(figsize=(10, 6))
plt.plot(c, k, 'ko', label='Experimental data')
plt.plot(c_model, k_pred1, 'r-', label='Linear regression (transformed)')
plt.plot(c_model, k_pred2, 'b--', label='Nonlinear regression (minimize)')
plt.plot(c_model, k_pred3, 'g-.', label='Nonlinear regression (curve_fit)')
plt.xlabel('Substrate Concentration (mg/L)')
plt.ylabel('Growth Rate (number/day)')
plt.title('Bacterial Growth Model - Comparison of Three Fitting Methods')
plt.legend()
plt.grid(True)
plt.show()

## Task 4: Fit Quality Statistics

To evaluate and compare the quality of our fits, we'll calculate two important statistics for each method:

1. **Coefficient of Determination (R²)**: Measures the proportion of variance in the dependent variable that is predictable from the independent variable(s).
   $$R^2 = 1 - \frac{S_r}{S_t}$$
   
   where:
   - $S_r = \sum_i (y_i - f_i)^2$ is the sum of squares of residuals
   - $S_t = \sum_i (y_i - \bar{y})^2$ is the total sum of squares
   - $y_i$ are the observed values
   - $f_i$ are the predicted values
   - $\bar{y}$ is the mean of the observed data

2. **Standard Error of the Estimate (Syx)**: Represents the average distance that the observed values fall from the regression line.
   $$S_{yx} = \sqrt{\frac{S_r}{n-p}}$$
   
   where:
   - $n$ is the number of observations
   - $p$ is the number of parameters in the model

### Implement a Function to Calculate Fit Quality Statistics

Your task is to implement a function that calculates these statistics for any model and dataset. This function will be useful not only for this worksheet but also for future assignments involving curve fitting.

In [None]:
def fit_quality_stats(y_observed, y_predicted, num_params):
    """Calculate fit quality statistics for a model

    Parameters:
    -----------
    y_observed : array-like
        Observed data values
    y_predicted : array-like
        Model predicted values
    num_params : int
        Number of parameters in the model

    Returns:
    --------
    tuple
        (r_squared, standard_error)
    """
    # Calculate the mean of the observed data

    # Calculate the total sum of squares (S_t)
    # Hint: This is the sum of squared deviations from the mean

    # Calculate the sum of squares of residuals (S_r)
    # Hint: This is the sum of squared differences between observed and predicted values

    # Calculate the coefficient of determination (R²)
    # R² = 1 - S_r/S_t

    # Calculate the standard error of the estimate (Syx)
    # Syx = sqrt(S_r/(n-p)) where n is the number of data points and p is the number of parameters

    # Return the fit quality statistics as a tuple
    pass  # Replace with your implementation

# Now let's use this function to evaluate all three fitting methods

# Calculate predictions for the actual data points (not the smooth curve)
k_pred1_actual = kmodel(c, kmax1, cs1)
k_pred2_actual = kmodel(c, kmax2, cs2)
k_pred3_actual = kmodel(c, kmax3, cs3)

# Calculate fit quality statistics for each method
# Each model has 2 parameters (kmax and cs)

# Your code here to call fit_quality_stats for each method

# Create a table of results
print("Fit Quality Statistics:")
print("-" * 60)
print(f"{'Method':<30} {'R²':>10} {'Syx':>10} {'kmax':>10} {'cs':>10}")
print("-" * 60)
print(f"{'Linear Regression (transformed)':<30} {r_squared1:10.4f} {syx1:10.4f} {kmax1:10.4f} {cs1:10.4f}")
print(f"{'Nonlinear Regression (minimize)':<30} {r_squared2:10.4f} {syx2:10.4f} {kmax2:10.4f} {cs2:10.4f}")
print(f"{'Nonlinear Regression (curve_fit)':<30} {r_squared3:10.4f} {syx3:10.4f} {kmax3:10.4f} {cs3:10.4f}")
print("-" * 60)

# Plot residuals (observed - predicted)
residuals1 = k - k_pred1_actual
residuals2 = k - k_pred2_actual
residuals3 = k - k_pred3_actual

plt.figure(figsize=(10, 6))
plt.plot(c, residuals1, 'ro-', label='Linear regression (transformed)')
plt.plot(c, residuals2, 'bs-', label='Nonlinear regression (minimize)')
plt.plot(c, residuals3, 'g^-', label='Nonlinear regression (curve_fit)')
plt.axhline(y=0, color='k', linestyle='--')
plt.xlabel('Substrate Concentration (mg/L)')
plt.ylabel('Residuals (observed - predicted)')
plt.title('Residuals for Different Fitting Methods')
plt.legend()
plt.grid(True)
plt.show()

## Task 5: Discussion and Reflection

Based on the curve fitting results and quality statistics, consider and respond to the following questions:

1. **Comparison of Methods**: Compare the three fitting methods (transformed linear regression, nonlinear regression with minimize, and nonlinear regression with curve_fit) in terms of:
   - Parameter estimates (kmax and cs)
   - Fit quality (R² and Syx)
   - Ease of implementation
   - When might each method be preferred?

2. **Transformation Impacts**: How does transforming a nonlinear model to a linear form impact the fitting process? What advantages and disadvantages did you observe?

3. **Residuals Analysis**: What can you infer from the patterns in the residuals plot? Are there any trends that indicate potential issues with any of the fitting methods?

Your answers here: