# Least Squares Regression, RSS, RMSE, R-squared

When you have a set of experimental data points $(x_i, y_i)$, where $i$ ranges from 1 to $n$ (the number of data points), and you want to find a mathematical function that best describes the relationship between $x$ and $y$, you are performing **curve fitting** or **regression analysis**. The goal is to find the parameters of a chosen function that make the function's output as close as possible to your observed $y$ values for the corresponding $x$ values.

Let's review some specific non-linear case, and consider that we want to approximate the data with the function:

$$f(x; A, B) = A \cdot (e^{-B \cdot x} - 1) + 100$$

Here, $A$ and $B$ are the parameters that we need to determine from the data points. The '100' is a constant offset in this specific function.

```{note}
As of now, we consider all experimental points with no errors. However, if $y$ values or both $x$ and $y$ values have some errors, we need to apply different algorithms. We will consider such algorihms later.
```

## The Core Idea: Minimizing Differences

> The fundamental idea behind most curve fitting methods is to minimize the "difference" between your experimental $y_i$ values and the $y$ values predicted by your chosen function, $f(x_i; A, B)$. This "difference" is often called the **residual**.

For each data point $(x_i, y_i)$, the residual, $e_i$, is defined as:

$e_i = y_i - f(x_i; A, B)$

Our goal is to find the values of $A$ and $B$ that make these residuals, collectively, as small as possible.

There are various methods for approximating data, but for continuous functions and without explicit error bars on individual points (as you specified initially), the most common and widely used method is **Least Squares Regression**.

## Least Squares Regression

The principle of least squares is to find the parameters (in our case, $A$ and $B$) that **minimize the sum of the squares of the residuals**. Why squares?
* Squaring the residuals ensures that positive and negative differences don't cancel each other out.
* It penalizes larger errors more heavily than smaller errors, which is often desirable.

So, we want to minimize the following quantity, which is the **Residual Sum of Squares (RSS)**:

$$ RSS(A, B) = \sum_{i=1}^{n} (y_i - f(x_i; A, B))^2 $$

Substituting our specific function:

$$ RSS(A, B) = \sum_{i=1}^{n} (y_i - (A \cdot (e^{-B \cdot x_i} - 1) + 100))^2 $$

To find the values of $A$ and $B$ that minimize $RSS$, we typically use calculus. We take the partial derivatives of $RSS$ with respect to each parameter ($A$ and $B$), set them equal to zero, and solve the resulting system of equations.

$$\frac{\partial RSS}{\partial A} = 0$$

$$\frac{\partial RSS}{\partial B} = 0$$

For linear regression, these equations are linear and have a direct analytical solution. However, for non-linear functions like ours (due to the $e^{-B \cdot x}$ term), these equations are often non-linear and require iterative numerical optimization algorithms (like the Levenberg-Marquardt algorithm, which is commonly used in `scipy.optimize.curve_fit` in Python). We won't derive the specific partial derivatives for the aforementioned function here, as it gets quite involved and typically handled by computational tools. The core idea remains the same: find $A$ and $B$ that make the slope of the $RSS$ surface zero.

Once you've found the best-fit parameters $A$ and $B$, you need to evaluate how "good" your approximation is. This is where metrics like RSS, RMSE, and R-squared come in.

## Residual Sum of Squares (RSS)

As derived above, RSS is:

$$RSS = \sum_{i=1}^{n} (y_i - f(x_i; A, B))^2$$

RSS is a direct measure of the total discrepancy between your observed data points and your fitted function. A smaller RSS indicates a better fit to the data.

**Understanding:**
* It's always non-negative.
* Its units are the square of the units of $y$.
* It's absolute: you CANNOT compare RSS directly between different datasets or models with different numbers of data points or vastly different scales of $y$.

## R-squared (Coefficient of Determination)

R-squared is a very popular metric because it provides a standardized measure of how well your model explains the variability in the dependent variable $y$.

First, let's re-state the key sums of squares:

1.  **Total Sum of Squares (TSS):** This measures the total variability in the observed $y$ values around their mean $\bar{y}$. It represents how much the $y$ values vary in total, without considering any model.

    $$TSS = \sum_{i=1}^{n} (y_i - \bar{y})^2$$

    where $y_i$ are the observed data points and $\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i$ is the mean of the observed $y$ values. TSS represents the total variability in the observed $y$ values around their mean. It's the sum of squared differences if you were to approximate all $y_i$ with their mean $\bar{y}$ (which is essentially a horizontal line).


2.  **Residual Sum of Squares (RSS) / Sum of Squared Residuals (SSR) / Sum of Squares Error (SSE):** This measures the variability in the observed $y$ values that is *not* explained by the regression model. It's the sum of the squared differences between the observed $y_i$ and the predicted $f(x_i)$ (often denoted as $\hat{y}_i$).

    $$RSS = \sum_{i=1}^{n} (y_i - f(x_i))^2$$

    where $f(x_i)$ (or $\hat{y}_i$) are the predicted values from your model.

3.  **Sum of Squares due to Regression (SSR) / Explained Sum of Squares (ESS):** This measures the variability in the dependent variable ($y$) that *is* explained by the regression model. It's the sum of the squared differences between the predicted values $f(x_i)$ and the mean of the observed $y$ values $\bar{y}$.

    $$ESS = \sum_{i=1}^{n} (f(x_i) - \bar{y})^2$$

    NOTE: There can be some confusion with the acronym "SSR" as it may refer to "Sum of Squares due to Regression" or "Sum of Squared Residuals". For these materials, we will use Residual Sum of Squares (RSS) and Explained Sum of Squares (ESS) to avoid any confusion.

R-squared is defined as:

$$R^2 = 1 - \frac{RSS}{TSS}$$

**Meaning:** R-squared tells you the proportion of the variance in the dependent variable ($y$) that is predictable from the independent variable ($x$) using your regression model. In simpler terms, it indicates how much of the variation in $y$ can be explained by your chosen function.

**Understanding:**
* $R^2$ ranges from 0 to 1 (or 0% to 100%) for ordinary least squares with an intercept.
* An $R^2$ of 1 (or 100%) means that your model perfectly explains all the variability in $y$. The residuals are all zero, and the function passes through every data point. This is rare in experimental data.
* An $R^2$ of 0 means that your model explains none of the variability in $y$. In this case, your model performs no better than simply predicting the mean of $y$ for all $x$ values.
* A higher $R^2$ generally indicates a better fit.
* NOTE: $R^2$ may be less than zero or greater than one for non-linear regression or for linear models without an intercept. 

**Interpretation Caveats:**
* A high $R^2$ doesn't necessarily mean the model is "correct" or that the chosen function is the true underlying relationship. It just means it explains a lot of the variance.
* Adding more parameters to a model will generally increase $R^2$, even if those parameters don't significantly improve the model's predictive power (this is why **adjusted R-squared** is sometimes used, which penalizes for added complexity).
* **IMPORTANT**: $R^2$ is most appropriate for linear models. For non-linear models, using of $R^2$ is not recommended as it's often misleading and shouldn't be the primary metric.
* It's possible to have a low $R^2$ for a valid model if the inherent variability in the data (noise) is very high, even if the model captures the underlying trend.

## Adjusted R-squared $R^2$

The standard R-squared $R^2$ measures the proportion of variance in the dependent variable that is explained by the independent variables in a regression model. While useful, $R^2$ has a significant drawback: **it always increases or stays the same when you add more independent parameters to your model, even if those new variables do not genuinely improve the model's explanatory power.** This can lead to misleading conclusions, as a more complex model might appear better simply because it has more terms, not because it's truly a better fit to the underlying phenomenon.

**Adjusted R-squared** addresses this issue by penalizing the inclusion of unnecessary parameters. It adjusts the $R^2$ value based on the number of estimated parameters in the model and the number of data points.

Imagine you have a model with a certain $R^2$. If you add a new parameter that genuinely helps explain the variance in $y$, the $RSS$ will decrease significantly, and the $R^2$ will increase. However, if you add a new parameter that is irrelevant (e.g., random noise), $RSS$ will still decrease slightly (due to random chance or fitting noise), causing $R^2$ to increase, but this increase is not meaningful. Adjusted $R^2$ accounts for this by considering the degrees of freedom.

The formula for adjusted R-squared $R^2_{adj}$ is:

$$R^2_{adj} = 1 - \frac{RSS / (n - k)}{TSS / (n - 1)}$$

Let's break down the components and explain why this formula works:

* $n$: The number of data points (observations).
* $k$: The number of estimated parameters in the model.
* $RSS$: Residual Sum of Squares (unexplained variation).
* $TSS$: Total Sum of Squares (total variation).

To understand the division terms, we need to introduce the concept of **degrees of freedom (df)**:

* **Degrees of freedom for residuals ($df_{res}$):** This is the number of data points minus the number of parameters estimated by the model $df_{res} = n - k$.
* **Degrees of freedom for total variation ($df_{tot}$):** This is the number of data points minus 1 (because the mean $\bar{y}$ is estimated from the data). So, $df_{tot} = n - 1$.

Now, let's rewrite the formula using degrees of freedom:

$$R^2_{adj} = 1 - \frac{RSS / df_{res}}{TSS / df_{tot}}$$

This can also be expressed in terms of **Mean Squared Error (MSE)**:

* **Mean Squared Error of Residuals $MSE$:** This is the average squared residual.
    $MSE = \frac{RSS}{n - k}$
* **Mean Squared Total $MST$:** This is the sample variance of $y$.
    $MST = \frac{TSS}{n - 1}$

Substituting these into the adjusted R-squared formula:

$$R^2_{adj} = 1 - \frac{MSE}{MST}$$

**Why this adjustment works:**

* **Penalizing Complexity:** When you add an irrelevant parameter, $RSS$ will decrease only slightly, but $k$ (the number of estimated parameters) increases by 1. This means $n - k$ (the denominator for $MSE$) decreases. If the decrease in $RSS$ is not substantial enough to offset the decrease in $n - k$, then $MSE$ might actually *increase*. If $MSE$ increases, $R^2_{adj}$ will decrease. This is the "penalty" for adding useless variables.
* **Fair Comparison:** Adjusted $R^2$ allows for a more fair comparison between models with different numbers of parameters. A model with a higher adjusted $R^2$ is generally preferred, as it suggests a better fit that is not merely a result of adding more terms.

You can also derive adjusted R-squared from standard R-squared:

$$R^2 = 1 - \frac{RSS}{TSS}$$
$$\frac{RSS}{TSS} = 1 - R^2$$
$$RSS = (1 - R^2) \cdot TSS$$

Substitute $RSS$ into the adjusted R-squared formula:

$$R^2_{adj} = 1 - \frac{(1 - R^2) \cdot TSS / (n - k)}{TSS / (n - 1)}$$

Cancel out $TSS$:

$$R^2_{adj} = 1 - (1 - R^2) \frac{n - 1}{n - k}$$

This form clearly shows how $R^2_{adj}$ relates to $R^2$ and the degrees of freedom.

## R-squared for Non-Linear Models

**Adjusted R-squared can be used for both linear and non-linear functions. However, it should be used with caution non-linear functions/models**

* **Linear Functions:** Adjusted R-squared is very commonly used in linear regression. It's the preferred metric over standard $R^2$ when comparing linear models with different numbers of parameters, as it helps identify models that are parsimonious (simple yet effective).

* **Non-Linear Functions:** The principles behind adjusted R-squared (penalizing for model complexity and providing a more robust measure of explained variance) apply equally to non-linear models, especially when they are fitted using the least squares method (for example, using `scipy.optimize.curve_fit`).

Technically, we can compute adjusted R-squared for any model fitted using least squares methods, including non-linear least squares.

However, for non-linear models, adjusted R-squared as well as R-squared can behave unexpectedly:
* It may not represent the proportion of variance explained in the same intuitive way
* It can sometimes be negative or exceed 1
* The penalty for additional parameters may not adequately capture model complexity in non-linear cases

Therefore, when you are comparing different non-linear models, or trying to decide if adding another parameter to your non-linear function is truly beneficial, adjusted R-squared is a more appropriate metric than the standard R-squared. However, the general recommendation is to use adjusted R-squared with caution for non-linear models.

For a linear model that includes an intercept term (e.g., $f(x) = \beta_0 + \beta_1 x$), the following identity is always true:

$$ \sum_{i=1}^{n} (y_i - \bar{y})^2 = \sum_{i=1}^{n} (f(x_i) - \bar{y})^2 + \sum_{i=1}^{n} (y_i - f(x_i))^2 $$

Or, more simply:

$$ TSS = ESS + RSS $$

(Total Sum of Squares = Explained Sum of Squares + Residual Sum of Squares)

This identity holds because, in linear OLS regression, the residuals $e_i = y_i - f(x_i)$ are mathematically guaranteed to be uncorrelated with the predicted values $f(x_i)$. This leads to a key cross-product term being exactly zero. Let's show this:

Start with the definition of TSS:

$$ TSS = \sum (y_i - \bar{y})^2 $$

We can add and subtract the predicted value $f(x_i)$ inside the parentheses:

$$ TSS = \sum (y_i - f(x_i) + f(x_i) - \bar{y})^2 $$

Group the terms:

$$ TSS = \sum ( (y_i - f(x_i)) + (f(x_i) - \bar{y}) )^2 $$

Expand the square:

$$ TSS = \sum (y_i - f(x_i))^2 + \sum (f(x_i) - \bar{y})^2 + 2 \sum (y_i - f(x_i))(f(x_i) - \bar{y}) $$
$$ TSS = RSS + ESS + 2 \sum e_i (f(x_i) - \bar{y}) $$

For this identity to hold, the final cross-product term must be zero. In OLS linear regression, the method of minimizing the RSS ensures that
* $\sum e_i f(x_i) = 0$
* $\sum e_i = 0$
* Since $\bar{y}$ is a constant, the cross-product term $2 \sum (y_i - f(x_i))(f(x_i) - \bar{y})$ simplifies to $2 (\sum e_i f(x_i) - \bar{y} \sum e_i) = 2 (0 - \bar{y} \cdot 0) = 0$.

Thus, for linear OLS, we have the clean decomposition: $ TSS = ESS + RSS $.

This allows $R^2$ to be interpreted as the proportion of variance explained:

$$ R^2 = \frac{ESS}{TSS} = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS} $$


**Why This Fails for Non-Linear Regression**

When you fit a non-linear model using an iterative method like Levenberg-Marquardt, the algorithm minimizes the RSS. However, it **does not guarantee** that the resulting residuals will be uncorrelated with the predicted values in the same way. The cross-product term $\sum (y_i - f(x_i))(f(x_i) - \bar{y})$ is generally **not zero**.

As a result, the neat identity breaks down:

$$ TSS \neq ESS + RSS \quad (\text{for non-linear models}) $$

**Consequences for R-squared Interpretation:**

1.  **Proportion of Variance is Lost:** Since $TSS$ no longer neatly partitions into ESS and RSS, the formula $R^2 = 1 - RSS/TSS$ can no longer be interpreted as the "proportion of variance explained." It is simply a statement comparing the model's error $RSS$ to the error of a baseline model (a horizontal line, $TSS$).

2.  **R-squared Can Be Negative:** If a non-linear model provides a worse fit to the data than a simple horizontal line at the mean $\bar{y}$, the $RSS$ can be larger than the TSS. If $RSS > TSS$, then $RSS/TSS > 1$, resulting in a **negative R-squared**.

A negative $R^2$ is a clear signal that the chosen non-linear model is a very poor fit for the data, performing worse than the most basic baseline model. For this reason, while you can calculate a value for $R^2$ for any model, its interpretation must be handled with extreme care outside of linear regression. It is better viewed as a comparative metric rather than an absolute measure of explained variance.

If we're going to use $R^2$ for the non-linear regression, then we need to select between these two formulas:

$$R^2 = \frac{ESS}{TSS}$$

or

$$R^2 = 1 - \frac{RSS}{TSS}$$

First formula cannot be negative but can exceed 1. Second formula cannot exceed 1 but can be negative. While the usage of $R^2$ is not recommended for non-linear regression, if you still decide to use this metric, the recommendation is to use the second formula:

$$R^2 = 1 - \frac{RSS}{TSS}$$

This formula is directly tied to the quantity ($RSS$) that is minimized during the regression. It compares the sum of the squared errors of your model ($RSS$) to the sum of the squared errors you would get from a very simple model that just predicts the mean of the data ($TSS$). This formula gives a clear and intuitive meaning to $R^2$. It represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). An $R^2$ of 1 means your model has zero residual error, while an $R^2$ of 0 means your model is no better than simply predicting the mean. Also, this formula ensures that $R^2$ will not exceed 1. While it can become negative if the model is a very poor fit (i.e., $RSS > TSS$), this is a valid and meaningful result that indicates the model is worse than a simple horizontal line at the mean.

## Root Mean Squared Error (RMSE) and Standard Error of the Regression (SER)

RMSE is derived directly from RSS and is often more interpretable:

$$RMSE = \sqrt{\frac{RSS}{n}}$$

RMSE represents the typical or average magnitude of the residuals. It gives you a sense of the average "error" your model makes in predicting $y$.

**Understanding:**
* It's in the same units as your dependent variable $y$. This makes it easier to interpret: "On average, our prediction is off by RMSE units of $y$."
* It's sensitive to outliers because of the squaring of errors. Large errors contribute disproportionately to RMSE.
* Like RSS, a smaller RMSE indicates a better fit.
* You can compare RMSE values between different models **on the same dataset** to see which one performs better, provided the models have a similar number of parameters. Comparing RMSE across different datasets or datasets with vastly different scales of $y$ can still be misleading.

While the Root Mean Squared Error (RMSE) provides a direct and intuitive measure of the average error magnitude, you may encounter a closely related metric known as the **Standard Error of the Regression (SER)**, also called the *Residual Standard Error* or *Standard Error of the Estimate*.

The formula for the SER is:

$$ SER = \sqrt{\frac{RSS}{n - k}} $$

Let's break down the components and the reasoning behind this formula:

*   **$RSS$ (Residual Sum of Squares):** As before, this is the sum of the squared differences between the observed and predicted values.
*   **$n$:** The number of data points.
*   **$k$:** The number of estimated parameters in the model.
*   **$n - k$:** This term represents the **degrees of freedom** of the residuals.

**Derivation and Rationale: Why $n - k$?**

The concept of degrees of freedom is central to understanding the SER. When we estimate the $k$ parameters of our model from the data, we "use up" $k$ pieces of information. This leaves $(n - k)$ independent pieces of information (the residuals) to estimate the variance of the underlying error in our model.

The term $RSS / (n - k)$ is the **Mean Squared Error (MSE)**, which is considered an **unbiased estimator** of the variance of the random errors $\sigma^2$ in the data. An unbiased estimator is one whose expected value is equal to the true population parameter it is trying to estimate. We discuss unbiased estimators in more details [here](variance-covariance.ipynb).

Let's formalize this. Assume our model is fundamentally correct and that the observed values $y_i$ are composed of the true function value plus a random error term $\epsilon_i$, which is assumed to have a mean of 0 and a variance of $\sigma^2$.

$$ y_i = f(x_i; \beta_1, ..., \beta_k) + \epsilon_i $$

The RSS is the sum of the squared *sample* residuals, not the true errors $\epsilon_i$. It can be shown mathematically that the expected value of the RSS is:

$$ E[RSS] = (n - k)\sigma^2 $$

NOTE: We do not show this derivation in this material, however, you can validate derivation of a similar case [here](variance-covariance.ipynb).

Therefore, to get an unbiased estimate of the true error variance $\sigma^2$, we must divide RSS by the degrees of freedom:

$$ \hat{\sigma}^2 = MSE = \frac{RSS}{n - k} $$

The Standard Error of the Regression (SER) is simply the square root of this unbiased variance estimate:

$$ SER = \hat{\sigma} = \sqrt{\frac{RSS}{n - k}} $$

**Comparison: RMSE vs. SER**

*   **RMSE:**

    $$ RMSE = \sqrt{\frac{RSS}{n}} $$

    *   **Interpretation:** The average magnitude of the residual. It answers: "By how much is our model typically wrong in its predictions?"
    *   **Use Case:** Best for comparing the predictive accuracy of different models on the *same dataset*, especially in machine learning contexts.

*   **SER:**

    $$ SER = \sqrt{\frac{RSS}{n - k}} $$
    
    *   **Interpretation:** An estimate of the standard deviation of the underlying, unobservable errors. It answers: "What is the typical size of the random noise in the data that our model cannot explain?"
    *   **Use Case:** Rooted in statistical inference. It is used for calculating confidence intervals and p-values for the model parameters.

For large datasets (where $n$ is much larger than $k$), the two values will be very close.

## What is the Levenberg-Marquardt Algorithm (LMA)?

The Levenberg-Marquardt Algorithm (LMA), also known as the damped least-squares (DLS) method, is a powerful and widely used iterative algorithm for solving **non-linear least squares problems**.

Imagine you have a set of data points and you want to fit a mathematical model to them. If the model is non-linear with respect to its parameters, finding the best parameters to minimize the difference between the model's predictions and the actual data can be challenging. That's where LMA comes in.

LMA is a hybrid optimization algorithm that cleverly combines two other optimization methods:

1.  **Gradient Descent (or Steepest Descent):** This method takes steps in the direction opposite to the gradient of the objective function. It's good at finding a solution when you are far from the minimum but can be slow to converge as you get closer.
2.  **Gauss-Newton Algorithm (GNA):** This method uses the Jacobian matrix (first derivatives) to approximate the objective function as a quadratic and then finds the minimum of that quadratic approximation. It converges very fast when you are close to the minimum, but it can struggle or even diverge if the initial guess is far from the solution or if the Jacobian matrix is singular (not invertible).

**How LMA combines them:**

The LMA introduces a "damping factor" ($\lambda$) that controls the blend between these two methods:

* **When $\lambda$ is large:** The algorithm behaves more like **gradient descent**. This makes it more robust when the current parameters are far from the optimal solution, as it ensures a decrease in the error, even if it's a small step. This helps it avoid getting stuck or diverging in difficult regions of the parameter space.
* **When $\lambda$ is small:** The algorithm behaves more like the **Gauss-Newton method**. As the algorithm approaches the minimum, $\lambda$ is decreased, allowing for faster convergence.

The LMA adaptively adjusts this damping factor at each iteration. If a step leads to a significant reduction in the sum of squares, $\lambda$ is decreased, moving towards the faster Gauss-Newton method. If a step increases the sum of squares (meaning it overshot or moved in the wrong direction), $\lambda$ is increased, making it more like gradient descent to take smaller, more cautious steps.

This adaptive nature makes LMA very robust and efficient for a wide range of non-linear problems.

### How is it related to Least Squares Regression?

Least squares regression, in its essence, is about finding the parameters of a model that minimize the sum of the squared differences (residuals) between the observed data and the values predicted by the model.

* **Linear Least Squares:** When the model is linear in its parameters (e.g., $y = ax + b$), the problem can be solved directly using linear algebra (e.g., normal equations). However, LMA can also resolve this problem but it will not be so resource efficient. If the performance is not so important, then you can use either of algorithms. The next subpage contains code examples for non-linear problem and comparison of various methods for the same linear problem. You can review to see what other options we have to resolve linear problems.
* **Non-linear Least Squares:** When the model is non-linear in its parameters (e.g., $y = A \cdot e^{-Bx} + C$), there's no direct analytical solution. This is where iterative optimization algorithms like the Levenberg-Marquardt algorithm come into play.

**The LMA is specifically designed to solve non-linear least squares problems.** It iteratively refines the model parameters by minimizing the sum of squared residuals, which is the core objective of least squares regression.

The objective function that LMA minimizes is typically of the form:

$$S(\beta) = \sum_{i=1}^{n} (y_i - f(x_i, \beta))^2$$

where:
* $y_i$ are the observed data points.
* $f(x_i, \beta)$ is the model function that predicts the dependent variable based on the independent variable $x_i$ and the parameter vector $\beta$.
* $n$ is the number of data points.

The LMA tries to find the $\beta$ that minimizes $S(\beta)$.

### `scipy.optimize.curve_fit`

The Levenberg-Marquardt algorithm **is the default method for `scipy.optimize.curve_fit` for unconstrained problems.**

NOTE: In the context of optimization, an **unconstrained problem** is an optimization problem where there are **no restrictions or limitations on the values that the parameters (or variables) can take.** The goal is to find the minimum (or maximum) of an objective function over the entire domain of the variables, which is typically the set of all real numbers ($R^n$ for $n$ parameters).

**Examples of unconstrained problems:**

* Minimizing $f(x) = x^2 - 4x + 5$. Here, $x$ can be any real number.
* Fitting a non-linear curve to data where the parameters of the curve can take any real value (e.g., an amplitude, decay rate, or phase that isn't physically limited to a specific range).

**In contrast, a constrained problem** has limitations or bounds on the parameter values. These constraints can be:

* **Box constraints (bounds):** Parameters must lie within a specific range (e.g., $0 \le x \le 10$, or $-5 \le y \le 5$).
* **Equality constraints:** Parameters must satisfy certain equations (e.g., $x + y = 1$).
* **Inequality constraints:** Parameters must satisfy certain inequalities (e.g., $x^2 + y^2 \le 1$).

The Levenberg-Marquardt algorithm, in its original formulation, is designed for unconstrained optimization. When bounds or other constraints are introduced, other algorithms like Trust Region Reflective (`trf`) or Dogbox (`dogbox`) are typically employed, as they are specifically designed to handle such constraints efficiently. This is why `scipy.optimize.curve_fit` switches to `trf` when bounds are provided.

Link: https://docs.scipy.org/doc/scipy-1.16.0/reference/generated/scipy.optimize.curve_fit.html