# Goodness of Fit and Chi-Squared Statistic

## Definitions

The concepts of goodness of fit and the chi-squared statistic are central to the statistical rigor of a least squares fit. Let's build them from the ground up.

> In a general sense, **"goodness of fit"** is a measure of how well a statistical model describes a set of observations. When you fit a curve to data, you're building a model. "Goodness of fit" is the answer to the question: "How close are the values predicted by my model to the actual data values I measured?"

The fundamental building block for this is the **residual**. For each data point $i$, the residual, $e_i$, is simply the difference between the observed value ($y_i$) and the value predicted by the model ($f(x_i, \beta)$):

$$e_i = y_i - f(x_i, \beta)$$

A perfect fit would have all residuals equal to zero. However, in reality, measurements have uncertainty, and a perfect fit is impossible. Our goal is to find a single number that summarizes all these residuals to tell us if the overall fit is "good enough."

A simple approach to summarizing the residuals would be to just sum their squares, as in Ordinary Least Squares (OLS):

$$\text{Residual Sum of Squares (RSS)} = \sum_{i=1}^{N} e_i^2 = \sum_{i=1}^{N} (y_i - f(x_i, \beta))^2$$

The problem with RSS is that it doesn't account for the **precision of the individual measurements**. Imagine two different experiments:
* **Experiment A:** Data points are very precise (low uncertainty). A small RSS might still represent a poor fit because the residuals are much larger than the measurement uncertainties.
* **Experiment B:** Data points are very noisy (high uncertainty). A large RSS might be a perfectly good fit because the residuals are consistent with the large measurement uncertainties.

To create a goodness-of-fit metric that is meaningful regardless of the measurement units or precision, we need to **normalize** each residual by its own uncertainty. This is the key idea behind the $\chi^2$ statistic.

**The Derivation:**
1.  **Define a standardized residual.** For each data point $i$, we know its residual $e_i = y_i - f(x_i, \beta)$ and its measurement uncertainty, quantified by the standard deviation $\sigma_i$. Let's create a new, dimensionless value by dividing the residual by its standard deviation:

    $$\frac{e_i}{\sigma_i} = \frac{y_i - f(x_i, \beta)}{\sigma_i}$$
    
    This new value represents how many standard deviations the observed value is away from the model's prediction. If this value is close to $\pm 1$, the model's prediction is within one standard deviation of the measurement, which is what we would expect for a good fit.

2.  **Sum the squares.** The **chi-squared statistic** ($\chi^2$) is defined as the sum of the squares of these standardized residuals. We square them to ensure all values are positive and to give more weight to larger deviations, just as in OLS.

    $$\chi^2 = \sum_{i=1}^{N} \left( \frac{y_i - f(x_i, \beta)}{\sigma_i} \right)^2$$

This simple formula is the foundation of the $\chi^2$ test. It is a single number that quantifies the overall discrepancy between the data and the model, scaled by the known measurement uncertainties.

Notice that this is exactly the objective function that the Weighted Least Squares algorithm minimizes. When you provide `sigma` values to `curve_fit`, the function internally calculates the weights as $w_i = 1/\sigma_i^2$ and minimizes this same sum.

### Reduced Chi-Squared

A raw $\chi^2$ value on its own isn't very informative because its expected value depends on the number of data points. To make it more meaningful, we use the concept of **degrees of freedom**.

* **Degrees of Freedom ($\nu$):** This is defined as the number of data points minus the number of parameters you are fitting.

    $$\nu = N - P$$

    * $N$: Number of data points.
    * $P$: Number of parameters in your model.
    The degrees of freedom represent the number of independent pieces of information available to test the goodness of fit after the model parameters have been determined.

* **Reduced Chi-Squared ($\chi^2_{red}$):** To get a normalized measure that is easier to interpret, we divide $\chi^2$ by the degrees of freedom.

    $$\chi^2_{red} = \frac{\chi^2}{\nu}$$

The interpretation of the reduced chi-squared value is intuitive and powerful:

* **$\chi^2_{red} \approx 1$ (Ideal):** This is the gold standard for a good fit. It means that the total squared deviation of the data from the model is, on average, about what you would expect given your known measurement uncertainties. The residuals are consistent with the random noise you've specified with your `sigma` values.

* **$\chi^2_{red} > 1$ (Poor Fit or Underestimated Errors):** This indicates that the residuals are larger than what your uncertainties would predict. The model is a poor fit for the data. This could be because:
    1.  The model function `f(x)` is simply incorrect and doesn't describe the underlying physical process.
    2.  Your measurement uncertainties (`sigma`) were underestimated, and the data is actually more noisy than you assumed.

* **$\chi^2_{red} < 1$ ("Too Good" Fit or Overestimated Errors):** This suggests that the model fits the data *better* than would be expected by random chance alone. This could be a sign that:
    1.  Your measurement uncertainties (`sigma`) were overestimated.
    2.  The model is over-fitting the data by matching the random noise instead of the underlying trend.
    3.  The data might have been cherry-picked, or there's a systematic error in how the uncertainties were determined.

In summary, the chi-squared statistic provides a rigorous, quantitative way to answer the question of "goodness of fit" by comparing the observed discrepancies between the model and the data against the known uncertainties of the data points themselves.

## Unknown standard deviation for `scipy.optimize.curve_fit`

There is an the apparent paradox: the `chi-squared` formula needs `sigma`, but the `scipy.optimize.curve_fit` function calculates a "goodness of fit" without it.

The answer lies in a clever statistical trick `scipy.optimize.curve_fit` uses when you don't provide explicit uncertainties.

When you skip the `sigma` parameter, `curve_fit` doesn't assume that the errors are zero. Instead, it makes a different, less strict assumption: it assumes that all your data points have the same **unknown** standard deviation, $\sigma_{unknown}$.

This is a specific form of the homoscedasticity assumption. Since the standard deviation is the same for every point, you don't need to specify it for the purpose of finding the best-fit parameters, because all the weights would be equal ($w_i = 1/\sigma_{unknown}^2$). As we discussed, a constant weight doesn't change the location of the minimum of the sum of squares.

### How the Goodness of Fit is Calculated

Even though the standard deviation is unknown, we can **estimate** it from the data itself after the fit is complete. The logic is as follows:

1.  **Perform a standard OLS fit:** The algorithm finds the parameters that minimize the plain old Residual Sum of Squares (RSS):

    $$RSS = \sum_{i=1}^{N} (y_i - f(x_i, \beta))^2$$

2.  **Estimate the Variance:** The best estimate for the unknown variance, $\sigma_{unknown}^2$, is the average squared residual. This is known as the **residual variance** or **Mean Squared Error (MSE)**, and it's calculated by dividing the RSS by the degrees of freedom ($\nu = N-P$):

    $$\hat{\sigma}^2_{unknown} = \frac{RSS}{\nu} = \frac{\sum_{i=1}^{N} (y_i - f(x_i, \beta))^2}{N-P}$$

    This value, $\hat{\sigma}^2_{unknown}$, is a measure of the average scatter of your data points around the fitted curve.

3.  **The Scaling Factor:** The reduced chi-squared value is conceptually $\chi^2/\nu$. When no `sigma` is provided, the chi-squared value is technically calculated with $\sigma_i = 1$ for all points. So, $\chi^2 = RSS$. Therefore, the reduced chi-squared is simply $RSS/\nu$. This value is used as the scaling factor for the covariance matrix. Link: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html

    More formally, the `scipy.optimize.curve_fit` documentation states that when `absolute_sigma=False`, the covariance matrix is scaled by a factor that demands the reduced chi-squared for the optimal parameters equals unity. This is equivalent to scaling the covariance matrix by $RSS / \nu$. Mathematically,

    `pcov(absolute_sigma=False) = pcov(absolute_sigma=True) * chisq(popt)/(M-N)`

    For more details, review the documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html

In essence, when you skip the `sigma` parameter, the function does two things:

1.  It assumes all points have the same, unknown uncertainty.
2.  It uses the **observed scatter of the data** (quantified by the residual variance, $RSS/\nu$) as a post-hoc *estimate* of that unknown uncertainty.

This estimate of the uncertainty is then used to calculate the standard errors of your fitted parameters. This is a robust approach when you don't have prior knowledge of your measurement errors, but it's not as statistically rigorous as providing a known uncertainty via the `sigma` parameter. The parameter uncertainties you get are a reflection of how well your model explains the observed variation in your data, rather than being based on a known, independent measure of your experimental precision.

## Chi-Squared Statistic for Non-Linear Regression

The use of reduced chi-squared statistics for non-linear models involves **significant interpretational challenges** that require careful consideration. While the statistic can still be calculated, its meaning and reliability become more complex compared to linear models.

The reduced chi-squared ($\chi^2_v$) is defined as the chi-squared statistic ($\chi^2$) divided by the number of degrees of freedom ($\nu$):

$$\chi^2_v = \frac{\chi^2}{\nu}$$

The chi-squared statistic itself, a measure of how well a model fits the data, is straightforward to calculate for **both linear and non-linear models**:

$$\chi^2 = \sum_{i=1}^{N} \frac{(O_i - E_i)^2}{\sigma_i^2} = \sum_{i=1}^{N} \left( \frac{y_i - f(x_i, \beta)}{\sigma_i} \right)^2$$

Here, $O_i$ are the observed data points, $E_i$ are the expected values from the model, and $\sigma_i$ are the measurement uncertainties.

The conventional calculation for degrees of freedom ($\nu$) is the number of data points ($N$) minus the number of parameters ($P$) that were fitted to the data ($\nu = N - P$).

* **For linear models**, this approach is well-established. The number of fitted parameters ($P$) corresponds directly to the model parameters (e.g., slope and intercept in linear regression). Because the model is a linear superposition of basis functions, the relationship between parameters and the chi-squared value is quadratic, making the parameters mathematically independent and the degrees of freedom calculation straightforward.

* **For non-linear models**, several complications arise that challenge the traditional interpretation:
  - The relationship between parameters and data becomes more complex
  - Parameters can exhibit strong correlations and interdependencies
  - Small changes in one parameter may have disproportionate effects on others
  - The concept of "effective degrees of freedom" may differ significantly from the simple parameter count
  - The number of constraints imposed by the fit may not equal the number of parameters

**You can still calculate** the chi-squared ($\chi^2$) value to find best-fit parameters for your non-linear function through weighted least-squares fitting. Many statistical software packages routinely report reduced chi-squared values for nonlinear fits.

**However, exercise caution when interpreting** the reduced chi-squared ($\chi^2_{red}$) for nonlinear models:
- A value close to 1 does not guarantee a good model, as the degrees of freedom calculation may not be strictly valid
- The traditional interpretation guidelines may be less reliable
- Consider the result as one piece of evidence rather than a definitive assessment

**Bottom line**: Reduced chi-squared can be calculated for nonlinear models and may provide useful information, but its interpretation is less straightforward than for linear cases, and it should be used alongside other model assessment tools for more reliable conclusions. It's also recommended to use information criteria like the **Akaike Information Criterion (AIC)** or **Bayesian Information Criterion (BIC)**, which penalize model complexity in a more robust way and allow comparison between different models.


## Additional Materials

* https://www.nbi.dk/~petersen/Teaching/IntroStat2020/IS2021_01_14_ChiSquare.pdf
