# Weighted Least Squares

## What is Weighted Least Squares (WLS)?

**Weighted Least Squares (WLS)** is a variation of the Ordinary Least Squares (OLS) method. In OLS, it's assumed that the variance of the errors (residuals) is constant across all observations. This assumption is called **homoscedasticity**. However, in many real-world scenarios, this assumption doesn't hold; the errors might be larger for some observations than for others. This situation is called **heteroscedasticity**.

When heteroscedasticity is present, OLS gives equal weight to all data points. This can lead to inefficient parameter estimates (meaning the estimates are not the most precise possible) and incorrect standard errors, which in turn affect the reliability of confidence intervals and hypothesis tests.

WLS addresses this by assigning different **weights** to each data point in the regression. The goal of WLS is to minimize the sum of the *weighted* squared residuals:

$$\sum_{i=1}^{n} w_i (y_i - f(x_i, \beta))^2$$

where:
* $w_i$ is the weight for the $i$-th data point.
* $y_i$ is the observed dependent variable for the $i$-th point.
* $f(x_i, \beta)$ is the predicted value from the model for the $i$-th point, with parameters $\beta$.

**How are weights determined?**
The weights are typically inversely proportional to the variance of the errors for each observation. If $\sigma_i^2$ is the variance of the error for the $i$-th observation, then the weight $w_i$ is usually $1/\sigma_i^2$. This means:
* Observations with smaller errors (lower variance) get larger weights, influencing the fit more.
* Observations with larger errors (higher variance) get smaller weights, influencing the fit less.

## Standard deviation

In experimental sciences and data analysis, observations are inherently subject to **measurement uncertainty**. This uncertainty reflects the lack of perfect knowledge about the true value of a quantity due to limitations of instruments, environmental variations, or inherent stochastic processes. In our case of the regression analysis and statistics, we may also call it as an "error".

The **standard deviation ($\sigma$)** is the most common statistical measure used to quantify the spread or dispersion of a set of data points around their mean. In the context of individual experimental measurements, the standard deviation of a measurement (or its uncertainty) refers to the expected variability if that measurement were to be repeated multiple times under identical conditions.

If a measurement $Y$ is reported as $Y \pm \delta Y$, where $\delta Y$ represents the uncertainty, this $\delta Y$ is frequently taken to be the **standard deviation** of that measurement. It implies that approximately 68.3% of repeated measurements would fall within the range $[Y - \delta Y, Y + \delta Y]$, assuming a normal distribution of errors.

### Interpretation of Individual Data Points in Weighted Least Squares

When performing regression analysis, especially Weighted Least Squares (WLS), each data point $(x_i, y_i)$ is treated as follows:
* **$x_i$ (Independent Variable):** Typically assumed to be known precisely, or to have negligible uncertainty compared to $y_i$.
* **$y_i$ (Dependent Variable):** This value is considered the **best estimate** (or the sample mean) of the true underlying value of the dependent variable at $x_i$. This implies that if multiple independent measurements of $y$ were taken at $x_i$, $y_i$ would represent their average, aiming to minimize random errors.
* **$\sigma_i$ (Uncertainty/Standard Deviation of $y_i$):** This parameter, provided to the fitting algorithm (e.g., via the `sigma` argument in `scipy.optimize.curve_fit`), quantifies the **standard deviation of the measurement $y_i$**. It reflects the precision with which $y_i$ was determined. A smaller $\sigma_i$ indicates a more precise (less uncertain) measurement, and vice-versa.


### Role in Weighted Least Squares (WLS)

The `scipy.optimize.curve_fit` function, when provided with the `sigma` array and `absolute_sigma=True`, performs a Weighted Least Squares minimization.

* **Weight Calculation:** For each data point $(x_i, y_i)$ with associated standard deviation $\sigma_i$, a weight $w_i$ is calculated as the inverse of the variance: $w_i = \frac{1}{\sigma_i^2}$.
* **Minimization Objective:** The Levenberg-Marquardt algorithm (the default for `curve_fit`) then seeks to minimize the **weighted sum of squared residuals (WSSR)**:
    $$\text{WSSR} = \sum_{i=1}^{N} w_i (y_i - f(x_i, \beta))^2 = \sum_{i=1}^{N} \frac{(y_i - f(x_i, \beta))^2}{\sigma_i^2}$$
    where $f(x_i, \beta)$ is the model's predicted value and $\beta$ represents the model parameters.
* **Impact of Weights:** Measurements with smaller $\sigma_i$ (higher precision) receive larger weights ($w_i$), thus exerting a greater influence on the determination of the fitted parameters. Conversely, measurements with larger $\sigma_i$ (lower precision) receive smaller weights, having less impact on the fit. This ensures that the fitting process prioritizes minimizing deviations for the more reliable data points.

By incorporating these standard deviations, WLS provides more statistically efficient (more precise) estimates of the model parameters when the assumption of constant error variance (homoscedasticity) is violated, as is the case when different data points have different known uncertainties.

## What if each point has the same standard deviation?

If each data point has the same standard deviation, the homoscedasticity is not violated. However, we can use standard deviation data to improve our results.

**Homoscedasticity** is the statistical assumption that the **variance of the errors** is constant across all levels of the independent variable. In our notation, this means $\sigma_i^2 = C$, where $C$ is a constant for all data points $i$.

Since standard deviation is the square root of variance ($\sigma = \sqrt{\sigma^2}$), a constant variance implies a constant standard deviation. If every data point has the same standard deviation, then the assumption of homoscedasticity is **satisfied and not violated**.


This leads to a crucial and interesting point about the relationship between Weighted Least Squares (WLS) and Ordinary Least Squares (OLS).

1.  **WLS Objective:** WLS minimizes the weighted sum of squared residuals:

    $$\sum_{i=1}^{N} w_i (y_i - f(x_i, \beta))^2$$

    where the weight $w_i = 1/\sigma_i^2$.

2.  **When Homoscedasticity is Satisfied:** If every data point has the same standard deviation, let's call it $\sigma_{const}$. Then, for all $i$, we have $\sigma_i = \sigma_{const}$. This means all the weights are also the same:

    $$w_i = \frac{1}{\sigma_i^2} = \frac{1}{\sigma_{const}^2} = C$$

    where $C$ is a constant.

3.  **Mathematical Equivalence:** The WLS minimization objective then becomes:

    $$\sum_{i=1}^{N} C (y_i - f(x_i, \beta))^2 = C \sum_{i=1}^{N} (y_i - f(x_i, \beta))^2$$

    Since $C$ is just a positive constant, minimizing this expression is mathematically identical to minimizing the expression without the constant:

    $$\sum_{i=1}^{N} (y_i - f(x_i, \beta))^2$$
    
    This is exactly the objective of **Ordinary Least Squares (OLS)**.

If the assumption of homoscedasticity holds true (i.e., every data point has the same standard deviation), then **Weighted Least Squares and Ordinary Least Squares will produce the exact same parameter estimates**.

However, WLS can still be valuable even in this situation if you have a known, constant `sigma` and use `absolute_sigma=True`, because it will provide you with the correct uncertainties (standard errors) for your fitted parameters. OLS would simply assume the errors are scaled by the goodness of fit, which might not be an accurate reflection of the true experimental uncertainties.

### Scenario A: Using a constant `sigma` array with `absolute_sigma=True`

Let's assume your known, constant standard deviation is $\sigma_{known} = 0.5$. Your `sigma` array would be `[0.5, 0.5, 0.5, ...]`

* **Fitted Parameters (`popt`):** The optimal values for parameters A and B will be **identical** to the case where you don't use `sigma`. As we discussed, the minimization is mathematically equivalent to OLS, and the location of the minimum of the objective function is the same.
* **Covariance Matrix (`pcov`) and Parameter Errors (`perr`):** This is where the difference lies. By providing `sigma` and setting `absolute_sigma=True`, you are telling the algorithm: "My measurements have an absolute standard deviation of 0.5. Calculate the parameter uncertainties based on this known fact." The covariance matrix (`pcov`) and the standard errors (`perr`) derived from it will directly reflect the propagated uncertainty from your measurements.

This is the **statistically correct** approach when you have known measurement uncertainties. The resulting parameter errors are a more accurate representation of the true uncertainty in your fitted parameters, grounded in the physical reality of your experiment.

### Scenario B: Skipping the `sigma` parameter

In this case, you simply call `curve_fit` without specifying `sigma`.

* **Fitted Parameters (`popt`):** The optimal values for parameters A and B will be **identical** to Scenario A.
* **Covariance Matrix (`pcov`) and Parameter Errors (`perr`):** The `curve_fit` function handles this differently. It performs a standard OLS fit (which is equivalent to WLS with constant weights), but then it **scales the covariance matrix by the reduced chi-squared value (the goodness of fit)**. This is the behavior of `absolute_sigma=False` (the default).

    * The standard errors you get are based on the **observed scatter of your data points around the fitted curve**, not on any known measurement uncertainty.
    * If your data points are very close to the fitted curve, the reduced chi-squared value will be small, and the calculated `perr` will be artificially small.
    * If your data points are very scattered around the fitted curve, the reduced chi-squared value will be large, and the calculated `perr` will be artificially large.

Summary:

| Feature                  | Using `sigma` with `absolute_sigma=True` (constant `sigma`) | Skipping `sigma` (OLS)                                    |
| ------------------------ | ------------------------------------------------------------- | ----------------------------------------------------------- |
| **Fitted Parameters (`popt`)** | **Identical** | **Identical** |
| **Parameter Errors (`perr`)** | **More Accurate.** Based on your known measurement uncertainty. | **Less Accurate.** Based on the observed scatter of the data. |
| **Statistical Assumption** | You know the absolute uncertainty of your measurements.         | You don't know the absolute uncertainty, and the uncertainty is constant. |
| **Purpose** | To find the parameters and their uncertainties based on your experimental knowledge. | To find the parameters and their uncertainties based on the model's goodness of fit. |

## Goodness of Fit and Chi-Squared Statistic

The concepts of goodness of fit and the chi-squared statistic are central to the statistical rigor of a least squares fit. Let's build them from the ground up.

> In a general sense, **"goodness of fit"** is a measure of how well a statistical model describes a set of observations. When you fit a curve to data, you're building a model. "Goodness of fit" is the answer to the question: "How close are the values predicted by my model to the actual data values I measured?"

The fundamental building block for this is the **residual**. For each data point $i$, the residual, $e_i$, is simply the difference between the observed value ($y_i$) and the value predicted by the model ($f(x_i, \beta)$):

$$e_i = y_i - f(x_i, \beta)$$

A perfect fit would have all residuals equal to zero. However, in reality, measurements have uncertainty, and a perfect fit is impossible. Our goal is to find a single number that summarizes all these residuals to tell us if the overall fit is "good enough."

A simple approach to summarizing the residuals would be to just sum their squares, as in Ordinary Least Squares (OLS):

$$\text{Sum of Squared Residuals (SSR)} = \sum_{i=1}^{N} e_i^2 = \sum_{i=1}^{N} (y_i - f(x_i, \beta))^2$$

The problem with SSR is that it doesn't account for the **precision of the individual measurements**. Imagine two different experiments:
* **Experiment A:** Data points are very precise (low uncertainty). A small SSR might still represent a poor fit because the residuals are much larger than the measurement uncertainties.
* **Experiment B:** Data points are very noisy (high uncertainty). A large SSR might be a perfectly good fit because the residuals are consistent with the large measurement uncertainties.

To create a goodness-of-fit metric that is meaningful regardless of the measurement units or precision, we need to **normalize** each residual by its own uncertainty. This is the key idea behind the $\chi^2$ statistic.

**The Derivation:**
1.  **Define a standardized residual.** For each data point $i$, we know its residual $e_i = y_i - f(x_i, \beta)$ and its measurement uncertainty, quantified by the standard deviation $\sigma_i$. Let's create a new, dimensionless value by dividing the residual by its standard deviation:
    $$\frac{e_i}{\sigma_i} = \frac{y_i - f(x_i, \beta)}{\sigma_i}$$
    This new value represents how many standard deviations the observed value is away from the model's prediction. If this value is close to $\pm 1$, the model's prediction is within one standard deviation of the measurement, which is what we would expect for a good fit.

2.  **Sum the squares.** The **chi-squared statistic** ($\chi^2$) is defined as the sum of the squares of these standardized residuals. We square them to ensure all values are positive and to give more weight to larger deviations, just as in OLS.

    $$\chi^2 = \sum_{i=1}^{N} \left( \frac{y_i - f(x_i, \beta)}{\sigma_i} \right)^2$$

This simple formula is the foundation of the $\chi^2$ test. It is a single number that quantifies the overall discrepancy between the data and the model, scaled by the known measurement uncertainties.

Notice that this is exactly the objective function that the Weighted Least Squares algorithm minimizes. When you provide `sigma` values to `curve_fit`, the function internally calculates the weights as $w_i = 1/\sigma_i^2$ and minimizes this same sum.

### Interpreting the Chi-Squared Value

A raw $\chi^2$ value on its own isn't very informative because its expected value depends on the number of data points. To make it more meaningful, we use the concept of **degrees of freedom**.

* **Degrees of Freedom ($\nu$):** This is defined as the number of data points minus the number of parameters you are fitting.
    $$\nu = N - k$$
    * $N$: Number of data points.
    * $k$: Number of parameters in your model.
    The degrees of freedom represent the number of independent pieces of information available to test the goodness of fit after the model parameters have been determined.

* **Reduced Chi-Squared ($\chi^2_\nu$):** To get a normalized measure that is easier to interpret, we divide $\chi^2$ by the degrees of freedom.
    $$\chi^2_\nu = \frac{\chi^2}{\nu}$$

The interpretation of the reduced chi-squared value is intuitive and powerful:

* **$\chi^2_\nu \approx 1$ (Ideal):** This is the gold standard for a good fit. It means that the total squared deviation of the data from the model is, on average, about what you would expect given your known measurement uncertainties. The residuals are consistent with the random noise you've specified with your `sigma` values.

* **$\chi^2_\nu > 1$ (Poor Fit or Underestimated Errors):** This indicates that the residuals are larger than what your uncertainties would predict. The model is a poor fit for the data. This could be because:
    1.  The model function `f(x)` is simply incorrect and doesn't describe the underlying physical process.
    2.  Your measurement uncertainties (`sigma`) were underestimated, and the data is actually more noisy than you assumed.

* **$\chi^2_\nu < 1$ ("Too Good" Fit or Overestimated Errors):** This suggests that the model fits the data *better* than would be expected by random chance alone. This could be a sign that:
    1.  Your measurement uncertainties (`sigma`) were overestimated.
    2.  The model is over-fitting the data by matching the random noise instead of the underlying trend.
    3.  The data might have been cherry-picked, or there's a systematic error in how the uncertainties were determined.

In summary, the chi-squared statistic provides a rigorous, quantitative way to answer the question of "goodness of fit" by comparing the observed discrepancies between the model and the data against the known uncertainties of the data points themselves.

## Unknown standard deviation for `scipy.optimize.curve_fit`

There is an the apparent paradox: the `chi-squared` formula needs `sigma`, but the `scipy.optimize.curve_fit` function calculates a "goodness of fit" without it.

The answer lies in a clever statistical trick `scipy.optimize.curve_fit` uses when you don't provide explicit uncertainties.

When you skip the `sigma` parameter, `curve_fit` doesn't assume that the errors are zero. Instead, it makes a different, less strict assumption: it assumes that all your data points have the same **unknown** standard deviation, $\sigma_{unknown}$.

This is a specific form of the homoscedasticity assumption. Since the standard deviation is the same for every point, you don't need to specify it for the purpose of finding the best-fit parameters, because all the weights would be equal ($w_i = 1/\sigma_{unknown}^2$). As we discussed, a constant weight doesn't change the location of the minimum of the sum of squares.

### How the Goodness of Fit is Calculated

Even though the standard deviation is unknown, we can **estimate** it from the data itself after the fit is complete. The logic is as follows:

1.  **Perform a standard OLS fit:** The algorithm finds the parameters that minimize the plain old Sum of Squared Residuals (SSR):
    $$\text{SSR} = \sum_{i=1}^{N} (y_i - f(x_i, \beta))^2$$

2.  **Estimate the Variance:** The best estimate for the unknown variance, $\sigma_{unknown}^2$, is the average squared residual. This is known as the **residual variance** or **Mean Squared Error (MSE)**, and it's calculated by dividing the SSR by the degrees of freedom ($\nu = N-k$):
    $$\hat{\sigma}^2_{unknown} = \frac{\text{SSR}}{\nu} = \frac{\sum_{i=1}^{N} (y_i - f(x_i, \beta))^2}{N-k}$$
    This value, $\hat{\sigma}^2_{unknown}$, is a measure of the average scatter of your data points around the fitted curve.

3.  **The Scaling Factor:** The reduced chi-squared value is conceptually $\chi^2/\nu$. When no `sigma` is provided, the chi-squared value is technically calculated with $\sigma_i = 1$ for all points. So, $\chi^2 = \text{SSR}$. Therefore, the reduced chi-squared is simply $\text{SSR}/\nu$. This value is used as the scaling factor for the covariance matrix. Link: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html

    More formally, the `scipy.optimize.curve_fit` documentation states that when `absolute_sigma=False`, the covariance matrix is scaled by a factor that demands the reduced chi-squared for the optimal parameters equals unity. This is equivalent to scaling the covariance matrix by $\text{SSR} / \nu$. Mathematically,

    $$pcov(absolute_sigma=False) = pcov(absolute_sigma=True) * chisq(popt)/(M-N)$$

    For more details, review the documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html

In essence, when you skip the `sigma` parameter, the function does two things:

1.  It assumes all points have the same, unknown uncertainty.
2.  It uses the **observed scatter of the data** (quantified by the residual variance, $\text{SSR}/\nu$) as a post-hoc *estimate* of that unknown uncertainty.

This estimate of the uncertainty is then used to calculate the standard errors of your fitted parameters. This is a robust approach when you don't have prior knowledge of your measurement errors, but it's not as statistically rigorous as providing a known uncertainty via the `sigma` parameter. The parameter uncertainties you get are a reflection of how well your model explains the observed variation in your data, rather than being based on a known, independent measure of your experimental precision.