# Weighted Least Squares

## What is Weighted Least Squares (WLS)?

**Weighted Least Squares (WLS)** is a variation of the Ordinary Least Squares (OLS) method. In OLS, it's assumed that the variance of the errors (residuals) is constant across all observations. This assumption is called **homoscedasticity**. However, in many real-world scenarios, this assumption doesn't hold; the errors might be larger for some observations than for others. This situation is called **heteroscedasticity**.

When heteroscedasticity is present, OLS gives equal weight to all data points. This can lead to inefficient parameter estimates (meaning the estimates are not the most precise possible) and incorrect standard errors, which in turn affect the reliability of confidence intervals and hypothesis tests.

WLS addresses this by assigning different **weights** to each data point in the regression. The goal of WLS is to minimize the sum of the *weighted* squared residuals:

$$\sum_{i=1}^{n} w_i (y_i - f(x_i, \beta))^2$$

where:
* $w_i$ is the weight for the $i$-th data point.
* $y_i$ is the observed dependent variable for the $i$-th point.
* $f(x_i, \beta)$ is the predicted value from the model for the $i$-th point, with parameters $\beta$.

**How are weights determined?**
The weights are typically inversely proportional to the variance of the errors for each observation. If $\sigma_i^2$ is the variance of the error for the $i$-th observation, then the weight $w_i$ is usually $1/\sigma_i^2$. This means:
* Observations with smaller errors (lower variance) get larger weights, influencing the fit more.
* Observations with larger errors (higher variance) get smaller weights, influencing the fit less.

NOTE: For additional statistical context, Weighted Least Squares is a special case of a broader method known as Generalized Least Squares (GLS). GLS is designed for situations where the errors are either heteroscedastic (have non-constant variance) or are correlated with each other. WLS simplifies the GLS framework by assuming that the errors, while having different variances, are not correlated with one another. This means the covariance matrix of the errors is a diagonal matrix, which makes the calculations more straightforward than in the full GLS approach.

## Standard deviation

In experimental sciences and data analysis, observations are inherently subject to **measurement uncertainty**. This uncertainty reflects the lack of perfect knowledge about the true value of a quantity due to limitations of instruments, environmental variations, or inherent stochastic processes. In our case of the regression analysis and statistics, we may also call it as an "error".

The **standard deviation ($\sigma$)** is the most common statistical measure used to quantify the spread or dispersion of a set of data points around their mean. In the context of individual experimental measurements, the standard deviation of a measurement (or its uncertainty) refers to the expected variability if that measurement were to be repeated multiple times under identical conditions.

If a measurement $Y$ is reported as $Y \pm \delta Y$, where $\delta Y$ represents the uncertainty, this $\delta Y$ is frequently taken to be the **standard deviation** of that measurement. It implies that approximately 68.3% of repeated measurements would fall within the range $[Y - \delta Y, Y + \delta Y]$, assuming a normal distribution of errors.

### Interpretation of Individual Data Points in Weighted Least Squares

When performing regression analysis, especially Weighted Least Squares (WLS), each data point $(x_i, y_i)$ is treated as follows:
* **$x_i$ (Independent Variable):** Typically assumed to be known precisely, or to have negligible uncertainty compared to $y_i$.
* **$y_i$ (Dependent Variable):** This value is considered the **best estimate** (or the sample mean) of the true underlying value of the dependent variable at $x_i$. This implies that if multiple independent measurements of $y$ were taken at $x_i$, $y_i$ would represent their average, aiming to minimize random errors.
* **$\sigma_i$ (Uncertainty/Standard Deviation of $y_i$):** This parameter, provided to the fitting algorithm (e.g., via the `sigma` argument in `scipy.optimize.curve_fit`), quantifies the **standard deviation of the measurement $y_i$**. It reflects the precision with which $y_i$ was determined. A smaller $\sigma_i$ indicates a more precise (less uncertain) measurement, and vice-versa.

### Role in Weighted Least Squares (WLS)

The `scipy.optimize.curve_fit` function, when provided with the `sigma` array and `absolute_sigma=True`, performs a Weighted Least Squares minimization.

* **Weight Calculation:** For each data point $(x_i, y_i)$ with associated standard deviation $\sigma_i$, a weight $w_i$ is calculated as the inverse of the variance: $w_i = \frac{1}{\sigma_i^2}$.
* **Minimization Objective:** The Levenberg-Marquardt algorithm (the default for `curve_fit`) then seeks to minimize the **weighted residual sum of squares (WRSS)**:

    $$RSS_w = \sum_{i=1}^{N} w_i (y_i - f(x_i, \beta))^2 = \sum_{i=1}^{N} \frac{(y_i - f(x_i, \beta))^2}{\sigma_i^2}$$
    
    where $RSS_w$ is WRSS, $f(x_i, \beta)$ is the model's predicted value and $\beta$ represents the model parameters.
* **Impact of Weights:** Measurements with smaller $\sigma_i$ (higher precision) receive larger weights ($w_i$), thus exerting a greater influence on the determination of the fitted parameters. Conversely, measurements with larger $\sigma_i$ (lower precision) receive smaller weights, having less impact on the fit. This ensures that the fitting process prioritizes minimizing deviations for the more reliable data points.

By incorporating these standard deviations, WLS provides more statistically efficient (more precise) estimates of the model parameters when the assumption of constant error variance (homoscedasticity) is violated, as is the case when different data points have different known uncertainties.


## What if each point has the same standard deviation?

If each data point has the same standard deviation, the homoscedasticity is not violated. However, we can use standard deviation data to improve our results.

**Homoscedasticity** is the statistical assumption that the **variance of the errors** is constant across all levels of the independent variable. In our notation, this means $\sigma_i^2 = C$, where $C$ is a constant for all data points $i$.

Since standard deviation is the square root of variance ($\sigma = \sqrt{\sigma^2}$), a constant variance implies a constant standard deviation. If every data point has the same standard deviation, then the assumption of homoscedasticity is **satisfied and not violated**.

This leads to a crucial and interesting point about the relationship between Weighted Least Squares (WLS) and Ordinary Least Squares (OLS).

1.  **WLS Objective:** WLS minimizes the weighted sum of squared residuals:

    $$\sum_{i=1}^{N} w_i (y_i - f(x_i, \beta))^2$$

    where the weight $w_i = 1/\sigma_i^2$.

2.  **When Homoscedasticity is Satisfied:** If every data point has the same standard deviation, let's call it $\sigma_{const}$. Then, for all $i$, we have $\sigma_i = \sigma_{const}$. This means all the weights are also the same:

    $$w_i = \frac{1}{\sigma_i^2} = \frac{1}{\sigma_{const}^2} = C$$

    where $C$ is a constant.

3.  **Mathematical Equivalence:** The WLS minimization objective then becomes:

    $$\sum_{i=1}^{N} C (y_i - f(x_i, \beta))^2 = C \sum_{i=1}^{N} (y_i - f(x_i, \beta))^2$$

    Since $C$ is just a positive constant, minimizing this expression is mathematically identical to minimizing the expression without the constant:

    $$\sum_{i=1}^{N} (y_i - f(x_i, \beta))^2$$
    
    This is exactly the objective of **Ordinary Least Squares (OLS)**.

If the assumption of homoscedasticity holds true (i.e., every data point has the same standard deviation), then **Weighted Least Squares and Ordinary Least Squares will produce the exact same parameter estimates**.

However, WLS can still be valuable even in this situation if you have a known, constant `sigma` and use `absolute_sigma=True`, because it will provide you with the correct uncertainties (standard errors) for your fitted parameters. OLS would simply assume the errors are scaled by the goodness of fit, which might not be an accurate reflection of the true experimental uncertainties.

### Scenario A: Using a constant `sigma` array with `absolute_sigma=True`

Let's assume your known, constant standard deviation is $\sigma_{known} = 0.5$. Your `sigma` array would be `[0.5, 0.5, 0.5, ...]`

* **Fitted Parameters (`popt`):** The optimal values for parameters A and B will be **identical** to the case where you don't use `sigma`. As we discussed, the minimization is mathematically equivalent to OLS, and the location of the minimum of the objective function is the same.
* **Covariance Matrix (`pcov`) and Parameter Errors (`perr`):** This is where the difference lies. By providing `sigma` and setting `absolute_sigma=True`, you are telling the algorithm: "My measurements have an absolute standard deviation of 0.5. Calculate the parameter uncertainties based on this known fact." The covariance matrix (`pcov`) and the standard errors (`perr`) derived from it will directly reflect the propagated uncertainty from your measurements.

This is the **statistically correct** approach when you have known measurement uncertainties. The resulting parameter errors are a more accurate representation of the true uncertainty in your fitted parameters, grounded in the physical reality of your experiment.

### Scenario B: Skipping the `sigma` parameter

In this case, you simply call `curve_fit` without specifying `sigma`.

* **Fitted Parameters (`popt`):** The optimal values for parameters A and B will be **identical** to Scenario A.
* **Covariance Matrix (`pcov`) and Parameter Errors (`perr`):** The `curve_fit` function handles this differently. It performs a standard OLS fit (which is equivalent to WLS with constant weights), but then it **scales the covariance matrix by the reduced chi-squared value (the goodness of fit)**. This is the behavior of `absolute_sigma=False` (the default).

    * The standard errors you get are based on the **observed scatter of your data points around the fitted curve**, not on any known measurement uncertainty.
    * If your data points are very close to the fitted curve, the reduced chi-squared value will be small, and the calculated `perr` will be artificially small.
    * If your data points are very scattered around the fitted curve, the reduced chi-squared value will be large, and the calculated `perr` will be artificially large.

Summary:

| Feature                  | Using `sigma` with `absolute_sigma=True` (constant `sigma`) | Skipping `sigma` (OLS)                                    |
| ------------------------ | ------------------------------------------------------------- | ----------------------------------------------------------- |
| **Fitted Parameters (`popt`)** | **Identical** | **Identical** |
| **Parameter Errors (`perr`)** | **More Accurate.** Based on your known measurement uncertainty. | **Less Accurate.** Based on the observed scatter of the data. |
| **Statistical Assumption** | You know the absolute uncertainty of your measurements.         | You don't know the absolute uncertainty, and the uncertainty is constant. |
| **Purpose** | To find the parameters and their uncertainties based on your experimental knowledge. | To find the parameters and their uncertainties based on the model's goodness of fit. |

## Weighted Residual Sum of Squares (WRSS)

The objective of any "least squares" method is to find the model parameters that minimize the sum of the squared residuals. In Weighted Least Squares (WLS), this objective function is the **Weighted Residual Sum of Squares (WRSS)**. A small WRSS indicates a tight fit of the model to the data.

The formula for the WRSS is:

$$ RSS_w = \sum_{i=1}^{N} w_i (y_i - f(x_i, \beta))^2 $$

where:
* $RSS_w$ is WRSS.
* $y_i$ is the observed value for the i-th data point.
* $f(x_i, \beta)$ is the value predicted by the model for the i-th data point.
* $(y_i - f(x_i, \beta))$ is the residual (the error) for the i-th data point.
* $w_i$ is the weight assigned to the i-th data point.

The crucial difference from the ordinary Residual Sum of Squares (RSS) is the inclusion of the weight, $w_i$. This weight ensures that not all squared residuals contribute equally to the final sum. As defined in your earlier sections, the weight is typically the inverse of the error variance ($w_i = 1/\sigma_i^2$), giving more influence to more precise data points.

### The Connection Between WRSS and the Chi-Squared (χ²) Statistic

In the context of curve fitting where meaningful, known uncertainties ($\sigma_i$) are provided for each data point, the WRSS takes on a profound statistical meaning. In this case, the WRSS value is identical to the **chi-squared statistic (χ²)**.

$$ \chi^2 = \sum_{i=1}^{N} \left( \frac{y_i - f(x_i, \beta)}{\sigma_i} \right)^2 = \sum_{i=1}^{N} \frac{(y_i - f(x_i, \beta))^2}{\sigma_i^2} = WRSS $$

This is more than just a notational change; it recasts the WRSS as a goodness-of-fit statistic. By comparing the calculated $\chi^2$ value with the theoretical chi-squared distribution for a given number of degrees of freedom, one can quantitatively assess how well the model describes the data. More details about $\chi^2$ stasitics see [here](../01_basics/goodness-of-fit-and-chi-squared.ipynb).

### Calculating Goodness-of-Fit Metrics with WLS

Standard metrics like RMSE and R-squared should be adapted to use the weights in a WLS context to be consistent with the weighted regression.

**Root Mean Squared Error (RMSE) and Standard Error of the Residuals (SER)**

In WLS, both the RMSE and the SER (also known as the Residual Standard Error) are calculated using the WRSS to properly reflect the weighted nature of the fit.

*   **Weighted Root Mean Squared Error (WRMSE):** A common formulation for the weighted RMSE is:

    $$ RMSE_w = \sqrt{\frac{\sum_{i=1}^{N} w_i (y_i - f(x_i, \beta))^2}{\sum_{i=1}^{N} w_i}} = \sqrt{\frac{RSS_w}{\sum w_i}} $$

    This calculates a weighted average of the squared errors. Note that some definitions might use $N$ (the number of data points) in the denominator instead of the sum of the weights.

*   **Standard Error of the Residuals (SER) for WLS:** The SER is an unbiased estimator of the error variance, adjusted for the number of parameters ($p$) in the model. Its formula is:

    $$ \text{SER} = \sqrt{\frac{\text{WRSS}}{N - p}} $$

    This value quantifies the typical deviation of the data points from the fitted line in the weighted space. A smaller SER indicates that the model's predictions are closer to the actual observations.

#### **R-squared (R²) and Adjusted R-squared in WLS**

The interpretation of R-squared in WLS is more complex than in OLS. The standard R-squared formula is $R² = 1 - RSS/TSS$, where TSS is the Total Sum of Squares around the mean. In WLS, both the RSS and TSS must be weighted to be meaningful.

*   **Weighted R-squared (R²_w):** A widely accepted approach is to calculate a weighted version of both the residual sum of squares (which is WRSS) and the total sum of squares (WTSS).

    The **Weighted Total Sum of Squares (WTSS)** is defined as:

    $$ \text{WTSS} = \sum_{i=1}^{N} w_i (y_i - \bar{y}_w)^2 $$

    where $ȳ_w$ is the **weighted mean** of the dependent variable $y$, calculated as $ȳ_w = (\sigma w_i * y_i) / (\sigma w_i)$.

    The weighted R-squared is then:

    $$ R^2_w = 1 - \frac{\text{WRSS}}{\text{WTSS}} $$

    This $R²_w$ represents the proportion of the total weighted variance in the dependent variable that is explained by the weighted model. It is important to note that different statistical packages might use slightly different formulations, which can sometimes lead to confusion when comparing results.

*   **Adjusted Weighted R-squared:** The adjusted R² is modified similarly, using the weighted R-squared and accounting for the number of data points ($N$) and the number of predictors ($p$):

    $$ \text{Adjusted } R^2_w = 1 - (1 - R^2_w) \frac{N-1}{N-p-1} $$

## Additional Materials

* https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
* https://online.stat.psu.edu/stat501/book/export/html/990
* https://www.stat.uchicago.edu/~yibi/teaching/stat224/L14.pdf
* https://ms.mcmaster.ca/canty/teaching/stat3a03/Lectures7.pdf
* https://en.wikipedia.org/wiki/Reduced_chi-squared_statistic
* https://en.wikipedia.org/wiki/Weighted_arithmetic_mean
* https://stats.stackexchange.com/questions/51442/weighted-variance-one-more-time
* https://stats.stackexchange.com/questions/61225/correct-equation-for-weighted-unbiased-sample-covariance/61298#61298
* https://stats.stackexchange.com/questions/330548/difference-in-r-squared-observed-from-statsmodels-when-wls-is-used
* https://stats.stackexchange.com/questions/439590/how-does-r-compute-r-squared-for-weighted-least-squares
* https://en.wikipedia.org/wiki/Pseudo-R-squared