## Statement

There are two standard formulas given in textbooks to compute the sample variance $s^2$ of the number $x_1, ..., x_n$:

$$
\begin{align*}
s^2 &= \frac{1}{n-1} \Big[\sum^n_{i=1} x_i^2 - \frac{1}{n} \Big(\sum^n_{i=1} x_i\Big)^2\Big]
&\ \ \__\mathrm{(1)}\\
\mathrm{First\ compute\ }\bar{x} = \frac{1}{n} \sum^n_{i=1} x_i,
\mathrm{\ and\ then\ }s^2 &= \frac{1}{n-1} \sum^n_{i=1} \big(x_i-\bar{x}\big)^2
&\ \ \__\mathrm{(2)}
\end{align*}
$$

Which one is more stable?

## Numerical Explanation

In [15]:
X = [1.253, 2.411, 3.174]
y = 0.935562
n = size(X)[1]

3

Calculate $s^2$ using the **formula (1)**: 

$$s^2 = \frac{1}{n-1} \Big[\sum^n_{i=1} x_i^2 - \frac{1}{n} \big(\sum^n_{i=1} x_i\big)^2\Big]$$

In [17]:
s2_1 = 1 / (n - 1) * (sum(X.^2) - 1 / n * sum(X)^2)

0.9355623333333352

Calculate $s^2$ using **formula (2)**, by first calculating $$\bar{x} = \frac{1}{n} \sum^n_{i=1} x_i$$

In [18]:
x_bar = 1 / n * sum(X)

2.2793333333333328

And then calculate using

$$s^2 = \frac{1}{n-1} \sum^n_{i=1} \big(x_i-\bar{x}\big)^2$$

In [21]:
s2_2 = 1 / (n - 1) * sum((X .- x_bar).^2)

0.9355623333333334

Let's compare the errors between these 2 formulas:

In [22]:
println("formula 1 error: $(s2_1 - y)")
println("formula 2 error: $(s2_2 - y)")

if s2_1 - y < s2_2 - y
    println("--> formula 1 is more stable")
else
    println("--> formula 2 is more stable")
end

formula 1 error: 3.3333333515628283e-7
formula 2 error: 3.33333333379926e-7
--> formula 2 is more stable


**According to the results, formula (2) is more stable than formula (1).**

## Mathematical Explanation

***If the given formulas are too complex, don't try to prove it with Jacobian's matrices or relative condition numbers. I tried and it sucks!***

There are 3 possible numerical instabilities:

- Round-off error
- Overflow / Underflow
- Cancellation Error

Let's consider them one by one.

### Round-off error

In this case, **formula (1) is prone to round-off error.** The squared terms $x_i^2$ and $(\sum_{i=1}^nx_i)^2$ cause round-off errors due to the limitation of floating-point number representation, like the mantissa of [IEEE 754's double precision](https://en.wikipedia.org/wiki/Double-precision_floating-point_format). Such the product reduces the accuracy of the number.

Meanwhile, **formula (2)** is also, but less, prone to round-off error because the squaring only happens once from $(x_i - \overline{x})^2$.

### Overflow / Underflow

In this case, **formula (1) is more prone to overflow or underflow**, particularly the terms $x_i^2$ and especially $(\sum^n_{i=1} x_i)^2$, which may be considerably large, where the value got squared. So it is possible that the squared product might exceed the capacity of, let's say, the *[IEEE 754's double precision](https://en.wikipedia.org/wiki/Double-precision_floating-point_format)*

Meanwhile, **formula (2)** deducts each value of $x_i$ with $\overline{x}$ that significantly reduces the magnitude (size of values, or absolute) of the data closer to zero before getting squared, so less likely to overflow/underflow.

*Note: underflow can happen when a (big) negative number is presented in the sequence.*

### Cancellation Error

**BOTH** **formula (1)** and **formula (2)** are prone to *cancellation error* because:

- In **formula (1)**, $\sum^n_{i=1} x_i^2 - \frac{1}{n} (\sum^n_{i=1} x_i)^2$ is the subtraction of floating-point numbers.
- In **formula (2)**, $x_i-\bar{x}$ is the subtraction of floating-point numbers.

But **formula (1)** is worse because the subtraction may increase the error derived from both squared terms $x_i^2 - (\sum_{i=1}^nx_i)^2$, which is the round-off error we mentioned earlier.

$n-1$ doesn't count here because $n$ is an integer.

### Conclusion

From the discussion, **formula (1)** is obviously the most instable here because it causes all the mentioned problems: round-off error, overflow/underflow and cancellation error.

**Formula (2)** is also prone to those errors, but much less severe.

*Note: some of these summary got inspired by the [conversation with ChatGPT](https://chat.openai.com/share/064cb66a-4070-45d5-9a2f-2394c7270165).*