-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
What happens?
The population correlation coefficient is defined as
where
When we have a constant X, the standard deviation corr(X, Y) should be NaN. However, this is all mathematically speaking. When doing floating point arithmetic, there are roundoff errors. For example,
>>> a = np.array([5.4, 5.4, 5.4])
>>> a.mean()
5.400000000000001
>>> a.mean() == 5.4
False
>>> a.std()
8.881784197001252e-16
DuckDB's stddev implementation uses Welford's algorithm which accumulates the mean differences and is thus immune to this problem. Fantastic!
However, the Combine method, when used in a multithreaded context, makes the results non-deterministically.
These tiny near-zero standard deviations will then create non-deterministic correlations for corr aggregations!
To Reproduce
import pandas as pd
import numpy as np
import duckdb
num_rows = 1_000_000
data = {
's': np.random.choice(list('ABCDEFG'), num_rows),
'x': [3.7] * num_rows, # constant value
'y': range(num_rows)
}
df = pd.DataFrame(data)
con = duckdb.connect()
for i in range(2):
res = con.query("""
select s,
stddev(x),
corr(x, y)
from df
group by s
""").df()
print(res)
con.sql("SET threads=1")
for i in range(2):
res = con.query("""
select s,
stddev(x),
corr(x, y)
from df
group by s
""").df()
print(res)may produce:
s stddev(x) corr(x, y)
0 A 4.450146e-16 -0.183090
1 D 2.439442e-16 -0.219159
2 E 0.000000e+00 NaN
3 G 1.795092e-16 0.247403
4 B 0.000000e+00 NaN
5 C 1.637839e-16 0.227031
6 F 0.000000e+00 NaN
s stddev(x) corr(x, y)
0 C 1.700127e-16 0.058311
1 A 0.000000e+00 NaN
2 D 1.555774e-16 -0.181494
3 B 1.691312e-16 0.091954
4 G 0.000000e+00 NaN
5 E 2.059132e-16 0.111619
6 F 0.000000e+00 NaN
set threads=1 deterministically produces
s stddev(x) corr(x, y)
0 E 0.0 NaN
1 B 0.0 NaN
2 C 0.0 NaN
3 D 0.0 NaN
4 G 0.0 NaN
5 A 0.0 NaN
6 F 0.0 NaN
OS:
Ubuntu 22.04.2 LTS, aarch64
DuckDB Version:
1.0.0 & 1.0.1.dev5143
DuckDB Client:
Python
Full Name:
Mark T.
Affiliation:
bq
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have