digoal
2022-11-10
PostgreSQL , DuckDG , kurtosis , skewness , 峰度 , 偏度
kurtosis , skewness , 峰度 , 偏度 用来描述数据的分布.
1、数据集中度. (是否大概率集中在某个中心点, 由中心点向两端概率曲线下降.)
四阶中心矩除以概率分布方差的平方再减去3:这也被称为超值峰度(excess kurtosis)。“减3”是为了让正态分布的峰度为0。
kurtosis 越靠近0, 表示数据越集中在中间部分(钟型部分.)
https://baike.baidu.com/item/%E5%B3%B0%E5%BA%A6/10840865
2、数据偏离度. (概率最高的点是否为数据的中心点上? 还是说偏左, 或偏右?)
偏度(skewness),是统计数据分布偏斜方向和程度的度量,是统计数据分布非对称程度的数字特征。定义上偏度是样本的三阶标准化矩。
https://blog.csdn.net/xbmatrix/article/details/69360167
使用PostgreSQL产生正态分布数据
random_gaussian ( lb, ub, parameter ) a integer
Computes a Gaussian-distributed random integer in [lb, ub], see below.
random_gaussian(1, 10, 2.5) a an integer between 1 and 10
For a Gaussian distribution, the interval is mapped onto a standard normal distribution
(the classical bell-shaped Gaussian curve) truncated at -parameter on the left and +parameter on the right.
Values in the middle of the interval are more likely to be drawn. To be precise, if PHI(x)
is the cumulative distribution function of the standard normal distribution, with mean mu
defined as (max + min) / 2.0, with:
f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) /
(2.0 * PHI(parameter) - 1)
then value i between min and max inclusive is drawn with probability:
f(i + 0.5) - f(i - 0.5).
Intuitively, the larger the parameter, the more frequently values close to
the middle of the interval are drawn, and the less frequently values close
to the min and max bounds.
About 67% of values are drawn from the middle 1.0 / parameter,
that is a relative 0.5 / parameter around the mean,
and 95% in the middle 2.0 / parameter,
that is a relative 1.0 / parameter around the mean;
for instance,
if parameter is 4.0, 67% of values are drawn from the middle quarter (1.0 / 4.0) of the interval
(i.e., from 3.0 / 8.0 to 5.0 / 8.0) and 95% from the middle half (2.0 / 4.0) of the
interval (second and third quartiles).
The minimum allowed parameter value is 2.0.
postgres=# create unlogged table t1 (id int, info text);
postgres=# create unlogged table t2 (id int, info text);
postgres=# create unlogged table t3 (id int, info text);
postgres=# create unlogged table t4 (id int, info text);
cat t1.sql
\set id1 random_gaussian(1,1000,2)
\set id2 random_gaussian(1,1000,3)
\set id3 random_gaussian(1,1000,4)
\set id4 random_gaussian(1,1000,10)
begin;
insert into t1 values (:id1, md5(random()::text));
insert into t2 values (:id2, md5(random()::text));
insert into t3 values (:id3, md5(random()::text));
insert into t4 values (:id4, md5(random()::text));
end;
pgbench -M prepared -n -r -P 1 -f ./t1.sql -c 4 -j 4 -T 120
copy t1 to '/Users/digoal/Downloads/t1.csv' with (format csv);
copy t2 to '/Users/digoal/Downloads/t2.csv' with (format csv);
copy t3 to '/Users/digoal/Downloads/t3.csv' with (format csv);
copy t4 to '/Users/digoal/Downloads/t4.csv' with (format csv);
将数据导入duckdb, 计算峰度和偏度
D create table t1 (id int, info text);
D create table t2 (id int, info text);
D create table t3 (id int, info text);
D create table t4 (id int, info text);
D insert into t1 select * from '/Users/digoal/Downloads/t1.csv';
D insert into t2 select * from '/Users/digoal/Downloads/t2.csv';
D insert into t3 select * from '/Users/digoal/Downloads/t3.csv';
D insert into t4 select * from '/Users/digoal/Downloads/t4.csv';
D select id,count(*) from t1 group by id order by count(*) desc limit 10;
┌─────┬──────────────┐
│ id │ count_star() │
├─────┼──────────────┤
│ 486 │ 1496 │
│ 537 │ 1457 │
│ 464 │ 1443 │
│ 539 │ 1442 │
│ 496 │ 1429 │
│ 524 │ 1428 │
│ 528 │ 1422 │
│ 482 │ 1421 │
│ 503 │ 1421 │
│ 546 │ 1420 │
└─────┴──────────────┘
D select id,count(*) from t1 group by id order by count(*) limit 10;
┌──────┬──────────────┐
│ id │ count_star() │
├──────┼──────────────┤
│ 999 │ 175 │
│ 1 │ 178 │
│ 3 │ 184 │
│ 983 │ 184 │
│ 988 │ 186 │
│ 996 │ 187 │
│ 1000 │ 187 │
│ 995 │ 188 │
│ 11 │ 190 │
│ 9 │ 191 │
└──────┴──────────────┘
D select id,count(*) from t2 group by id order by count(*) desc limit 10;
┌─────┬──────────────┐
│ id │ count_star() │
├─────┼──────────────┤
│ 474 │ 2097 │
│ 507 │ 2060 │
│ 496 │ 2056 │
│ 501 │ 2051 │
│ 522 │ 2049 │
│ 458 │ 2032 │
│ 459 │ 2026 │
│ 500 │ 2025 │
│ 512 │ 2025 │
│ 497 │ 2020 │
└─────┴──────────────┘
D select id,count(*) from t2 group by id order by count(*) limit 10;
┌─────┬──────────────┐
│ id │ count_star() │
├─────┼──────────────┤
│ 999 │ 17 │
│ 7 │ 19 │
│ 989 │ 19 │
│ 996 │ 19 │
│ 988 │ 20 │
│ 20 │ 21 │
│ 1 │ 22 │
│ 3 │ 22 │
│ 12 │ 22 │
│ 10 │ 23 │
└─────┴──────────────┘
D select id,count(*) from t4 group by id order by count(*) desc limit 10;
┌─────┬──────────────┐
│ id │ count_star() │
├─────┼──────────────┤
│ 496 │ 6232 │
│ 497 │ 6200 │
│ 501 │ 6157 │
│ 493 │ 6149 │
│ 504 │ 6144 │
│ 500 │ 6143 │
│ 502 │ 6137 │
│ 498 │ 6127 │
│ 505 │ 6125 │
│ 507 │ 6124 │
└─────┴──────────────┘
D select id,count(*) from t4 group by id order by count(*) limit 10;
┌─────┬──────────────┐
│ id │ count_star() │
├─────┼──────────────┤
│ 263 │ 1 │
│ 281 │ 1 │
│ 284 │ 1 │
│ 285 │ 1 │
│ 287 │ 1 │
│ 291 │ 1 │
│ 292 │ 1 │
│ 293 │ 1 │
│ 296 │ 1 │
│ 299 │ 1 │
└─────┴──────────────┘
D select kurtosis(id) from t1;
┌─────────────────────┐
│ kurtosis(id) │
├─────────────────────┤
│ -0.6318288828341471 │
└─────────────────────┘
D select kurtosis(id) from t2;
┌──────────────────────┐
│ kurtosis(id) │
├──────────────────────┤
│ -0.17422601070650803 │
└──────────────────────┘
D select kurtosis(id) from t3;
┌───────────────────────┐
│ kurtosis(id) │
├───────────────────────┤
│ -0.022138186449561795 │
└───────────────────────┘
D select kurtosis(id) from t4;
┌───────────────────────┐
│ kurtosis(id) │
├───────────────────────┤
│ -0.009409321541938891 │
└───────────────────────┘
D select skewness(id) from t1;
┌───────────────────────┐
│ skewness(id) │
├───────────────────────┤
│ 0.0001219640608811309 │
└───────────────────────┘
D select skewness(id) from t2;
┌─────────────────────┐
│ skewness(id) │
├─────────────────────┤
│ 0.00398621173574203 │
└─────────────────────┘
D select skewness(id) from t3;
┌───────────────────────┐
│ skewness(id) │
├───────────────────────┤
│ -0.004060786734657334 │
└───────────────────────┘
D select skewness(id) from t4;
┌────────────────────────┐
│ skewness(id) │
├────────────────────────┤
│ -0.0012135024934198028 │
└────────────────────────┘