Skip to content

Latest commit

 

History

History
281 lines (228 loc) · 11.7 KB

20221110_01.md

File metadata and controls

281 lines (228 loc) · 11.7 KB

超值峰度(excess kurtosis)、 偏度(skewness)

作者

digoal

日期

2022-11-10

标签

PostgreSQL , DuckDG , kurtosis , skewness , 峰度 , 偏度


背景

kurtosis , skewness , 峰度 , 偏度 用来描述数据的分布.

1、数据集中度. (是否大概率集中在某个中心点, 由中心点向两端概率曲线下降.)

四阶中心矩除以概率分布方差的平方再减去3:这也被称为超值峰度(excess kurtosis)。“减3”是为了让正态分布的峰度为0。

kurtosis 越靠近0, 表示数据越集中在中间部分(钟型部分.)

https://baike.baidu.com/item/%E5%B3%B0%E5%BA%A6/10840865

2、数据偏离度. (概率最高的点是否为数据的中心点上? 还是说偏左, 或偏右?)

偏度(skewness),是统计数据分布偏斜方向和程度的度量,是统计数据分布非对称程度的数字特征。定义上偏度是样本的三阶标准化矩。

https://blog.csdn.net/xbmatrix/article/details/69360167

例子

使用PostgreSQL产生正态分布数据

random_gaussian ( lb, ub, parameter ) a integer                                                                                         
  Computes a Gaussian-distributed random integer in [lb, ub], see below.    
  random_gaussian(1, 10, 2.5) a an integer between 1 and 10    
  
For a Gaussian distribution, the interval is mapped onto a standard normal distribution   
(the classical bell-shaped Gaussian curve) truncated at -parameter on the left and +parameter on the right.   
Values in the middle of the interval are more likely to be drawn. To be precise, if PHI(x)   
is the cumulative distribution function of the standard normal distribution, with mean mu   
defined as (max + min) / 2.0, with:   
  
    f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) /  
           (2.0 * PHI(parameter) - 1)  
  
then value i between min and max inclusive is drawn with probability:   
  f(i + 0.5) - f(i - 0.5).   
  
Intuitively, the larger the parameter, the more frequently values close to  
the middle of the interval are drawn, and the less frequently values close   
to the min and max bounds.   
About 67% of values are drawn from the middle 1.0 / parameter,   
  that is a relative 0.5 / parameter around the mean,   
and 95% in the middle 2.0 / parameter,   
  that is a relative 1.0 / parameter around the mean;   
  
for instance,   
if parameter is 4.0, 67% of values are drawn from the middle quarter (1.0 / 4.0) of the interval   
(i.e., from 3.0 / 8.0 to 5.0 / 8.0) and 95% from the middle half (2.0 / 4.0) of the  
interval (second and third quartiles).   
  
The minimum allowed parameter value is 2.0.  
postgres=# create unlogged table t1 (id int, info text);  
postgres=# create unlogged table t2 (id int, info text);  
postgres=# create unlogged table t3 (id int, info text);  
postgres=# create unlogged table t4 (id int, info text);  
  
cat t1.sql  
\set id1 random_gaussian(1,1000,2)  
\set id2 random_gaussian(1,1000,3)  
\set id3 random_gaussian(1,1000,4)  
\set id4 random_gaussian(1,1000,10)  
begin;  
insert into t1 values (:id1, md5(random()::text));  
insert into t2 values (:id2, md5(random()::text));  
insert into t3 values (:id3, md5(random()::text));  
insert into t4 values (:id4, md5(random()::text));  
end;  
  
pgbench -M prepared -n -r -P 1 -f ./t1.sql -c 4 -j 4 -T 120  
  
  
copy t1 to '/Users/digoal/Downloads/t1.csv' with (format csv);  
copy t2 to '/Users/digoal/Downloads/t2.csv' with (format csv);  
copy t3 to '/Users/digoal/Downloads/t3.csv' with (format csv);  
copy t4 to '/Users/digoal/Downloads/t4.csv' with (format csv);  

将数据导入duckdb, 计算峰度和偏度

D create table t1 (id int, info text);  
D create table t2 (id int, info text);  
D create table t3 (id int, info text);  
D create table t4 (id int, info text);  
  
D insert into t1 select * from '/Users/digoal/Downloads/t1.csv';  
D insert into t2 select * from '/Users/digoal/Downloads/t2.csv';  
D insert into t3 select * from '/Users/digoal/Downloads/t3.csv';  
D insert into t4 select * from '/Users/digoal/Downloads/t4.csv';  
  
  
D select id,count(*) from t1 group by id order by count(*) desc limit 10;  
┌─────┬──────────────┐  
│ id  │ count_star() │  
├─────┼──────────────┤  
│ 486 │ 1496         │  
│ 537 │ 1457         │  
│ 464 │ 1443         │  
│ 539 │ 1442         │  
│ 496 │ 1429         │  
│ 524 │ 1428         │  
│ 528 │ 1422         │  
│ 482 │ 1421         │  
│ 503 │ 1421         │  
│ 546 │ 1420         │  
└─────┴──────────────┘  
D select id,count(*) from t1 group by id order by count(*)  limit 10;  
┌──────┬──────────────┐  
│  id  │ count_star() │  
├──────┼──────────────┤  
│ 999  │ 175          │  
│ 1    │ 178          │  
│ 3    │ 184          │  
│ 983  │ 184          │  
│ 988  │ 186          │  
│ 996  │ 187          │  
│ 1000 │ 187          │  
│ 995  │ 188          │  
│ 11   │ 190          │  
│ 9    │ 191          │  
└──────┴──────────────┘  
D select id,count(*) from t2 group by id order by count(*) desc limit 10;  
┌─────┬──────────────┐  
│ id  │ count_star() │  
├─────┼──────────────┤  
│ 474 │ 2097         │  
│ 507 │ 2060         │  
│ 496 │ 2056         │  
│ 501 │ 2051         │  
│ 522 │ 2049         │  
│ 458 │ 2032         │  
│ 459 │ 2026         │  
│ 500 │ 2025         │  
│ 512 │ 2025         │  
│ 497 │ 2020         │  
└─────┴──────────────┘  
D select id,count(*) from t2 group by id order by count(*)  limit 10;  
┌─────┬──────────────┐  
│ id  │ count_star() │  
├─────┼──────────────┤  
│ 999 │ 17           │  
│ 7   │ 19           │  
│ 989 │ 19           │  
│ 996 │ 19           │  
│ 988 │ 20           │  
│ 20  │ 21           │  
│ 1   │ 22           │  
│ 3   │ 22           │  
│ 12  │ 22           │  
│ 10  │ 23           │  
└─────┴──────────────┘  
  
D select id,count(*) from t4 group by id order by count(*) desc limit 10;  
┌─────┬──────────────┐  
│ id  │ count_star() │  
├─────┼──────────────┤  
│ 496 │ 6232         │  
│ 497 │ 6200         │  
│ 501 │ 6157         │  
│ 493 │ 6149         │  
│ 504 │ 6144         │  
│ 500 │ 6143         │  
│ 502 │ 6137         │  
│ 498 │ 6127         │  
│ 505 │ 6125         │  
│ 507 │ 6124         │  
└─────┴──────────────┘  
D select id,count(*) from t4 group by id order by count(*) limit 10;  
┌─────┬──────────────┐  
│ id  │ count_star() │  
├─────┼──────────────┤  
│ 263 │ 1            │  
│ 281 │ 1            │  
│ 284 │ 1            │  
│ 285 │ 1            │  
│ 287 │ 1            │  
│ 291 │ 1            │  
│ 292 │ 1            │  
│ 293 │ 1            │  
│ 296 │ 1            │  
│ 299 │ 1            │  
└─────┴──────────────┘  
D select kurtosis(id) from t1;  
┌─────────────────────┐  
│    kurtosis(id)     │  
├─────────────────────┤  
│ -0.6318288828341471 │  
└─────────────────────┘  
D select kurtosis(id) from t2;  
┌──────────────────────┐  
│     kurtosis(id)     │  
├──────────────────────┤  
│ -0.17422601070650803 │  
└──────────────────────┘  
D select kurtosis(id) from t3;  
┌───────────────────────┐  
│     kurtosis(id)      │  
├───────────────────────┤  
│ -0.022138186449561795 │  
└───────────────────────┘  
D select kurtosis(id) from t4;  
┌───────────────────────┐  
│     kurtosis(id)      │  
├───────────────────────┤  
│ -0.009409321541938891 │  
└───────────────────────┘  
D select skewness(id) from t1;  
┌───────────────────────┐  
│     skewness(id)      │  
├───────────────────────┤  
│ 0.0001219640608811309 │  
└───────────────────────┘  
D select skewness(id) from t2;  
┌─────────────────────┐  
│    skewness(id)     │  
├─────────────────────┤  
│ 0.00398621173574203 │  
└─────────────────────┘  
D select skewness(id) from t3;  
┌───────────────────────┐  
│     skewness(id)      │  
├───────────────────────┤  
│ -0.004060786734657334 │  
└───────────────────────┘  
D select skewness(id) from t4;  
┌────────────────────────┐  
│      skewness(id)      │  
├────────────────────────┤  
│ -0.0012135024934198028 │  
└────────────────────────┘  

digoal's wechat