run 200millon rows group crushed #5641

wonb168 · 2022-12-09T01:10:11Z

What happens?

>>> import duckdb as dd

con = dd.connect("yierkang.duckdb")>>> >>>
>>> con.execute("select count(*) from tmp_stock_kpi").fetch_df()#3513697 231571997
   count_star()
0     220042544
>>> sql="""--聚合
... drop table if exists tmp_fct_stock_sku_stockorg_kpi;
... create table tmp_fct_stock_sku_stockorg_kpi as
... select
    operating_unit_sk, sku_sk, max(skc_sk) as skc_sk, max(product_code) as product_code, max(color_code) as color_code,
... ...     max(size_code) as size_code, stockorg_sk, max(stockorg_code) as stockorg_code,
    sum(stock_qty) as stock_qty, sum(onroad_stock_qty) as onroad_stock_qty,
... ...     sum(onorder_stock_in_qty) as onorder_stock_in_qty, sum(onorder_stock_out_qty) as onorder_stock_out_qty
from tmp_stock_kpi
... ... group by operating_unit_sk, sku_sk, stockorg_sk, day_date"""
>>> con.execute(sql)

Killed

I checked the parquet file is 3.1GB, and the memory_limit is 53.9G in duckdb.

To Reproduce

>>> con.execute("SELECT current_setting('memory_limit')").df()
  current_setting('memory_limit')
0                          53.9GB
con.execute("PRAGMA database_size").df()
  database_size  block_size  total_blocks  used_blocks  free_blocks wal_size memory_usage memory_limit
0         3.3GB      262144         12657        12657            0  0 bytes      0 bytes       53.9GB

ls -lh yi*
-rw-rw-r-- 1 etl_yierkang_prod etl_yierkang_prod 3.1G Dec  8 16:14 yierkang.duckdb
-rw-rw-r-- 1 etl_yierkang_prod etl_yierkang_prod    0 Dec  8 18:41 yierkang.duckdb.wal

OS:

centos7

DuckDB Version:

0.6.1

DuckDB Client:

Python

Full Name:

Changzhen Wang

Affiliation:

linezonedata.com

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

The text was updated successfully, but these errors were encountered:

Mause · 2022-12-09T01:22:04Z

Can you please try with 0.6.1?

wonb168 · 2022-12-09T03:04:56Z

Can you please try with 0.6.1?

sorry，my version is 0.6.1, I write the wrong ver.

dd.__version__
'0.6.1'

connect my,duckdb file, or read parquet file, both are killed.

>>> sql="""--聚合
... drop table if exists tmp_fct_stock_sku_stockorg_kpi;
... create table tmp_fct_stock_sku_stockorg_kpi as
... select
...     operating_unit_sk, sku_sk, max(skc_sk) as skc_sk, max(product_code) as product_code, max(color_code) as color_code,
...     max(size_code) as size_code, stockorg_sk, max(stockorg_code) as stockorg_code,
...     sum(stock_qty) as stock_qty, sum(onroad_stock_qty) as onroad_stock_qty,
...     sum(onorder_stock_in_qty) as onorder_stock_in_qty, sum(onorder_stock_out_qty) as onorder_stock_out_qty
... from 'tmp_stock_kpi.parquet'
... group by operating_unit_sk, sku_sk, stockorg_sk, day_date"""
>>> con.execute(sql)
Killed

hannes · 2022-12-09T05:18:26Z

Can you please try to make this bug report reproducible, e.g. by uploading the parquet file somewhere? Thanks

wonb168 · 2022-12-09T07:21:08Z

Can you please try to make this bug report reproducible, e.g. by uploading the parquet file somewhere? Thanks

The duckdb has 3.1gb，the parquet larger 4.1gb，I don't know how to share you this big file?

hannes · 2022-12-09T09:53:47Z

Well you could try to reduce the dataset to make it smaller (e.g. by dropping columns not used in the query)?

Mytherin · 2022-12-09T09:59:39Z

The “killed” indicates to me this is most likely the system running out of memory and triggering the OOM killer. Note that the database size is the compressed size, the group by requires the entire uncompressed result of the query to fit in memory.

I would also try lowering the memory limit - I have seen the OOM killer get triggered on Linux systems before the actual system RAM limit was reached, particularly in setups with multiple NUMA nodes.

wonb168 · 2022-12-09T10:11:55Z

Well you could try to reduce the dataset to make it smaller (e.g. by dropping columns not used in the query)?

I tried little rows, this sql no problem.
I tried little columns, also OK.

con.execute("select * from tmp_stock_kpi limit 1").df()
operating_unit_sk   sku_sk   skc_sk product_code color_code size_code  stockorg_sk stockorg_code  stock_qty  onroad_stock_qty  onorder_stock_in_qty  onorder_stock_out_qty   day_date


sql="""--聚合
drop table if exists tmp_fct_stock_sku_stockorg_kpi;
create table tmp_fct_stock_sku_stockorg_kpi as
select
    operating_unit_sk, sku_sk, stockorg_sk, day_date, max(skc_sk) as skc_sk
    ,max(product_code) as product_code
    , max(color_code) as color_code
    --,max(size_code) as size_code, max(stockorg_code) as stockorg_code
    --,sum(stock_qty) as stock_qty, sum(onroad_stock_qty) as onroad_stock_qty
    --,sum(onorder_stock_in_qty) as onorder_stock_in_qty, sum(onorder_stock_out_qty) as onorder_stock_out_qty
--from (select * from 'tmp_stock_kpi.parquet' limit 10) t
from 'tmp_stock_kpi.parquet'
group by operating_unit_sk, sku_sk, stockorg_sk, day_date"""
con.execute(sql)

wonb168 · 2022-12-09T10:22:05Z

The “killed” indicates to me this is most likely the system running out of memory and triggering the OOM killer. Note that the database size is the compressed size, the group by requires the entire uncompressed result of the query to fit in memory.

I would also try lowering the memory limit - I have seen the OOM killer get triggered on Linux systems before the actual system RAM limit was reached, particularly in setups with multiple NUMA nodes.

con.execute("PRAGMA database_size").df()#3.3g, only one table

free -m
              total        used        free      shared  buff/cache   available
Mem:          32010       22748        6048        1636        3214        7407
Swap:             0           0           0


con.execute("""SELECT * EXCLUDE (column_path, segment_id, start, stats, persistent, block_id, block_offset, has_updates)
FROM pragma_storage_info('tmp_stock_kpi')
USING SAMPLE 10 ROWS
ORDER BY row_group_id;""").df()

row_group_id	column_name	column_id	segment_type	count	compression
130	onorder_stock_out_qty	11	VALIDITY	122880	Uncompressed
293	stockorg_code	7	VARCHAR	122880	Dictionary
366	skc_sk	2	VALIDITY	122880	Constant
460	skc_sk	2	VALIDITY	122880	Constant
535	onroad_stock_qty	9	INTEGER	122880	RLE
537	stock_qty	8	INTEGER	122880	BitPacking
704	color_code	4	VALIDITY	122880	Constant
879	size_code	5	VARCHAR	122880	Dictionary
1108	size_code	5	VARCHAR	122880	Dictionary
1130	stockorg_code	7	VARCHAR	122880	Dictionary

github-actions · 2023-07-29T00:29:46Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions · 2023-08-28T00:32:25Z

This issue was closed because it has been stale for 30 days with no activity.

wonb168 added the bug label Dec 9, 2022

This comment was marked as abuse.

Sign in to view

github-actions bot added the stale label Jul 29, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run 200millon rows group crushed #5641

run 200millon rows group crushed #5641

wonb168 commented Dec 9, 2022 •

edited

Mause commented Dec 9, 2022

wonb168 commented Dec 9, 2022 •

edited

hannes commented Dec 9, 2022

wonb168 commented Dec 9, 2022

hannes commented Dec 9, 2022

Mytherin commented Dec 9, 2022

wonb168 commented Dec 9, 2022

wonb168 commented Dec 9, 2022 •

edited

This comment was marked as abuse.

github-actions bot commented Jul 29, 2023

github-actions bot commented Aug 28, 2023

run 200millon rows group crushed #5641

run 200millon rows group crushed #5641

Comments

wonb168 commented Dec 9, 2022 • edited

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Mause commented Dec 9, 2022

wonb168 commented Dec 9, 2022 • edited

hannes commented Dec 9, 2022

wonb168 commented Dec 9, 2022

hannes commented Dec 9, 2022

Mytherin commented Dec 9, 2022

wonb168 commented Dec 9, 2022

wonb168 commented Dec 9, 2022 • edited

This comment was marked as abuse.

github-actions bot commented Jul 29, 2023

github-actions bot commented Aug 28, 2023

wonb168 commented Dec 9, 2022 •

edited

Have you tried this on the latest `master` branch?

wonb168 commented Dec 9, 2022 •

edited

wonb168 commented Dec 9, 2022 •

edited