Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run 200millon rows group crushed #5641

Closed
2 tasks done
wonb168 opened this issue Dec 9, 2022 · 11 comments
Closed
2 tasks done

run 200millon rows group crushed #5641

wonb168 opened this issue Dec 9, 2022 · 11 comments
Labels

Comments

@wonb168
Copy link

wonb168 commented Dec 9, 2022

What happens?

>>> import duckdb as dd

con = dd.connect("yierkang.duckdb")>>> >>>
>>> con.execute("select count(*) from tmp_stock_kpi").fetch_df()#3513697 231571997
   count_star()
0     220042544
>>> sql="""--聚合
... drop table if exists tmp_fct_stock_sku_stockorg_kpi;
... create table tmp_fct_stock_sku_stockorg_kpi as
... select
    operating_unit_sk, sku_sk, max(skc_sk) as skc_sk, max(product_code) as product_code, max(color_code) as color_code,
... ...     max(size_code) as size_code, stockorg_sk, max(stockorg_code) as stockorg_code,
    sum(stock_qty) as stock_qty, sum(onroad_stock_qty) as onroad_stock_qty,
... ...     sum(onorder_stock_in_qty) as onorder_stock_in_qty, sum(onorder_stock_out_qty) as onorder_stock_out_qty
from tmp_stock_kpi
... ... group by operating_unit_sk, sku_sk, stockorg_sk, day_date"""
>>> con.execute(sql)

Killed

I checked the parquet file is 3.1GB, and the memory_limit is 53.9G in duckdb.

To Reproduce

>>> con.execute("SELECT current_setting('memory_limit')").df()
  current_setting('memory_limit')
0                          53.9GB
con.execute("PRAGMA database_size").df()
  database_size  block_size  total_blocks  used_blocks  free_blocks wal_size memory_usage memory_limit
0         3.3GB      262144         12657        12657            0  0 bytes      0 bytes       53.9GB

ls -lh yi*
-rw-rw-r-- 1 etl_yierkang_prod etl_yierkang_prod 3.1G Dec  8 16:14 yierkang.duckdb
-rw-rw-r-- 1 etl_yierkang_prod etl_yierkang_prod    0 Dec  8 18:41 yierkang.duckdb.wal

OS:

centos7

DuckDB Version:

0.6.1

DuckDB Client:

Python

Full Name:

Changzhen Wang

Affiliation:

linezonedata.com

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@wonb168 wonb168 added the bug label Dec 9, 2022
@Mause
Copy link
Member

Mause commented Dec 9, 2022

Can you please try with 0.6.1?

@wonb168
Copy link
Author

wonb168 commented Dec 9, 2022

Can you please try with 0.6.1?

sorry,my version is 0.6.1, I write the wrong ver.

dd.__version__
'0.6.1'

connect my,duckdb file, or read parquet file, both are killed.

>>> sql="""--聚合
... drop table if exists tmp_fct_stock_sku_stockorg_kpi;
... create table tmp_fct_stock_sku_stockorg_kpi as
... select
...     operating_unit_sk, sku_sk, max(skc_sk) as skc_sk, max(product_code) as product_code, max(color_code) as color_code,
...     max(size_code) as size_code, stockorg_sk, max(stockorg_code) as stockorg_code,
...     sum(stock_qty) as stock_qty, sum(onroad_stock_qty) as onroad_stock_qty,
...     sum(onorder_stock_in_qty) as onorder_stock_in_qty, sum(onorder_stock_out_qty) as onorder_stock_out_qty
... from 'tmp_stock_kpi.parquet'
... group by operating_unit_sk, sku_sk, stockorg_sk, day_date"""
>>> con.execute(sql)
Killed


@hannes
Copy link
Member

hannes commented Dec 9, 2022

Can you please try to make this bug report reproducible, e.g. by uploading the parquet file somewhere? Thanks

@wonb168
Copy link
Author

wonb168 commented Dec 9, 2022

Can you please try to make this bug report reproducible, e.g. by uploading the parquet file somewhere? Thanks

The duckdb has 3.1gb,the parquet larger 4.1gb,I don't know how to share you this big file?

@hannes
Copy link
Member

hannes commented Dec 9, 2022

Well you could try to reduce the dataset to make it smaller (e.g. by dropping columns not used in the query)?

@Mytherin
Copy link
Collaborator

Mytherin commented Dec 9, 2022

The “killed” indicates to me this is most likely the system running out of memory and triggering the OOM killer. Note that the database size is the compressed size, the group by requires the entire uncompressed result of the query to fit in memory.

I would also try lowering the memory limit - I have seen the OOM killer get triggered on Linux systems before the actual system RAM limit was reached, particularly in setups with multiple NUMA nodes.

@wonb168
Copy link
Author

wonb168 commented Dec 9, 2022

Well you could try to reduce the dataset to make it smaller (e.g. by dropping columns not used in the query)?

I tried little rows, this sql no problem.
I tried little columns, also OK.

con.execute("select * from tmp_stock_kpi limit 1").df()
operating_unit_sk   sku_sk   skc_sk product_code color_code size_code  stockorg_sk stockorg_code  stock_qty  onroad_stock_qty  onorder_stock_in_qty  onorder_stock_out_qty   day_date


sql="""--聚合
drop table if exists tmp_fct_stock_sku_stockorg_kpi;
create table tmp_fct_stock_sku_stockorg_kpi as
select
    operating_unit_sk, sku_sk, stockorg_sk, day_date, max(skc_sk) as skc_sk
    ,max(product_code) as product_code
    , max(color_code) as color_code
    --,max(size_code) as size_code, max(stockorg_code) as stockorg_code
    --,sum(stock_qty) as stock_qty, sum(onroad_stock_qty) as onroad_stock_qty
    --,sum(onorder_stock_in_qty) as onorder_stock_in_qty, sum(onorder_stock_out_qty) as onorder_stock_out_qty
--from (select * from 'tmp_stock_kpi.parquet' limit 10) t
from 'tmp_stock_kpi.parquet'
group by operating_unit_sk, sku_sk, stockorg_sk, day_date"""
con.execute(sql)

@wonb168
Copy link
Author

wonb168 commented Dec 9, 2022

The “killed” indicates to me this is most likely the system running out of memory and triggering the OOM killer. Note that the database size is the compressed size, the group by requires the entire uncompressed result of the query to fit in memory.

I would also try lowering the memory limit - I have seen the OOM killer get triggered on Linux systems before the actual system RAM limit was reached, particularly in setups with multiple NUMA nodes.

con.execute("PRAGMA database_size").df()#3.3g, only one table

free -m
              total        used        free      shared  buff/cache   available
Mem:          32010       22748        6048        1636        3214        7407
Swap:             0           0           0


con.execute("""SELECT * EXCLUDE (column_path, segment_id, start, stats, persistent, block_id, block_offset, has_updates)
FROM pragma_storage_info('tmp_stock_kpi')
USING SAMPLE 10 ROWS
ORDER BY row_group_id;""").df()

row_group_id column_name column_id segment_type count compression
130 onorder_stock_out_qty 11 VALIDITY 122880 Uncompressed
293 stockorg_code 7 VARCHAR 122880 Dictionary
366 skc_sk 2 VALIDITY 122880 Constant
460 skc_sk 2 VALIDITY 122880 Constant
535 onroad_stock_qty 9 INTEGER 122880 RLE
537 stock_qty 8 INTEGER 122880 BitPacking
704 color_code 4 VALIDITY 122880 Constant
879 size_code 5 VARCHAR 122880 Dictionary
1108 size_code 5 VARCHAR 122880 Dictionary
1130 stockorg_code 7 VARCHAR 122880 Dictionary

@l1t1

This comment was marked as abuse.

@github-actions
Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label Jul 29, 2023
@github-actions
Copy link

This issue was closed because it has been stale for 30 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants