Skip to content

After hash join, the size of the generated table is three times that of vanilla Spark #10693

@xiaojie19852006

Description

@xiaojie19852006

Backend

VL (Velox)

Bug description

[TEST SQL:
origin spark,
drop table tmpxx purge;
create table tmpxx using orc as select * from store_sales s1 left outer join store_returns s2 on sr_item_sk = ss_item_sk where s1.ss_sold_date_sk>2452630 and s2.sr_returned_date_sk>2452790;

gluten,
drop table tmpyy purge;
create table tmpyy using orc as select * from store_sales s1 left outer join store_returns s2 on sr_item_sk = ss_item_sk where s1.ss_sold_date_sk>2452630 and s2.sr_returned_date_sk>2452790;

the result:
91.3 G 273.8 G hdfs://tpc/warehouse/tablespace/managed/hive/tpcds_10000_orc.db/tmpxx
575.2 G 1.7 T hdfs://tpc/warehouse/tablespace/managed/hive/tpcds_10000_orc.db/tmpyy

Tests show that whether native write is used or not does not affect the result,without hash join occurring and with direct insertion, the data volume shows little difference.Expected behavior] and [actual behavior].

Gluten version

Gluten-1.2

Spark version

Spark-3.4.x

Spark configurations

No response

System information

No response

Relevant logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions