resize hash table before building #9069

englefly · 2022-04-18T01:17:48Z

Proposed changes

Initialize hash table size by the tuple number instead of fixed number 1024 to reduce BuildTableExpanseTime.
After initialize table size, the total build time decreased by 8.9% on tpch 10G,
select count(*) from lineitem join orders on l_orderkey = o_orderkey

Issue Number: close #xxx

Problem Summary:

Describe the overview of changes.

Checklist(Required)

Does it affect the original behavior: (Yes/No/I Don't know)
Has unit tests been added: (Yes/No/No Need)
Has document been added or modified: (Yes/No/No Need)
Does it need to update dependencies: (Yes/No)
Are there any changes that cannot be rolled back: (Yes/No)

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

zbtzbtzbt · 2022-04-18T01:49:43Z

what is the hashtable build time? before and after your pr.

be/src/vec/common/hash_table/hash_table.h

yiguolei

LGTM

github-actions · 2022-05-23T02:19:58Z

PR approved by at least one committer and no changes requested.

github-actions · 2022-05-23T02:20:00Z

PR approved by anyone and no changes requested.

…r to avoid oom (apache#9173)

apache#9180) * avoiding a corrupt image file when there is image.ckpt with non-zero size For now, saveImage writes data to image.ckpt via an append FileOutputStream, when there is a non-zero size file named image.ckpt, a disaster would happen due to a corrupt image file. Even worse, fe only keeps the lastest image file and removes others. BTW, image file should be synced to disk. It is dangerous to only keep the latest image file, because an image file is validated when generating the next image file. Then we keep an non validated image file but remove validated ones. So I will issue a pr which keeps at least 2 image file. * append other data after MetaHeader * use channel.force instead of sync

Co-authored-by: Rongqian Li <rongqian_li@idgcapital.com>

…n mode (apache#9195) Co-authored-by: yiguolei <yiguolei@gmail.com>

* rename ImageSeq to LatestImageSeq in Storage * keep at least one validated image file

…he#9011) * load newly generated image file as soon as generated to check if it is valid. * delete the latest invalid image file * fix * fix * get filePath from saveImage() to ensure deleting the correct file while exception happens * fix Co-authored-by: wuhangze <wuhangze@jd.com>

…pache#9222)

… spark load. (apache#9136) Buffer flip is used incorrectly. When the hash key is string type, the hash value is always zero. The reason is that the buffer of string type is obtained by wrap, which is not needed to flip. If we do so, the buffer limit for read will be zero.

…ge back to SSD (apache#9158) 1. fix bug described in apache#9159 2. fix a `fill_tuple` bug introduced from apache#9173

start_fe.sh: line 174: [: -eq: unary operator expected

…le column (apache#9191)

…lified data (apache#9205)

…hread (apache#9472) * add ArrowReaderProperties to parquet::arrow::FileReader * support perfecth batch

…alDictValue` exceeds integer range (apache#9436)

Co-authored-by: lihaopeng <lihaopeng@baidu.com>

…m load (apache#9154)

…e it consistent with hive and trino behavior. (apache#9190) Hive and trino/presto would automatically trim the trailing spaces but Doris doesn't. This would cause different query result with hive. Add a new session variable "trim_tailing_spaces_for_external_table_query". If set to true, when reading csv from broker scan node, it will trim the tailing space of the column

…df (apache#9440)

…vparquet/vbroker scanner (apache#9666) * [Refactor][Bug-Fix][Load Vec] Refactor code of basescanner and vjson/vparquet/vbroker scanner 1. fix bug of vjson scanner not support `range_from_file_path` 2. fix bug of vjson/vbrocker scanner core dump by src/dest slot nullable is different 3. fix bug of vparquest filter_block reference of column in not 1 4. refactor code to simple all the code It only changed vectorized load, not original row based load. Co-authored-by: lihaopeng <lihaopeng@baidu.com>

Enhance java style. Now: checkstyle about code order is in this page--Class and Interface Declarations This pr can make idea auto rearrange code

…apache#9508)

Add insert best practices

Currently, the libhdfs3 library integrated by doris BE does not support accessing the cluster with kerberos authentication enabled, and found that kerberos-related dependencies（gsasl and krb5） were not added when build libhdfs3. so, this pr will enable kerberos support and rebuild libhdfs3 with dependencies gsasl and krb5: - gsasl version: 1.8.0 - krb5 version: 1.19

…he#9619)

select column from table where column is null

Disable by default because current checksum logic has some bugs. And it will also bring some overhead.

…e docs. (apache#9701)

…pache#9703) Due to the current architecture, predicate derivation at rewrite cannot satisfy all cases, because rewrite is performed on first and then where, and when there are subqueries, all cases cannot be derived. So keep the predicate pushdown method here. eg. select * from t1 left join t2 on t1 = t2 where t1 = 1; InferFiltersRule can't infer t2 = 1, because this is out of specification. The expression(t2 = 1) can actually be deduced to push it down to the scan node.

github-actions bot added the area/vectorization label Apr 18, 2022

zbtzbtzbt reviewed Apr 18, 2022

View reviewed changes

be/src/vec/common/hash_table/hash_table.h Show resolved Hide resolved

yiguolei previously approved these changes May 23, 2022

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label May 23, 2022

github-actions bot added the reviewed label May 23, 2022

yiguolei added the dev/1.0.1-deprecated should be merged into dev-1.0.1 branch label May 23, 2022

yiguolei added this to the v1.1 milestone May 23, 2022

englefly dismissed yiguolei’s stale review via cbc6c39 May 23, 2022 02:32

github-actions bot removed the approved Indicates a PR has been approved by one committer. label May 23, 2022

SleepyBear96 and others added 18 commits May 23, 2022 11:11

[fix](broker load) sync the workflow of BrokerScanner to other Scanne…

d43adf9

…r to avoid oom (apache#9173)

[BUG] fix compiling bug for java udf (apache#9161)

c2b0e42

FIX: getChannel -> getChannel() (apache#9217)

2820884

Co-authored-by: Rongqian Li <rongqian_li@idgcapital.com>

[Improvement] not print logs to fe.out when fe is running under daemo…

b251f47

…n mode (apache#9195) Co-authored-by: yiguolei <yiguolei@gmail.com>

keep at least one validated image file (apache#9192)

044123e

* rename ImageSeq to LatestImageSeq in Storage * keep at least one validated image file

[github] enable requested status check before merging pull requests (a…

6b4bc20

…pache#9222)

[fix](function) fix lag/lead function return invalid data (apache#9076)

44359be

[feature-wip](array-type) ArrayFileColumnIterator bug fix (apache#9114)

6c696a4

[fix](hierarchical-storage) Fix bug that storage medium property chan…

7d6688c

…ge back to SSD (apache#9158) 1. fix bug described in apache#9159 2. fix a `fill_tuple` bug introduced from apache#9173

[fix] fix sequence bug in non-vec mode (apache#9184)

a4d1ed3

[fix](script) meet error on start_fe.sh(apache#9187)

72520c5

start_fe.sh: line 174: [: -eq: unary operator expected

[refactor](doc)Cluster upgrade adds metadata backup (apache#9189)

1117be2

[fix](Lateral-View)(Vectorized) core dump on lateral-view with nullab…

08aebb0

…le column (apache#9191)

[typo](annotation): fix typo in ldap.conf (apache#9200)

42fdab2

[fix](OlapScanner)fix bitmap or hll's OOM when loading too many unqua…

6eed6b9

…lified data (apache#9205)

xleoken and others added 24 commits May 23, 2022 11:11

[Enhance] Add host info to heartbeat error msg (apache#9499)

c4e1262

[Enhancement] improve parquet reader via arrow's prefetch and multi t…

e2f66f3

…hread (apache#9472) * add ArrowReaderProperties to parquet::arrow::FileReader * support perfecth batch

[fix](sparkload): fix min_value will be negative number when `maxGlob…

9333f22

…alDictValue` exceeds integer range (apache#9436)

[refactor][rowset]move rowset writer to a single place (apache#9368)

54025a7

[feature](nereids): add join rules base code (apache#9598)

c6cfe75

[docs] Fix error command of meta tool docs (apache#9590)

df2baca

Co-authored-by: lihaopeng <lihaopeng@baidu.com>

[improvement](stream-load) adjust read unit of http to optimize strea…

d8812ca

…m load (apache#9154)

[Vectorized][java-udf] add datetime&&largeint&&decimal type to java-u…

02ca2d5

…df (apache#9440)

[code style] minor update for code style (apache#9695)

d5c7780

[enhancement](community): enhance java style (apache#9693)

33a1c1b

Enhance java style. Now: checkstyle about code order is in this page--Class and Interface Declarations This pr can make idea auto rearrange code

[Refactor] add vpre_filter_expr for vectorized to improve performance (…

65c857e

…apache#9508)

[doc]Add insert best practices (apache#9723)

7dfc9bf

Add insert best practices

[Refactor] simplify some code in routine load (apache#9532)

82aaf58

[refactor](load) add tablet errors when close_wait return error (apac…

a384af4

…he#9619)

[fix] NullPredicate should implement evaluate_vec (apache#9689)

3acd2ce

select column from table where column is null

[doc] Fix typos in documentation (apache#9692)

23723aa

[config](checksum) Disable consistency checker by default (apache#9699)

4f22a4b

Disable by default because current checksum logic has some bugs. And it will also bring some overhead.

[doc] Add trim_tailing_spaces_for_external_table_query variable to th…

60528ef

…e docs. (apache#9701)

[doc] update docs for FE UT (apache#9718)

22a30fc

clang format

5035637

github-actions bot added area/sql/function Issues or PRs related to the SQL functions kind/docs Categorizes issue or PR as related to documentation. labels May 23, 2022

HappenLee mentioned this pull request May 23, 2022

[vec][opt] opt hash join build resize hash table before insert data #9735

Merged

yiguolei closed this in #9735 May 23, 2022

morningman removed the dev/1.0.1-deprecated should be merged into dev-1.0.1 branch label May 24, 2022

englefly deleted the resize_join_hash_table branch August 5, 2022 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resize hash table before building #9069

resize hash table before building #9069

englefly commented Apr 18, 2022

zbtzbtzbt commented Apr 18, 2022

yiguolei left a comment

github-actions bot commented May 23, 2022

github-actions bot commented May 23, 2022

resize hash table before building #9069

resize hash table before building #9069

Conversation

englefly commented Apr 18, 2022

Proposed changes

Problem Summary:

Checklist(Required)

Further comments

zbtzbtzbt commented Apr 18, 2022

yiguolei left a comment

Choose a reason for hiding this comment

github-actions bot commented May 23, 2022

github-actions bot commented May 23, 2022