Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resize hash table before building #9069

Closed
wants to merge 340 commits into from

Conversation

englefly
Copy link
Contributor

Proposed changes

Initialize hash table size by the tuple number instead of fixed number 1024 to reduce BuildTableExpanseTime.
After initialize table size, the total build time decreased by 8.9% on tpch 10G,
select count(*) from lineitem join orders on l_orderkey = o_orderkey

Issue Number: close #xxx

Problem Summary:

Describe the overview of changes.

Checklist(Required)

  1. Does it affect the original behavior: (Yes/No/I Don't know)
  2. Has unit tests been added: (Yes/No/No Need)
  3. Has document been added or modified: (Yes/No/No Need)
  4. Does it need to update dependencies: (Yes/No)
  5. Are there any changes that cannot be rolled back: (Yes/No)

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@zbtzbtzbt
Copy link
Contributor

what is the hashtable build time? before and after your pr.

yiguolei
yiguolei previously approved these changes May 23, 2022
Copy link
Contributor

@yiguolei yiguolei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label May 23, 2022
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@yiguolei yiguolei added the dev/1.0.1-deprecated should be merged into dev-1.0.1 branch label May 23, 2022
@yiguolei yiguolei added this to the v1.1 milestone May 23, 2022
@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label May 23, 2022
SleepyBear96 and others added 18 commits May 23, 2022 11:11
apache#9180)

* avoiding a corrupt image file when there is image.ckpt with non-zero size

For now, saveImage writes data to image.ckpt via an append FileOutputStream,
when there is a non-zero size file named image.ckpt, a disaster would happen
due to a corrupt image file. Even worse, fe only keeps the lastest image file
and removes others.

BTW, image file should be synced to disk.

It is dangerous to only keep the latest image file, because an image file is
validated when generating the next image file. Then we keep an non validated
image file but remove validated ones. So I will issue a pr which keeps at least
2 image file.

* append other data after MetaHeader

* use channel.force instead of sync
Co-authored-by: Rongqian Li <rongqian_li@idgcapital.com>
…n mode (apache#9195)

Co-authored-by: yiguolei <yiguolei@gmail.com>
* rename ImageSeq to LatestImageSeq in Storage

* keep at least one validated image file
…he#9011)

* load newly generated image file as soon as generated to check if it is valid.

* delete the latest invalid image file

* fix

* fix

* get filePath from saveImage() to ensure deleting the correct file while exception happens

* fix

Co-authored-by: wuhangze <wuhangze@jd.com>
… spark load. (apache#9136)

Buffer flip is used incorrectly.
When the hash key is string type, the hash value is always zero.
The reason is that the buffer of string type is obtained by wrap, which is not needed to flip.
If we do so, the buffer limit for read will be zero.
…ge back to SSD (apache#9158)

1. fix bug described in apache#9159
2. fix a `fill_tuple` bug introduced from apache#9173
start_fe.sh: line 174: [: -eq: unary operator expected
xleoken and others added 24 commits May 23, 2022 11:11
…hread (apache#9472)

* add ArrowReaderProperties to parquet::arrow::FileReader

* support perfecth batch
Co-authored-by: lihaopeng <lihaopeng@baidu.com>
…e it consistent with hive and trino behavior. (apache#9190)

Hive and trino/presto would automatically trim the trailing spaces but Doris doesn't.
This would cause different query result with hive.

Add a new session variable "trim_tailing_spaces_for_external_table_query".
If set to true, when reading csv from broker scan node, it will trim the tailing space of the column
…vparquet/vbroker scanner (apache#9666)

* [Refactor][Bug-Fix][Load Vec] Refactor code of basescanner and vjson/vparquet/vbroker scanner
1. fix bug of vjson scanner not support `range_from_file_path`
2. fix bug of vjson/vbrocker scanner core dump by src/dest slot nullable is different
3. fix bug of vparquest filter_block reference of column in not 1
4. refactor code to simple all the code

It only changed vectorized load, not original row based load.

Co-authored-by: lihaopeng <lihaopeng@baidu.com>
Enhance java style.

Now: checkstyle about code order is in this page--Class and Interface Declarations

This pr can make idea auto rearrange code
Currently, the libhdfs3 library integrated by doris BE does not support accessing the cluster with kerberos authentication 
enabled, and found that kerberos-related dependencies(gsasl and krb5) were not added when build libhdfs3.

so, this pr will enable kerberos support and rebuild libhdfs3 with dependencies gsasl and krb5:

- gsasl version: 1.8.0
- krb5 version: 1.19
select column from table where column is null
Disable by default because current checksum logic has some bugs.
And it will also bring some overhead.
…pache#9703)

Due to the current architecture, predicate derivation at rewrite cannot satisfy all cases,
because rewrite is performed on first and then where, and when there are subqueries, all cases cannot be derived.
So keep the predicate pushdown method here.

eg.
select * from t1 left join t2 on t1 = t2 where t1 = 1;

InferFiltersRule can't infer t2 = 1, because this is out of specification.

The expression(t2 = 1) can actually be deduced to push it down to the scan node.
@github-actions github-actions bot added area/sql/function Issues or PRs related to the SQL functions kind/docs Categorizes issue or PR as related to documentation. labels May 23, 2022
@morningman morningman removed the dev/1.0.1-deprecated should be merged into dev-1.0.1 branch label May 24, 2022
@englefly englefly deleted the resize_join_hash_table branch August 5, 2022 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sql/function Issues or PRs related to the SQL functions area/vectorization kind/docs Categorizes issue or PR as related to documentation. reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet