Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(expr): allow sparse column id in chunk #8789

Merged
merged 2 commits into from
Nov 14, 2022

Conversation

andylokandy
Copy link
Collaborator

@andylokandy andylokandy commented Nov 14, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

Columns in a Chunk has an individual column id that is not necessarily continuous.

When converting between Chunk and DataBlock, we assume the ids are continuous.

@vercel
Copy link

vercel bot commented Nov 14, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
databend ⬜️ Ignored (Inspect) Nov 14, 2022 at 9:37AM (UTC)

@mergify mergify bot added the pr-refactor this PR changes the code base without new features or bugfix label Nov 14, 2022
@BohuTANG BohuTANG merged commit 2fb477c into databendlabs:main Nov 14, 2022
@RinChanNOWWW
Copy link
Contributor

The memory consumption of HashMap may be larger than Vec when the chunk is almost full.

@sundy-li
Copy link
Member

The index of the hashmap seems useless to represent a chunk.

Maybe we should have this map out of the chunk.

@andylokandy
Copy link
Collaborator Author

andylokandy commented Nov 14, 2022

The idea behind this change is to use the global column id all the way through the process of planning, block pruning, and execution. It helps reduce the complexity of reusing the same column id in different phrases in which the column refernces used in the Expr are different(some of the column refs are eliminated by constant folder), especially when considering to give every chunk a unique Expr with unique column ref set in the future.

@RinChanNOWWW
Copy link
Contributor

The idea behind this change is to use the global column id all the way through the process of planning, block pruning, and execution.

The columns order is the same as the fields order in the schema. When we wanted to operate one column in the chunk, we would write such codes:

let index = schema.index_of(&col_name)?; // which is to find an index  in  Vec<DataField>
let col = chunk.columns()[[index];

How do we achieve this after refactoring Chunk?

@b41sh
Copy link
Member

b41sh commented Nov 14, 2022

The idea behind this change is to use the global column id all the way through the process of planning, block pruning, and execution. It helps reduce the complexity of reusing the same column id in different phrases in which the column refernces used in the Expr are different(some of the column refs are eliminated by constant folder), especially when considering to give every chunk a unique Expr with unique column ref set in the future.

Can we put ColumnId into Schema? When using it, the index of Column in Chunk can be obtained from the Schema according to the ColumnId.
This makes the structure of Chunk simpler. At the same time, Schema and Chunk also need to be used together, and will not bring too many additional operations.

@sundy-li
Copy link
Member

The idea behind this change is to use the global column id all the way through the process of planning, block pruning, and execution. It helps reduce the complexity of reusing the same column id in different phrases in which the column refernces used in the Expr are different(some of the column refs are eliminated by constant folder), especially when considering to give every chunk a unique Expr with unique column ref set in the future.

Chunk shall be enabled to create without any information with Expr or binder on the storage side. How to convert a chunk from arrow::Chunk ?

@andylokandy
Copy link
Collaborator Author

This makes the structure of Chunk simpler. At the same time, Schema and Chunk also need to be used together, and will not bring too many additional operations.

In order to release the full power of the adaptive constant folder, each chunk should be able to have different numbers of columns even if they are for the same query. Therefore, every Chunk will have its own Schema to retrieve its unique id mapping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-refactor this PR changes the code base without new features or bugfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants