-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read Parquet data file with projection #244
Comments
Firstly, it's great to see someone else helping out on this - getting projection and filtering working on reads will unlock the most important (for me anyway 😅) use cases, so thanks for the contribution! I think you need to make a change though. The parquet file columns should be matched by ID rather than name, as the schema can evolve. See https://iceberg.apache.org/spec/#column-projection for more details. This makes it a bit more tricky but you are thinking along the right direction! |
Thank you @sdd. I will take a look the doc tomorrow and update the PR accordingly. |
Thank you @liurenjie1024 . I have looked the doc @sdd mentioned that describes Iceberg column projection. Looks like the projection is specified by field id as @sdd said due to schema evolution. I think at user API, I guess it should be selected by column names. For example, currently in But currently the I will take a look #251 and #252 and see if I can implement them first. |
Ah, I see. Thanks for the point. I began looking at Java implementation of #251. |
Problem StatementWhen converting parquet file to arrow in iceberg, there are several problems to take into consideration:
ExampleLet's use an example to illustrate these problems. Let's say current iceberg table schema is following: schema {
struct person [id = 1] {
struct address [id = 2] {
string city [id= 3]
string street [id = 4]
}
string name [id = 5]
}
struct howtown [id=6] {
string city [id = 7]
string state [id=8]
}
long age [id=9]
} And parquet file with following schema: schema {
struct person [id = 1] {
struct address [id = 2] {
string city [id= 3]
string street [id = 4]
}
string name [id = 5]
}
struct howtown [id=6] {
string city [id = 7]
}
int age [id=9]
} Now we want to do following projection: ("person.address", "person.name", "hometown.state", "age") The result schema is supposed to be following:
SolutionAfter #251 #252 , we have finished necessary building blocks for projection. Here is a proposed algorithm for this :
|
Thanks @liurenjie1024. I read through the summary above. I think currently #245 has done the first one When the arrow reader goes to read files, it uses these field ids to find corresponding leaf column indices from Parquet schema. The leaf column indices are used to construct |
I think this
Currently #245 has not implemented these yet. I think we can finish the basic projection in #245 and I will continue work on the |
Yes, but with one extra requirement: reconstructing struct arrays. For example, when we select (
But what we expect is
This sound reasonable, but we need to add a verification that selected fields are not nested fields and primitive types, which works for most cases. |
Yea. I think the Parquet reader cannot return the flatten schema but the pruned struct |
@liurenjie1024 Thanks for reviewing and merging #245.
I will work on this soon. Do we want to reuse this ticket? |
Yes, given we have discussions in this ticket, I think reusing it would make keeping the context easier. |
We can read Parquet file with
TableScan
as a stream of Arrow RecordBatches now. However, it reads all columns without projections of columns.TableScanBuilder.select
is a no-op now. It is better if we can propagate selected columns toTableScan
to apply the projection to scan operation.The text was updated successfully, but these errors were encountered: