Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add lookup tables to StructType #12

Merged
merged 10 commits into from
Jul 28, 2023
Merged

Conversation

JanKaul
Copy link
Collaborator

@JanKaul JanKaul commented Jul 25, 2023

This PR adds lookup tables for field ids and field names to the StructType. This speeds up the field access for structs with a lot of fields.

@JanKaul
Copy link
Collaborator Author

JanKaul commented Jul 25, 2023

cc @Fokko @nastra @ConeyLiu @liurenjie1024 @Xuanwo @ZENOTME could you have a look at this one. Thanks

@Fokko
Copy link
Contributor

Fokko commented Jul 25, 2023

Can we add a test with a map, list, and nested record?

This is the test in Python: https://github.com/apache/iceberg/blob/f967317de564ec4af28d962fa48a26844b71007f/python/tests/test_schema.py#L155-L178

This is a great fixture that we use in Python for these kind of tests: https://github.com/apache/iceberg/blob/master/python/tests/conftest.py#L129-L177

@JanKaul
Copy link
Collaborator Author

JanKaul commented Jul 25, 2023

How are nested structs typically handled? Are they flattened on creation? Does it make sense to flatten them in the constructor?

@Fokko
Copy link
Contributor

Fokko commented Jul 25, 2023

@JanKaul They are not flattened. This is why we use the visitor pattern, so we can easily traverse the structure.

schema {
  1: position: struct<10: lat, 11: lon>
}

Results in:

names = {
  1: "position",
  10: "position.lat",
  11: "position.lon"
}

Let me know if this answers your question.

@JanKaul
Copy link
Collaborator Author

JanKaul commented Jul 26, 2023

Ah okay, I think I had a different idea for the lookup tables. Generally you could think of different lookup tables, these could be:

  1. Lookup table to obtain the full field name (i.e. "position.lat") from a field id
  2. Lookup table to obtain the index in the storage vector from a field id (not nested)
  3. Lookup table to obtain the index in the storage vector from a field name (not nested)

I've currently implemented 2 & 3 with the goal to speed up the access of a structfield.

I didn't think of nested structs and that you need to obtain the full field name. That opens the question how best to implement this.

We could use two lookup tables (field id -> full field name, field name -> storage index) which means that an access by field id has to always perform a field name lookup first.

@Fokko
Copy link
Contributor

Fokko commented Jul 26, 2023

Thinking of it, I believe it is fine for now. The nested types are more applicable to a schema, rather than a StructType (they are similar, both have a fields object).

@liurenjie1024
Copy link
Collaborator

Thinking of it, I believe it is fine for now. The nested types are more applicable to a schema, rather than a StructType (they are similar, both have a fields object).

Agree that it's fine for now, to implement 1 we many need some advanced data structures such as prefix tree, anyway we can postpone the optimization later.

#[serde(rename = "struct", tag = "type")]
pub struct StructType {
/// Struct fields
fields: Vec<StructField>,
/// Lookup for index by field id
#[serde(skip_serializing)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say this is why I want an in-memory represents 😢. Since we have already reached a consensus in the previous discussion, I will not initiate a new round here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to force this serialization/deserialization design. If we don't have a consensus on the design, I think we should get back to the discussion. It doesn't make sense to merge something that might get changed later on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our discussion, if we want to avoid such annotation, we could implement a serializer for StructType.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to avoid such annotation, we could implement a serializer for StructType.

Currently, it appears that there is no immediate need for this. Let's proceed and observe how the situation develops.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it appears that there is no immediate need for this. Let's proceed and observe how the situation develops.

Totally agree, for simple cases like this, it doesn't make code more difficult to read, so we can go on with it.

src/spec/datatypes.rs Outdated Show resolved Hide resolved
src/spec/datatypes.rs Outdated Show resolved Hide resolved
@ZENOTME
Copy link
Contributor

ZENOTME commented Jul 27, 2023

Agree that it's fine for now, to implement 1 we many need some advanced data structures such as prefix tree, anyway we can postpone the optimization later.

+1. And I think maintaining lookup table of full field name from a field id will cause that renaming of a field will affect multiple place (cmiiw)🤔.

struct posistion {
    struct a{
        b
    }
}

struct posistion will maintain full name "posistion.a.b"

struct a will maintain full name "a.b"

if b be rename, we also need to rename full name in struct posistion and struct a.

Thinking of it, I believe it is fine for now. The nested types are more applicable to a schema, rather than a StructType (they are similar, both have a fields object).

But if the full name is only maintained in schema, I guess there is not this problem.

Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

src/spec/datatypes.rs Outdated Show resolved Hide resolved
Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@ZENOTME ZENOTME left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Fokko Fokko merged commit bb5ea69 into apache:main Jul 28, 2023
7 checks passed
@Fokko
Copy link
Contributor

Fokko commented Jul 28, 2023

Thanks @JanKaul for working on this, and @Xuanwo and @ZENOTME for the reviews! 👏🏻

@JanKaul JanKaul deleted the feat-struct-lookup branch August 2, 2023 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants