Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement Iceberg values #20

Merged
merged 42 commits into from
Aug 9, 2023
Merged

feat: Implement Iceberg values #20

merged 42 commits into from
Aug 9, 2023

Conversation

JanKaul
Copy link
Collaborator

@JanKaul JanKaul commented Aug 2, 2023

This pull request defines the representation of iceberg values. Additionally the serialization/deserialization is implemented.

Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort. I think deriving serialization/deserialization for primitive types is ok. But I'm not sure we should implement ser/de for composite types in this way. When ser/de composite types, we should associate more information(such as schema, type) with it to avoid keeping them in memory.

crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/error.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @JanKaul, I left some comments

crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
Value::String(_) => Type::Primitive(PrimitiveType::String),
Value::UUID(_) => Type::Primitive(PrimitiveType::Uuid),
Value::Decimal(dec) => Type::Primitive(PrimitiveType::Decimal {
precision: 38,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to make the precision configurable as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we then might need to store the precision in another field. I don't know how to calculate the precision from the scale and I don't see a way to store it in the rust decimal.

crates/iceberg/src/spec/values.rs Show resolved Hide resolved
crates/iceberg/src/spec/values.rs Outdated Show resolved Hide resolved
@liurenjie1024
Copy link
Collaborator

I would suggest two changes to this pr:

  1. Create a PrimitiveValue type so that we can be type safe when ser/de from bytes.
  2. Remove the ser/de implementation for composite types, I think we would need to have data types when implementing them, and we can do it later.

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@liurenjie1024
Copy link
Collaborator

Thanks @JanKaul I think we are almost there, just need to update date/time with long/int.

@liurenjie1024 liurenjie1024 mentioned this pull request Aug 9, 2023
Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one question about the naming. A lot has happened since I last checked, and I'll leave it up to you to leave it as Literal or revert it to Value, it is up to you

PrimitiveLiteral::String(_) => Type::Primitive(PrimitiveType::String),
PrimitiveLiteral::UUID(_) => Type::Primitive(PrimitiveType::Uuid),
PrimitiveLiteral::Decimal(dec) => Type::Primitive(PrimitiveType::Decimal {
precision: 38,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't always 38. It actually depends on the number of bytes that are read. But we can break that out in a separate chore

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to store the precision in an extra field because I don't see a way to get the precision from the rust Decimal. I'm not sure if this can be somehow calculated depending on the mantissa and the scale.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how we do it in Python:

MAX_PRECISION = tuple(math.floor(math.log10(math.fabs(math.pow(2, 8 * pos - 1) - 1))) for pos in range(24))
REQUIRED_LENGTH = tuple(next(pos for pos in range(24) if p <= MAX_PRECISION[pos]) for p in range(40))


def decimal_required_bytes(precision: int) -> int:
    """Compute the number of bytes required to store a precision.

    Args:
        precision: The number of digits to store.

    Returns:
        The number of bytes required to store a decimal with a certain precision.
    """
    if precision <= 0 or precision >= 40:
        raise ValueError(f"Unsupported precision, outside of (0, 40]: {precision}")

    return REQUIRED_LENGTH[precision]

And how we do it in Java: https://github.com/apache/iceberg/blob/f5f543a54ff7460648bb864f4f06a29eb28938b9/api/src/main/java/org/apache/iceberg/types/TypeUtil.java#L679-L715

// under the License.

/*!
* Value in iceberg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of happened since I reviewed this PR the last time. I'm also okay with calling this a value instead of a literal.

fn avro_bytes_long() {
let bytes = vec![32u8, 0u8, 0u8, 0u8, 0u8, 0u8, 0u8, 0u8];

check_avro_bytes_serde(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we testing here? Keep in mind that Avro uses zig-zag encoding for integers.

Copy link
Collaborator Author

@JanKaul JanKaul Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically the upper and lower bounds are stored as avro "bytes" inside the manifest_entry. These bytes conform to the iceberg binary single value serialization. This is a test to check if bytes serialized into avro "bytes" can be deserialized into a literal value.

crates/iceberg/src/error.rs Outdated Show resolved Hide resolved
}
}

fn to_optional_literal(value: Result<Literal, Error>) -> Result<Option<Literal>, Error> {

This comment was marked as duplicate.

Copy link
Contributor

@ZENOTME ZENOTME left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM!

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JanKaul for working on this, and @liurenjie1024 @Xuanwo @ZENOTME for the reviews 🚀

@Fokko Fokko merged commit 4ee05b4 into apache:main Aug 9, 2023
7 checks passed
@JanKaul JanKaul deleted the values branch August 9, 2023 14:00
@ZENOTME ZENOTME mentioned this pull request Oct 12, 2023
let reader = apache_avro::Reader::new(&*encoded).unwrap();

for record in reader {
let result = apache_avro::from_value::<ByteBuf>(&record.unwrap()).unwrap();
Copy link
Contributor

@ZENOTME ZENOTME Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I can't figure it out what is the difference between result and bytes? 🤔 cc @JanKaul

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! The literal is supposed to be written to the Avro Writer instead of the bytes. Like so:

writer.append_ser(Into::<ByteBuf>::into(literal)).unwrap();

I will create a PR to fix this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a PR to fix it.

@Xuanwo Xuanwo mentioned this pull request Nov 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants