Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: better ways to use deletion vectors #215

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

hntd187
Copy link
Collaborator

@hntd187 hntd187 commented May 22, 2024

This is meant to be some utils to help make DV usage easier. You can

  • Get a deletion/selection vector
  • Get a DV or SV in bits or bools depending
  • Get either DV or SV in batches or a full vector specified by the caller
  • Pass an array, slice or vec to use for the container
  • Implement the traits to provide your own container to operate on

TODO:

  1. Docs
  2. More tests
  3. FFI integration for this. I'd prefer to get the rust side right first and then I can extend for FFI usage.
  4. Correctness, is this what we want?

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first pass. Overall I think this is the right idea, but I'm not sold on the batch thing

kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing first pass. Seems like an interesting start, but not sure I have a "big picture" view of the idea quite yet?
How selection vs. deletion vectors would work, how engine could provide its own bit vector implementation, etc.

kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of comments but not much substance in the end... rust pointers are just messy no matter how you "slice" it :(

ffi/src/lib.rs Outdated
@@ -122,6 +126,12 @@ mod private {
len: usize,
}

#[repr(C)]
pub struct KernelRowIndexArray {
ptr: *mut u64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this accurate?

Suggested change
ptr: *mut u64,
ptr: NonNull<u64>,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: this mirrors the existing KernelBoolSlice, which uses a null pointer to represent the empty slice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, according to std::slice::from_raw_parts:

You can obtain a pointer that is usable as data for zero-length slices using NonNull::dangling().

... that way, we wouldn't need any null check in the as_ref() method below.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for Vec don't specifically mention NonNull::dangling, but Vec::as_ptr does return

a dangling raw pointer valid for zero sized reads if the vector didn’t allocate.

The docs for Vec::from_raw_slice are not clear tho:

ptr must have been allocated using the global allocator, such as via the alloc::alloc function. ... These requirements are always upheld by any ptr that has been allocated via Vec<T> [which presumably includes the pointer returned by as_ptr which we already trust for the non-empty case?]. Other allocation sources are allowed if the invariants are upheld.

Further, both Unique::dangling and NonNull::dangling state that they are

useful for initializing types which lazily allocate, like Vec::new does.

That said, it's probably safest of all to just use Vec::new().as_mut_ptr() as pointer for an empty index slice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the null pointer approach is simpler if the engine might ever need to create one of these to pass into kernel? Actually, given how messy all the non-null/unique/etc constraints are, nullable pointers are probably the simplest even if engine doesn't need to create one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's not clear to me what changes should be made? Should be switch both slice types over to a new way in a subsequent PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry that was a lot of rambling/exploring. Basically, we have two options for representing an empty slice:

  • Pointer is always non-null, length 0 and dangling pointer means empty
  • Null pointer means empty

Rust and its library classes don't have a really nice way to handle either case, so IMO we should choose the least-bad approach (= least code and least error prone) and use that consistently. Maybe the current code is already the correct choice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the current code does both -- the slice -> vec code assumes nullable pointer, while the vec -> slice (impl From) uses a dangling pointer. We should pick one and stick with it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any opinion on which you prefer @scovich ? I'd probably go with the always non-null length zero.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated both bool and row id slice to use non-zero length as empty.

ffi/src/lib.rs Outdated Show resolved Hide resolved
ffi/src/lib.rs Outdated
Comment on lines 214 to 215
let mut vec = value.row_indexes();
vec.shrink_to_fit();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut vec = value.row_indexes();
vec.shrink_to_fit();
let vec = value.row_indexes();

Vec::into_boxed_slice states:

Before doing the conversion, this method discards excess capacity like shrink_to_fit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: This feels like a very messy part of rust. There are so many ways to convert to and from pointers, and it's not clear that any of them is the "best" -- especially when bits don't fit together nicely. Like here -- into_raw returns a pointer that satisfies NonNull, but there's no "safe" way to actually create a NonNull from it.

ffi/src/lib.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally fine, but we need to augment the actual ffi apis. So today we have selection_vector_from_dv (in ffi/src/scan.rs), which always returns a bool slice.

we should probably add a row_indexes_from_dv to mirror that.

Alternately, we could have a single function that can return whatever the user asks for, but while that could be clean in rust (return an enum), I think over ffi it's probably easier/cleaner to just have multiple functions each with its own return type.

ffi/src/lib.rs Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
ffi/src/lib.rs Outdated
@@ -4,12 +4,14 @@
#[cfg(any(feature = "default-engine", feature = "sync-engine"))]
use std::collections::HashMap;
use std::default::Default;
use std::ffi::{c_long, c_longlong, c_ulonglong};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use std::ffi::{c_long, c_longlong, c_ulonglong};

not used

ffi/src/lib.rs Outdated
use std::os::raw::{c_char, c_void};
use std::ptr::NonNull;
use std::sync::Arc;
use tracing::debug;
use url::Url;

use delta_kernel::actions::deletion_vector::DeletionVector;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use delta_kernel::actions::deletion_vector::DeletionVector;

not used

@hntd187 hntd187 requested a review from nicklan June 21, 2024 21:19
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly good! I think there's still some stuff ryan pointed out that needs to be fixed, and I had one more thing.

kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
@hntd187 hntd187 requested a review from nicklan June 27, 2024 12:57
kernel/src/scan/state.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
ffi/src/lib.rs Outdated Show resolved Hide resolved
ffi/src/lib.rs Outdated Show resolved Hide resolved
kernel/src/actions/deletion_vector.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for iterating on this. just a couple of doc comments and we're good to go (i think)

ffi/src/lib.rs Show resolved Hide resolved
kernel/src/scan/state.rs Show resolved Hide resolved
@hntd187 hntd187 requested a review from scovich July 23, 2024 00:04
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few nits - and can we update the PR description to accurately represent the changes?

@@ -153,10 +163,10 @@ mod private {
/// The slice must have been originally created `From<Vec<bool>>`, and must not have been
/// already been consumed by a previous call to this method.
pub unsafe fn as_ref(&self) -> &[bool] {
if self.ptr.is_null() {
if self.len == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safety nit: should we add a comment saying that the pointer is assumed valid when len > 0 and is invalid when len == 0?

@@ -190,13 +204,57 @@ mod private {
/// memory, but must only free it by calling [super::free_bool_slice]. Since the global
/// allocator is threadsafe, it doesn't matter which engine thread invokes that method.
unsafe impl Send for KernelBoolSlice {}
unsafe impl Send for KernelRowIndexArray {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safety comment

@@ -382,4 +392,16 @@ mod tests {
expected[4294967300] = false;
assert_eq!(bools, expected);
}

#[test]
fn test_dv_wrapper() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn test_dv_wrapper() {
fn test_dv_row_indexes() {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants