-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: better ways to use deletion vectors #215
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a first pass. Overall I think this is the right idea, but I'm not sold on the batch thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing first pass. Seems like an interesting start, but not sure I have a "big picture" view of the idea quite yet?
How selection vs. deletion vectors would work, how engine could provide its own bit vector implementation, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of comments but not much substance in the end... rust pointers are just messy no matter how you "slice" it :(
ffi/src/lib.rs
Outdated
@@ -122,6 +126,12 @@ mod private { | |||
len: usize, | |||
} | |||
|
|||
#[repr(C)] | |||
pub struct KernelRowIndexArray { | |||
ptr: *mut u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this accurate?
ptr: *mut u64, | |
ptr: NonNull<u64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: this mirrors the existing KernelBoolSlice
, which uses a null pointer to represent the empty slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, according to std::slice::from_raw_parts:
You can obtain a pointer that is usable as data for zero-length slices using NonNull::dangling().
... that way, we wouldn't need any null check in the as_ref()
method below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs for Vec
don't specifically mention NonNull::dangling
, but Vec::as_ptr does return
a dangling raw pointer valid for zero sized reads if the vector didn’t allocate.
The docs for Vec::from_raw_slice are not clear tho:
ptr
must have been allocated using the global allocator, such as via the alloc::alloc function. ... These requirements are always upheld by anyptr
that has been allocated viaVec<T>
[which presumably includes the pointer returned byas_ptr
which we already trust for the non-empty case?]. Other allocation sources are allowed if the invariants are upheld.
Further, both Unique::dangling and NonNull::dangling state that they are
useful for initializing types which lazily allocate, like
Vec::new
does.
That said, it's probably safest of all to just use Vec::new().as_mut_ptr()
as pointer for an empty index slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the null pointer approach is simpler if the engine might ever need to create one of these to pass into kernel? Actually, given how messy all the non-null/unique/etc constraints are, nullable pointers are probably the simplest even if engine doesn't need to create one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it's not clear to me what changes should be made? Should be switch both slice types over to a new way in a subsequent PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sorry that was a lot of rambling/exploring. Basically, we have two options for representing an empty slice:
- Pointer is always non-null, length 0 and dangling pointer means empty
- Null pointer means empty
Rust and its library classes don't have a really nice way to handle either case, so IMO we should choose the least-bad approach (= least code and least error prone) and use that consistently. Maybe the current code is already the correct choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the current code does both -- the slice -> vec code assumes nullable pointer, while the vec -> slice (impl From
) uses a dangling pointer. We should pick one and stick with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any opinion on which you prefer @scovich ? I'd probably go with the always non-null length zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated both bool and row id slice to use non-zero length as empty.
ffi/src/lib.rs
Outdated
let mut vec = value.row_indexes(); | ||
vec.shrink_to_fit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut vec = value.row_indexes(); | |
vec.shrink_to_fit(); | |
let vec = value.row_indexes(); |
Vec::into_boxed_slice states:
Before doing the conversion, this method discards excess capacity like shrink_to_fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: This feels like a very messy part of rust. There are so many ways to convert to and from pointers, and it's not clear that any of them is the "best" -- especially when bits don't fit together nicely. Like here -- into_raw
returns a pointer that satisfies NonNull
, but there's no "safe" way to actually create a NonNull
from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally fine, but we need to augment the actual ffi apis. So today we have selection_vector_from_dv
(in ffi/src/scan.rs), which always returns a bool slice.
we should probably add a row_indexes_from_dv
to mirror that.
Alternately, we could have a single function that can return whatever the user asks for, but while that could be clean in rust (return an enum
), I think over ffi it's probably easier/cleaner to just have multiple functions each with its own return type.
ffi/src/lib.rs
Outdated
@@ -4,12 +4,14 @@ | |||
#[cfg(any(feature = "default-engine", feature = "sync-engine"))] | |||
use std::collections::HashMap; | |||
use std::default::Default; | |||
use std::ffi::{c_long, c_longlong, c_ulonglong}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use std::ffi::{c_long, c_longlong, c_ulonglong}; |
not used
ffi/src/lib.rs
Outdated
use std::os::raw::{c_char, c_void}; | ||
use std::ptr::NonNull; | ||
use std::sync::Arc; | ||
use tracing::debug; | ||
use url::Url; | ||
|
||
use delta_kernel::actions::deletion_vector::DeletionVector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use delta_kernel::actions::deletion_vector::DeletionVector; |
not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly good! I think there's still some stuff ryan pointed out that needs to be fixed, and I had one more thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for iterating on this. just a couple of doc comments and we're good to go (i think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few nits - and can we update the PR description to accurately represent the changes?
@@ -153,10 +163,10 @@ mod private { | |||
/// The slice must have been originally created `From<Vec<bool>>`, and must not have been | |||
/// already been consumed by a previous call to this method. | |||
pub unsafe fn as_ref(&self) -> &[bool] { | |||
if self.ptr.is_null() { | |||
if self.len == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safety nit: should we add a comment saying that the pointer is assumed valid when len > 0 and is invalid when len == 0?
@@ -190,13 +204,57 @@ mod private { | |||
/// memory, but must only free it by calling [super::free_bool_slice]. Since the global | |||
/// allocator is threadsafe, it doesn't matter which engine thread invokes that method. | |||
unsafe impl Send for KernelBoolSlice {} | |||
unsafe impl Send for KernelRowIndexArray {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safety comment
@@ -382,4 +392,16 @@ mod tests { | |||
expected[4294967300] = false; | |||
assert_eq!(bools, expected); | |||
} | |||
|
|||
#[test] | |||
fn test_dv_wrapper() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn test_dv_wrapper() { | |
fn test_dv_row_indexes() { |
This is meant to be some utils to help make DV usage easier. You can
TODO: