Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trait bounds for persistence #1229

Merged
merged 5 commits into from
Jan 18, 2024
Merged

Trait bounds for persistence #1229

merged 5 commits into from
Jan 18, 2024

Conversation

gz
Copy link
Collaborator

@gz gz commented Jan 5, 2024

DBData now has an additional constraint where we make sure the archived variant of something that implements DBData also satisfies this constraint:

<T as ArchivedDBData>::Repr: Ord + PartialOrd<T>

Ideally we would also have Clone for the Archived types, but I'm not sure it will get merged into rkyv so currently I removed the Clone bound again until this is resolved:

Some other things this changes:

  • gdelt used ArcStr for interned strings until now, but with persistence this wouldn't work anyways so I removed it from dbsp and gdelt
  • we can't store Rust tuples anymore directly since they don't satisfy the PartialOrd constraint and we can't implement it ourselves due to rust trait rules so we added Tup{2,3,4} types
  • we can't use isize for weights or usize as types that goes into an Ord* data-structure anymore (since the serialization is platform dependent, so replaced them all with u64/i64)
  • chrono by default enabled the size-32 feature of rkyv, the PR I sent to them let us enable the size-64 feature which means we can serialize/deserialize things correctly on x86-64

Is this a user-visible change (yes/no): no

@blp
Copy link
Member

blp commented Jan 5, 2024

This should allow me to make my storage format faster! Currently it does more deserialization than would be ideal.

I have not looked at this yet.

@blp
Copy link
Member

blp commented Jan 5, 2024

@gz Even before I start reading the changes, I can tell how much work this was from the commit message. You had to coordinate with multiple other maintainers and crates, and deal with (presumably) so many compiler trait errors. Thank you so much!

@blp
Copy link
Member

blp commented Jan 5, 2024

Just to make sure: ultimately, we will not be using a forked version of rkyv, right? (I assume that's just an intermediate step.)

@gz
Copy link
Collaborator Author

gz commented Jan 5, 2024

Thanks for the kind words <3.. Yea this patch was a bit of a nightmare for sure, glad it's now pretty much ready

ultimately, we will not be using a forked version of rkyv, right?

yes we should be able to get all patches in rkyv .. except for maybe the one that adds Clone to ArchivedString and ArchivedVec but if we need to I feel we can work around that

@blp
Copy link
Member

blp commented Jan 5, 2024

yes we should be able to get all patches in rkyv .. except for maybe the one that adds Clone to ArchivedString and ArchivedVec but if we need to I feel we can work around that

That's probably for the best, if needed, because otherwise we can't update dbsp on crates.io.

@gz
Copy link
Collaborator Author

gz commented Jan 16, 2024

@mihaibudiu

so here are the necessary changes for the compiler for my PR:
high level change is tha some arguments that were a tuple before now need to be called using dbsp::utils::Tup2(a, b) instead of (a, b)

these operators are affected:

group/lag.rs changed to

pub fn lag<OV, PF>(
        &self,
        offset: usize,
        project: PF,
    ) -> Stream<RootCircuit, OrdIndexedZSet<B::Key, Tup2<B::Val, OV>, B::R>>

pub fn lead<OV, PF>(
        &self,
        offset: usize,
        project: PF,
    ) -> Stream<RootCircuit, OrdIndexedZSet<B::Key, Tup2<B::Val, OV>, B::R>>

then in operator/index.rs

    pub fn index<K, V>(&self) -> Stream<C, OrdIndexedZSet<K, V, CI::R>>
    where
        K: DBData,
        V: DBData,
        CI: BatchReader<Key = Tup2<K, V>, Val = (), Time = ()>,

    pub fn index_generic<CO>(&self) -> Stream<C, CO>
    where
        CI: BatchReader<Key = Tup2<CO::Key, CO::Val>, Val = (), Time = (), R = CO::R>,
        CO: Batch<Time = ()>,

    pub fn index_with<K, V, F>(&self, index_func: F) -> Stream<C, OrdIndexedZSet<K, V, CI::R>>
    where
        CI: BatchReader<Time = (), Val = ()>,
        F: Fn(&CI::Key) -> Tup2<K, V> + Clone + 'static,
        K: DBData,
        V: DBData,

impl<CI, CO> UnaryOperator<CI, CO> for Index<CI, CO>
where
    CO: Batch<Time = ()>,
    CI: BatchReader<Key = Tup2<CO::Key, CO::Val>, Val = (), Time = (), R = CO::R>,
{

impl<CI, CO, F> UnaryOperator<CI, CO> for IndexWith<CI, CO, F>
where
    CO: Batch<Time = ()>,
    CI: BatchReader<Val = (), Time = (), R = CO::R>,
    F: Fn(&CI::Key) -> Tup2<CO::Key, CO::Val> + 'static,

operators/sample.rs:

    pub fn stream_sample_unique_key_vals(
        &self,
        sample_size: &Stream<RootCircuit, usize>,
    ) -> Stream<RootCircuit, OrdZSet<(B::Key, B::Val), B::R>>


impl<T> BinaryOperator<T, usize, OrdZSet<Tup2<T::Key, T::Val>, T::R>> for SampleUniqueKeyVals<T>

gz and others added 5 commits January 17, 2024 16:54
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Also generate dbsp tuples using the macro.

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
@gz gz marked this pull request as ready for review January 18, 2024 02:12
@gz
Copy link
Collaborator Author

gz commented Jan 18, 2024

will merge this so it is out of my sight

@gz gz merged commit db75fa0 into main Jan 18, 2024
5 checks passed
@gz gz deleted the trait-boundsv2 branch January 18, 2024 02:12
@blp
Copy link
Member

blp commented Jan 18, 2024

Hurray! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants