-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compare anyOf based on handmade diff score #32
Conversation
Cargo.toml
Outdated
@@ -25,6 +25,7 @@ schemars = { version = "0.8.12", default_features = false } | |||
serde = "1.0.158" | |||
serde_json = "1.0.94" | |||
thiserror = "1.0.40" | |||
hungarian = "1.1.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
istm hungarian is dead: https://github.com/nwtnni/hungarian
src/diff_walker.rs
Outdated
.diff_score(&mut rhs.clone().into_object()) | ||
} | ||
} | ||
impl DiffScore for SchemaObject { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually! why not invoke the main diff
function as a way to compute the score?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice idea
src/diff_walker.rs
Outdated
for (i, l) in lhs_any_of.iter_mut().enumerate() { | ||
for (j, r) in rhs_any_of.iter_mut().enumerate() { | ||
let mut walker = DiffWalker { | ||
changes: vec![], | ||
lhs_root: self.lhs_root.clone(), | ||
rhs_root: self.rhs_root.clone(), | ||
}; | ||
walker.diff( | ||
"", | ||
&mut l.clone().into_object(), | ||
&mut r.clone().into_object(), | ||
)?; | ||
mat[(i, j)] = i32::try_from(walker.changes.len()).expect("too many changes"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more convenient for Diffwalker::diff
to return Vec<Change>
instead of mutating internal changes. By doing so we can reuse Diffwalker
and avoid cloning RootSchema
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is generally way too much cloning going on. I haven't wrapped my head around it yet, but I already have a problem with the fact that we cannot just pass an immutable reference when we recurse into schemas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not change it for now, I want to figure out how we can eliminate as many clones as possible. changing DiffWalker now might make it harder to remove clone() in a hotter codepath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@6293 actually, I think we should change the design of diff walker in the following way:
- instead of a vector of changes, it is generic over the collection that holds the changes. this can be implemented by making
DiffWalker
take a callback for appending a change - at this point I think
DiffWalker
should have a propernew
constructor as well, because we instantiate it in two places - in the public API, we can pass
myvector.push
as callback. but here, we can increment an integer and discard the change. that's a lot less memory used - if there are too many changes, we clip them at
i32::MAX
rhs_any_of.sort_by_cached_key(|x| format!("{x:?}")); | ||
match (lhs_any_of.len(), rhs_any_of.len()) { | ||
(l, r) if l <= r => { | ||
lhs_any_of.append(&mut vec![Schema::Bool(false); r - l]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems you're trying to get the vectors to be the same length by padding the shorter one with false
.
if that is the case I think you can do this:
let max_len = cmp::max(lhs_any_of.len(), rhs_any_of.len());
lhs_any_of.resize(max_len, Schema::Bool(false));
rhs_any_of.resize(max_len, Schema::Bool(false));
src/diff_walker.rs
Outdated
for (i, l) in lhs_any_of.iter_mut().enumerate() { | ||
for (j, r) in rhs_any_of.iter_mut().enumerate() { | ||
let mut walker = DiffWalker { | ||
changes: vec![], | ||
lhs_root: self.lhs_root.clone(), | ||
rhs_root: self.rhs_root.clone(), | ||
}; | ||
walker.diff( | ||
"", | ||
&mut l.clone().into_object(), | ||
&mut r.clone().into_object(), | ||
)?; | ||
mat[(i, j)] = i32::try_from(walker.changes.len()).expect("too many changes"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@6293 actually, I think we should change the design of diff walker in the following way:
- instead of a vector of changes, it is generic over the collection that holds the changes. this can be implemented by making
DiffWalker
take a callback for appending a change - at this point I think
DiffWalker
should have a propernew
constructor as well, because we instantiate it in two places - in the public API, we can pass
myvector.push
as callback. but here, we can increment an integer and discard the change. that's a lot less memory used - if there are too many changes, we clip them at
i32::MAX
src/diff_walker.rs
Outdated
impl DiffWalker { | ||
impl<'cb> DiffWalker<'cb> { | ||
pub fn new( | ||
cb: Box<dyn FnMut(Change) + 'cb>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we instead do:
struct DiffWalker<F: FnMut(Change)> {
pub cb: F
}
I don't think we need a nameable DiffWalker type
the benefit of this is that the compiler can inline cb
into the parent function, then it enables further optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would get an reached the recursion limit while instantiating ...
error if we try to specify the type of callback as a generic parameter. This is because generic instantiation occurs infinitely in diff_any_of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just have private DiffWalker::count
method for saving memory? Although it is not as efficient as "callback" approach, the memory for Vec<Change>
would be freed by the time DiffWalker::count
is dropped and I think that is fair enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand this comment at first because we are not instantiatign DiffWalker
recursively in DiffWalker::new
. But you are right, there is a problem with generics. The reason is apparently that the closure contains the parent type in its signature, so the expansion of the type name recurses infinitely.
I have never seen that compiler error before.
but we can still get rid of mandatory boxing, so I pushed a change that does that. Instead we can box only for the counter, and allow for inlining of the other callback
the super-optimized solution would define a new Counter
struct that holds an integer and implements FnMut
(or rather, a custom trait), but that is too annoying, even for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can still get rid of mandatory boxing
that was beyond my imagination. thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just have private DiffWalker::count method for saving memory?
the problem is having a vector at all. if we have any vector and keep pushing stuff into it, it will use memory
the vector in its entirety is short-lived, so it is not like json-schema-diff will consume a lot of memory overall. but it will be slow because allocating, resizing, deallocating is slow.
relatively slow, overall it will still be super fast. arguably this is premature optimization, but we have a O(n^2)
loop here already to build the cross-product of all anyOf items between LHS and RHS, so I guess for some pathological input it could be a problem
Had to box the callback because we recursively instantiate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome
Closes #26