Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compare anyOf based on handmade diff score #32

Merged
merged 9 commits into from
May 17, 2023

Conversation

6293
Copy link
Contributor

@6293 6293 commented May 16, 2023

Closes #26

Cargo.toml Outdated
@@ -25,6 +25,7 @@ schemars = { version = "0.8.12", default_features = false }
serde = "1.0.158"
serde_json = "1.0.94"
thiserror = "1.0.40"
hungarian = "1.1.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

istm hungarian is dead: https://github.com/nwtnni/hungarian

.diff_score(&mut rhs.clone().into_object())
}
}
impl DiffScore for SchemaObject {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually! why not invoke the main diff function as a way to compute the score?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice idea

@6293 6293 marked this pull request as ready for review May 17, 2023 04:33
@6293 6293 requested a review from a team as a code owner May 17, 2023 04:33
@6293 6293 requested a review from untitaker May 17, 2023 04:47
Comment on lines 41 to 55
for (i, l) in lhs_any_of.iter_mut().enumerate() {
for (j, r) in rhs_any_of.iter_mut().enumerate() {
let mut walker = DiffWalker {
changes: vec![],
lhs_root: self.lhs_root.clone(),
rhs_root: self.rhs_root.clone(),
};
walker.diff(
"",
&mut l.clone().into_object(),
&mut r.clone().into_object(),
)?;
mat[(i, j)] = i32::try_from(walker.changes.len()).expect("too many changes");
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more convenient for Diffwalker::diff to return Vec<Change> instead of mutating internal changes. By doing so we can reuse Diffwalker and avoid cloning RootSchema here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is generally way too much cloning going on. I haven't wrapped my head around it yet, but I already have a problem with the fact that we cannot just pass an immutable reference when we recurse into schemas

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not change it for now, I want to figure out how we can eliminate as many clones as possible. changing DiffWalker now might make it harder to remove clone() in a hotter codepath.

Copy link
Member

@untitaker untitaker May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@6293 actually, I think we should change the design of diff walker in the following way:

  1. instead of a vector of changes, it is generic over the collection that holds the changes. this can be implemented by making DiffWalker take a callback for appending a change
  2. at this point I think DiffWalker should have a proper new constructor as well, because we instantiate it in two places
  3. in the public API, we can pass myvector.push as callback. but here, we can increment an integer and discard the change. that's a lot less memory used
  4. if there are too many changes, we clip them at i32::MAX

rhs_any_of.sort_by_cached_key(|x| format!("{x:?}"));
match (lhs_any_of.len(), rhs_any_of.len()) {
(l, r) if l <= r => {
lhs_any_of.append(&mut vec![Schema::Bool(false); r - l]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems you're trying to get the vectors to be the same length by padding the shorter one with false.

if that is the case I think you can do this:

let max_len = cmp::max(lhs_any_of.len(), rhs_any_of.len());
lhs_any_of.resize(max_len, Schema::Bool(false));
rhs_any_of.resize(max_len, Schema::Bool(false));

Comment on lines 41 to 55
for (i, l) in lhs_any_of.iter_mut().enumerate() {
for (j, r) in rhs_any_of.iter_mut().enumerate() {
let mut walker = DiffWalker {
changes: vec![],
lhs_root: self.lhs_root.clone(),
rhs_root: self.rhs_root.clone(),
};
walker.diff(
"",
&mut l.clone().into_object(),
&mut r.clone().into_object(),
)?;
mat[(i, j)] = i32::try_from(walker.changes.len()).expect("too many changes");
}
}
Copy link
Member

@untitaker untitaker May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@6293 actually, I think we should change the design of diff walker in the following way:

  1. instead of a vector of changes, it is generic over the collection that holds the changes. this can be implemented by making DiffWalker take a callback for appending a change
  2. at this point I think DiffWalker should have a proper new constructor as well, because we instantiate it in two places
  3. in the public API, we can pass myvector.push as callback. but here, we can increment an integer and discard the change. that's a lot less memory used
  4. if there are too many changes, we clip them at i32::MAX

impl DiffWalker {
impl<'cb> DiffWalker<'cb> {
pub fn new(
cb: Box<dyn FnMut(Change) + 'cb>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we instead do:

struct DiffWalker<F: FnMut(Change)> {
    pub cb: F
}

I don't think we need a nameable DiffWalker type

the benefit of this is that the compiler can inline cb into the parent function, then it enables further optimization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would get an reached the recursion limit while instantiating ... error if we try to specify the type of callback as a generic parameter. This is because generic instantiation occurs infinitely in diff_any_of

Copy link
Contributor Author

@6293 6293 May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just have private DiffWalker::count method for saving memory? Although it is not as efficient as "callback" approach, the memory for Vec<Change> would be freed by the time DiffWalker::count is dropped and I think that is fair enough

Copy link
Member

@untitaker untitaker May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand this comment at first because we are not instantiatign DiffWalker recursively in DiffWalker::new. But you are right, there is a problem with generics. The reason is apparently that the closure contains the parent type in its signature, so the expansion of the type name recurses infinitely.

I have never seen that compiler error before.

but we can still get rid of mandatory boxing, so I pushed a change that does that. Instead we can box only for the counter, and allow for inlining of the other callback

the super-optimized solution would define a new Counter struct that holds an integer and implements FnMut (or rather, a custom trait), but that is too annoying, even for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can still get rid of mandatory boxing

that was beyond my imagination. thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just have private DiffWalker::count method for saving memory?

the problem is having a vector at all. if we have any vector and keep pushing stuff into it, it will use memory

the vector in its entirety is short-lived, so it is not like json-schema-diff will consume a lot of memory overall. but it will be slow because allocating, resizing, deallocating is slow.

relatively slow, overall it will still be super fast. arguably this is premature optimization, but we have a O(n^2) loop here already to build the cross-product of all anyOf items between LHS and RHS, so I guess for some pathological input it could be a problem

@6293
Copy link
Contributor Author

6293 commented May 17, 2023

Had to box the callback because we recursively instantiate Diffwalker

Copy link
Member

@untitaker untitaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome

@untitaker untitaker merged commit 0276ef2 into getsentry:main May 17, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Certain changes to anyOf will change internal sorting and produce huge diffs
2 participants