Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntactical Ordering of Object Keys #53

Merged
merged 16 commits into from
Mar 12, 2020
Merged

Syntactical Ordering of Object Keys #53

merged 16 commits into from
Mar 12, 2020

Conversation

fugu13
Copy link
Contributor

@fugu13 fugu13 commented Mar 3, 2020

This adds a flag that makes object key ordering based on JSonnet syntactical ordering, by using ordered maps internally and skipping the sort step when the flag is provided.

I'm part of a team reusing SJSonnet to support data transformation (https://datasonnet.com, and we're still very early days). Our main audience is highly technical business analysts who are experts in the data schemas and formats they're manipulating, but infrequent programmers. The reordering behavior of JSonnet is one of the most confusing things to them about JSonnet programs. Additionally, some of the code we work with taking JSonnet output is ordering-sensitive (unfortunately).

Beyond our use case, I've noticed that requests for an ordering option have popped up several times in general JSonnet discussions. That does not look likely to happen across all implementations, but I think having it available would likely still be of interest to many SJSonnet users.

While this PR goes beyond that, since ordering flows naturally once ordered maps are used, the only two well-defined ordering guarantees this PR attempts to provide when using the flag are:

  1. keys within a single syntactical JSonnet object are ordered lexically, and
  2. when two objects are combined, any ordering of keys from the first guarantee in the first object are honored in the result before any from the second object, and then all keys only in the second object are ordered in the way they are in the second object, after the keys in the first object

Since there are many places output can be generated, and in most/all of those cases consistent behavior is important (such as objects rendered during errors vs in output), we've made that a global flag on the evaluator.

We're happy to revise or do other further work as needed if it will help include this capability in SJSonnet.

Thanks for looking,
Russell

@lihaoyi-databricks
Copy link
Contributor

This looks reasonable, but could you run benchmarks (I think we have some in the test suite) to see if this has a significant impact on performance? Our Jsonnet compilation at work is moderately slow, so I want to make sure this doesn't negatively impact performance too much.

If using LinkedHashMap affects performance too much, we may need to dynamically swap between mutable.LinkedHashMap and mutable.Map depending on the value of preserveOrder

@lihaoyi-databricks
Copy link
Contributor

lihaoyi-databricks commented Mar 5, 2020

One more thing is I think we should have a slightly more thorough test suite, e.g. to verify the behavior when:

  • We override keys: does it preserve the original key or move it to the end of the key list?
  • hide keys: I think you can override keys with key:: value to hide a previously visible key; do we preserve the order of the remaining keys? What if we use key ::: value to re-introduce it?
  • using standard library methods:does std.mergePatch preserve order? How about std.objectFields and std.objectFieldsAll? std.manifestIni? etc.

There are probably more cases you can come up with, but we should err on the side of thoroughness when for this PR, since otherwise I'm sure this functionality will be accidentally broken in future

@fugu13
Copy link
Contributor Author

fugu13 commented Mar 5, 2020

Thanks for taking a look, I'll work on the performance tests and increased test coverage and hopefully have updates for you soon!

@fugu13
Copy link
Contributor Author

fugu13 commented Mar 9, 2020

I've now updated this PR with all the suggested tests and a number of additional tests, it went pretty smoothly other than a brief sidetracking when I accidentally left off a comma between objects and had them combine on me instead, which is syntax I'd forgotten about.

It did uncover one additional code change needed, since objectFields and objectFieldsAll did their own sorting of keys, which is done. In addition to adding tests for all methods that look plausible, I searched for places where sorting happens, and did not discover any more that this PR should change the behavior of.

I also discovered one (possible?) bug, around manifestXmlJsonml, where the example in the jsonnet docs errors in SJSonnet due to numerical attribute values. I've filed that as #56. I say possible because, while the example works on jsonnet.org, the function is pretty underspecified. I believe the JsonML spec does allow numbers in that particular place, though.

@fugu13
Copy link
Contributor Author

fugu13 commented Mar 9, 2020

I hope to have benchmarking results soon.

@fugu13
Copy link
Contributor Author

fugu13 commented Mar 10, 2020

@lihaoyi-databricks We've run the benchmarks, and the results suggest any speed differences due to the changes are very very small, so I'm hopeful we'll be able to merge with this approach.

Two of us ran the benchmarks, on different but basically identical computers. Once was before a small number of the changes due to tests were in, and the other time was after.

In both cases, we ran the version of SjsonnetTestMain at commit 2108b89, which is one commit behind master. The most recent version is unrunnable due to including several kubernetes configs that do not exist in the repository. We ran the tests with ./mill -i show sjsonnet[2.13.0].jvm.test.run. Both computers were 15” 2017 MacBook Pro, 2.9 GHz Quad-Core Intel Core i7, 16 GB 2133 MHz LPDDR3.

The two codebases were modusintegration:master (first run 6366b0a, next run 06da4df) and databricks:master (83cabbe)

Results are the number of times a set of tests can be run in 20 seconds. Higher numbers are better.

First runs

In the first set of runs we hadn’t yet established a formal procedure, but while no applications were quit, nothing was being done on the computer during the tests. No burn in run was done. Ten runs were recorded for each codebase.

The ten test results, in order for modusintegration:master (6366b0a)

2936
2811
2797
2829
2628
2590
2927
2740
2806
2795

The ten test results, in order, for databricks:master

2808
3009
2932
2969
2778
2679
3035
2845
2794
2865

The means are very close. The mean without ordering code is about 85 higher than the mean with. This about about 3/4 the standard deviation of each sample, and the standard deviation is low relative to the values, so not very large. Specifically, a one-sided t-test says there’s a slightly greater than 5% chance they’re both from the same distribution, so the null hypothesis should be rejected and we should assume there’s no difference.

Second runs

In the second set of runs, non-involved applications were quit, and nothing was done on the computer during test runs. For each codebase, the benchmark was run first once without recording the result, then ten runs that were recorded.

The ten run results, in order, for modusintegration:master (06da4df)

2880
3119
2791
2996
3039
3098
2962
3145
3057
3084

The ten run results, in order, for databricks:master

2990
3016
3027
3066
3022
3084
2997
2959
3039
2907

The means were virtually identical. In fact, the mean for with ordering is a very tiny bit higher. The standard deviation for each is much higher than the mean, though about twice as high for the with ordering runs. In both cases the standard deviation is modest relative to the mean. The t-test probability that they are from same the distribution is very high, so we should again assume there is no difference.

@lihaoyi-databricks
Copy link
Contributor

@fugu13 this looks good to me! Happy to merge as is.

Going forward I'll be leaning heavily on your test suite to make sure future changes do not violate any ordering-related properties that are important to you. Hopefully it is enough to catch regressions, otherwise we can add further tests when any issues turn up in future

@lihaoyi-databricks lihaoyi-databricks merged commit 12fe9ed into databricks:master Mar 12, 2020
@fugu13
Copy link
Contributor Author

fugu13 commented Mar 12, 2020

Thanks! And yeah, I feel pretty good about the test suite maintaining ordering everywhere that makes sense, but we're also going to be measured about what we tell people is definitely not going to change.

Would it be possible to have a released pushed to the maven repos soon with the updates?

By the way, thank you so much for creating SJSonnet, it's made integrating JSonnet with the java codebases common in the enterprise service bus world feasible. We've really enjoyed working with the SJSonnet codebase, too, it's worked well for our needs even though it wasn't written with our somewhat different data transformation scenario in mind.

@lihaoyi-databricks
Copy link
Contributor

tagged and uploaded sjsonnet 0.2.4 to maven central

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants