Generate YAML with comments indicating which line of code generated each item #105

lihaoyi-databricks · 2020-12-04T01:01:51Z

Generates YAML that looks like this:

builds: # kube-config/runbot/dev/runbot-app.jsonnet:159:5
- args: # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:147:29
  - name: sha # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:147:37
    required: true # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:147:54
  autorun: # kube-config/runbot/autorun.jsonnet.TEMPLATE:3:22
  - repeat: true # kube-config/runbot/autorun.jsonnet.TEMPLATE:12:13
    sequence: # kube-config/runbot/autorun.jsonnet.TEMPLATE:4:16
    - args: # kube-config/runbot/autorun.jsonnet.TEMPLATE:10:15
        sha: # kube-config/runbot/autorun.jsonnet.TEMPLATE:10:22
          value: origin/master # kube-config/runbot/autorun.jsonnet.TEMPLATE:10:29
      interval: 3600000 # kube-config/runbot/autorun.jsonnet.TEMPLATE:8:29
  cleanupCommands: # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:44
  - - bash # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:221:8
    - -c # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:221:16
    - docker ps -aq | xargs docker rm -f; true # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:221:22
  - - df # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:222:8
    - -h # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:222:14
  - - git # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:48
    - clean # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:55
    - -xdf # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:64
  - - git # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:74
    - reset # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:81
    - --hard # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:90
    - HEAD # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:100

The idea being to make it easier to figure out how the materialized values are being constructed: by letting the user get back to the last line of jsonnet that generated a value, they can then use the https://github.com/databricks/intellij-jsonnet IDE plugin to further explore the jsonnet code to see how the values are passed into that line.

How it works

We capture a Position value along with every runtime value sjsonnet.Val in our program. This is a thing wrapper around a file path and a character offset; we don't do any heavy computation until the end when we need to generate path+linenumber comments, and so it's reasonably lightweight and cheap as long as comment-generation isn't enabled (see perf numbers below). Every time a new value is computed, we store the current Position of the expression that computed it, and the positions of any upstream values that fed into the computation are discarded/ignored

Although we could in theory capture the entire computation graph (DAG) leading up to each value, the size of this forest of graphs is likely to be large and too verbose for general usage. The IDE effectively lets the user interactively explore a subset of the graph, which is probably good enough for now. Capturing and making sense of the entire computation forest can be explored in future

Only works with direct YAML output, since JSON doesn't allow comments, though in theory we could generate a look-aside-table/source-map-file with the same information and somehow get the IDE to hook them up.

As the upickle.core.Visitor interface doesn't support anything other than Ints as offsets to each callback method, we instead plumb the currentPos in a mutable variable which the Materializer sets using a callback and the PrettyYamlRenderer reads using a callback. Not pretty, but it works. While in theory we could encode the entire Position value using the offset: Int that upickle.core.Visitor receives, I suspect it would be even more inelegant than this shared mutable variable thing.

I had to plumb offset: Int or pos: Position information into a bunch of places where we previously didn't need it (e.g. all of Std.scala). Where previously we only needed position information where we have a chance of throwing errors, now we need position information in all places where we construct new sjsonnet.Val objects.

Output comment generation is controlled under a flag --yaml-debug

Comment Positioning

Due to the somewhat heterogenous output style of PrettyYamlRenderer, which is copied from the PyYAML package, there is some subtlety in how the positioning of comments works. At a high-level:

For most terminal values (bools, strings, nulls, numbers, empty arrays, empty lists), these are only ever printed at the end of a line, and so the source line comment is simply appended at the end of the line after the value, EXCEPT
For Strings rendered block-style (|- , |+, |2, etc.) the only place we can put a comment is after the block delimiter, before the first line of the string, so we put the source line comment there.
Lists and dictionaries in general are not guaranteed to appear on a line on their own (e.g. you can have nested lists in one line - - - foo or nested dictionaries - foo: bar) and so in general we do not include a source line comment for where collections values were constructed, even though we have the data available, EXCEPT
Lists and dictionaries that are themselves values in another dictionary do have a spot of space where we can place a comment: after the preceding key of the enclosing dictionary. This is because nested lists/dictionaries on one line such as - - foo: can only ever have a dictionary in the right-most position, and if the dictionary key is a collection (whether list or dict) that is always placed on the following lines. Thus in this case we put the source line comment of the following dictionary value at the end of the line of the preceding dictionary key

Performance

Performance on my laptop, # of iterations per 20s of generation some arbitrary set of config files:

Case	Run1	Run2	Run3	Avg	StdDev
Master Generating JSON	1157	1142	1140	1146	9
PR Generating JSON, positions disabled with env var	1139	1107	1119	1122	16
PR Generating JSON, positions enabled	1126	1093	1101	1106	17
PR Generating PrettyYAML, positions enabled	670	698	681	683	14
PR Generating PrettyYAML + Comments, positions enabled	263	294	292	283	17

In addition to the code in this PR, I tried disabling Position-management "globally" via an environment variable (made Position.apply return null, wrapped a lot of position-related code blocks in if guards) to see if it made a significant difference. It turns out that even with positions enabled, the performance degradation of PR Generating JSON, positions enabled compared to the Master generating JSON benchmark is small enough (2-3%) that it's probably not worth the complexity of juggling environment variables

The PR Generating PrettyYAML + Comments, positions enabled Is about 2x slower than without the source line comment generation; this is about as much a slowdown as I expected. To reach this, I had to cache calls to fastparse.internal.Util.lineNumberLookup that performs the building of line-number tables for the Jsonnet files we load, and
I re-implemented fastparse.IndexedParserInput.prettyIndex in sjsonnet.Util.prettyIndex using a binary search for performance rather than a linear scan; without these changes, it's about 300x slower on the arbitrary set of configs I was profiling above

Materializing the largest configuration file i could find on my laptop (not part of the benchmarks above) with a hot Sjsonnet JVM took ~3s with --yaml-debug vs ~2.7s with --yaml-out, and on a cold JVM it was ~19s vs ~17s. Either way it's probably reasonably fast enough for now, especially since it's intended mostly for manual usage when debugging things and not really intended to be part of automated workflows

Merge-Patch

In order to preserve Positions, I rewrote the mergePatch std lib function to avoid round-tripping the sjsonnet.Val data structures through ujson.Values, and instead performing the merge-patch algorithm lazily directly on the lazy sjsonnet.Val data structure. This allows us to do a best-effort preservation of Positions:

Terminal types (numbers, booleans, nulls, strings) are passed through unchanged with positions preserved
Arrays are also passed through unchanged (since they do not participate in the merge-patch algorithm) with positions preserved
Objects that do not get merged on the LHS are passed through unchanged with positions preserved
Objects that do not get merged on the RHS are passed through with nulls removed, but with positions preserved
Objects that do get merged are the only case where the position is over-written with the position of the std.mergePatch call

There are probably other std.* methods that we could rewrite in order to better preserve Position information while they do their work, but empirically it seems mergePatch is the most commonly used one in our company's config codebase

Testing

Added some rudimentary unit tests in PrettyYamlRendererTests, covering both this new line-comment generation as well as the previously-untested PrettyYamlRenderer. Not very thorough, but at least it's something...

Added a small fuzz-test to compare the output of the two prettyIndex implementations on a range of inputs, to try and catch any deviations between them. All such tests pass both on Scala-JVM and Scala.JS

Added few test cases in StdMergePatchTests to cover cases where I had trouble with semantic differences during implementation. These are in addition to the existing upstream test cases vendored in sjsonnet/test/resources/test_suite/merge.jsonnet.

Cleaned up the test suite a bit to DRY up copy-paste code, extracting a few common helpers in TestUtils.scala

All existing unit tests pass. The correctness of the changes when --yaml-debug is not passed is also validated by running this on our work configuration codebase, and verifying that no output YAMLs were changed as the result of these refactors

We don't actually have any fuzz tests to validate that adding the source-line comments doesn't change the parsed contents of the YAML output, but since it's mostly for manual debugging that's probably fine for now

…lue`s

… the end

iuliand-db

I left a couple of comments, but overall ship it!

iuliand-db · 2020-12-15T10:10:52Z

sjsonnet/src/sjsonnet/Util.scala

+
+
+object Util{
+  def binarySearch(lineStartsMin: Int, lineStartsMax: Int, lineStarts: Array[Int], index: Int): Int = {


I think Arrays.binarySearch would work as well.

The issue with Arrays.binarySearch is that it only can check for exact value == input, whereas this binary search needs to check for value > input. In theory it should be about the same logic, but I didn't find any convenient verison built in :/

I think it returns the -insertion_point for a value that's not in the array, if I understand this correctly:

Returns:
index of the search key, if it is contained in the array within the specified range; otherwise, (-(insertion point) - 1). The insertion point is defined as the point at which the key would be inserted into the array: the index of the first element in the range greater than the key, or toIndex if all elements in the range are less than the specified key. Note that this guarantees that the return value will be >= 0 if and only if the key is found.

Ah cool find, I'll make use of that then!

Turns out that the comment explaining how we use the result of java.util.Arrays.binarySearch is longer than the binary search implementation in the first place! OTOH it lets me delete 115 lines of test suite, so definitely still a win overall

iuliand-db · 2020-12-15T10:19:46Z

sjsonnet/src/sjsonnet/PrettyYamlRenderer.scala

+  def saveCurrentPos() = {
+    val current = getCurrentPosition()
+    if (current != null){
+      bufferedComment = " # " + current.currentFile.renderOffsetStr(current.offset, loadedFileContents)


This is in a hot code path (almost every line of output will have a comment), so maybe it would be better to avoid string concatenation. The logic where the buffer is appended to the StringWriter could do out.append(" # ") before appending the actual offset. Not sure if this would result in an observable speedup, but operations on Strings used to be a significant slowdown in the Scala compiler. Since this seems like an easy change, I'd be curious to know the outcome for very large files.

I'll try this and see if there's any measurable difference

lihaoyi-databricks · 2020-12-16T01:54:24Z

All tests pass, merging this

lihaoyi-databricks added 2 commits December 3, 2020 16:57

wip

018d91b

merge

70f4765

szeiger mentioned this pull request Dec 4, 2020

Proof of concept for OpenAPI schema validation #106

Open

lihaoyi-databricks added 13 commits December 11, 2020 16:46

fix a few equality checks

81aa257

wip

8fec0f7

refactor mergePatch

6593009

Convert mergePatch to work directly on Vals rather than `ujson.Va…

7deedac

…lue`s

tweaks

16d2842

add ability to disable position comments via an env variable

56abf7d

fix mergePatch with nested objects, add some tests

7029372

cleanpu

fca3a2d

basic PrettyYamlRendererTests

b9b7795

add flag

76bf9ce

Merge branch 'master' into source-comments

a239d5b

tweak flag

22ce3b1

Add source line comment at the start of block strings, rather than at…

357ba53

… the end

lihaoyi-databricks changed the title ~~POC generate YAML with comments indicating which line of code generated each item~~ Generate YAML with comments indicating which line of code generated each item Dec 14, 2020

getOrElseUpdate

52d5f62

lihaoyi-databricks requested a review from iuliand-db December 14, 2020 17:00

iuliand-db approved these changes Dec 15, 2020

View reviewed changes

lihaoyi-databricks added 2 commits December 15, 2020 17:43

merge master and fix tests

7937798

swap over to java.util.Arrays binary search

db57596

lihaoyi-databricks merged commit ee72a0b into master Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate YAML with comments indicating which line of code generated each item #105

Generate YAML with comments indicating which line of code generated each item #105

lihaoyi-databricks commented Dec 4, 2020 •

edited

iuliand-db left a comment

iuliand-db Dec 15, 2020

lihaoyi-databricks Dec 15, 2020

iuliand-db Dec 15, 2020

lihaoyi-databricks Dec 16, 2020

lihaoyi-databricks Dec 16, 2020 •

edited

iuliand-db Dec 15, 2020

lihaoyi-databricks Dec 15, 2020

lihaoyi-databricks commented Dec 16, 2020



		object Util{
		def binarySearch(lineStartsMin: Int, lineStartsMax: Int, lineStarts: Array[Int], index: Int): Int = {

Generate YAML with comments indicating which line of code generated each item #105

Generate YAML with comments indicating which line of code generated each item #105

Conversation

lihaoyi-databricks commented Dec 4, 2020 • edited

How it works

Comment Positioning

Performance

Merge-Patch

Testing

iuliand-db left a comment

Choose a reason for hiding this comment

iuliand-db Dec 15, 2020

Choose a reason for hiding this comment

lihaoyi-databricks Dec 15, 2020

Choose a reason for hiding this comment

iuliand-db Dec 15, 2020

Choose a reason for hiding this comment

lihaoyi-databricks Dec 16, 2020

Choose a reason for hiding this comment

lihaoyi-databricks Dec 16, 2020 • edited

Choose a reason for hiding this comment

iuliand-db Dec 15, 2020

Choose a reason for hiding this comment

lihaoyi-databricks Dec 15, 2020

Choose a reason for hiding this comment

lihaoyi-databricks commented Dec 16, 2020

lihaoyi-databricks commented Dec 4, 2020 •

edited

lihaoyi-databricks Dec 16, 2020 •

edited