Skip to content

Reversibility principle

Bob Nystrom edited this page Jun 15, 2023 · 1 revision

The formatter is deliberately designed to maintain a principle that's somewhat subtle but I think important. I haven't been able to find a good name for it, so I'll call it reversibility.

An example using undo

Let's say you have some code like this (step 1):

main() {
  someReallyVeryLongFunction(aReallyQuiteLongArgument, anotherShorterOne);
}

You decide rename the second argument to something longer (step 2):

main() {
  someReallyVeryLongFunction(aReallyQuiteLongArgument, anotherActuallyLongerOne);
}

Then later you undo the change (step 3):

main() {
  someReallyVeryLongFunction(aReallyQuiteLongArgument, anotherShorterOne);
}

The code at step 3 is exactly the same as it was at step 1. Now let's say that you run the formatter between each of those steps. You start the same:

main() {
  someReallyVeryLongFunction(aReallyQuiteLongArgument, anotherShorterOne);
}

After renaming the argument, the argument list is longer and the formatter splits it:

main() {
  someReallyVeryLongFunction(
      aReallyQuiteLongArgument, anotherActuallyLongerOne);
}

When you rename the argument back, the argument list fits again, and the formatter removes the split:

main() {
  someReallyVeryLongFunction(aReallyQuiteLongArgument, anotherShorterOne);
}

As expected, you're right back where you started, even though during the process, the formatter did make formatting changes to your code.

An counterexample

Now let's consider a formatter that has different formatting rules. In particular, it has two apparently reasonable rules:

  1. When an argument list doesn't fit on one line, split it so that each argument is on its own line, and add a trailing comma to the last one. This aligns with how multi-line argument lists in Flutter-like code are idiomatically formatted.

  2. If a user wants an argument list to split even if it would fit in a single line, they can hand author a trailing comma to signal that to the formatter. This gives users some discretionary control when they think an argument is less readable if packed onto one line.

Let's walk through the scenario again while running the formatter after each step:

main() {
  someReallyVeryLongFunction(aReallyQuiteLongArgument, anotherShorterOne);
}

At first, everything fits and there's no trailing comma, so neither rule comes into play. Then you make the second argument longer. The argument list no longer fits, so it gets wrapped like so:

main() {
  someReallyVeryLongFunction(
    aReallyQuiteLongArgument,
    anotherActuallyLongerOne,
  );
}

That's what you expect. But you decide to undo that rename and go back to your original code:

main() {
  someReallyVeryLongFunction(
    aReallyQuiteLongArgument,
    anotherShorterOne,
  );
}

Only, that isn't your original code. Because of the second rule, the argument list remains split because of the trailing comma. So even though you exactly reverted a change, the resulting code isn't back to what you started with. The intent of the second rule is that it only applies to hand-authored trailing commas, but after a file has been formatted once, there's no way to tell which trailing commas came from the programmer and which were inserted by the formatter itself.

The fact that you happened to run the formatter during some point in the file's history has left a mark on the structure of the code. In order to avoid that, you would have to be careful to not run the formatter at certain points in your development process if you don't want it to leave any stylistic remnants of the file's previous contents. Or you'd have to go back and carefully remove them based on your memory of which bits of code were authored versus inserted.

Reversible formatting rules

I don't want users to ever have to worry about when they can or can't run the formatter. In particular, many users have their editor set to format on every save, and also to auto-save periodically. That should be an entirely safe workflow.

To ensure that, the formatter is careful to only have rules that are "reversible". By that I mean that any formatting change the formatter introduces because the code has some property is a change that it will remove when the code no longer has that property. If it splits an argument list that's too long, it will unsplit it if it's not too long. Likewise collection literals and other sequences of elements. If it allows you to not have a blank line between two statements, it will never insert one there itself.

Another way to say the principle is that the formatter never uses as formatting input any output that itself produces. Once a file has been formatted once, there's no way to tell which bits where hand-written versus produced by the formatter, so the way this rule is implemented is that the only parts of the code that affect formatting are parts that the formatter never touches.

Mainly, this means non-whitespace code. The actual code in your file is what mainly determines how it is formatted: the kinds of expressions, numbers of arguments, lengths of identifiers, etc. The formatter never changes any of that. It only modifies whitespace. The latter is one reason why the formatter mostly ignores the whitespace of the original program: Doing so helps sure that formatting is reversible. (There are some places where incoming whitespace is preserved, like newlines between statements.)

Constraints and trade-offs

I believe this principle is important and is worth upholding. Users take for granted that they can run the formatter whenever and as often as they like. They never have to go back and clean up any gunk or remnants left by the formatter from being run when their code was still in progress.

Inside Google, large-scale refactorings across thousands of files are quite common. The person performing them often has little context about the surrounding code and no time to hand-massage its formatting. The behavior of the formatter ensures that they can add or remove parameters or rename identifiers globally. When they do so, the resulting code will be formatted as it the code had always had that signature.

This principle does mean that some desirable formatting features are harder or infeasible:

  • As mentioned before, using a trailing comma as a signal to keep an argument list split means the formatter can't have the discretion to insert trailing commas itself.

  • Splitting string literals that don't fit the line length. In order to do this, the formatter would need to be able to unsplit adjacent strings that do fit in a single line. Joining isn't too hard. But if the string changed again such that it no longer fit, the formatter would have to split it. Deciding where to split a string literal that could contain text in potentially any language or any other format is very difficult to automate.

    Instead, we consider it the human author's responsibility to decide the best place to split strings. That way, in the places where that really is difficult, they have the ability to control it.

  • Likewise, wrapping and re-wrapping comments gets harder.