Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft OEP-33: Code Autoformatting #107

Closed
wants to merge 8 commits into from
127 changes: 127 additions & 0 deletions oeps/oep-0033-bp-code-autoformatting.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
===========================
OEP-33: Code Autoformatting
===========================

+-----------------+--------------------------------------------------------+
| OEP | :doc:`OEP-33 </oeps/oep-0033-bp-code-autoformatting>` |
+-----------------+--------------------------------------------------------+
| Title | Code Autoformatting |
+-----------------+--------------------------------------------------------+
| Last Modified | 2019-03-04 |
+-----------------+--------------------------------------------------------+
| Authors | Calen Pennington <cale@edx.org> |
+-----------------+--------------------------------------------------------+
| Arbiter | Ned Batchelder <ned@edx.org> (Tentatively) |
+-----------------+--------------------------------------------------------+
| Status | Draft |
+-----------------+--------------------------------------------------------+
| Type | Best Practice |
+-----------------+--------------------------------------------------------+
| Created | 2019-03-04 |
+-----------------+--------------------------------------------------------+
| `Review Period` | 2019-03-04 - 2019-04-01 |
+-----------------+--------------------------------------------------------+

Context
-------

In a large software project or ecosystem like the Open edX codebase, consistency
of code formatting makes it easier to read the code. In an ecosystem with many
contributors, maintaining that consistency requires automated tools (such as
linters), consistent code review, or both. In general, minimizing the number
of review comments that a reviewer has to make on an incoming change makes
review easier, and lessens the number of back-and-forths required to merge
the code. As such, linters that can be run locally should be preferred.
Similarly, ambiguity in style guidelines should be minimized, so that there
are fewer places for commentary around what amounts to personal preference
(which might then instigate unnecessary debate).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


In short, more consistent, less ambiguous standards for code formatting
make for more productive code reviews and waste less time on inconsequential
details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have there been recent incidents where inconsistent formatting has resulted in longer PR reviews? I have not found this to be a major issue in my own personal experience to-date working on our platform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I run into pylint violations that are simply a question of spacing/indentation fairly frequently. I also often refrain from commenting on formatting that would make the code more consistent because it's not something that I want to have another review cycle on. Maybe it's just me that's bothered by inconsistent code formatting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. I've also run into pylint violations, complaining of extra spaces that it doesn't allow. We could also just update our linter to ignore that violation if it's not necessary to enforce.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My view is that there is value in consistency and in the spacing guidelines that pylint suggests, from a readability perspective, and that there isn't much value in making engineers actually enforce it themselves, since tooling can do that job faster and more effectively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nasthagiri, great question. Yes, formatting and CI in general has been a painpoint for me while contributing to Open edX.

Despite its great value, it still puts a lot of unrelated issues to the code that I write -- say I edited line no. 1, it would complain about line no. 100. This is especially true for home-grown linters such as the XSS linter.

I'm not saying that we should loosen and "de-regulate" Open edX quality, but at the same time please don't underestimate the effort that infrequent contributors have to go through.

This is especially true for the edx-platform repository.

I have learned my way through the process, but I find it hard to recruit other open source contributors whether from my company or from the community.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a separate OEP (#108) that I want to use to cut off the accidental flagging from the XSS Linter. In general, I'm all for making it so that the only things that actually get flagged on a PR are substantive issues in code that you actually wrote in the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cpennington! That would be great. Afterwards, we'd only be left with the inevitable cost of dealing with the CI on such a monolithic piece of code -- the edx-platform. As a long time (sine 2014) but infrequent contributor, I do see the value of the CI, but also don't want to undermine the cost of it to me as a contributor.

Therefore, I think we should look at such initiatives a bit more critically that we're doing now -- which is mostly focusing on the pro's of adding yet another step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the impact on time-to-feedback for edx-platform specifically is a good question.

On my machine (not devstack), w/ no black cache, it takes 10 minutes to format all of edx-platform. After reformatting everything, it takes 13s to reformat/check again.

After I reformat everything, and then delete the cache, a full format takes 5 minutes (and again, once cached, takes about 13s).

So, that suggests that a) we want to make sure jenkins keeps that cache around, and b) that devstacks are pre-built with the cache so that running black inside devstack to do a full-format (for someone w/o editor integration) is fast enough to not be a blocker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh... turns out that black needs python 3.6 to run, which we don't have installed on devstack... So, we'd need some more investment to make edx-platform work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also often refrain from commenting on formatting that would make the code more consistent because it's not something that I want to have another review cycle on. Maybe it's just me that's bothered by inconsistent code formatting?

Internally at Stanford, I enforce this fairly rigidly, through sheer force of will :)
That doesn't always scale though...


Decision
--------

We will select automatic code formatters for all of our major languages. These
formatters will have both a lint-mode (that will be run in continuous
integration to ensure consistent formatting) and an autoformat mode (which
can be integrated with editors to free developers from having to think about
code formatting). These formatting standards will be applied across all
Open edX-authored repositories.

Selected tools should, ideally, require minimal configuration. Any required
configuration should be added to `edx/edx-lint`_ in order to allow it to
be centralized and standardized across repositories in the Open edX ecosystem.

The current recommended formatters are:

- Python: `black`_

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last I looked into it, Black was too opinionated and barely, if at all configurable.
I found it too rigid. This would seem to maximize the amount of the codebase to be re-written,
which I'd like to avoid, per reasoning elsewhere in thread..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, opinionated is actually the goal, here. Whatever formatter is chosen should format any given code construct identically no matter how it's originally input. I think the pain of the rewrite is going to be there no matter how big or small it is (there's gonna be one big PR against the repo either way).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, opinionated is actually the goal, here.

My hangup here isn't being opinionated;
it's being "poorly" opinionated.

I can rerun my test cases again if you'd like the specific violations,
but as mentioned, when I tried this on our code, black mandated changes that
I considered "wrong".

There where specific style patterns that we chose intentionally to make our fork more manageable, but we were no longer able to use these patterns under black
That's why I want flexibility/configuration; so we can forgo any overbearing rules.
Otherwise, we're getting hit on both the front and back ends;

  • the upfront pain of the initial translation
  • the ongoing cost of not being able to write code as we see fit for our fork

I'm not suggesting flexibility for the sake of endless and ongoing style debates
or a needlessly iterative process, but I don't want to be backed into a corner.
I want options if needed.

In RFC terms, think
MAY reconfigure SOME options
not
WILL reconfigure MOST/ALL options

Whatever formatter is chosen should format any given code construct identically no matter how it's originally input.

Agreed.

I think the pain of the rewrite is going to be there no matter how big or small it is

It's really a matter of perspective...
For edx, yeah, it's going to a pain so a single shot makes it easier.
Downstream, consider the following:

  • most side repos will be fairly painless for us;
    platform is my major concern re:transition
  • So if you enable it for new directories, there's no pain downstream (or upstream, for that matter)

Now when it comes to running it against existing platform code, consider:

  • single shot: for you
    • run the script
    • commit the code
    • move on
  • single shot: for us
    • merge the code
    • spend weeks debugging/resolving conflicts because every code file in the platform has been modified
    • hope we got it all
    • run the script against our code now
    • commit the code
    • move on... to significant regression testing because we didn't just run an autoformating tool, we ran an autoformating tool, reworked virtually all of our custom integrations in every code file, then ran an autoformating tool again. It'll be more than a chore.

versus

  • staged roll out: for you
    • run the script against a subset of additional directories
    • commit the code
    • repeat each named release until complete
  • staged roll out: for us
    • merge the code
    • spend a much smaller amount of time resolving format-conflicts in the handful of newly-formatted subdirs
    • feel more confident that we got everything in the now smaller surface area
    • defer running the script against our code now, so that we can deal w/ it out-of-band of this upgrade
    • spend significantly less time debugging and regression testing since the whole platform didn't change out from us over night, which allows us to get the named release live and not fall further behind, allowing us to
    • repeat each named release until complete

I concede, single shot is easiest for you, it's one run of an automated tool.
Just consider how much more it ends up being for anyone with non-trivial platform modifications.
While a staged roll out would yield fewer commits, I'm not sure that's a selling point for us.
One commit Auto format isn't a practical difference for us versus multiple Auto format 1/N.

And a staged roll out let's us absorb those hits over time, ideally over the course of named releases. When there have been sweeping changes to the edx-platform codebase in the past,
having to "eat it" all at once basically drops our integration velocity down to zero for an extended period of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hangup here isn't being opinionated;
it's being "poorly" opinionated.

Yeah, that's fair enough. Like I said, I hadn't run into places where I objected w/ Black's formatting, so it would be helpful to me (while making a recommendation in this OEP) to see your specific counterexamples.

Just consider how much more it ends up being for anyone with non-trivial platform modifications.
While a staged roll out would yield fewer commits, I'm not sure that's a selling point for us.
One commit Auto format isn't a practical difference for us versus multiple Auto format 1/N.

And a staged roll out let's us absorb those hits over time, ideally over the course of named releases. When there have been sweeping changes to the edx-platform codebase in the past,
having to "eat it" all at once basically drops our integration velocity down to zero for an extended period of time.

I'm definitely willing to defer to your experience on this. It shouldn't be too much trouble to set up our integration tooling in a way that we can mark specific directories as having been blackened, and slowly expand that coverage. It's a bit cumbersome w/ editor auto-integration, but perhaps worth it. What pace of conversions is something that you think you and other integrators could manage? You mentioned being over the course of multiple open-edx releases, with puts the timeframe in months or years, which is a bit unfortunate from my POV.

- Javascript: `prettier`_
- CSS/SASS: `prettier`_
- HTML: `prettier`_
- JSON: `prettier`_
- PHP: `prettier-php`_

.. _black: https://github.com/ambv/black
.. _prettier: https://prettier.io/
.. _prettier-php: https://github.com/prettier/plugin-php
.. _`edx/edx-lint`: https://github.com/edx/edx-lint

Formatting should exposed via the standard ``Makefile`` commands
``format`` (which should autoformat all code in the repository)
and ``check-format`` (which should validate that all code in the
repository is formatted). ``check-format`` should also be called
as part of ``make quality``.

When bulk-formatting the entire repo for the first time, all changes should
be done in as few commits as possible (ideally, only one). Then, the
commit hashes should be added to a ``.git-blame-ignore-revs`` file.
This will allow the `git hyper-blame`_ tool to ignore them by default,
to preserve the easy ability to see who changed particular file lines.

.. _git hyper-blame: https://commondatastorage.googleapis.com/chrome-infra-docs/flat/depot_tools/docs/html/git-hyper-blame.html

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this seems interesting, I don't think it helps for many use cases.
Does Github suport it?
I know that my text editor integration (vim-fugitive) explicitly doesn't support it now [1].
I'm worried the impact here would be small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Github doesn't support it. I'm willing to include it if people want it (or even if it would be marginally useful), because it isn't a big lift to add (but I agree, it's also doesn't solve a lot of use-cases).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess I'm not saying don't mention it, as much as I wanted to be clear that it shouldn't be interpreted as end-all solution.


Consequences
------------

There will be commits to all Open edX repositories that cause large-scale
reformatting without functional changes. This will muddy the blame-file
history. Users with auto-formatters configured in their editors working
in a repo that is not yet OEP-33 compliant will need to disable their
autoformatter, or risk large-scale changes that are unrelated to their
current tasks. All Open edX code will be formatted consistently and
unambiguously. Developers will not have arguments about
code layout.

References
----------

`gofmt`_ was the first code-autoformatter to gain wide-scale acceptance.

.. _gofmt: https://blog.golang.org/go-fmt-your-code

`Black editor integration`_
`Prettier editor integration`_

.. _Black editor integration: https://github.com/ambv/black#editor-integration
.. _Prettier editor integration: https://prettier.io/docs/en/editors.html

An example ``tox.ini``:

.. code-block::

[testenv:format]
basepython =
python3.6
deps=black
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an OEP-18 perspective, this should use a requirements file generated by pip-compile instead (so we get version consistency and regular updates). I realize this is just an example, but I don't want to encourage people to copy this pattern verbatim.

commands=
black .

[testenv:check-format]
basepython =
python3.6
deps=black
commands=
black . --check