Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent removal of trailing whitespace for clean diffs #505

Open
sadielbartholomew opened this issue Jan 25, 2024 · 13 comments
Open

Persistent removal of trailing whitespace for clean diffs #505

sadielbartholomew opened this issue Jan 25, 2024 · 13 comments
Labels
GitHub Usage Improvement to how we use GitHub for this repository

Comments

@sadielbartholomew
Copy link
Member

As per my review comment in #401 (review), it has been agreed to re-remove trailing whitespace (and any other non-visible characters and features) from the document (AsciiDoc format) files. I will do this later today.

That said, I've raised this before in #393 where the blitzing of all whitespce was previously completed, but it can easily get added to the document pages as PRs go in. For that reason, I propose we have a means to automate or at least regulate the removal of any trailing whitespace that gets added. This could be part of further linting automation, or standalone, though removal of trailing whitespace on all files is quite trivial in terms of applying operations to do it, so would be particularly easy to set up, maybe as a first step.

I would suggest the following, on these lines. Either we:

  • ensure no trailing whitespace is added for any PR that gets merged after I do the next removal, either by reviewer insistence or by means of a tool such as pre-commit hooks to ensure all commits obey the linting standards we configure;
  • or (my preferred option since it will PR submitters won't need to concern themselves at all about any trailing whitespace) we have some automated process to do the removal at regular periods. This could be configured as a CI job by GitHub Actions which we already use, to check for it and if it exists, submit a PR to remove it.

I am happy to set something above towards the above.

If folk, esp. on the Info. Management Committee, could let me know if they agree with the proposal, that would be appreciated. In the meantime, I'll do a PR to remove the whitespace introduced after more recent PRs including #401.

@erget
Copy link
Member

erget commented Jan 25, 2024

Hi Sadie,

Thanks for taking the lead here - if we were all devs I'd say let's use a pre-commit hook, but considering the diversity of technical setups that our contributors have, my recommendation would be to setup an automation that remotes trailing whitespace on top of PRs. Thinking out loud here, not sure about feasibility, what if at every PR there's a check that looks for trailing whitespace and if it's found adds a commit on top of that that kills the trailing whitespace?

In fact with such an action, a manual blitz would no longer be necessary...

What could possibly go wrong?
image

@sadielbartholomew
Copy link
Member Author

Thanks @erget, I agree that pre-commit hooks are probably too much of a barrier to employ and that some automation solution is best.

what if at every PR there's a check that looks for trailing whitespace and if it's found adds a commit on top of that that kills the trailing whitespace

Sounds good to me. I will look into options and see if I can get a job running that implements this, once I've waited a few working days to see if anyone has any issue with the idea in general.

@ChrisBarker-NOAA
Copy link
Contributor

I"m no expert in git -- but I'm pretty sure that tehre is a way to set up a pre-commit hook that does this beofre the commit gets pushed.

what if at every PR there's a check that looks for trailing whitespace and if it's found adds a commit on top of that that kills the trailing whitespace

This is less than ideal, because you get an extra "wasted" commit -- a somewhat ugly history -- as an example "git blame" would show that every line that had whitespace removed has been last edited by the system, rather than whoever actually made a change.

Some quick googling did not reveal an answer, but this is a VERY common issue, there has got to be a standard solution out there!

@larsbarring
Copy link
Contributor

I agree that pre-commit probably will be a barrier. Is the auto-blitz intended to happen when opening a PR, and every time time the PR is updated? Then I agree agree with @ChrisBarker-NOAA that this will complicate the history, and that there ought to be a smart solution already out there. Could the auto-bliz happen when the PR is merged?

@ethanrd
Copy link
Member

ethanrd commented Jan 25, 2024

I expect most PRs contain more than one commit. So I don't think we should worry too much about an extra commit (or two) on a PR messing up the commit history.

There are lots of style guide checkers for code (PEP8, Spotless, etc.) but I haven't found any for Asciidoc. I've work on other projects where the workflow involves a test on PRs that fail if the style isn't followed. A committer must then push another commit with the fix so the test passes before the PR can be merged. Not sure if that is better or easier than an automated process but it is a bit more visible (which might help with awareness among contributors of the style guide).

@sadielbartholomew
Copy link
Member Author

sadielbartholomew commented Jan 29, 2024

Thanks all for your thoughts and ideas. And it is good to hear some background to this issue e.g. from @ethanrd. It's probably wise as Ethan suggests to consider style linting as a whole whilst we think about this, because if we implement a solution here we would want it to be extensible to any further style fixes we might want to use too, eventually at least. And I think trailing whitespace removal is a good thing to start with, since it's non-controversial, easy to run operations to remove, and clear to review for manually to check our solution is doing what we want.

As everyone's indicated, there's no perfect solution, but plenty of (fairly-)standard approaches to choose from. It seems we have collectively thought of the following possibilities, which I've numbered to help us discuss, and moreover are agreed (please do correct me if anyone feels this is wrong or mis-representative) that we don't want the first so I've crossed it out:

  1. pre-commit hooks: would work but are too much of a barrier to those who contribute, not all of who are (very) familiar with the command-line or git etc., so avoid;
  2. a regular PR/commit dedicated to style fixes, ideally by a CI workflow and/or cron or similar running a regular job;
  3. linting checks on (all) PRs which warn of style issues such as trailing whitespace could be a solution where either:
    1. the PR author needs to add extra commits to fix any issues before it is allowed to be merged;
    2. a CI job runs after every PR or set of PRs merge to check for any style issues and automate the fixing of any that have been introduced by the PR.

Maybe if everyone could outline their favourite (or unfavoured) option(s), or ask any questions about them, then we can determine the best way forward? If it helps, based on past experience I have I think it would be simple enough for me to set up any of these solutions, so ease of implementation isn't a concern (to me, at least).

@erget
Copy link
Member

erget commented Jan 29, 2024

Thanks @sadielbartholomew , excellently summarised :)

I'm happy with both 2 and 3, but I could see a certain charm in traceability of implementing 2 and agreeing to do this before minting a new release. That means that it would take place en bloc before the release goes out. Of course, I don't mint the releases so that's easy for me to say; @davidhassell and others may have a different perspective.

@JonathanGregory JonathanGregory added the GitHub Usage Improvement to how we use GitHub for this repository label Jan 29, 2024
@ChrisBarker-NOAA
Copy link
Contributor

(2) has the disadvantage of creating gaps in the history -- any lines that were fixed by the job will show as hvng been edited by the CI, not whoever actually made the change (and the data will be wring, too, though if it's run regularly, not by much). On the other hand, maybe history isn't important for this kind of project. If there's something of concern with a part of the doc, does it matter exactly who last touched it or when they did?

I think (3 i. ) is not great -- similar to commit hooks, it puts a burden on the author of the PR -- and depending on their skill set could be a barrier to entry.

I'm not sure I fully understand (3 ii.) -- where / when would the fixing occur? Depending on the answer to that, it could introduce similar history issue.

So I suggest (4) (which might be 3 ii, if I've misunderstood:

(4) Summary: all changed, including auto-linting, are squashed as a single commit when merged into main

Some of this is (hopefully) policy already, but I'm putting it here, as it's important to the process.

  • main is never updated directly -- the only way to update main is via a merge from another branch.
  • In all non-main branches, a CI job is run that fixes style issue on every push: those branches will always be "clean".
  • When a branch is merged, commits are "squashed", so appear as a single commit -- so the history shows the whole thing as a single, correct, change, including style fixing.

This could be a bit complicated by PRs from a third-party repo -- I'd have to think about that more, as asking folks to first merge into a temporary branch,and then we merge into main could be a bit challenging.

@larsbarring
Copy link
Contributor

I am nothing like an expert in git or github, more the opposite. Maybe that is why I do not fully see the same disadvantage alternative 2 as @ChrisBarker-NOAA does:
Wouldn't it possible to have an action that kicks in immediately after the files from the PR has been merged into the main? That is, it will be clear who authored the changes and when they where merged from a separate branch. Immediately after the merge the action kicks in and does any cleaning of those specific files in the main branch. I.e. the action is the only one that do changes directly in the main branch, and they happens seconds/minutes after the merge.
--- Or am I envisioning the impossible here? There seems to to be some disussion on this subject over at Stack Overflow, but now I am definitively out of my depth.

@ChrisBarker-NOAA
Copy link
Contributor

ChrisBarker-NOAA commented Jan 30, 2024

Wouldn't it possible to have an action that kicks in immediately after the files from the PR has been merged into the main?

yes, I'm pretty sure that's possible.

That is, it will be clear who authored the changes and when they where merged from a separate branch.

That's the problem -- for example:

  • Person A makes some changes, with some extra whitespace, etc.
  • They make a PR, and it gets merged in to main with commits "squashed"
  • The git log shows something like: (taken from today)
$ git log
commit 44e7c0784007ebf0f21f67e3e891a746f30edda5 (HEAD -> main, origin/main, origin/HEAD)
Author: Daniel Lee <daniel.lee@eumetsat.int>
Date:   Wed Jan 24 17:04:11 2024 +0100

    Update line breaks (and minor corrections) (#401)
    
    * Update line breaks (and minor corrections)
    
    * Homogenise monotype and bolding

And git blame shows:
(small excerpt)

44e7c078 (Daniel Lee         2024-01-24 17:04:11 +0100  93) The format for the `formula_terms` attribute is 
940afbc0 (Steve Jones        2022-09-14 16:26:49 +0200  94) 
44e7c078 (Daniel Lee         2024-01-24 17:04:11 +0100  95) ----
44e7c078 (Daniel Lee         2024-01-24 17:04:11 +0100  96) formula_terms = "a: var1 b: var2 ps: var3 p0: var4" 
^eca16f4 (Richard Hattersley 2015-06-04 11:42:26 +0100  97) ----
^eca16f4 (Richard Hattersley 2015-06-04 11:42:26 +0100  98) 

All good. We know that lines 95 and 96 were touched by Daniel on 2024-01-24.

Now the bot runs, and corrects whitespace, etc.

Now git log will look like:

$ git log
commit bf0f21f67e3e891a74644e7c0784007ef30edda5 (HEAD -> main, origin/main, origin/HEAD)
Author: cf-autoformat-bot 
Date:   Wed Jan 24 17:06:00 2024 +0100

    Auto-formatting run

not too bad, you can always look at previous commits.

But now git blame looks like this:

44e7c078 (cf-autoformat-bot    2024-01-24 17:06:00 +0100  93) The format for the `formula_terms` attribute is 
940afbc0 (Steve Jones        2022-09-14 16:26:49 +0200  94) 
44e7c078 (cf-autoformat-bot    2024-01-24 17:06:00 +0100  95) ----
44e7c078 (cf-autoformat-bot    2024-01-24 17:06:00 +0100  96) formula_terms = "a: var1 b: var2 ps: var3 p0: var4" 
^eca16f4 (Richard Hattersley 2015-06-04 11:42:26 +0100  97) ----
^eca16f4 (Richard Hattersley 2015-06-04 11:42:26 +0100  98) 

So we know when those lines were changed, but not by who.

Anyway, not fatal -- in some cases with real code, you really want to know who made the changes, but for this kind of thing, maybe we don't care (and it is there in the history if you really want to dig for it) -- but that's my point.

If we can find a way to get the autoformat changes to be squashed with the other changes as one merge, then it'll be a clearer history. maybe that's not important, though.

@erget
Copy link
Member

erget commented Jan 30, 2024

TL;DR: I'd keep it simple for the merging procedures because that takes place a lot more often than forensics do, and I don't think we incur huge debts by having a bit of a dirty history. Note that we have a lot of commits that have very undescriptive messages as they are.

I can unpack that a bit:

Yes, in essence you can still find out who really authored each line by going back to the commit previous to the one made by the autoformatter and then you'd still get the line-by-line allocated to (probably) humans.

In the end though I think the squashing or rebasing or whatever is probably down to the release procedures, because there you have similar concerns. One could e.g. rebase such that the autoformatter changes are always squashed into the previous commit. That'd be how I'd do it, personally. It really depends on how we do the branching and merging - currently after agreement essentially the work branch is merged into master which will eventually be the next tag. If we wanted great traceability and great formatting, options I see are to

  • have a community that abides by the formatting rules (I have shown that I'm not personally trustworthy here because in the absence of an IDE I used the GitHub editor, which introduces trailing white space - mea culpa!) or
  • have a pre-commit hook that forces people to abide - not in favour because it could be confusing to many contributors or
  • run a job before merging to main that would squash the commit. That's a bit more cumbersome for the merger, and given that the Conventions Committee are not all git experts, that could hamper their efforts, or
  • involve the IM team to finalise merges - then they squash and merge after e.g. the committee has merged to a next branch or something, which means that we have an overall higher complexity in our branching approach but it doesn't really hurt the Committee's work

In my view all of those are feasible but come at costs that outweigh the benefits. In the end, in the event that one really wants line-by-line forensics, it's gonna involve a bit of digging anyway, but that digging doesn't take place often and people who are expert in git can be asked for advise at that point.

@ethanrd
Copy link
Member

ethanrd commented Jan 30, 2024

Hi all - I favor option 3.1 (test and fix by hand-ish [1]) because it increases the visibility of the style guide rather than hiding it behind an automatic process. I think awareness of the style guide/rules is worth a bit of extra effort. Also, that effort doesn't need to be provided by the original committer. The committer can allow CF repo maintainers to push to the PR branch.

[1] By hand or by a CI job that runs after the failed test and adds a commit to the PR.

@ethanrd
Copy link
Member

ethanrd commented Jan 30, 2024

After reading @ChrisBarker-NOAA's "squash" comment, I took a closer look and notice that we have a combination of PRs being merged and PRs being squashed onto the 'main' branch. While I think a clean commit history is important to work towards, I think we might want to develop some process around how to decide on what makes a clean commit for a given PR. Encouraging PR committer(s) to decide on and rebase to get a clean (minimal but informative) commit history would be my go to preference. But that adds yet another barrier. Though, again, I think it could be handled by CF repo maintainers.

This should probably be a separate discussion. I'll try and start an issue on PR merge process and clean commit history later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GitHub Usage Improvement to how we use GitHub for this repository
Projects
None yet
Development

No branches or pull requests

6 participants