Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust collection of file paths #596

Conversation

sanmai-NL
Copy link
Contributor

@sanmai-NL sanmai-NL commented Dec 15, 2023

Closes #564.

⚡ Summary

Lefthook is currently not completely ready for repositories with filenames that contain commas.

  1. Add --file as the successor of --files. This option can be repeated for a list of file paths. --files is dangerous, e.g., one can create a file README.md,.. and if it gets supplied to a fixer, who knows what happens.
  2. Add --files-from-stdin. This option reads file paths from stdin, separated by a null character. This allows for file names with a comma or newline, and is interoperable with Git and Unix tools (e.g., git ls-files -z, git diff --name-only -z, find -print0).

@mrexox Can you test and add automated tests? I specifically advise to test interaction with existing stdin-related features.

☑️ Checklist

  • Check locally
  • Add tests
  • Add documentation

Using a null char as separator, which allows files with e.g. commas or
end-of-line characters in their names.
@mrexox
Copy link
Member

mrexox commented Dec 18, 2023

Lefthook is currently not completely ready for repositories with filenames that contain commas

Could you please share an example repo with such files? Why there's a need to have commas in file names? I am just curious, because I've never met this before.

cmd/run.go Outdated Show resolved Hide resolved
Interactive bool `mapstructure:"interactive" yaml:",omitempty" json:"interactive,omitempty" toml:"interactive,omitempty"`
UseStdin bool `mapstructure:"use_stdin" yaml:",omitempty" json:"use_stdin,omitempty" toml:"use_stdin,omitempty"`
StageFixed bool `mapstructure:"stage_fixed" yaml:"stage_fixed,omitempty" json:"stage_fixed,omitempty" toml:"stage_fixed,omitempty"`
FilesFromStdin bool `mapstructure:"files_from_stdin" yaml:",omitempty" json:"files_from_stdin,omitempty" toml:"files_from_stdin,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't imagine a case when you need this option implicitly, because as I know git does not pass files to STDIN. So, I assume the only use case is when you explicitly run something like echo file1 file2 | lefthook run do-something. Do I miss something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With implicitly you mean, from configuration?

In the use case of Git hooks, then true. But Lefthook can be invoked in multiple ways for multiple use cases. For example, an organization may prohibit acceptance of paths via --files (for example, because it does have filenames with comma's in it, like data science CSV batches produced outside of their control) by enforcing this in the team-wide Lefthook config via files_from_stdin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With implicitly I mean from the git hook. If you specify lefthook run do-something --files-from-stdin, then you probably expect that {files} or {staged_files} are replaced from the data parsed from STDIN. But there's no need to have a special template for this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for that case, indeed. Please note my previous reply though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An organization may prohibit acceptance of paths via --files (for example, because it does have filenames with comma's in it, like data science CSV batches produced outside of their control) by enforcing this in the team-wide Lefthook config via files_from_stdin

I am sorry, I can't agree on this use case because it doesn't align with the main purpose of lefthook – git hooks management. Adding {files_from_stdin} template doesn't fit into my vision of what lefthook should support because it adds a very special flow not related to git hooks at all.

I am ok with this change: --files-from-stdin option forces lefthook to get the file list from STDIN and replace {files} or {staged_files} with the files from STDIN.

Copy link
Contributor Author

@sanmai-NL sanmai-NL Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to add a files_from_stdin template, but a configuration option. Checking that source code again, it does look it's only adding a configuration option.

But coming to think of your interpretation: yes, it may be pretty useful or required to provide the list of files to commands through stdin as well. Not via a run template, I agree. But that should be supported though (either with or without roundtrip through Lefthook scanning stdin).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrexox Can you check again whether I'm truly adding a template {files_from_stdin} and if so, please point out which source code sections do this, so that I can remove them while keeping a configuration option files_from_stdin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was talking about the template, but I meant the configuration option.

Copy link
Contributor Author

@sanmai-NL sanmai-NL Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An organization may prohibit acceptance of paths via --files (for example, because it does have filenames with comma's in it, like data science CSV batches produced outside of their control) by enforcing this in the team-wide Lefthook config via files_from_stdin

I am sorry, I can't agree on this use case because it doesn't align with the main purpose of lefthook – git hooks management. Adding {files_from_stdin} template doesn't fit into my vision of what lefthook should support because it adds a very special flow not related to git hooks at all.

Then the above reflects where we stood, in our discussion.

Data science teams do record data files in Git (e.g., with Git LFS), and then these issues with commas pop up. An organization should be able to enforce files_from_stdin and the only thing that can be made constant is an immutable config (file).

This official guidelines document warns against using commas in filenames, which means it does happen in some datasets and software should not break (indeed, a linter should be possible that detects this, this is where Lefthook would help).

Anecdotes:

internal/lefthook/run/prepare_command.go Outdated Show resolved Hide resolved
@@ -124,6 +125,7 @@ Run 'lefthook install' manually.`,
SkipSettings: logSettings,
DisableTTY: cfg.NoTTY || args.NoTTY,
AllFiles: args.AllFiles,
FilesFromStdin: args.FilesFromStdin,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no need to pass this argument. You can parse the files here and pass them via Files option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the functionality isn't behind a CLI option, it will break cases where users supply other information to standard input (e.g., using yes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't get it. If the functionality isn't behind a CLI option, then lefthook is called from within a git hook. In this case files from STDIN don't make sense, since git doesn't pass files via STDIN.

And if lefthook is called explicitly, I'd like to make sure that it is called correctly. If the override is required – let's rely on the CLI options.

I want to distinguish the implicit calls that must make sense and explicit calls that tweak lefthook behavior, for instance change the files passed to the commands. That's my vision and I believe it aligns with KISS principle.

Copy link
Contributor Author

@sanmai-NL sanmai-NL Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't get it. If the functionality isn't behind a CLI option, then lefthook is called from within a git hook. In this case files from STDIN don't make sense, since git doesn't pass files via STDIN.

First, let's discuss Git hook behavior. According to its docs, Git hooks do get information through standard input, see e.g.: https://git-scm.com/docs/githooks#_pre_push

And if lefthook is called explicitly, I'd like to make sure that it is called correctly. If the override is required – let's rely on the CLI options.

That was exactly my intention, but previously you wrote:

I think there's no need to pass this argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no need to pass this argument.

It was related to Go code. I meant it's better to read from STDIN in internal/lefthook/run.go and pass the results as Files option. It's Ok to have --files-from-stdin CLI option that replaces files in lefthook commands.

Git hooks do get information through standard input

Yes, but it is related to Git while getting files from STDIN is not about git.

Copy link
Contributor Author

@sanmai-NL sanmai-NL Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but perhaps there's a bit of a language barrier. Can you explain in particular your last comment?

The following I think are the facts:

  • Git passes paths via stdin to a hook script, such as Lefthook, in some cases.
  • Some Lefthook commands (i.e., Pylint) can read file paths via stdin.
  • With or without passing of paths to Lefthook via stdin, Lefthook commands can execute command lines that may output paths via stdout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Git passes paths via stdin to a hook script, such as Lefthook, in some cases.

Git passes data with a special format: https://git-scm.com/docs/githooks#_pre_push. This is not files list separated with newlines. So, an option files_from_stdin can't be safely used for a git hook. In this case this option is only valid for the explicit call of lefthook. This means we can keep just a CLI argument --files-from-stdin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me show how I see the configuration:

# lefthook.yml

custom-hook:
  commands:
    lint:
      files: git diff --name-only main # custom files
      run: python scripts/linter.py {files}

When called like lefthook run custom-hook, {files} will be replaced with the results of git diff --name-only main. When called like find . -name '*.py' -print0 | lefthook run custom-hook --files-from-stdin {files} will be replaced from the \0-separated list from STDIN.

Does it sound good?

Copy link
Contributor Author

@sanmai-NL sanmai-NL Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrexox Only for pre-push, pre-receive and reference-transaction stdin gets used. I was intending for the configuration option to apply even when Lefthook is invoked from the CLI. I prefer to check for those hooks that use stdin in the source code rather than not having the config key.

@sanmai-NL
Copy link
Contributor Author

sanmai-NL commented Dec 18, 2023

What you're asking to me is like, can you show an example of someone needing a macro in Word document sent over e-mail? The problem is that you don't know which files people supply, e.g. in a PR branch. These are valid filenames.

If you're asking for user stories rather than attacker stories, then as a user, I want to script Lefthook in a cross-platform way, without manipulating e.g., Git tool output to replace null chars or newlines with a comma. Mind that tr is a Unix tool.

@mrexox
Copy link
Member

mrexox commented Dec 18, 2023

I am just curious. I agree that it is a possible case. I don't mind adding the support for getting files from STDIN and passing files via --file option if you address my comments above.

@sanmai-NL
Copy link
Contributor Author

@mrexox I'm very keen to get this merged, so that my team has another blocker removed.

Would you be so kind to develop tests and to write a (draft) docs/changelog entry to your liking already, even if one thing still must be ironed out?

@mrexox
Copy link
Member

mrexox commented Dec 21, 2023

I'll be glad to help with polishing the PR and adding documentation and changelog entries when we finish the discussion 👍

@sanmai-NL
Copy link
Contributor Author

@mrexox Can we continue this ?

@mrexox
Copy link
Member

mrexox commented Jan 18, 2024

@sanmai-NL, yes, I plan to add a few edits to the PR:

  • Instead of passing AllFiles and FilesFromStdin, just get the files once and pass them in a Files option
  • Remove files_from_stdin config option. I think that it is impossible to use this option with git hooks, so I plan to leave it only as an override for the explicit call like find . -name '*.go' | lefthook run pre-commit --files-from-stdin

I can do them myself since I think I've got the vision.

@sanmai-NL
Copy link
Contributor Author

sanmai-NL commented Jan 18, 2024

Yes please continue, that sounds acceptable to me personally (although as discussed, removing files_from_stdin as config option entirely seems a bit unnecessary, but alas). I don't know about the advantages of your first design change, but I suppose you have thought that through.

@mrexox
Copy link
Member

mrexox commented Jan 19, 2024

@sanmai-NL, I've added the tests and refactored some parts. Now overrides are applied not only to {staged_files} template but to all templates ({files}, {all_files}, {push_files}).

I also added parsing files from STDIN by \n as well as \0, since I think it's not a common thing to have file names that include \n, but having this option makes the feature use more convenient.

Please, review the change, and if it's OK I will merge the PR and prepare a release next week.

@mrexox mrexox merged commit 3246821 into evilmartians:master Jan 22, 2024
13 of 16 checks passed
@sanmai-NL
Copy link
Contributor Author

sanmai-NL commented Jan 23, 2024

@mrexox, but splitting on both \n and \0 breaks my intentional functionality ...

I'm only now ready to review. Regrettably, it seems I assessed your change differently from how it was implemented. I thought you also added an option to split on newlines (exclusively), but you combined it with the null character based splitting.

Splitting on \n is not safe, same as on ,, since anyone can create and supply files with such characters in their names. This choice has the consequence that I can still not pipe Unix tool output (see OP) since sometimes, newlines are in there too even without a file name containing one. The freedom to combine Unix tool outputs, in addition to safety, is why all other tools do not split on newlines ...

@mrexox
Copy link
Member

mrexox commented Jan 23, 2024

@sanmai-NL, having newlines in file names is not conventional (however is possible). Do you have such files, or are you afraid that it can break something for your work?

@sanmai-NL
Copy link
Contributor Author

The latter, such files can be added to pull request branches and it would break tooling.

@sanmai-NL
Copy link
Contributor Author

Or rather, it would break Lefthook but not other tooling.

@mrexox
Copy link
Member

mrexox commented Jan 23, 2024

Alright, I will change this in next fix version to use only \0 by default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

--files path separator not specified, whitespace doesn't work so it doesn't process globs
2 participants