Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --from-commit argument to allow commit-incremental update scans #295

Closed
wants to merge 2 commits into from

Conversation

psedge
Copy link
Contributor

@psedge psedge commented Apr 22, 2020

Problem
In situations where use of detect-secrets-server isn't possible, it can take a long time to re-scan all files in a repository if it's sufficiently large, and it's rare that every file is worth re-checking.

Proposal
Rather than re-scanning all git-tracked files, store the commit of the last successful scan and if a flag is passed only compare the files changed since then. This allows incremental scans which are much more lightweight than a standard scan action. This is functionally similar to the way that detect-secrets-hook works.

  • Add a --from-commit argument which accepts either:
    • a commit ref to diff against
    • no argument, taking the commit ref from the baseline
  • Scan only the files that have changed
  • Store the HEAD ref in a new baseline field commit

@psedge psedge closed this Apr 26, 2020
Copy link
Contributor

@domanchi domanchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, I love this idea. I think it has great potential -- why was it closed?

Secondly, as I did some deeper thinking about this, would we be able to leverage the existing scan_diff functionality instead? There is a nuanced difference between git diff --name-only <from-commit> and piping git diff <from-commit> into the scan_diff functionality, as the former would identify secrets in a file, even if the secret wasn't added between that time period. Which one were you going for?

@@ -1 +1 @@
VERSION = '0.13.1'
VERSION = '0.13.2'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably minor version bump for this new feature.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to bump the version mentioned in the README as well.

You can use python scripts/bumpity.py --bump minor to do this if you want to.

'Scan only files which have changed since a specified commit reference.',
'Leave empty to read commit reference from baseline',
),
dest='from_commit',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dest parameter is not required.

@@ -80,6 +99,9 @@ def initialize(
for file in sorted(files_to_scan):
output.scan_file(file)

if from_commit:
output.files_scanned.append(file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line had me thinking for a while. Why do we need to keep track of all the files that we scan in the baseline?

By definition, this would produce a different set of files as compared to git ls-files. However, you could also argue that scanning the differential wouldn't seek to create a baseline as we know it, rather, attempts to see whether there's any secrets between the git commit provided, and current HEAD.

If this is true (and there's no other need for having a list of files scanned), you could just use the results field to see if there are any secrets between the from_commit and HEAD. If there isn't, our job is done. If there is, it's not like we'll be creating a baseline from it anyway?

Granted, this is overloading the baseline format as another way to present detected secrets, but 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this implementation it was necessary due to problems if we're passing the --update flag, as we need to know which files were 'scanned'; if a scan has found that secrets have been removed then the new_results won't have a key for eg. 'fileA', whereas old_results may do. In this case, we need to not include old_results['fileA'] during our merge_results call.

I actually struggled with the changed files concept; if we proceed with this idea I'll revisit this to use scan_diff as it sounds like it solves the problem! My concern was that it's possible to change a file and not commit, scan, then revert the file. We'd now have secrets entries which have been invalidated.

@KevinHock
Copy link
Collaborator

KevinHock commented Apr 28, 2020

To summarize my thoughts on this, I think this is worth having, since it will do what -server does but out of the box not needing something like S3.

It may contribute to merge conflicts like generated_at currently does, but we could always not include the commit field if the flag for #92 was passed.

@killuazhu
Copy link
Contributor

  • Add a --from-commit argument which accepts either:
    • a commit ref to diff against
    • no argument, taking the commit ref from the baseline

@psedge I'm also curious what the behavior would be if the commit ref from the baseline no longer exists? Would it scan all the commits? Or just give up on --from-commit argument? This can happen consistently if your team use a squash and merge PR approval flow. The commit still existed and referenced by the hidden PR ref on Github server, but local clones might not know that commit at all.

Comment on lines +352 to +372
try:
with open(os.devnull, 'w') as fnull:
git_files = subprocess.check_output(
[
'git',
'--no-pager',
'diff',
'--name-only',
from_commit,
rootdir,
],
stderr=fnull,
)
for filename in git_files.decode('utf-8').split():
# As opposed to ls-files which gives path relative to -C, diff gives
# paths from git root directory
relative_path = util.get_relative_path_if_in_cwd('.', filename)
if relative_path:
output.append(relative_path)
except subprocess.CalledProcessError:
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Might be better practice to only wrap the relevant subprocess call in a try: except:

@@ -383,6 +413,12 @@ def test_load_baseline_without_exclude(self, mock_log):
)
assert mock_log.error_messages == 'Incorrectly formatted baseline!\n'

def get_point_thirteen_point_two_and_later_baseline_dict(self, gmtime):
# In v0.13.2 --from-commit got added
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, great tests!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I wanted to ask though, I added my additional assertion at secrets_collection_test.py:390 trying to stay in style by duplicating the contents of the previous test_load_baseline_from_string_with_X methods; would it have been better to simply call those previous version tests in mine, and then test my assertions?

@psedge
Copy link
Contributor Author

psedge commented May 2, 2020

To summarize my thoughts on this, I think this is worth having, since it will do what -server does but out of the box not needing something like S3.

It may contribute to merge conflicts like generated_at currently does, but we could always not include the commit field if the flag for #92 was passed.

I've been thinking about this over the last few days and I think you are right that merge conflicts are fairly likely on this field in a many-engineers-one-codebase setup, and it's going to sometimes require manual intervention. They become far more likely if we update the commit field automatically in baseline.py:85, so would it be best to have this as a hand-updated field only? I'm thinking that this should be used more as a 'last safe point' which can be modified when the delta becomes prohibitively large/slow to scan.

Throughout our discussion I've actually moved towards using the -server component in my CI flow as it has numerous stability advantages over this implementation, but that hasn't solved the problems of the client component repeating work when scanning, where I think this might still be a valid feature. It is a permanent change to the .baseline format though as you say @KevinHock; and its naming may mislead users - if you can suggest better wording of either the flag or the usage instructions please do!

@psedge
Copy link
Contributor Author

psedge commented May 2, 2020

  • Add a --from-commit argument which accepts either:

    • a commit ref to diff against
    • no argument, taking the commit ref from the baseline

@psedge I'm also curious what the behavior would be if the commit ref from the baseline no longer exists? Would it scan all the commits? Or just give up on --from-commit argument? This can happen consistently if your team use a squash and merge PR approval flow. The commit still existed and referenced by the hidden PR ref on Github server, but local clones might not know that commit at all.

I ran into this problem when testing; I initially wrote the command to exit in this case but didn't know whether that was a good UX. I'm open to suggestions though, in the current implementation I believe it fails silently and acts as if no files changed between the non-existent ref and HEAD.

killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request May 28, 2020
killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request Jul 9, 2020
killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request Sep 17, 2020
@perryzjc
Copy link
Contributor

@psedge I am love to see this feature get implemented... but the PR has been closed for 4 years and no further discussions.

@lorenzodb1
Copy link
Member

@perryzjc consider opening a PR with the changes contained in this one so we can move from there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants