Add --from-commit argument to allow commit-incremental update scans #295

psedge · 2020-04-22T16:17:19Z

Problem
In situations where use of detect-secrets-server isn't possible, it can take a long time to re-scan all files in a repository if it's sufficiently large, and it's rare that every file is worth re-checking.

Proposal
Rather than re-scanning all git-tracked files, store the commit of the last successful scan and if a flag is passed only compare the files changed since then. This allows incremental scans which are much more lightweight than a standard scan action. This is functionally similar to the way that detect-secrets-hook works.

Add a --from-commit argument which accepts either:
- a commit ref to diff against
- no argument, taking the commit ref from the baseline
Scan only the files that have changed
Store the HEAD ref in a new baseline field commit

… provided commit ref

…ement version to 0.13.2

domanchi

First off, I love this idea. I think it has great potential -- why was it closed?

Secondly, as I did some deeper thinking about this, would we be able to leverage the existing scan_diff functionality instead? There is a nuanced difference between git diff --name-only <from-commit> and piping git diff <from-commit> into the scan_diff functionality, as the former would identify secrets in a file, even if the secret wasn't added between that time period. Which one were you going for?

domanchi · 2020-04-22T20:07:30Z

detect_secrets/__init__.py

@@ -1 +1 @@
-VERSION = '0.13.1'
+VERSION = '0.13.2'


We should probably minor version bump for this new feature.

Remember to bump the version mentioned in the README as well.

You can use python scripts/bumpity.py --bump minor to do this if you want to.

domanchi · 2020-04-27T20:25:00Z

detect_secrets/core/usage.py

+                'Scan only files which have changed since a specified commit reference.',
+                'Leave empty to read commit reference from baseline',
+            ),
+            dest='from_commit',


This dest parameter is not required.

domanchi · 2020-04-28T12:33:53Z

detect_secrets/core/baseline.py

@@ -80,6 +99,9 @@ def initialize(
    for file in sorted(files_to_scan):
        output.scan_file(file)

+        if from_commit:
+            output.files_scanned.append(file)


This line had me thinking for a while. Why do we need to keep track of all the files that we scan in the baseline?

By definition, this would produce a different set of files as compared to git ls-files. However, you could also argue that scanning the differential wouldn't seek to create a baseline as we know it, rather, attempts to see whether there's any secrets between the git commit provided, and current HEAD.

If this is true (and there's no other need for having a list of files scanned), you could just use the results field to see if there are any secrets between the from_commit and HEAD. If there isn't, our job is done. If there is, it's not like we'll be creating a baseline from it anyway?

Granted, this is overloading the baseline format as another way to present detected secrets, but 🤷

In this implementation it was necessary due to problems if we're passing the --update flag, as we need to know which files were 'scanned'; if a scan has found that secrets have been removed then the new_results won't have a key for eg. 'fileA', whereas old_results may do. In this case, we need to not include old_results['fileA'] during our merge_results call.

I actually struggled with the changed files concept; if we proceed with this idea I'll revisit this to use scan_diff as it sounds like it solves the problem! My concern was that it's possible to change a file and not commit, scan, then revert the file. We'd now have secrets entries which have been invalidated.

KevinHock · 2020-04-28T20:58:50Z

To summarize my thoughts on this, I think this is worth having, since it will do what -server does but out of the box not needing something like S3.

It may contribute to merge conflicts like generated_at currently does, but we could always not include the commit field if the flag for #92 was passed.

killuazhu · 2020-04-30T13:19:21Z

Add a --from-commit argument which accepts either:

a commit ref to diff against

no argument, taking the commit ref from the baseline

@psedge I'm also curious what the behavior would be if the commit ref from the baseline no longer exists? Would it scan all the commits? Or just give up on --from-commit argument? This can happen consistently if your team use a squash and merge PR approval flow. The commit still existed and referenced by the hidden PR ref on Github server, but local clones might not know that commit at all.

KevinHock · 2020-04-30T19:40:35Z

detect_secrets/core/baseline.py

+    try:
+        with open(os.devnull, 'w') as fnull:
+            git_files = subprocess.check_output(
+                [
+                    'git',
+                    '--no-pager',
+                    'diff',
+                    '--name-only',
+                    from_commit,
+                    rootdir,
+                ],
+                stderr=fnull,
+            )
+        for filename in git_files.decode('utf-8').split():
+            # As opposed to ls-files which gives path relative to -C, diff gives
+            # paths from git root directory
+            relative_path = util.get_relative_path_if_in_cwd('.', filename)
+            if relative_path:
+                output.append(relative_path)
+    except subprocess.CalledProcessError:
+        pass


Nit: Might be better practice to only wrap the relevant subprocess call in a try: except:

KevinHock · 2020-04-30T19:42:43Z

tests/core/secrets_collection_test.py

@@ -383,6 +413,12 @@ def test_load_baseline_without_exclude(self, mock_log):
            )
        assert mock_log.error_messages == 'Incorrectly formatted baseline!\n'

+    def get_point_thirteen_point_two_and_later_baseline_dict(self, gmtime):
+        # In v0.13.2 --from-commit got added


++, great tests!

thanks! I wanted to ask though, I added my additional assertion at secrets_collection_test.py:390 trying to stay in style by duplicating the contents of the previous test_load_baseline_from_string_with_X methods; would it have been better to simply call those previous version tests in mine, and then test my assertions?

psedge · 2020-05-02T08:55:17Z

To summarize my thoughts on this, I think this is worth having, since it will do what -server does but out of the box not needing something like S3.

It may contribute to merge conflicts like generated_at currently does, but we could always not include the commit field if the flag for #92 was passed.

I've been thinking about this over the last few days and I think you are right that merge conflicts are fairly likely on this field in a many-engineers-one-codebase setup, and it's going to sometimes require manual intervention. They become far more likely if we update the commit field automatically in baseline.py:85, so would it be best to have this as a hand-updated field only? I'm thinking that this should be used more as a 'last safe point' which can be modified when the delta becomes prohibitively large/slow to scan.

Throughout our discussion I've actually moved towards using the -server component in my CI flow as it has numerous stability advantages over this implementation, but that hasn't solved the problems of the client component repeating work when scanning, where I think this might still be a valid feature. It is a permanent change to the .baseline format though as you say @KevinHock; and its naming may mislead users - if you can suggest better wording of either the flag or the usage instructions please do!

psedge · 2020-05-02T09:00:25Z

Add a --from-commit argument which accepts either:

a commit ref to diff against

no argument, taking the commit ref from the baseline

@psedge I'm also curious what the behavior would be if the commit ref from the baseline no longer exists? Would it scan all the commits? Or just give up on --from-commit argument? This can happen consistently if your team use a squash and merge PR approval flow. The commit still existed and referenced by the hidden PR ref on Github server, but local clones might not know that commit at all.

I ran into this problem when testing; I initially wrote the command to exit in this case but didn't know whether that was a good UX. I'm open to suggestions though, in the current implementation I believe it fails silently and acts as if no files changed between the non-existent ref and HEAD.

perryzjc · 2024-03-19T20:11:11Z

@psedge I am love to see this feature get implemented... but the PR has been closed for 4 years and no further discussions.

lorenzodb1 · 2024-04-16T19:09:59Z

@perryzjc consider opening a PR with the changes contained in this one so we can move from there

psedge added 2 commits April 22, 2020 14:57

Add --from-commit arg to scan action, allowing incremental scans from…

d6a9a18

… provided commit ref

Add default empty --from-commit arg, write HEAD to baseline, and incr…

4312075

…ement version to 0.13.2

psedge closed this Apr 26, 2020

domanchi reviewed Apr 28, 2020

View reviewed changes

KevinHock reviewed Apr 30, 2020

View reviewed changes

KevinHock mentioned this pull request May 18, 2020

Enhancement proposal: no date in file? #302

Open

killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request May 28, 2020

Include wordlist by default (Yelp#295)

897fa22

killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request Jul 9, 2020

Include wordlist by default (Yelp#295)

8b761e1

killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request Sep 17, 2020

Include wordlist by default (Yelp#295)

3993664

perryzjc mentioned this pull request Mar 19, 2024

Using Template for Secrets Detection Guide NASA-AMMOS/slim#143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --from-commit argument to allow commit-incremental update scans #295

Add --from-commit argument to allow commit-incremental update scans #295

psedge commented Apr 22, 2020

domanchi left a comment

domanchi Apr 22, 2020

KevinHock Apr 30, 2020

domanchi Apr 27, 2020

domanchi Apr 28, 2020

psedge Apr 30, 2020

KevinHock commented Apr 28, 2020 •

edited

Loading

killuazhu commented Apr 30, 2020

KevinHock Apr 30, 2020

KevinHock Apr 30, 2020

psedge May 2, 2020

psedge commented May 2, 2020

psedge commented May 2, 2020

perryzjc commented Mar 19, 2024

lorenzodb1 commented Apr 16, 2024

		@@ -1 +1 @@
		VERSION = '0.13.1'
		VERSION = '0.13.2'

Add --from-commit argument to allow commit-incremental update scans #295

Add --from-commit argument to allow commit-incremental update scans #295

Conversation

psedge commented Apr 22, 2020

domanchi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinHock commented Apr 28, 2020 • edited Loading

killuazhu commented Apr 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

psedge commented May 2, 2020

psedge commented May 2, 2020

perryzjc commented Mar 19, 2024

lorenzodb1 commented Apr 16, 2024

KevinHock commented Apr 28, 2020 •

edited

Loading