-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --from-commit argument to allow commit-incremental update scans #295
Conversation
… provided commit ref
…ement version to 0.13.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First off, I love this idea. I think it has great potential -- why was it closed?
Secondly, as I did some deeper thinking about this, would we be able to leverage the existing scan_diff
functionality instead? There is a nuanced difference between git diff --name-only <from-commit>
and piping git diff <from-commit>
into the scan_diff
functionality, as the former would identify secrets in a file, even if the secret wasn't added between that time period. Which one were you going for?
@@ -1 +1 @@ | |||
VERSION = '0.13.1' | |||
VERSION = '0.13.2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably minor version bump for this new feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember to bump the version mentioned in the README as well.
You can use python scripts/bumpity.py --bump minor
to do this if you want to.
'Scan only files which have changed since a specified commit reference.', | ||
'Leave empty to read commit reference from baseline', | ||
), | ||
dest='from_commit', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dest
parameter is not required.
@@ -80,6 +99,9 @@ def initialize( | |||
for file in sorted(files_to_scan): | |||
output.scan_file(file) | |||
|
|||
if from_commit: | |||
output.files_scanned.append(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line had me thinking for a while. Why do we need to keep track of all the files that we scan in the baseline?
By definition, this would produce a different set of files as compared to git ls-files
. However, you could also argue that scanning the differential wouldn't seek to create a baseline as we know it, rather, attempts to see whether there's any secrets between the git commit provided, and current HEAD.
If this is true (and there's no other need for having a list of files scanned), you could just use the results
field to see if there are any secrets between the from_commit
and HEAD. If there isn't, our job is done. If there is, it's not like we'll be creating a baseline from it anyway?
Granted, this is overloading the baseline format as another way to present detected secrets, but 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this implementation it was necessary due to problems if we're passing the --update
flag, as we need to know which files were 'scanned'; if a scan has found that secrets have been removed then the new_results
won't have a key for eg. 'fileA', whereas old_results
may do. In this case, we need to not include old_results['fileA']
during our merge_results
call.
I actually struggled with the changed files concept; if we proceed with this idea I'll revisit this to use scan_diff
as it sounds like it solves the problem! My concern was that it's possible to change a file and not commit, scan, then revert the file. We'd now have secrets entries which have been invalidated.
To summarize my thoughts on this, I think this is worth having, since it will do what -server does but out of the box not needing something like S3. It may contribute to merge conflicts like |
@psedge I'm also curious what the behavior would be if the commit ref from the baseline no longer exists? Would it scan all the commits? Or just give up on |
try: | ||
with open(os.devnull, 'w') as fnull: | ||
git_files = subprocess.check_output( | ||
[ | ||
'git', | ||
'--no-pager', | ||
'diff', | ||
'--name-only', | ||
from_commit, | ||
rootdir, | ||
], | ||
stderr=fnull, | ||
) | ||
for filename in git_files.decode('utf-8').split(): | ||
# As opposed to ls-files which gives path relative to -C, diff gives | ||
# paths from git root directory | ||
relative_path = util.get_relative_path_if_in_cwd('.', filename) | ||
if relative_path: | ||
output.append(relative_path) | ||
except subprocess.CalledProcessError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Might be better practice to only wrap the relevant subprocess
call in a try:
except:
@@ -383,6 +413,12 @@ def test_load_baseline_without_exclude(self, mock_log): | |||
) | |||
assert mock_log.error_messages == 'Incorrectly formatted baseline!\n' | |||
|
|||
def get_point_thirteen_point_two_and_later_baseline_dict(self, gmtime): | |||
# In v0.13.2 --from-commit got added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++, great tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! I wanted to ask though, I added my additional assertion at secrets_collection_test.py:390 trying to stay in style by duplicating the contents of the previous test_load_baseline_from_string_with_X
methods; would it have been better to simply call those previous version tests in mine, and then test my assertions?
I've been thinking about this over the last few days and I think you are right that merge conflicts are fairly likely on this field in a many-engineers-one-codebase setup, and it's going to sometimes require manual intervention. They become far more likely if we update the Throughout our discussion I've actually moved towards using the -server component in my CI flow as it has numerous stability advantages over this implementation, but that hasn't solved the problems of the client component repeating work when scanning, where I think this might still be a valid feature. It is a permanent change to the .baseline format though as you say @KevinHock; and its naming may mislead users - if you can suggest better wording of either the flag or the usage instructions please do! |
I ran into this problem when testing; I initially wrote the command to exit in this case but didn't know whether that was a good UX. I'm open to suggestions though, in the current implementation I believe it fails silently and acts as if no files changed between the non-existent ref and HEAD. |
@psedge I am love to see this feature get implemented... but the PR has been closed for 4 years and no further discussions. |
@perryzjc consider opening a PR with the changes contained in this one so we can move from there |
Problem
In situations where use of detect-secrets-server isn't possible, it can take a long time to re-scan all files in a repository if it's sufficiently large, and it's rare that every file is worth re-checking.
Proposal
Rather than re-scanning all git-tracked files, store the commit of the last successful scan and if a flag is passed only compare the files changed since then. This allows incremental scans which are much more lightweight than a standard scan action. This is functionally similar to the way that detect-secrets-hook works.
commit