-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keyword regexes improvement #418
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing this! I was going to work on a fix, but I'm so glad that you were able to fix these issues yourselves.
I like the detail that you have in your PR summary; I think that this should be captured in code as comments (especially your thorough breakdown of the new SECRET
regex rationale).
With regards to test cases, I want to abandon the "old-style" tests, and (eventually) replace them with much more readable tests. After all, readable tests are a great way of documenting intended usage of functions. I think you have a good set of test cases, and maybe my interim test cases will help you get started?
def test_ignore_case():
secret = next(scan_line('os.environ["AWS_SECRET_ACCESS_KEY"] = "testing"'))
assert secret.secret_value == 'testing'
def test_secret_with_whitespace():
secret = next(scan_line('password = "not a secret"'))
assert secret.secret_value == 'not a secret'
def test_comparison_operator():
secret = next(scan_line('api_key == "hunter2"'))
assert secret.secret_value == 'hunter2'
def test_bounds():
for index, secret in scan_line(
'detect-secrets scan '
'--exclude-lines "password = blah" '
'--exclude-lines "password = fake"',
):
assert secret.secret_value in {'blah', 'fake'}
# TODO(2021-03-03): This should probably yield `fake` as well.
# assert index == 1
def test_backticks():
secret = next(scan_line('api_key = `cat password`'))
assert secret.secret_value == 'cat password'
@pytest.mark.parametrize(
'line',
(
'if ("value" == password)',
'else if ("value" === password)',
...
),
)
def test_preceded_by_equals_sign(line):
pass
@pytest.fixture(autouse=True)
def use_keyword_detector():
with transient_settings({
'plugins_used': [{
'name': 'KeywordDetector',
}],
}):
yield
* feat: add auth keyword to detectors Another common word for secrets that isn't currently picked up. * feat: tests for new db2 keywords too * test: test case for user reported false negative
We implement a new We obtain a 98% of coverage with the new Thank you very much! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! What you've done with the KeywordDetector test cases is amazing! So much better, and more than I expected. Thanks!
detect_secrets/plugins/keyword.py
Outdated
''' | ||
Secret regex details: | ||
[^\r\n]* -> this section match with every character except line breaks. | ||
This allows to find secrets that starts with symbols or | ||
alphanumeric characters. | ||
[a-zA-Z0-9]+ -> this section match only with alphanumeric characters, and at | ||
least one is required. This allows to reduce the false positives | ||
number. | ||
[^\r\n]* -> this section match with every character except line breaks. | ||
This allows to find secrets with symbols at the end. | ||
[^\r\n,\'"`] -> this section match with the last secret character that can be | ||
everything except line breaks, comma, backticks or quotes. This | ||
allows to reduce the false positives number and to prevent | ||
errors in the code snippet highlighting. | ||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, are you running pre-commit
? I'm hitting this error:
$ pre-commit run --all-files
Check builtin type constructor use.......................................Passed
Check docstring is first.................................................Failed
hookid: check-docstring-first
detect_secrets/plugins/keyword.py:53 Multiple module docstrings (first docstring on line 1).
Debug Statements (Python)................................................Passed
Fix double quoted strings................................................Passed
Fix End of Files.........................................................Passed
Tests should end in _test.py.............................................Passed
Flake8...................................................................Passed
Trim Trailing Whitespace.................................................Passed
Reorder python imports...................................................Passed
Add trailing commas......................................................Passed
autopep8.................................................................Passed
Detect secrets...........................................................Failed
hookid: detect-secrets
Files were modified by this hook. Additional output:
The baseline file was updated.
Probably to keep line numbers of secrets up-to-date.
Please `git add .secrets.baseline`, thank you.
Your baseline file (.secrets.baseline) is unstaged.
`git add .secrets.baseline` to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are running pre-commit
. This error is in the line 53 that is the comment of FALSE_POSITIVES
variable. The details of the SECRET
regex can be added using one line comments (#
) instead of multiple line comments ('''
). What do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's just do multiple line comments, so pre-commit will be happy. Then I can merge this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think that pre-commit
is working fine now
Thank you very much!!! |
# everything except line breaks, comma, backticks or quotes. This | ||
# allows to reduce the false positives number and to prevent | ||
# errors in the code snippet highlighting. | ||
SECRET = r'[^\r\n]*[a-zA-Z0-9]+[^\r\n]*[^\r\n,\'"`]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pablosantiagolopez : I think you might have introduced an accidental exponential performance degradation here, with files with long lines.
(stable_version) $ time detect-secrets scan detect_secrets/filters/gibberish/rfc.model
...
real 0m0.286s
user 0m0.243s
sys 0m0.036s
(version_on_master) $ time detect-secrets scan detect_secrets/filters/gibberish/rfc.model
...
real 29m28.225s
user 29m24.396s
sys 0m2.074s
(version_on_master) $ time detect-secrets scan detect_secrets/filters/gibberish/rfc.model --disable-plugin KeywordDetector
...
real 0m0.385s
user 0m0.322s
sys 0m0.054s
Please look into this.
@pablosantiagolopez : sorry it's been a while since I was able to look at your other pending PRs. I realized that since the In my investigations, I have found that accidental exponential performance degradation is still there, on the master branch. Here is the example: <img src="
"
> On the latest published version, this payload works. However, on the current master ( Maybe we should just add a max char limit to the regex? |
Hi @domanchi, how are you? About our pending PRs, no problem, we understand that reviewing them suppose very much time and effort, so don't worry. About the performance problem of the
Based in this links, I have changed the SECRET = r'(?=[^\v\'\"]*)(?=\w+)[^\v\'\"]*[^\v,\'\"`]' The regex content is the same, but the first two sub-groups are moved into Please, test this regex and tell me what do you think about this. If it works for you, I can add it into one of our pending PRs because they are also related to the |
Thanks for your quick response! I'll test it out, and let you know. |
Hi! I totally agree with the backtracking performance problem that @pablosantiagolopez has mentioned. I would like to comment that it might be interesting to create a filter for these cases, in addition to fixing the backtracking problem (if the @domanchi 's tests confirm it). I think that a Also, other good idea can be include more non-code files extensions into the This tips can improve the |
@pablosantiagolopez : I can confirm it performs a lot better. Let's cut out a PR for this change alone, so that I can merge it independently of other modifications to the EDIT: Feel free to use the example payload as a test case for long lines, so that we can add regression tests. |
Perfect, it's comming...Thank you! |
Amazing PR @pablosnt 😮 |
Thank you very much @KevinHock!!! Everything is easier with the help of @syn-4ck ;) |
This Pull Request solves some issues detected in #414 and includes some improvements described in #396 to the latest version of detect-secrets.
This code includes:
FOLLOWED_BY_EQUAL_SIGNS_REGEX
andFOLLOWED_BY_EQUAL_SIGNS_QUOTES_REQUIRED_REGEX
regexes. For example:PRECEDED_BY_EQUAL_COMPARISON_SIGNS_QUOTES_REQUIRED_REGEX
regex. For example:Corrections in the
SECRET
regex to avoid some problems detected in Bugfix of Yaml exception with simple quotes #414 and reduce the false positive percentage. The regex can be decomposed in the following sections:[^\r\n]*
: this section match with every character except line breaks. This allows to find secrets that starts with symbols or alphanumeric characters.[a-zA-Z0-9]+
: this section match only with alphanumeric characters, and at least one is required. This allows to reduce the false positives number.[^\r\n]*
: this section match with every character except line breaks. This allows to find secrets with symbols at the end.[^\r\n,\'"]
: this section match with the last secret character that can be everything except line breaks, comma or quotes. This allows to reduce the false positives number and to prevent errors in the code snippet highlighting. Next you can see an example:The following image shows a test with this new regex:
Of course,
keyword_test.py
has been updated to check all this changes.