Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keyword regexes improvement #418

Merged
merged 8 commits into from
Mar 6, 2021
Merged

Keyword regexes improvement #418

merged 8 commits into from
Mar 6, 2021

Conversation

pablosnt
Copy link
Contributor

@pablosnt pablosnt commented Mar 4, 2021

This Pull Request solves some issues detected in #414 and includes some improvements described in #396 to the latest version of detect-secrets.

This code includes:

  • Support to secret detection in comparisons. You can see it in FOLLOWED_BY_EQUAL_SIGNS_REGEX and FOLLOWED_BY_EQUAL_SIGNS_QUOTES_REQUIRED_REGEX regexes. For example:
if (password == "value") {}
else if (password === "value") {}
else if (password != "value") {}
else if (password !== "value") {}
  • Support to secret detection in reverse comparisons. You can see it in the PRECEDED_BY_EQUAL_COMPARISON_SIGNS_QUOTES_REQUIRED_REGEX regex. For example:
if ("value" == password) {}
else if ("value" === password) {}
else if ("value" != password) {}
else if ("value" !== password) {}
  • Corrections in the SECRET regex to avoid some problems detected in Bugfix of Yaml exception with simple quotes #414 and reduce the false positive percentage. The regex can be decomposed in the following sections:

    • [^\r\n]*: this section match with every character except line breaks. This allows to find secrets that starts with symbols or alphanumeric characters.
    • [a-zA-Z0-9]+: this section match only with alphanumeric characters, and at least one is required. This allows to reduce the false positives number.
    • [^\r\n]*: this section match with every character except line breaks. This allows to find secrets with symbols at the end.
    • [^\r\n,\'"]: this section match with the last secret character that can be everything except line breaks, comma or quotes. This allows to reduce the false positives number and to prevent errors in the code snippet highlighting. Next you can see an example:

last_comma

The following image shows a test with this new regex:

regex_example

Of course, keyword_test.py has been updated to check all this changes.

Copy link
Contributor

@domanchi domanchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing this! I was going to work on a fix, but I'm so glad that you were able to fix these issues yourselves.

I like the detail that you have in your PR summary; I think that this should be captured in code as comments (especially your thorough breakdown of the new SECRET regex rationale).

With regards to test cases, I want to abandon the "old-style" tests, and (eventually) replace them with much more readable tests. After all, readable tests are a great way of documenting intended usage of functions. I think you have a good set of test cases, and maybe my interim test cases will help you get started?

def test_ignore_case():
    secret = next(scan_line('os.environ["AWS_SECRET_ACCESS_KEY"] = "testing"'))
    assert secret.secret_value == 'testing'


def test_secret_with_whitespace():
    secret = next(scan_line('password = "not a secret"'))
    assert secret.secret_value == 'not a secret'


def test_comparison_operator():
    secret = next(scan_line('api_key == "hunter2"'))
    assert secret.secret_value == 'hunter2'


def test_bounds():
    for index, secret in scan_line(
        'detect-secrets scan '
        '--exclude-lines "password = blah" '
        '--exclude-lines "password = fake"',
    ):
        assert secret.secret_value in {'blah', 'fake'}

    # TODO(2021-03-03): This should probably yield `fake` as well.
    # assert index == 1


def test_backticks():
    secret = next(scan_line('api_key = `cat password`'))

    assert secret.secret_value == 'cat password'


@pytest.mark.parametrize(
    'line',
    (
        'if ("value" == password)',
        'else if ("value" === password)',
        ...
    ),
)
def test_preceded_by_equals_sign(line):
    pass


@pytest.fixture(autouse=True)
def use_keyword_detector():
    with transient_settings({
        'plugins_used': [{
            'name': 'KeywordDetector',
        }],
    }):
        yield

detect_secrets/plugins/keyword.py Outdated Show resolved Hide resolved
killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request Mar 4, 2021
* feat: add auth keyword to detectors

Another common word for secrets that isn't currently picked up.

* feat: tests for new db2 keywords too

* test: test case for user reported false negative
@pablosnt
Copy link
Contributor Author

pablosnt commented Mar 5, 2021

With regards to test cases, I want to abandon the "old-style" tests, and (eventually) replace them with much more readable tests. After all, readable tests are a great way of documenting intended usage of functions. I think you have a good set of test cases, and maybe my interim test cases will help you get started?

We implement a new keyword_test.py. I think it is much more readable and simpler, so it should be very easy to add new test cases. I hope you would like it.

We obtain a 98% of coverage with the new keyword_test, so I think that the included test cases are enough:

keyword_coverage

Thank you very much!

Copy link
Contributor

@domanchi domanchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! What you've done with the KeywordDetector test cases is amazing! So much better, and more than I expected. Thanks!

Comment on lines 155 to 169
'''
Secret regex details:
[^\r\n]* -> this section match with every character except line breaks.
This allows to find secrets that starts with symbols or
alphanumeric characters.
[a-zA-Z0-9]+ -> this section match only with alphanumeric characters, and at
least one is required. This allows to reduce the false positives
number.
[^\r\n]* -> this section match with every character except line breaks.
This allows to find secrets with symbols at the end.
[^\r\n,\'"`] -> this section match with the last secret character that can be
everything except line breaks, comma, backticks or quotes. This
allows to reduce the false positives number and to prevent
errors in the code snippet highlighting.
'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, are you running pre-commit? I'm hitting this error:

$ pre-commit run --all-files
Check builtin type constructor use.......................................Passed
Check docstring is first.................................................Failed
hookid: check-docstring-first

detect_secrets/plugins/keyword.py:53 Multiple module docstrings (first docstring on line 1).

Debug Statements (Python)................................................Passed
Fix double quoted strings................................................Passed
Fix End of Files.........................................................Passed
Tests should end in _test.py.............................................Passed
Flake8...................................................................Passed
Trim Trailing Whitespace.................................................Passed
Reorder python imports...................................................Passed
Add trailing commas......................................................Passed
autopep8.................................................................Passed
Detect secrets...........................................................Failed
hookid: detect-secrets

Files were modified by this hook. Additional output:

The baseline file was updated.
Probably to keep line numbers of secrets up-to-date.
Please `git add .secrets.baseline`, thank you.


Your baseline file (.secrets.baseline) is unstaged.
`git add .secrets.baseline` to fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are running pre-commit. This error is in the line 53 that is the comment of FALSE_POSITIVES variable. The details of the SECRET regex can be added using one line comments (#) instead of multiple line comments ('''). What do you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's just do multiple line comments, so pre-commit will be happy. Then I can merge this in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think that pre-commit is working fine now

@pablosnt
Copy link
Contributor Author

pablosnt commented Mar 6, 2021

Wow! What you've done with the KeywordDetector test cases is amazing! So much better, and more than I expected. Thanks!

Thank you very much!!!

@domanchi domanchi merged commit deb1705 into Yelp:master Mar 6, 2021
# everything except line breaks, comma, backticks or quotes. This
# allows to reduce the false positives number and to prevent
# errors in the code snippet highlighting.
SECRET = r'[^\r\n]*[a-zA-Z0-9]+[^\r\n]*[^\r\n,\'"`]'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pablosantiagolopez : I think you might have introduced an accidental exponential performance degradation here, with files with long lines.

(stable_version) $ time detect-secrets scan detect_secrets/filters/gibberish/rfc.model
...
real    0m0.286s
user    0m0.243s
sys     0m0.036s

(version_on_master) $ time detect-secrets scan detect_secrets/filters/gibberish/rfc.model
...
real    29m28.225s
user    29m24.396s
sys     0m2.074s


(version_on_master) $ time detect-secrets scan detect_secrets/filters/gibberish/rfc.model --disable-plugin KeywordDetector
...
real    0m0.385s
user    0m0.322s
sys     0m0.054s

Please look into this.

@domanchi
Copy link
Contributor

domanchi commented Apr 1, 2021

@pablosantiagolopez : sorry it's been a while since I was able to look at your other pending PRs. I realized that since the KeywordDetector regexes are becoming increasingly complicated, the only way to properly test them is to stress test them against a subset of our internal code. As such, I've been retooling internal tools to perform this analysis, and have been trying to use that to approach testing these proposed changes with much more rigor.

In my investigations, I have found that accidental exponential performance degradation is still there, on the master branch. Here is the example:

<img src="
"
>

On the latest published version, this payload works. However, on the current master (465902312), it takes LONG time. Please investigate.

Maybe we should just add a max char limit to the regex?

@pablosnt
Copy link
Contributor Author

pablosnt commented Apr 1, 2021

Hi @domanchi, how are you? About our pending PRs, no problem, we understand that reviewing them suppose very much time and effort, so don't worry.

About the performance problem of the KeywordDetector, I was working in that and I think that the problem of the SECRET regex is the called Catastrophic backtracking. I found this interesting articles about the problem:

Based in this links, I have changed the SECRET regex to the following:

SECRET = r'(?=[^\v\'\"]*)(?=\w+)[^\v\'\"]*[^\v,\'\"`]'

The regex content is the same, but the first two sub-groups are moved into (?= ... ). I have tested it with your example, and detect-secrets finish the scan in similar times when KeywordDetector is enabled and when it is disabled. I have also runned the tests and everything works, so I think that it could be a good solution.

Please, test this regex and tell me what do you think about this. If it works for you, I can add it into one of our pending PRs because they are also related to the KeywordDetector plugin. Thank you very much for your feedback!

@domanchi
Copy link
Contributor

domanchi commented Apr 1, 2021

Thanks for your quick response! I'll test it out, and let you know.

@syn-4ck
Copy link
Contributor

syn-4ck commented Apr 1, 2021

Hi! I totally agree with the backtracking performance problem that @pablosantiagolopez has mentioned.

I would like to comment that it might be interesting to create a filter for these cases, in addition to fixing the backtracking problem (if the @domanchi 's tests confirm it). I think that a <img> HTML tag with a base64 source content will never contain secrets, so implement a HTML tags filter can be a good feature for detect-secrets... This filter will improve performance and reduce the rate of false positives.

Also, other good idea can be include more non-code files extensions into the IGNORED_FILE_EXTENSIONS (binary files, compiled packages...). We contributed adding some extensions in #419, but I think this list should grow as the version is used.

This tips can improve the KeywordDetector plugin performance. What do you think @pablosantiagolopez @domanchi?

@domanchi
Copy link
Contributor

domanchi commented Apr 1, 2021

@pablosantiagolopez : I can confirm it performs a lot better. Let's cut out a PR for this change alone, so that I can merge it independently of other modifications to the KeywordDetector.

EDIT: Feel free to use the example payload as a test case for long lines, so that we can add regression tests.

@pablosnt
Copy link
Contributor Author

pablosnt commented Apr 1, 2021

Perfect, it's comming...Thank you!

@pablosnt pablosnt mentioned this pull request Apr 1, 2021
@KevinHock
Copy link
Collaborator

Amazing PR @pablosnt 😮

@pablosnt
Copy link
Contributor Author

Amazing PR @pablosnt 😮

Thank you very much @KevinHock!!! Everything is easier with the help of @syn-4ck ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants