Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of line regexes #244

Closed
KevinHock opened this issue Sep 25, 2019 · 6 comments
Closed

Improve the performance of line regexes #244

KevinHock opened this issue Sep 25, 2019 · 6 comments
Assignees

Comments

@KevinHock
Copy link
Collaborator

Due to the way we pass a file to every single plugin, rather than a line, we end up regex.searching the same line P times, where P is the number of plugins. This holds true for both ALLOWLIST_REGEXES and --exclude-lines. For large diffs on a tightly provisioned box this can be quite inefficient.

The relevant control flow is as follows

try:
log.info('Checking file: %s', filename)
for results, plugin in self._results_accumulator(filename):
results.update(plugin.analyze(f, filename))
f.seek(0)

def analyze(self, file, filename):
"""
:param file: The File object itself.
:param filename: string; filename of File object, used for creating
PotentialSecret objects
:returns dictionary representation of set (for random access by hash)
{ detect_secrets.core.potential_secret.__hash__:
detect_secrets.core.potential_secret }
"""
potential_secrets = {}
file_lines = tuple(file.readlines())
for line_num, line in enumerate(file_lines, start=1):
results = self.analyze_string(line, line_num, filename)

def analyze_string(self, string, line_num, filename):
"""
:param string: string; the line to analyze
:param line_num: integer; line number that is currently being analyzed
:param filename: string; name of file being analyzed
:returns: dictionary
NOTE: line_num and filename are used for PotentialSecret creation only.
"""
if (
any(
allowlist_regex.search(string) for allowlist_regex in ALLOWLIST_REGEXES
)
or (
self.exclude_lines_regex and
self.exclude_lines_regex.search(string)

@KevinHock
Copy link
Collaborator Author

To put the above differently, we ask "should we skip this line?" number of plugin times, where as we could ask once.

@OiCMudkips pointed out it might make sense to skip 'em after secrets are found, I think they're right.

Let's say the likelihood of having a line hit a --line-exclude or # allowlist is 1 in 1000, we would need over 1,000 plugins in order for skipping lines before analyze()ing to be more efficient, so let's skip them afterwards.

@killuazhu
Copy link
Contributor

To put the above differently, we ask "should we skip this line?" number of plugin times, where as we could ask once.

This makes sense, so the allowed list can be checked once early in the process.

@OiCMudkips pointed out it might make sense to skip 'em after secrets are found, I think they're right.

Sometimes we find a line might contain several secrets with different types. It would be nice not to skip a line after a secret been found, or give an option to control the behavior.

@KevinHock
Copy link
Collaborator Author

It would be nice not to skip a line after a secret been found, or give an option to control the behavior.

I think we used to have, (my memory is super hazy about this), a really long time ago, an --ignore-pragma type option that would still report lines with the pragma on them. I might be wrong about that. But you're right it makes a lot of sense to give users that option @killuazhu 👍

@OiCMudkips
Copy link
Contributor

OiCMudkips commented Sep 26, 2019

@OiCMudkips pointed out it might make sense to skip 'em after secrets are found, I think they're right.

Hmm, to be clear, I meant that we should delay the whitelist check until after the secret is found. This means that we only run the whitelist check on # secrets found, instead of (# plugins)*(# lines).

I agree that the --ignore-pragma option is a good idea.

@killuazhu
Copy link
Contributor

I meant that we should delay the whitelist check until after the secret is found.

This sounds like a really good idea to reduce some overhead given secret lines are rarer than normal lines. @OiCMudkips

@KevinHock KevinHock self-assigned this Apr 6, 2020
@KevinHock
Copy link
Collaborator Author

I can do this one, it's pretty easy.

KevinHock added a commit that referenced this issue Apr 7, 2020
This fixes issue #244.
Only check the line for allowlist regexes or --exclude-lines if a secret was found.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants