Improve the performance of line regexes #244

KevinHock · 2019-09-25T23:25:43Z

Due to the way we pass a file to every single plugin, rather than a line, we end up regex.searching the same line P times, where P is the number of plugins. This holds true for both ALLOWLIST_REGEXES and --exclude-lines. For large diffs on a tightly provisioned box this can be quite inefficient.

The relevant control flow is as follows

detect-secrets/detect_secrets/core/secrets_collection.py

Lines 336 to 341 in 5d3e065

    
           try: 
        
               log.info('Checking file: %s', filename) 
        
               for results, plugin in self._results_accumulator(filename): 
        
                   results.update(plugin.analyze(f, filename)) 
        
                   f.seek(0)

detect-secrets/detect_secrets/plugins/base.py

Lines 45 to 57 in 5d3e065

    
               def analyze(self, file, filename): 
        
                   """ 
        
                   :param file:     The File object itself. 
        
                   :param filename: string; filename of File object, used for creating 
        
                                    PotentialSecret objects 
        
                   :returns         dictionary representation of set (for random access by hash) 
        
                                    { detect_secrets.core.potential_secret.__hash__: 
        
                                          detect_secrets.core.potential_secret         } 
        
                   """ 
        
                   potential_secrets = {} 
        
                   file_lines = tuple(file.readlines()) 
        
                   for line_num, line in enumerate(file_lines, start=1): 
        
                       results = self.analyze_string(line, line_num, filename)

detect-secrets/detect_secrets/plugins/base.py

Lines 81 to 97 in 5d3e065

    
               def analyze_string(self, string, line_num, filename): 
        
                   """ 
        
                   :param string:    string; the line to analyze 
        
                   :param line_num:  integer; line number that is currently being analyzed 
        
                   :param filename:  string; name of file being analyzed 
        
                   :returns:         dictionary 
        
                   NOTE: line_num and filename are used for PotentialSecret creation only. 
        
                   """ 
        
                   if ( 
        
                       any( 
        
                           allowlist_regex.search(string) for allowlist_regex in ALLOWLIST_REGEXES 
        
                       ) 
        
                       or ( 
        
                           self.exclude_lines_regex and 
        
                           self.exclude_lines_regex.search(string)

The text was updated successfully, but these errors were encountered:

KevinHock · 2019-09-26T00:10:23Z

To put the above differently, we ask "should we skip this line?" number of plugin times, where as we could ask once.

@OiCMudkips pointed out it might make sense to skip 'em after secrets are found, I think they're right.

Let's say the likelihood of having a line hit a --line-exclude or # allowlist is 1 in 1000, we would need over 1,000 plugins in order for skipping lines before analyze()ing to be more efficient, so let's skip them afterwards.

killuazhu · 2019-09-26T11:08:26Z

To put the above differently, we ask "should we skip this line?" number of plugin times, where as we could ask once.

This makes sense, so the allowed list can be checked once early in the process.

@OiCMudkips pointed out it might make sense to skip 'em after secrets are found, I think they're right.

Sometimes we find a line might contain several secrets with different types. It would be nice not to skip a line after a secret been found, or give an option to control the behavior.

KevinHock · 2019-09-26T17:59:42Z

It would be nice not to skip a line after a secret been found, or give an option to control the behavior.

I think we used to have, (my memory is super hazy about this), a really long time ago, an --ignore-pragma type option that would still report lines with the pragma on them. I might be wrong about that. But you're right it makes a lot of sense to give users that option @killuazhu 👍

OiCMudkips · 2019-09-26T21:22:02Z

@OiCMudkips pointed out it might make sense to skip 'em after secrets are found, I think they're right.

Hmm, to be clear, I meant that we should delay the whitelist check until after the secret is found. This means that we only run the whitelist check on # secrets found, instead of (# plugins)*(# lines).

I agree that the --ignore-pragma option is a good idea.

killuazhu · 2019-09-27T09:16:32Z

I meant that we should delay the whitelist check until after the secret is found.

This sounds like a really good idea to reduce some overhead given secret lines are rarer than normal lines. @OiCMudkips

KevinHock · 2020-04-06T06:33:49Z

I can do this one, it's pretty easy.

This fixes issue #244. Only check the line for allowlist regexes or --exclude-lines if a secret was found.

KevinHock added the performance label Sep 25, 2019

KevinHock self-assigned this Apr 6, 2020

KevinHock added a commit that referenced this issue Apr 7, 2020

🎭 Improve the performance of line regexes

74d3787

This fixes issue #244. Only check the line for allowlist regexes or --exclude-lines if a secret was found.

KevinHock mentioned this issue Apr 7, 2020

Various small improvements #293

Merged

OiCMudkips closed this as completed Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the performance of line regexes #244

Improve the performance of line regexes #244

KevinHock commented Sep 25, 2019

KevinHock commented Sep 26, 2019

killuazhu commented Sep 26, 2019

KevinHock commented Sep 26, 2019

OiCMudkips commented Sep 26, 2019 •

edited

Loading

killuazhu commented Sep 27, 2019

KevinHock commented Apr 6, 2020

Improve the performance of line regexes #244

Improve the performance of line regexes #244

Comments

KevinHock commented Sep 25, 2019

KevinHock commented Sep 26, 2019

killuazhu commented Sep 26, 2019

KevinHock commented Sep 26, 2019

OiCMudkips commented Sep 26, 2019 • edited Loading

killuazhu commented Sep 27, 2019

KevinHock commented Apr 6, 2020

OiCMudkips commented Sep 26, 2019 •

edited

Loading