adding gibberish detector #416

domanchi · 2021-03-04T16:06:48Z

Summary

This leverages the gibberish-detector with a model based on all RFCs (at the time of writing). I wanted to use a source that was somewhat representative of Computer Science jargon: after all, we're attempting to look at variable names / strings in code, and trying to determine whether they are actually secret values. The typical "Complete List of Sherlock Holmes" as a corpus won't cut it.

This commit also introduces a variety of minor bug fixes. This includes:

configuring settings/plugins will now invalidate the cache
pre-commit hook now runs with the local detect-secrets-hook, which makes a lot more sense

Methodology

Turns out, you can download all RFCs to local disk,
for easier processing. So I did that, and ran this simple script to train the model.

from typing import Iterator
import argparse
import os
import string

from tqdm import tqdm

from gibberish_detector.model import Model
from gibberish_detector import serializer


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument('path', type=str, help='Path to extracted RFCs.')
    args = parser.parse_args()

    model = Model(string.ascii_lowercase)
    for content in tqdm(get_rfc_content(args.path)):
        for line in content.splitlines():
            model.train(line.strip().lower())
    
    print(serializer.serialize(model))


def get_rfc_content(path) -> Iterator[str]:
    for filename in os.listdir(path):
        if os.path.splitext(filename)[1] != '.txt':
            continue

        with open(os.path.join(path, filename)) as f:
            try:
                yield f.read()
            except UnicodeDecodeError:
                pass


if __name__ == '__main__':
    main()

then,

$ time python -m rfc_processor  ~/Documents/scratch/rfc > rfc.model
8753it [02:33, 57.12it/s]

TBH, I'm pretty surprised how quickly it went.

Testing

I tested this model with all the FALSE_POSITIVES found in the KeywordDetector, and measured their score. I also tested this with known secrets, and measured their score. Finally, based off internal data, this was able to reduce false positive rates by a whopping ~60% -- which is pretty superb, IMO.

It looks like 3.7 is a pretty conservative bar for gibberish strings, based on this model (conservative defined here as preferring false positives than false negatives), with one false negative from the corpus (dummy).

Reviewers Notes

I need to bundle the rfc.model with the package, but I'm not sure whether I'm doing it right. I followed these instructions, and will test in test.pypi.org whether it works?

calvinli

lgtm

I need to bundle the rfc.model with the package, but I'm not sure whether I'm doing it right.

You can build a wheel locally and install it into a different folder and see if it works

…adding-gibberish-detector

domanchi · 2021-03-06T01:55:12Z

Looks like bundling works!

(venv) $ python setup.py bdist_wheel

$ virtualenv --python=python3.6 test_venv
$ source test_venv/bin/activate
(test_venv) $ pip install dist/detect_secrets-1.0.3-py2.py3-none-any.whl
(test_venv) $ detect-secrets scan test_data/each_secret.py | jq -c '.results["test_data/each_secret.py"]' | jq length
7

(test_venv) $ pip install gibberish-detector
(test_venv) $ detect-secrets scan test_data/each_secret.py | jq -c '.results["test_data/each_secret.py"]' | jq length
4

adding gibberish detector

2a06392

domanchi assigned calvinli Mar 4, 2021

calvinli approved these changes Mar 4, 2021

View reviewed changes

Merge branch 'master' of github.com:Yelp/detect-secrets into feature/…

82c7b1d

…adding-gibberish-detector

domanchi merged commit ab844fe into master Mar 6, 2021

domanchi deleted the feature/adding-gibberish-detector branch March 11, 2021 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding gibberish detector #416

adding gibberish detector #416

domanchi commented Mar 4, 2021

calvinli left a comment

domanchi commented Mar 6, 2021

adding gibberish detector #416

adding gibberish detector #416

Conversation

domanchi commented Mar 4, 2021

Summary

Methodology

Testing

Reviewers Notes

calvinli left a comment

Choose a reason for hiding this comment

domanchi commented Mar 6, 2021