Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding gibberish detector #416

Merged
merged 2 commits into from
Mar 6, 2021
Merged

Conversation

domanchi
Copy link
Contributor

@domanchi domanchi commented Mar 4, 2021

Summary

This leverages the gibberish-detector with a model based on all RFCs (at the time of writing). I wanted to use a source that was somewhat representative of Computer Science jargon: after all, we're attempting to look at variable names / strings in code, and trying to determine whether they are actually secret values. The typical "Complete List of Sherlock Holmes" as a corpus won't cut it.

This commit also introduces a variety of minor bug fixes. This includes:

  • configuring settings/plugins will now invalidate the cache
  • pre-commit hook now runs with the local detect-secrets-hook, which makes a lot more sense

Methodology

Turns out, you can download all RFCs to local disk,
for easier processing. So I did that, and ran this simple script to train the model.

from typing import Iterator
import argparse
import os
import string

from tqdm import tqdm

from gibberish_detector.model import Model
from gibberish_detector import serializer


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument('path', type=str, help='Path to extracted RFCs.')
    args = parser.parse_args()

    model = Model(string.ascii_lowercase)
    for content in tqdm(get_rfc_content(args.path)):
        for line in content.splitlines():
            model.train(line.strip().lower())
    
    print(serializer.serialize(model))


def get_rfc_content(path) -> Iterator[str]:
    for filename in os.listdir(path):
        if os.path.splitext(filename)[1] != '.txt':
            continue

        with open(os.path.join(path, filename)) as f:
            try:
                yield f.read()
            except UnicodeDecodeError:
                pass


if __name__ == '__main__':
    main()

then,

$ time python -m rfc_processor  ~/Documents/scratch/rfc > rfc.model
8753it [02:33, 57.12it/s]

TBH, I'm pretty surprised how quickly it went.

Testing

I tested this model with all the FALSE_POSITIVES found in the KeywordDetector, and measured their score. I also tested this with known secrets, and measured their score. Finally, based off internal data, this was able to reduce false positive rates by a whopping ~60% -- which is pretty superb, IMO.

It looks like 3.7 is a pretty conservative bar for gibberish strings, based on this model (conservative defined here as preferring false positives than false negatives), with one false negative from the corpus (dummy).

Reviewers Notes

I need to bundle the rfc.model with the package, but I'm not sure whether I'm doing it right. I followed these instructions, and will test in test.pypi.org whether it works?

Copy link
Member

@calvinli calvinli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

I need to bundle the rfc.model with the package, but I'm not sure whether I'm doing it right.

You can build a wheel locally and install it into a different folder and see if it works

@domanchi
Copy link
Contributor Author

domanchi commented Mar 6, 2021

Looks like bundling works!

(venv) $ python setup.py bdist_wheel

$ virtualenv --python=python3.6 test_venv
$ source test_venv/bin/activate
(test_venv) $ pip install dist/detect_secrets-1.0.3-py2.py3-none-any.whl
(test_venv) $ detect-secrets scan test_data/each_secret.py | jq -c '.results["test_data/each_secret.py"]' | jq length
7

(test_venv) $ pip install gibberish-detector
(test_venv) $ detect-secrets scan test_data/each_secret.py | jq -c '.results["test_data/each_secret.py"]' | jq length
4

@domanchi domanchi merged commit ab844fe into master Mar 6, 2021
@domanchi domanchi deleted the feature/adding-gibberish-detector branch March 11, 2021 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants