Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filtering of english words from entropy checks #240

Closed
KevinHock opened this issue Sep 16, 2019 · 1 comment
Closed

Add filtering of english words from entropy checks #240

KevinHock opened this issue Sep 16, 2019 · 1 comment
Assignees

Comments

@KevinHock
Copy link
Collaborator

By filter I mean, before alerting off of a secret, we check if an english word (of length 4 or greater) is in the string, and then don't alert off of it, in order to reduce false-positives.

To get the wordlist, we can either:

  • add an e.g. --word-list words.txt option
    or
  • do what the How Bad Can It Git? paper did, and hard-code a list of 2,298 English words.

Once we have the wordlist, we have to do an Aho-Corasick type algorithm to efficiently tell if a secret has an English word in it. We can either:

For a first iteration, a word list argument and using a library seems to be the quickest way to get an MVP.

From the How Bad Can It Git? paper:

Words Filter

Another intuition is that a random string should not contain linguistic sequences of characters [12]. For this check, we compiled a dictionary of English words of length as least as long as a defined threshold. Then we searched each candidate string for each one of these words and failed the check if detected.

A trade-off exists in choosing this threshold. If it is too small, randomly occurring sequences that happen to create words will create false negatives (marking valid secrets as invalid), but if it is too large, legitimate words will be missed and create false positives (marking invalid secrets as valid). In our experiments, we set the word length threshold to be 5. This threshold was chosen as a best judgment after careful manual review; unfortunately, experimental derivation of this threshold was not possible given limited initial ground truth.

A dictionary of every English word would contain words that would not likely be used as part of a string in a code file and cause high amounts of false negatives. Therefore, we took the intersection of an English dictionary [45] and a dictionary of the 5 most common words used in source code files on GitHub [40]. The resulting dictionary contained the 2,298 English words that were likely to be used within code files, reducing the potential for false negatives.

@KevinHock KevinHock self-assigned this Sep 16, 2019
KevinHock added a commit that referenced this issue Sep 19, 2019
- Add `pyahocorasick` as an optional dependency

See issue #240 for more information.
KevinHock added a commit that referenced this issue Sep 19, 2019
- Add `pyahocorasick` as an optional dependency

See issue #240 for more information.
@KevinHock
Copy link
Collaborator Author

PR was merged, going to close, will release a new version soon (today or tomorrow)

Note for posterity that, aside from e.g. /usr/share/dict/words, you'll probably have to add things like the following to get the most use out of this functionality

  .org
  addr
  http
  attr
  href
  html
  yaml
  info
  %.2d
  json
  uri
  debug
  value
  123456789
  abcd
  !@#$%^&*(
  utf-8
  ISO-
  %2c
  %3A
  Mon:
  Fri:
  Sat:
  Mon-
  Fri-
  Sat-
  5:30PM
  10AM
  6PM

killuazhu pushed a commit to IBM/detect-secrets that referenced this issue May 28, 2020
)

* Update CONTRIBUTING.md to outline detector development process

Supports git-defenders/detect-secrets-discuss#312

* Minor wording update

* Address comments
killuazhu pushed a commit to IBM/detect-secrets that referenced this issue Jul 9, 2020
)

* Update CONTRIBUTING.md to outline detector development process

Supports git-defenders/detect-secrets-discuss#312

* Minor wording update

* Address comments
killuazhu pushed a commit to IBM/detect-secrets that referenced this issue Sep 17, 2020
)

* Update CONTRIBUTING.md to outline detector development process

Supports git-defenders/detect-secrets-discuss#312

* Minor wording update

* Address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant