feat(secretsscan): add 20 more secret patterns#2692
Conversation
covers discord (bot tokens + webhooks), telegram, fly.io, groq, perplexity, xai, cohere, buildkite, circleci, cloudinary, mongodb / postgres / azure-storage connection strings, mapbox secret keys, vault batch + recovery tokens, netlify, asana, and cloudflare origin ca keys. also bumps kwMask from [2]uint64 to [4]uint64 to make room.
rumpl
left a comment
There was a problem hiding this comment.
Sure.
Question, doesn't a secret scanning library already exist?
docker-agent
left a comment
There was a problem hiding this comment.
Assessment: 🟢 APPROVE
20 new secret detection rules reviewed. The kwMask expansion from [2]uint64 to [4]uint64 is arithmetically correct, all keyword/regex pairs are properly aligned (mixed-case keywords are correctly normalized via strings.ToLower in compiledRuleSet), operator precedence in overlaps() is correct per the Go spec (| has higher precedence than !=), and the named (?P<secret>...) groups in the MongoDB/Postgres/Azure rules redact only the password/key span.
One low-severity concern noted inline.
| // / `AccountName` framing is only metadata. The base64 value | ||
| // is typically 88 chars (44-byte key) but we accept anything | ||
| // from 20 chars upwards to cover shorter SAS-signing keys. | ||
| expression: `DefaultEndpointsProtocol=https?;AccountName=[^;]+;AccountKey=(?P<secret>[A-Za-z0-9+/=]{20,})`, |
There was a problem hiding this comment.
[LOW] Azure AccountKey regex may over-redact on connection strings missing a semicolon separator
The (?P<secret>[A-Za-z0-9+/=]{20,}) capture group is unbounded and greedily consumes any [A-Za-z0-9+/=] characters. For a well-formed Azure connection string the semicolons between fields act as natural stop-bytes — but if a consumer assembles a string without the trailing semicolon the regex swallows text beyond the key value.
Reproducible example:
"DefaultEndpointsProtocol=https;AccountName=foo;AccountKey=AAAA==BlobEndpoint=http://127.0.0.1"
Captured secret: "AAAA==BlobEndpoint=http" (the field name BlobEndpoint is entirely in [A-Za-z]).
Impact: only over-redaction (the secret is still detected), not a missed detection. Well-formed strings produced by Azure SDKs always include semicolons so this is unlikely to be triggered in practice. A safe fix is to add an upper bound, e.g. {20,200}, or to change the character class to exclude = at the end of the value by anchoring on (?:;|$), e.g.:
`DefaultEndpointsProtocol=https?;AccountName=[^;]+;AccountKey=(?P<secret>[A-Za-z0-9+/=]{20,200})(?:;|$)`
Adds 20 new detection rules to
pkg/secretsscan, on top of the existing upstream-derived catalogue:[MNO]<24>.<6>.<27+>(3-part dotted)https://discord(?:app)?.com/api/webhooks/<id>/<token><8-10 digits>:AA<33 base64url>FlyV1 fm2_…gsk_<52>pplx-<48-56>xai-<80>co_<40>bkua_<40>CCIPRJ_<org>_<token>cloudinary://<key>:<secret>@<cloud>mongodb(?:+srv)?://user:<pwd>@…(only the password is redacted)postgres(?:ql)?://user:<pwd>@…(only the password is redacted)DefaultEndpointsProtocol=…;AccountKey=<base64>(only the key is redacted)sk.<60>.<22>hvb.<90-200>hvr.<90-200>nfp_<40>1/<≥14 digits>:<32 hex>v1.0-<32 hex>-<146 base64>Each rule has a documented vendor prefix or framing structure unique enough to keep the keyword pre-filter useful and the false-positive rate low; the in-source comments cite the format they target.
Supporting changes
pkg/secretsscan/aho.go:kwMaskgrown from[2]uint64(128-bit) to[4]uint64(256-bit). The catalogue went from 121 → 149 unique keywords, exceeding the previous 128 cap.empty,overlaps,scan,buildAhoCorasick's panic limit, and the BFS merge are updated accordingly.pkg/secretsscan/aho_test.go:TestAhoCorasickPanicOnTooManyPatternsnow exercises the 257-pattern boundary;TestKwMaskOperationscovers bits in words 3 and 4.pkg/secretsscan/secrets_test.go: 21 new detection cases inTestContainsSecretsRecognisesKnownTokens(Discord gets two, one per first-segment letter class).pkg/secretsscan/edge_cases_test.go(new): three suites pinning the new behaviour:TestNoisyKeywordsFalsePositives— common noise (12:00 AA,1/2 cup,v1.0-beta,disk., …) does not trip detection.TestDiscordTokenPrefixes— every documented Discord token first-segment prefix (MT/Mz/ND/NT/Nz/OD) is detected; this caught a regex/keyword skew during review and now guards against future drift.TestConnectionStringContextPreservation— the new MongoDB / Postgres / Azure-storage rules redact only the(?P<secret>…)span and keep scheme + username + host visible.Validation
go test ./pkg/secretsscan/... ./pkg/hooks/...— all passgolangci-lint run ./...— 0 issuesBenchmarkRedactCleanInput: 19.1 µs/op (was 18.6 µs)BenchmarkRedactWithSecret: 4.3 µs/op (unchanged)