Heuristics for "random" values to help with base-encoded typo false positives #484

repi · 2022-05-07T10:00:31Z

This is a base58-encoded string from our codebase, is there some heuristic for the typo checker to not consider such a long "random" string to not be a word and not suggest anything for it? This was part of a larger JSON string in a test.

error: `Wew` should be `We`
  --> ./desc.rs:200:49
    |
200 |             "bytes_cid": "z177xERgbqgBdC97Y5GYXZWew1cFgkttqr5ipF2b8iCN17",
    |                                                 ^^^
    |

Here is another similar one also from a embedded JSON string:

error: `nd` should be `and`
  --> ./test.rs:31:221
   |
31 | pub const JWK: &str = r#"{"alg":"sig","n":"wnI2iD6F7qAg0qKGpFQ6L7qYdGbPkHSUHzigaW3p89fWBbZRT-WawqdU4vu3vANL9whlXMGlzLsPNUwXsoDKu6CnzAUUO9pr7E6CukN9A1UN13L-ZRKHAGv33NkdygDpTsYXUVAoQLykPnjToNVDKA0ohy96kzPkT4vql9n_5ev7Dhy69nd79mI09QhHo62RGzZDDanjdjXRBLBFA3Hm-CKiu"]}"#;
   |                                                                                                                                                                                                                             ^^
   |

These are the last two major false positives we've been seeing in our codebase with typos, works really well otherwise!

The text was updated successfully, but these errors were encountered:

epage · 2022-05-08T00:58:23Z

Yes, we have several issues related to hashes / base encodings of some sort

Having some kind of heuristic to discard hashes / base-encodings beyond a strict syntax check would be a big help. What that'd look like is the question though. To start off brainstorming,

X numbers (groups of digits) in string
X "words" (groups of letters) shorter than Y characters
We probably can treat base encoding equally with hashes (ie no special heuristics for how "much" of a word exists between -, protecting against math between variables) as we can identifiers in math will just show up somewhere else in the code and get flagged

Any other ideas for heuristics and for what the Xs and Ys should be?

#316 has a list of alternative approaches. Feel free to share how useful or not those approaches would be in that issue.

repi · 2022-08-05T13:03:45Z

think the most important would be to have a way to opt out of tricky situations, like you may want to have a text string that has typos in it included in a test code or similar, and there will be cases that are difficult to detect properly with these type of base-encoded numbers of JSON strings and other stuffs.

So having some solution like the ones in #316 to opt out would be great robust fallback. For our particular use cases having a way to disable handling through comments that enable/disable the spell check would work

boris-smidt-klarrio · 2024-08-22T13:32:33Z

Maybe this will help there is 'ripsecrets' which is a tool written in rust which uses ripgrep to find secrets in an existing project.

So these regexes could be reused for 'high heuristic' values:
https://github.com/sirwart/ripsecrets/blob/main/src/lib.rs

epage · 2024-08-22T15:43:38Z

@boris-smidt-klarrio thanks! For now, I've at least linked to that in the docs in 8b729e1

boris-smidt-klarrio · 2024-08-22T16:30:04Z

Not sure if it will work with the ignores because of the way the tokenizer works. I had a look at it but i assumed it kept on splitting tokens until it finds UUIDs/ words or numbers. So is there a setting to add other entries to the tokenizer with these regexes?

epage · 2024-08-22T17:10:00Z

@boris-smidt-klarrio extend-ignore-re is independent of the tokenizer. If we see a typo, we run extend-ignore-re against it and see if the typo is within the range. This is different than extend-ignore-identifiers-re and extend-ignore-words-re which work on tokenized values.

boris-smidt-klarrio · 2024-08-26T09:22:41Z

@epage Thank you it works!

repi changed the title ~~Long base-encoded typo false positive~~ Long base-encoded typo false positives May 7, 2022

epage mentioned this issue May 10, 2022

False positive in filename, hugo markdown custom shortcode #485

Closed

epage added the bug Not as expected label May 23, 2022

epage mentioned this issue Jun 15, 2022

Hashes/encodings below the heuristic limit are treated as typos #415

Open

moreal mentioned this issue Jun 27, 2022

ci(gh-actions): introduce typos job planetarium/lib9c#1155

Closed

epage changed the title ~~Long base-encoded typo false positives~~ Base-encoded typo false positives Aug 1, 2022

epage mentioned this issue Aug 1, 2022

Hex/base64 detection is not aggressive enough #526

Closed

epage mentioned this issue Sep 12, 2022

Bare hex colors are being treated as typos #568

Open

epage mentioned this issue Dec 29, 2022

Shell command short parameters typo false positive #643

Closed

epage mentioned this issue Jul 3, 2023

Project full of JWT keys (hexadecimal) #775

Closed

epage mentioned this issue Dec 13, 2023

Bad case when used with cert #883

Closed

Borda mentioned this issue Mar 31, 2024

lint: add typos check gitpython-developers/GitPython#1888

Merged

epage changed the title ~~Base-encoded typo false positives~~ Heuristics for "random" values to help with base-encoded typo false positives Apr 3, 2024

This was referenced Apr 3, 2024

False positive for random strings #978

Closed

False positive for commits id #982

Closed

Multiline strings cause false positives #984

Closed

epage mentioned this issue Aug 21, 2024

Add option to ignore high entropy strings (like secrets) #1080

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heuristics for "random" values to help with base-encoded typo false positives #484

Heuristics for "random" values to help with base-encoded typo false positives #484

repi commented May 7, 2022 •

edited

Loading

epage commented May 8, 2022

repi commented Aug 5, 2022

boris-smidt-klarrio commented Aug 22, 2024

epage commented Aug 22, 2024

boris-smidt-klarrio commented Aug 22, 2024 •

edited

Loading

epage commented Aug 22, 2024

boris-smidt-klarrio commented Aug 26, 2024

Heuristics for "random" values to help with base-encoded typo false positives #484

Heuristics for "random" values to help with base-encoded typo false positives #484

Comments

repi commented May 7, 2022 • edited Loading

epage commented May 8, 2022

repi commented Aug 5, 2022

boris-smidt-klarrio commented Aug 22, 2024

epage commented Aug 22, 2024

boris-smidt-klarrio commented Aug 22, 2024 • edited Loading

epage commented Aug 22, 2024

boris-smidt-klarrio commented Aug 26, 2024

repi commented May 7, 2022 •

edited

Loading

boris-smidt-klarrio commented Aug 22, 2024 •

edited

Loading