Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find some way to tolerate hexadecimal better #326

Closed
nmathewson opened this issue Aug 3, 2021 · 6 comments · Fixed by #330
Closed

Find some way to tolerate hexadecimal better #326

nmathewson opened this issue Aug 3, 2021 · 6 comments · Fixed by #330
Labels
enhancement Improve the expected

Comments

@nmathewson
Copy link
Contributor

I have a program that uses a fair amount of hex strings in order to represent binary sequences in its tests. But this doesn't sit so well with typos, which gives lots of false positives like:

error: `ba` should be `by`, `be`
  --> ./tor-llcrypto/src/pk.rs:77:21
   |
77 |              0816ed13ba3303ac5deb911548908025"

It seems that this is a byproduct of treading any sequence of non-numeric wordlike characters as a possible word, even if they occur adjacent to numbers, not specified by spaces.

@epage
Copy link
Collaborator

epage commented Aug 3, 2021

0816ed13ba3303ac5deb911548908025"

Is there anything that can help identify these as hex?

For example, we exclude

  • 0x<hexdigit>
  • UUIDs
  • Hashes between 40 (sha1) and 64 (sha256) characters
  • Base64 that is at least 90 characters long and contain the appropriate padding

We could exclude number prefixed but that will only help in 10/16 cases.

It seems that this is a byproduct of treading any sequence of non-numeric wordlike characters as a possible word, even if they occur adjacent to numbers, not specified by spaces.

This is because we are focused on spell checking code and want to get all words within identifiers, regardless of whether they split on case (e.g. PascalCase) or not (e.g. snake_case)

@nmathewson
Copy link
Contributor Author

nmathewson commented Aug 4, 2021

Hashes between 40 (sha1) and 64 (sha256) characters

Thinking aloud:

How about "strings of at least 32 characters containing only hexadecimal characters"?

Alternatively, excluding r"xxx" strings would probably work for me, and might make regex-users happier. (Though it might make somebody else unhappy.)

Alternatively, "strings within hex!()" would work here, though that is probably less general.

In the most general case, it might be okay to have a way to annotate something as "not text".

These would all work for my use case; do you think any of them are clean enough to use in practice?

@epage
Copy link
Collaborator

epage commented Aug 4, 2021

How about "strings of at least 32 characters containing only hexadecimal characters"?

While its unlikely for someone to spell out an identifier 32 characters long with only hex digits, I don't think its impossible. I guess the question is how likely it would happen since we do err on the side of not spell checking vs reporting false positives.

In the most general case, it might be okay to have a way to annotate something as "not text".

See #316

@nmathewson
Copy link
Contributor Author

While its unlikely for someone to spell out an identifier 32 characters long with only hex digits, I don't think its impossible. I guess the question is how likely it would happen since we do err on the side of not spell checking vs reporting false positives.

I think this is rare in practice. To try to measure this empirically, I used this ripgrep command pipe:

rg -I -o '[a-fA-F0-9]{32,}' | sort |uniq > all_hex_strings

I then ran this command against my ~/src directory (where I put software that I work on), and found 105204 distinct hex strings. That's too many to scan by hand, so I took a random sample of 1000. I then read all of those 1000 strings, and did not find any that were real text.

I reproduced this same result in my ~/build directory (where I put software that I build but do not work on) and got the same results. (Sampled 1000 strings out of 189683 and found none with real text.)

So I think that this rule is probably safe? You might want to double-check it on our own collection of software, though, to be sure.

scop added a commit to scop/hashpipe that referenced this issue Aug 15, 2021
For more frequent updates, more and quicker checks. Not that for a
project this size speed of spellcheck would matter at all.

Stick with 1.0.11 due to crate-ci/typos#326
for now.
@nmathewson
Copy link
Contributor Author

Friendly ping. If you agree with my analysis above, I'm happy to try to write a patch here. Alternatively, if the analysis isn't convincing, I'd be happy to re-run a similar experiment with a corpus, or sample size of your choosing.

@epage
Copy link
Collaborator

epage commented Aug 20, 2021

I think restricting ourselves to 32 same-case hex characters would work.

@epage epage added the enhancement Improve the expected label Aug 20, 2021
nmathewson added a commit to nmathewson/typos that referenced this issue Aug 20, 2021
By experimentation (see ticket), it seems that same-case hexadecimal
strings of 32 characters or longer are almost never intended to hold
text.  By treating such strings as ignored, we can resist a larger
category of false positives.

Closes crate-ci#326.
nmathewson added a commit to nmathewson/typos that referenced this issue Aug 20, 2021
By experimentation (see ticket), it seems that same-case hexadecimal
strings of 32 characters or longer are almost never intended to hold
text.  By treating such strings as ignored, we can resist a larger
category of false positives.

Closes crate-ci#326.
nmathewson added a commit to nmathewson/typos that referenced this issue Aug 20, 2021
By experimentation (see ticket), it seems that same-case hexadecimal
strings of 32 characters or longer are almost never intended to hold
text.  By treating such strings as ignored, we can resist a larger
category of false positives.

Closes crate-ci#326.
nmathewson added a commit to nmathewson/typos that referenced this issue Aug 20, 2021
By experimentation (see ticket), it seems that same-case hexadecimal
strings of 32 characters or longer are almost never intended to hold
text.  By treating such strings as ignored, we can resist a larger
category of false positives.

Closes crate-ci#326.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improve the expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants