Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for hash sum files (tools like sha256sum) #5130

Closed
theonlypwner opened this issue Jan 5, 2021 · 10 comments · Fixed by #5138
Closed

Add support for hash sum files (tools like sha256sum) #5130

theonlypwner opened this issue Jan 5, 2021 · 10 comments · Fixed by #5138
Labels
Add Language Good First Issue This is a great opportunity to start contributing to Linguist

Comments

@theonlypwner
Copy link
Contributor

Tools like md5sum, shasum, sha1sum, sha256sum, sha3sum, and b2sum output one line per input file:

  • hash digest in hexadecimal
  • space
  • space or asterisk (indicates text or binary mode, respectively)
  • file name
  • newline

If the file was stdin, the file name is -.

For example, in binary mode, the output looks like this:

0000000000000000000000000000000000000000 *file0
0000000000000000000000000000000000000001 *file1

In text mode, the output looks like this:

0000000000000000000000000000000000000000  file0
0000000000000000000000000000000000000001  file1

cksum is similar, but it outputs the checksum in decimal, a space, the file size (with all trailing null bytes trimmed), a space, and the file name. If the file was stdin, the last two parts are omitted.

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 5, 2021

There's also OpenBSD's checksumming tools that generate output of the form:

SHA256 (BOOTAA64.EFI) = 9fc44c7ba4f0d13f928835656567fa3ab454a0edcd88a8cbfe6d61f8b6ec183c
SHA256 (BUILDINFO) = 32dc74faeccaffcab5e155c5c7b070dfeaab14b28862f50759b4b83df4d241af
SHA256 (INSTALL.arm64) = 861045854729dd67067fc35f55bf2a6387b7b53e0c0f843c97debd1bda1c676c
SHA256 (base68.tgz) = a8917ac83fc104449b7eaa507a20d493b18ae4ffdd394bdb3c520be4ebd68af5
SHA256 (bsd) = 29889c632e967172c3cd813e1655be79b89168771f1a8b25b683a11502272c79
SHA256 (bsd.mp) = 0ab9d4a1f984bf9ac9d2b3727d2f3645333e6315a003ad7b73bdfdcbb2cf41bd
SHA256 (bsd.rd) = 203952bde49260a2eae39ee7d5d4c93d9f1006b547b65338ea8bc344168b7c77

Syntax aside, this sounds like it would be doable. We just need to have a concrete list of filenames to go by (e.g., SHA256, CHECKSUM.txt, et al). Any recommendations?

@lildude lildude added Add Language Good First Issue This is a great opportunity to start contributing to Linguist labels Jan 6, 2021
@theonlypwner
Copy link
Contributor Author

theonlypwner commented Jan 10, 2021

Ubuntu uses SHA256SUMS as the file name.

The list should include file names for various hashes (MD5, SHA1, SHA224, SHA256, SHA384, SHA512, etc.) like

  • <HASH>
  • <HASH>.txt
  • <hash>.txt
  • <HASH>SUM
  • <HASH>SUMS

plus file names that do not include the hash type, like

  • CHECKSUM.txt
  • CHECKSUMS.txt

If there are other common file names, they should be added, and if some are not common, they should be removed.

It should also support ``` blocks, possibly something like one of these: ```hash, ```sum, ```sums

Alhadis added a commit to Alhadis/language-etc that referenced this issue Jan 11, 2021
@Alhadis
Copy link
Collaborator

Alhadis commented Jan 11, 2021

@theonlypwner If you feel like submitting a pull-request, I've added a basic grammar that highlights checksum lists. Since the language-etc repository is already used by Linguist, there's no need to add a new submodule (although you may need to bump the submodule's commit).

@theonlypwner
Copy link
Contributor Author

What changes need to be made to Linguist? Is it just adding an entry to 3 files (lib/linguist/languages.yml, grammars.yml, vendor/README.md) and updating the submodule vendor/grammars/language-etc?

Also, in Alhadis/language-etc@2f8309a, would (?=\\S{24}) work for short hashes? For example, cksum might output 0 4 filename or 0 4 (for 4 bytes from stdin that have a checksum of zero). Would another entry be needed to handle cksum, where the third capture group would be a file size instead of a file name? Should we just not support cksum?

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 12, 2021

What changes need to be made to Linguist?

The process is documented in CONTRIBUTING.md § Adding a language. It's okay if you don't have the time or motivation; I'm sure I'll get around to adding it when I have the time. 👍

The time-consuming part isn't so much the changes to Linguist, but researching in-the-wild usage using Harvester. Of course, that step becomes moot if a search tosses up a six-figure number of results, at which point we can safely assume there's 100+ users among them. 😉

Should we just not support cksum?

Ah sorry, I could have been clearer on that point... it's common for checksum lists to contain blocks of prose before and/or after the actual checksums. I wasn't comfortable highlighting cksum(1) output, as it has a wide potential for errant matches. Consider a line of the form date time some arbitrary string:

20200112 3 files removed john.doe

Such a line would match a CRC-32 of 20200112 for a 3-byte file named files removed john.doe. Whether it's computationally feasible isn't really relevant; it's better to have no highlighting than wrong highlighting. 😉

(Also, I reasoned that cyclic redundancy checks are the least likely to be stored on-disk, since CRCs are fast to compute—it would behove authors to pick a secure algorithm if they wanted to prepare a checksum list ahead-of-time).

@theonlypwner
Copy link
Contributor Author

theonlypwner commented Jan 12, 2021

My Ruby environment has issues when running bundle install on this repo, and I don't have time to fix them right now. I added it manually in a draft PR and will run the scripts after I fix the issues.

I asked

Should we just not support cksum?

because I was wondering whether it'll be too tricky to support, with the false positives. It looks like we shouldn't try to detect it, but is there a way to highlight it when it is explicitly specified, such as ```hash blocks in Markdown?

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 12, 2021

but is there a way to highlight it when it is explicitly specified, such as ```hash blocks in Markdown?

Unfortunately not. ```hash blocks are highlighted by the same mechanism responsible for highlighting a file's source-code. For all intents and purposes, they're exactly the same.

@smola
Copy link
Contributor

smola commented Jan 14, 2021

Would it make sense to cover single-hash files that are usually identified with the extension (e.g. xxx.sha1, xxx.md5)?

@theonlypwner
Copy link
Contributor Author

Yes, I've added entries under extensions.

@theonlypwner
Copy link
Contributor Author

theonlypwner commented Jan 28, 2021

Unfortunately not. ```hash blocks are highlighted by the same mechanism responsible for highlighting a file's source-code. For all intents and purposes, they're exactly the same.

Don't the blocks below highlight differently? They are ```c, ```html, and ``` blocks, respectively.

int main() {} // <strong attr="val">HTML</strong>
int main() {} // <strong attr="val">HTML</strong>
int main() {} // <strong attr="val">HTML</strong>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Add Language Good First Issue This is a great opportunity to start contributing to Linguist
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants