Skip to content

copyleftdev/kindi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kindi

Cryptographic evidence hunter for embedded firmware and software supply chains.

Performance · Quick Start · Field Test · AI Output · Fuzzing · Website


Named for Al-Kindi (801-873 CE), the Arab polymath who wrote Risalah fi Istikhraj al-Mu'amma -- the first known treatise on breaking ciphers through frequency analysis. 1,200 years later, Kindi does what he did: finds cryptographic patterns hiding in text, but at 200 MB/s with SIMD-accelerated Aho-Corasick automata.

cargo install kindi

Performance

Operation Regex alternation Kindi Speedup
Keyword search (434 patterns) 10,814 us 332 us 33x
API search (4,427 patterns) 55,795 us 318 us 175x
Throughput 1.2 - 6.1 MB/s 199 - 208 MB/s 33 - 175x

How

Traditional scanners build (?:kw1|kw2|...|kw4427) and run re.finditer() line-by-line. That's O(n * m) per line.

Kindi compiles all patterns into an Aho-Corasick automaton -- single O(n) pass, SIMD memchr inner loop, zero allocation during search. Word boundaries enforced post-match. Line numbers via binary-searched LineIndex. Files scanned in parallel via Rayon.

Quick Start

# Scan firmware source tree
kindi --keywords patterns/keyword_list.txt --api-defs patterns/api_definitions.txt ./firmware

# Binary file analysis
kindi --keywords patterns.txt --scan-binary ./build/output

# AI agent mode (3 KB instead of 100 MB)
kindi --keywords patterns.txt --toon ./target

# Quick scan + CSV export
kindi --keywords patterns.txt --quick --csv -o results/ ./src

# Filter out noise
kindi --keywords patterns.txt --ignore-evidence-types generic,Hash ./src

Field Test

Four legendary C monorepos. Identical pattern databases. Real benchmarks.

Repo Files Source Hits Time Throughput Toon
Linux kernel 93,188 1.7 GB 108,010 0.64s 146K files/s 3.5 KB
OpenSSL 6,098 144 MB 272,051 1.03s 5.9K files/s 3.4 KB
Wireshark 7,256 357 MB 48,923 1.40s 5.2K files/s 3.5 KB
FFmpeg 10,260 112 MB 4,281 0.38s 27K files/s 2.7 KB
Total 116,802 2.3 GB 433,265 3.45s 13 KB

All four toon outputs combined: 3,272 tokens. That's 0.3% of a 1M-token context window.

What It Finds

48+ cryptographic algorithms, protocols, and library calls:

  • Symmetric -- AES, DES, 3DES, Blowfish, Twofish, Camellia, ChaCha20, RC4, CAST5
  • Asymmetric -- RSA, DSA, ECC, Diffie-Hellman, ElGamal
  • Hash -- SHA-1, SHA-2, SHA-3, MD5, MD4, BLAKE, RIPEMD, HMAC
  • Protocols -- TLS, SSL, SSH, Kerberos, PKCS, PKI
  • Libraries -- OpenSSL, libgcrypt, Crypto++, WinCrypt
  • Weak crypto flagged -- DES (NIST-withdrawn), MD5 (collision attacks), SHA-1 (NIST-2030), RC4 (biased keystream), Blowfish (64-bit block)

AI Agent Output

--toon compresses output for LLM consumption. The Linux kernel scan produces 100 MB of JSON (26M tokens). Toon gives you 3 KB (864 tokens). 30,228:1 compression.

@kindi pkg=linux files=93188 hits=108010 matched=4909 t=0.934s
#by_type count=8
AES 10804 908f  drivers/crypto/atmel-aes.c:323
SHA2 3851 449f  include/crypto/sha2.h:203
DES 2948 263f  drivers/gpu/drm/amd/...:225
#weak count=4
DES 2948 263f DEPRECATED:NIST-withdrawn
MD5 1193 219f BROKEN:collision-attacks
SHA1 1364 330f DEPRECATED:NIST-2030
RC4 87 12f BROKEN:biased-keystream
#hot top=3
1010 drivers/crypto/inside-secure/safexcel_cipher.c AES,DES,SHA1
940  drivers/crypto/axis/artpec6_crypto.c AES,SHA2,HMAC
927  drivers/md/dm-crypt.c AES,HMAC,MD5
#methods
keyword 107562
api 448

Design principles:

  • I (the AI) never need all N matches. I need aggregates and drill-down capability.
  • Repeated JSON field names are pure token waste. Positional format eliminates them.
  • Evidence grouped by type, not by file. That's how questions are asked.
  • Weak crypto pre-classified. Don't make me consult training data for NIST status.
  • Context lines omitted (70% of JSON volume). Available in drill-down mode.

Architecture

src/
  pattern.rs   -- Aho-Corasick SIMD automaton, per-pattern \b word boundaries
  detect.rs    -- Rayon parallel scanning, encoding fallback, method tracking
  extract.rs   -- Archive extraction, SHA-256 cycle detection, path sanitization
  toon.rs      -- Token-optimized output with weak crypto classification
  language.rs  -- File classification (13 languages + binary)
  output.rs    -- Crypto-spec v3.0 JSON + CSV
  error.rs     -- Typed errors
  main.rs      -- CLI

Memory Safety

Zero unsafe. Zero unwrap() in production code. Bounds-checked .get() on every hot-path index.

Protection Mechanism
Buffer overflow PathBuf, Vec, .get() bounds checking
Null deref Result<T, E> at every I/O boundary
Use-after-free RAII via Drop
OOM MAX_FILE_SIZE (256 MB) guard before allocation
Path traversal sanitize_entry_path() strips ../ and leading /
Archive bombs SHA-256 cycle detection in extraction graph

Fuzzing

8 libFuzzer targets with structure-aware Arbitrary inputs:

Target Invariants
fuzz_pattern_parse No panics on arbitrary INI+JSON
fuzz_pattern_search matched_text == content[begin..end], boundary enforcement, filtered types absent
fuzz_extract_zip All files under target dir (no traversal)
fuzz_extract_tar No panics on corrupt tar/gz/bz2/xz
fuzz_encoding Output always valid UTF-8
fuzz_full_pipeline detection_method correct, verification code valid, JSON round-trips, CSV well-formed
fuzz_extract_path_sanitize No directory escapes
fuzz_language_classify Source implies text, binary implies not-source
rustup toolchain install nightly
cargo +nightly fuzz run fuzz_pattern_search
cargo +nightly fuzz run fuzz_full_pipeline -- -max_total_time=300

Testing

35 unit tests. Zero compiler warnings.

cargo test    # 35 tests
cargo bench   # Criterion benchmarks

CLI Reference

kindi [OPTIONS] <PACKAGES>...

Options:
    --keywords <FILE>            Keyword pattern file [default: keyword_list.txt]
    --api-defs <FILE>            API pattern file (whole-word matching)
-i, --ignore-case                Case-insensitive
-q, --quick                      Quick scan (presence only)
    --source-only                Source files only
    --scan-binary                Include binary files
-s, --stop-after <N>             Stop after N matched files
    --ignore-evidence-types <T>  Filter types (comma-separated)
-o, --output <DIR>               Output directory [default: .]
    --csv                        CSV alongside JSON
    --toon                       AI agent token-optimized output
    --pretty                     Pretty JSON
-v, --verbose                    Verbose

Origin

"One way to solve an encrypted message, if we know its language, is to find a different plaintext of the same language long enough to fill one sheet or so, and then we count the occurrences of each letter."

-- Abu Yusuf Ya'qub ibn Ishaq al-Kindi, Risalah fi Istikhraj al-Mu'amma, Baghdad, c. 850 CE

License

Apache-2.0

About

Cryptographic evidence hunter for embedded firmware and software supply chains. Named for Al-Kindi (801-873 CE), father of cryptanalysis. SIMD Aho-Corasick, Rayon parallel, 33-175x faster, 8 fuzz targets, zero unsafe.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages