A high-performance lossless data compressor using context mixing, written in C#. Originally a port of MCM (Mathieu's Compression Method) by Mathieu Chartier.
CMSharp achieves strong compression ratios through adaptive context modeling with up to 8 specialized models mixed via a neural network, combined with preprocessing (dictionary, base64, CSV/JSON columnar transforms, deduplication, x86 BCJ, preflate, JPEG DCT, PNG defiltering) and arithmetic coding. Archives support optional AES-256-GCM encryption, Reed-Solomon recovery records, ECDSA P-256 digital signing, multi-volume splitting, and DEFLATE reconstruction (preflate) for ZIP containers.
On enwik8 (100MB Wikipedia text), AOT-compiled on Intel Arrow Lake-S (36MB L3):
| Mode | Compressed Size | Speed | Memory |
|---|---|---|---|
| Sequential (-m6) | 18,288 KB (1.498 bpb) | ~23s | ~160 MB |
| Parallel (-p, auto prewarm) | ~19,750 KB | ~6s | ~160 MB per thread |
| Fast (-cL) | ~20,500 KB | ~5s | ~40 MB per thread |
| Zstd (-cZ) | 26,967 KB (2.209 bpb) | ~2.5s | ~50 MB |
CMSharp.sln
src/CMSharp/ Core library (NuGet package)
Core/ Context mixer, hash table, range coder, BinaryIO utilities
Models/ Word, bracket, match, interval, special char, PPM-D models + ProbMap
Compression/ Archive format, compressor, parallel compressor, data detection
Preprocessing/ Dictionary, Base64, CSV/JSON columnar, LZP, dedup, X86 BCJ, preflate
Postprocessing/ AES-256-GCM encryption, Reed-Solomon recovery, ECDSA signing
src/CMSharp.Cli/ Command-line application
src/CMSharp.CUDA/ CUDA Mamba-2 neural predictor (optional, requires CUDA GPU)
src/CMSharp.ReedSolomon/ Reed-Solomon erasure coding (vendored from Witteborn.ReedSolomon)
src/CMSharp.Preflate/ Preflate DEFLATE reconstruction engine
src/CMSharp.Tests/ Unit tests (569 tests)
Requires .NET 10 SDK.
# JIT build
dotnet build CMSharp.sln -c Release
# AOT build (recommended, ~10% faster)
dotnet publish src/CMSharp.Cli/CMSharp.Cli.csproj -c Release -p:PublishAot=true -o publish_aot
# Run tests
dotnet test src/CMSharp.Tests -c Releasecmsharp <command> [options] <inputs...> [-o output]
| Command | Description |
|---|---|
a |
Create or add to archive |
x |
Extract archive |
l |
List archive contents |
t |
Test archive integrity (decompress + verify CRC32) |
r |
Repair damaged archive using recovery record |
bench (or b) |
Benchmark compress + decompress N times |
sweep (or s) |
Compress-only, machine-parseable output |
| Option | Description |
|---|---|
-c0 |
Store only (no compression) |
-cZ |
Zstd backend (fast, good compression) |
-cL |
Limited CM (4 models, faster) |
-cS |
Standard CM (7 models, default) |
-cX |
Extended CM (all models, best compression) |
| Option | Description |
|---|---|
-m[N] |
Memory level (1-10, default 6). Higher = better compression, more RAM |
-f[N] |
Fast mode (N=2-7 models). Fine-grained model count control |
-p[N] |
Parallel compression (N=max threads, default all cores) |
-ppm |
Enable PPM-D model (slower, ~0.7% smaller) |
-adaptive |
Adaptive model selection (profile 1MB, prune to optimal set) |
-e |
Encrypt archive with AES-256-GCM (prompts for password) |
-rr[:N] |
Add recovery record (N=overhead %, default 0.25) |
-sign:FILE |
Sign archive with ECDSA P-256 private key PEM |
-verify |
Verify signature on extraction (embedded public key) |
-vol:SIZE |
Split into volumes (e.g., -vol:100M, -vol:4.7G) |
-acl |
Preserve NTFS ACLs (Windows) |
-o <path> |
Output path (default: <input>.cms for a, current dir for x) |
-test |
Verify decompression after compression |
-v |
Verbose output |
| Option | Description |
|---|---|
-e |
Encrypt with AES-256-GCM; prompts for password interactively |
--password:VALUE |
Set password inline (no prompt) |
Encryption uses Argon2id key derivation (64 MB memory, OWASP 2023 recommended parameters). Decryption auto-detects encrypted archives and prompts for the password if needed.
| Option | Description |
|---|---|
-rr |
Add Reed-Solomon recovery record (0.25% overhead) |
-rr:N |
Custom overhead percentage (0.1-50) |
Recovery records enable the r command to repair bit-rot or partial corruption. The t command reports recovery record status and recoverability.
| Option | Description |
|---|---|
-sign:FILE |
Sign archive with ECDSA P-256 private key PEM |
-verify |
Verify signature on extraction (embedded public key) |
-verify:FILE |
Verify against specific public key PEM |
Signing embeds the public key in the archive for self-contained verification. Generate keys with OpenSSL:
openssl ecparam -genkey -name prime256v1 -noout -out private.pem
openssl ec -in private.pem -pubout -out public.pem| Option | Description |
|---|---|
-vol:SIZE |
Split archive into volumes (e.g., -vol:100M, -vol:4.7G, -vol:650K) |
Volumes are named .001, .002, etc. and automatically reassembled on extraction.
| Option | Description |
|---|---|
-dT / -dB / -dI |
Force text/binary/image mode (skip auto-detection) |
-nodict |
Skip dictionary preprocessing |
-nobase64 |
Skip inline base64 decoding |
-nojson |
Skip JSON columnar transform |
-nocsv |
Skip CSV columnar transform |
-nox86 |
Skip x86 BCJ filter for binary data |
-nopreflate |
Skip preflate preprocessing for ZIP containers |
-nojpeg |
Skip JPEG DCT coefficient extraction |
-nopng |
Skip PNG preprocessing (preflate-only: decompress IDAT) |
-pngfull |
Full PNG preprocessing with defiltering and channel separation |
-dedup |
Enable duplicate block removal |
-dict1:N |
Max 1-byte dictionary codes (1-125, default 40) |
-dictmin:N |
Minimum word frequency for dictionary inclusion (default 5) |
-dictprefix:N |
Min prefix length for substring matching (1-20, default 6) |
| Option | Description |
|---|---|
-w[N] |
Override prewarm size in KB (default: auto, implies -p) |
-runs:N |
Number of benchmark runs (default 3) |
-mamba |
Enable CUDA Mamba-2 neural predictor (requires CUDA GPU) |
--max-extract-size:N |
Reject archives claiming more than N bytes (safety limit) |
--enableLargePages |
Grant Lock Pages in Memory privilege (Windows, admin) |
# Compress a file
cmsharp a input.txt
# Compress a directory
cmsharp a mydir/
# Compress multiple inputs to named archive
cmsharp a -o archive.cms file1.txt file2.txt dir/
# Add files to existing archive
cmsharp a -o archive.cms newfile.txt
# Zstd compression (fast)
cmsharp a -cZ input.txt
# Limited CM (4 models, faster)
cmsharp a -cL input.txt
# Best compression (extended models, high memory)
cmsharp a -cX -m8 input.txt
# Parallel CM compression
cmsharp a -p input.txt
# Encrypt with recovery record
cmsharp a -e -rr:1 input.txt
# Sign an archive
cmsharp a -sign:private.pem input.txt
# Split into 100MB volumes
cmsharp a -vol:100M largefile.bin
# Extract
cmsharp x archive.cms -o outdir/
# Extract with signature verification
cmsharp x archive.cms -o outdir/ -verify
# Test archive integrity
cmsharp t archive.cms
# Repair damaged archive
cmsharp r archive.cms
# Benchmark (5 runs, parallel)
cmsharp bench -runs:5 -p input.txt
# Piping
cmsharp a - # compress stdin -> stdout
cmsharp a - -o out.cms # compress stdin -> file
cmsharp x input.cms -o - # decompress -> stdoutFor ~8% faster compression, enable 2MB large pages:
cmsharp --enableLargePages # Run once (triggers UAC), then log out/in
flowchart TD
A[Input Files] --> B[Profile Detection<br/>text / binary / image]
B --> C[File Ordering<br/>group by type + extension]
C --> D{Data Type?}
D -->|Text| T1[Base64 Decode<br/>inline blocks → raw]
T1 --> T2[JSON/CSV Columnar<br/>rows → columns]
T2 --> T3[Dictionary<br/>words → short codes]
D -->|Binary| B1[Preflate<br/>ZIP DEFLATE → raw]
B1 --> B2[X86 BCJ Filter<br/>rel→abs addresses]
D -->|Image| I1[JPEG DCT Extraction<br/>coefficients → stream]
D -->|Image| I2[PNG Defilter<br/>IDAT → raw pixels]
D -->|Optional| O1[Deduplication<br/>remove duplicate blocks]
T3 --> E{Backend}
B2 --> E
I1 --> E
I2 --> E
O1 --> E
E -->|Default| CM[Context Mixing<br/>7 models + neural mixer<br/>+ SSE + arithmetic coding]
E -->|-cZ| ZS[Zstd<br/>LZ77 + Huffman/FSE]
CM --> F[.cms Archive]
ZS --> F
F -.->|Optional| G1[AES-256-GCM Encryption]
G1 -.-> G2[Reed-Solomon Recovery]
G2 -.-> G3[ECDSA P-256 Signature]
G3 -.-> G4[Multi-Volume Split]
The archive format (v2.0) uses .cms extension with CMSARCHIVE magic header, CRC32 per-file verification, CRC-based deduplication, solid-block compression, and metadata preservation (timestamps, permissions, symlinks, NTFS ACLs). Post-compression records (recovery, signature) are backward-compatible: older decompressors ignore them. See FORMAT.md for the full wire format specification.
# AOT benchmark (3 runs on enwik8, builds AOT internally)
powershell -File bench_aot.ps1 -Runs 3 -Label TEST -HighPriority
# JIT benchmark
powershell -File bench.ps1 -Runs 3 -Label TEST- Mathieu Chartier - Original MCM compressor
- Matt Mahoney - PAQ compression research and squash/stretch tables
- Sami Runsas - NanoZip compressor (inspiration for adaptive model selection)
- Eugene Shelwien - PPM-D implementation (mod_ppmd_v2)
- Stephan Busch - Contributions to MCM
- Christopher Mattern - Contributions to MCM
- Dirk Steinke - Preflate DEFLATE reconstruction (Apache 2.0)
- Thomas Witteborn - Reed-Solomon implementation (Witteborn.ReedSolomon, MIT license)
- Oleg Stepanischev - ZstdSharp C# zstd port (BSD)
- Keef Aragon - Konscious.Security.Cryptography Argon2id (MIT)