CMSharp

A high-performance lossless data compressor using context mixing, written in C#. Originally a port of MCM (Mathieu's Compression Method) by Mathieu Chartier.

CMSharp achieves strong compression ratios through adaptive context modeling with up to 8 specialized models mixed via a neural network, combined with preprocessing (dictionary, base64, CSV/JSON columnar transforms, deduplication, x86 BCJ, preflate, JPEG DCT, PNG defiltering) and arithmetic coding. Archives support optional AES-256-GCM encryption, Reed-Solomon recovery records, ECDSA P-256 digital signing, multi-volume splitting, and DEFLATE reconstruction (preflate) for ZIP containers.

Performance

On enwik8 (100MB Wikipedia text), AOT-compiled on Intel Arrow Lake-S (36MB L3):

Mode	Compressed Size	Speed	Memory
Sequential (-m6)	18,288 KB (1.498 bpb)	~23s	~160 MB
Parallel (-p, auto prewarm)	~19,750 KB	~6s	~160 MB per thread
Fast (-cL)	~20,500 KB	~5s	~40 MB per thread
Zstd (-cZ)	26,967 KB (2.209 bpb)	~2.5s	~50 MB

Project Structure

CMSharp.sln
  src/CMSharp/            Core library (NuGet package)
    Core/                 Context mixer, hash table, range coder, BinaryIO utilities
    Models/               Word, bracket, match, interval, special char, PPM-D models + ProbMap
    Compression/          Archive format, compressor, parallel compressor, data detection
    Preprocessing/        Dictionary, Base64, CSV/JSON columnar, LZP, dedup, X86 BCJ, preflate
    Postprocessing/       AES-256-GCM encryption, Reed-Solomon recovery, ECDSA signing
  src/CMSharp.Cli/        Command-line application
  src/CMSharp.CUDA/       CUDA Mamba-2 neural predictor (optional, requires CUDA GPU)
  src/CMSharp.ReedSolomon/ Reed-Solomon erasure coding (vendored from Witteborn.ReedSolomon)
  src/CMSharp.Preflate/   Preflate DEFLATE reconstruction engine
  src/CMSharp.Tests/      Unit tests (569 tests)

Build

Requires .NET 10 SDK.

# JIT build
dotnet build CMSharp.sln -c Release

# AOT build (recommended, ~10% faster)
dotnet publish src/CMSharp.Cli/CMSharp.Cli.csproj -c Release -p:PublishAot=true -o publish_aot

# Run tests
dotnet test src/CMSharp.Tests -c Release

Usage

cmsharp <command> [options] <inputs...> [-o output]

Commands

Command	Description
`a`	Create or add to archive
`x`	Extract archive
`l`	List archive contents
`t`	Test archive integrity (decompress + verify CRC32)
`r`	Repair damaged archive using recovery record
`bench` (or `b`)	Benchmark compress + decompress N times
`sweep` (or `s`)	Compress-only, machine-parseable output

Compression Tiers

Option	Description
`-c0`	Store only (no compression)
`-cZ`	Zstd backend (fast, good compression)
`-cL`	Limited CM (4 models, faster)
`-cS`	Standard CM (7 models, default)
`-cX`	Extended CM (all models, best compression)

Common Options

Option	Description
`-m[N]`	Memory level (1-10, default 6). Higher = better compression, more RAM
`-f[N]`	Fast mode (N=2-7 models). Fine-grained model count control
`-p[N]`	Parallel compression (N=max threads, default all cores)
`-ppm`	Enable PPM-D model (slower, ~0.7% smaller)
`-adaptive`	Adaptive model selection (profile 1MB, prune to optimal set)
`-e`	Encrypt archive with AES-256-GCM (prompts for password)
`-rr[:N]`	Add recovery record (N=overhead %, default 0.25)
`-sign:FILE`	Sign archive with ECDSA P-256 private key PEM
`-verify`	Verify signature on extraction (embedded public key)
`-vol:SIZE`	Split into volumes (e.g., `-vol:100M`, `-vol:4.7G`)
`-acl`	Preserve NTFS ACLs (Windows)
`-o <path>`	Output path (default: `<input>.cms` for `a`, current dir for `x`)
`-test`	Verify decompression after compression
`-v`	Verbose output

Encryption Options

Option	Description
`-e`	Encrypt with AES-256-GCM; prompts for password interactively
`--password:VALUE`	Set password inline (no prompt)

Encryption uses Argon2id key derivation (64 MB memory, OWASP 2023 recommended parameters). Decryption auto-detects encrypted archives and prompts for the password if needed.

Recovery Record Options

Option	Description
`-rr`	Add Reed-Solomon recovery record (0.25% overhead)
`-rr:N`	Custom overhead percentage (0.1-50)

Recovery records enable the r command to repair bit-rot or partial corruption. The t command reports recovery record status and recoverability.

Digital Signing Options

Option	Description
`-sign:FILE`	Sign archive with ECDSA P-256 private key PEM
`-verify`	Verify signature on extraction (embedded public key)
`-verify:FILE`	Verify against specific public key PEM

Signing embeds the public key in the archive for self-contained verification. Generate keys with OpenSSL:

openssl ecparam -genkey -name prime256v1 -noout -out private.pem
openssl ec -in private.pem -pubout -out public.pem

Multi-Volume Options

Option	Description
`-vol:SIZE`	Split archive into volumes (e.g., `-vol:100M`, `-vol:4.7G`, `-vol:650K`)

Volumes are named .001, .002, etc. and automatically reassembled on extraction.

Preprocessing Options

Option	Description
`-dT` / `-dB` / `-dI`	Force text/binary/image mode (skip auto-detection)
`-nodict`	Skip dictionary preprocessing
`-nobase64`	Skip inline base64 decoding
`-nojson`	Skip JSON columnar transform
`-nocsv`	Skip CSV columnar transform
`-nox86`	Skip x86 BCJ filter for binary data
`-nopreflate`	Skip preflate preprocessing for ZIP containers
`-nojpeg`	Skip JPEG DCT coefficient extraction
`-nopng`	Skip PNG preprocessing (preflate-only: decompress IDAT)
`-pngfull`	Full PNG preprocessing with defiltering and channel separation
`-dedup`	Enable duplicate block removal
`-dict1:N`	Max 1-byte dictionary codes (1-125, default 40)
`-dictmin:N`	Minimum word frequency for dictionary inclusion (default 5)
`-dictprefix:N`	Min prefix length for substring matching (1-20, default 6)

Other Options

Option	Description
`-w[N]`	Override prewarm size in KB (default: auto, implies -p)
`-runs:N`	Number of benchmark runs (default 3)
`-mamba`	Enable CUDA Mamba-2 neural predictor (requires CUDA GPU)
`--max-extract-size:N`	Reject archives claiming more than N bytes (safety limit)
`--enableLargePages`	Grant Lock Pages in Memory privilege (Windows, admin)

Examples

# Compress a file
cmsharp a input.txt

# Compress a directory
cmsharp a mydir/

# Compress multiple inputs to named archive
cmsharp a -o archive.cms file1.txt file2.txt dir/

# Add files to existing archive
cmsharp a -o archive.cms newfile.txt

# Zstd compression (fast)
cmsharp a -cZ input.txt

# Limited CM (4 models, faster)
cmsharp a -cL input.txt

# Best compression (extended models, high memory)
cmsharp a -cX -m8 input.txt

# Parallel CM compression
cmsharp a -p input.txt

# Encrypt with recovery record
cmsharp a -e -rr:1 input.txt

# Sign an archive
cmsharp a -sign:private.pem input.txt

# Split into 100MB volumes
cmsharp a -vol:100M largefile.bin

# Extract
cmsharp x archive.cms -o outdir/

# Extract with signature verification
cmsharp x archive.cms -o outdir/ -verify

# Test archive integrity
cmsharp t archive.cms

# Repair damaged archive
cmsharp r archive.cms

# Benchmark (5 runs, parallel)
cmsharp bench -runs:5 -p input.txt

# Piping
cmsharp a -                          # compress stdin -> stdout
cmsharp a - -o out.cms               # compress stdin -> file
cmsharp x input.cms -o -             # decompress -> stdout

Large Pages (Windows)

For ~8% faster compression, enable 2MB large pages:

cmsharp --enableLargePages   # Run once (triggers UAC), then log out/in

How It Works

flowchart TD
    A[Input Files] --> B[Profile Detection<br/>text / binary / image]
    B --> C[File Ordering<br/>group by type + extension]

    C --> D{Data Type?}

    D -->|Text| T1[Base64 Decode<br/>inline blocks → raw]
    T1 --> T2[JSON/CSV Columnar<br/>rows → columns]
    T2 --> T3[Dictionary<br/>words → short codes]

    D -->|Binary| B1[Preflate<br/>ZIP DEFLATE → raw]
    B1 --> B2[X86 BCJ Filter<br/>rel→abs addresses]

    D -->|Image| I1[JPEG DCT Extraction<br/>coefficients → stream]
    D -->|Image| I2[PNG Defilter<br/>IDAT → raw pixels]

    D -->|Optional| O1[Deduplication<br/>remove duplicate blocks]

    T3 --> E{Backend}
    B2 --> E
    I1 --> E
    I2 --> E
    O1 --> E

    E -->|Default| CM[Context Mixing<br/>7 models + neural mixer<br/>+ SSE + arithmetic coding]
    E -->|-cZ| ZS[Zstd<br/>LZ77 + Huffman/FSE]

    CM --> F[.cms Archive]
    ZS --> F

    F -.->|Optional| G1[AES-256-GCM Encryption]
    G1 -.-> G2[Reed-Solomon Recovery]
    G2 -.-> G3[ECDSA P-256 Signature]
    G3 -.-> G4[Multi-Volume Split]

The archive format (v2.0) uses .cms extension with CMSARCHIVE magic header, CRC32 per-file verification, CRC-based deduplication, solid-block compression, and metadata preservation (timestamps, permissions, symlinks, NTFS ACLs). Post-compression records (recovery, signature) are backward-compatible: older decompressors ignore them. See FORMAT.md for the full wire format specification.

Benchmarking

# AOT benchmark (3 runs on enwik8, builds AOT internally)
powershell -File bench_aot.ps1 -Runs 3 -Label TEST -HighPriority

# JIT benchmark
powershell -File bench.ps1 -Runs 3 -Label TEST

Credits

Mathieu Chartier - Original MCM compressor
Matt Mahoney - PAQ compression research and squash/stretch tables
Sami Runsas - NanoZip compressor (inspiration for adaptive model selection)
Eugene Shelwien - PPM-D implementation (mod_ppmd_v2)
Stephan Busch - Contributions to MCM
Christopher Mattern - Contributions to MCM
Dirk Steinke - Preflate DEFLATE reconstruction (Apache 2.0)
Thomas Witteborn - Reed-Solomon implementation (Witteborn.ReedSolomon, MIT license)
Oleg Stepanischev - ZstdSharp C# zstd port (BSD)
Keef Aragon - Konscious.Security.Cryptography Argon2id (MIT)

Name		Name	Last commit message	Last commit date
Latest commit History 722 Commits
src		src
test_files		test_files
.gitignore		.gitignore
CMSharp.experiments.patch		CMSharp.experiments.patch
CMSharp.sln		CMSharp.sln
Cleanup.md		Cleanup.md
Directory.Build.props		Directory.Build.props
FAILED_EXPERIMENTS.md		FAILED_EXPERIMENTS.md
FORMAT.md		FORMAT.md
LICENSE		LICENSE
MICROARCH_ANALYSIS_GUIDE.md		MICROARCH_ANALYSIS_GUIDE.md
MP3Transform_Plan.md		MP3Transform_Plan.md
PERFORMANCE_FINDINGS.md		PERFORMANCE_FINDINGS.md
README.md		README.md
StreamLZ.ini.template		StreamLZ.ini.template
WINDOW_ACTIVATION_FINDINGS.md		WINDOW_ACTIVATION_FINDINGS.md
assembler_optimizations.md		assembler_optimizations.md
bench_aot.ps1		bench_aot.ps1
dram.md		dram.md
gen_huffman_tables.py		gen_huffman_tables.py
parse_trace.py		parse_trace.py
parse_vtune.py		parse_vtune.py
parse_vtune_lines.py		parse_vtune_lines.py
pgo_profile_tiered.mibc		pgo_profile_tiered.mibc
record_interleave_spec.md		record_interleave_spec.md
sweep.md		sweep.md
sweep_costs.ps1		sweep_costs.ps1
sweep_jitter_12.ini		sweep_jitter_12.ini
sweep_match_learn.sh		sweep_match_learn.sh
todo.md		todo.md
vtune_uarch.cmd		vtune_uarch.cmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CMSharp

Performance

Project Structure

Build

Usage

Commands

Compression Tiers

Common Options

Encryption Options

Recovery Record Options

Digital Signing Options

Multi-Volume Options

Preprocessing Options

Other Options

Examples

Large Pages (Windows)

How It Works

Benchmarking

Credits

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CMSharp

Performance

Project Structure

Build

Usage

Commands

Compression Tiers

Common Options

Encryption Options

Recovery Record Options

Digital Signing Options

Multi-Volume Options

Preprocessing Options

Other Options

Examples

Large Pages (Windows)

How It Works

Benchmarking

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages