Skip to content

bobjase/mcmSharp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

722 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CMSharp

A high-performance lossless data compressor using context mixing, written in C#. Originally a port of MCM (Mathieu's Compression Method) by Mathieu Chartier.

CMSharp achieves strong compression ratios through adaptive context modeling with up to 8 specialized models mixed via a neural network, combined with preprocessing (dictionary, base64, CSV/JSON columnar transforms, deduplication, x86 BCJ, preflate, JPEG DCT, PNG defiltering) and arithmetic coding. Archives support optional AES-256-GCM encryption, Reed-Solomon recovery records, ECDSA P-256 digital signing, multi-volume splitting, and DEFLATE reconstruction (preflate) for ZIP containers.

Performance

On enwik8 (100MB Wikipedia text), AOT-compiled on Intel Arrow Lake-S (36MB L3):

Mode Compressed Size Speed Memory
Sequential (-m6) 18,288 KB (1.498 bpb) ~23s ~160 MB
Parallel (-p, auto prewarm) ~19,750 KB ~6s ~160 MB per thread
Fast (-cL) ~20,500 KB ~5s ~40 MB per thread
Zstd (-cZ) 26,967 KB (2.209 bpb) ~2.5s ~50 MB

Project Structure

CMSharp.sln
  src/CMSharp/            Core library (NuGet package)
    Core/                 Context mixer, hash table, range coder, BinaryIO utilities
    Models/               Word, bracket, match, interval, special char, PPM-D models + ProbMap
    Compression/          Archive format, compressor, parallel compressor, data detection
    Preprocessing/        Dictionary, Base64, CSV/JSON columnar, LZP, dedup, X86 BCJ, preflate
    Postprocessing/       AES-256-GCM encryption, Reed-Solomon recovery, ECDSA signing
  src/CMSharp.Cli/        Command-line application
  src/CMSharp.CUDA/       CUDA Mamba-2 neural predictor (optional, requires CUDA GPU)
  src/CMSharp.ReedSolomon/ Reed-Solomon erasure coding (vendored from Witteborn.ReedSolomon)
  src/CMSharp.Preflate/   Preflate DEFLATE reconstruction engine
  src/CMSharp.Tests/      Unit tests (569 tests)

Build

Requires .NET 10 SDK.

# JIT build
dotnet build CMSharp.sln -c Release

# AOT build (recommended, ~10% faster)
dotnet publish src/CMSharp.Cli/CMSharp.Cli.csproj -c Release -p:PublishAot=true -o publish_aot

# Run tests
dotnet test src/CMSharp.Tests -c Release

Usage

cmsharp <command> [options] <inputs...> [-o output]

Commands

Command Description
a Create or add to archive
x Extract archive
l List archive contents
t Test archive integrity (decompress + verify CRC32)
r Repair damaged archive using recovery record
bench (or b) Benchmark compress + decompress N times
sweep (or s) Compress-only, machine-parseable output

Compression Tiers

Option Description
-c0 Store only (no compression)
-cZ Zstd backend (fast, good compression)
-cL Limited CM (4 models, faster)
-cS Standard CM (7 models, default)
-cX Extended CM (all models, best compression)

Common Options

Option Description
-m[N] Memory level (1-10, default 6). Higher = better compression, more RAM
-f[N] Fast mode (N=2-7 models). Fine-grained model count control
-p[N] Parallel compression (N=max threads, default all cores)
-ppm Enable PPM-D model (slower, ~0.7% smaller)
-adaptive Adaptive model selection (profile 1MB, prune to optimal set)
-e Encrypt archive with AES-256-GCM (prompts for password)
-rr[:N] Add recovery record (N=overhead %, default 0.25)
-sign:FILE Sign archive with ECDSA P-256 private key PEM
-verify Verify signature on extraction (embedded public key)
-vol:SIZE Split into volumes (e.g., -vol:100M, -vol:4.7G)
-acl Preserve NTFS ACLs (Windows)
-o <path> Output path (default: <input>.cms for a, current dir for x)
-test Verify decompression after compression
-v Verbose output

Encryption Options

Option Description
-e Encrypt with AES-256-GCM; prompts for password interactively
--password:VALUE Set password inline (no prompt)

Encryption uses Argon2id key derivation (64 MB memory, OWASP 2023 recommended parameters). Decryption auto-detects encrypted archives and prompts for the password if needed.

Recovery Record Options

Option Description
-rr Add Reed-Solomon recovery record (0.25% overhead)
-rr:N Custom overhead percentage (0.1-50)

Recovery records enable the r command to repair bit-rot or partial corruption. The t command reports recovery record status and recoverability.

Digital Signing Options

Option Description
-sign:FILE Sign archive with ECDSA P-256 private key PEM
-verify Verify signature on extraction (embedded public key)
-verify:FILE Verify against specific public key PEM

Signing embeds the public key in the archive for self-contained verification. Generate keys with OpenSSL:

openssl ecparam -genkey -name prime256v1 -noout -out private.pem
openssl ec -in private.pem -pubout -out public.pem

Multi-Volume Options

Option Description
-vol:SIZE Split archive into volumes (e.g., -vol:100M, -vol:4.7G, -vol:650K)

Volumes are named .001, .002, etc. and automatically reassembled on extraction.

Preprocessing Options

Option Description
-dT / -dB / -dI Force text/binary/image mode (skip auto-detection)
-nodict Skip dictionary preprocessing
-nobase64 Skip inline base64 decoding
-nojson Skip JSON columnar transform
-nocsv Skip CSV columnar transform
-nox86 Skip x86 BCJ filter for binary data
-nopreflate Skip preflate preprocessing for ZIP containers
-nojpeg Skip JPEG DCT coefficient extraction
-nopng Skip PNG preprocessing (preflate-only: decompress IDAT)
-pngfull Full PNG preprocessing with defiltering and channel separation
-dedup Enable duplicate block removal
-dict1:N Max 1-byte dictionary codes (1-125, default 40)
-dictmin:N Minimum word frequency for dictionary inclusion (default 5)
-dictprefix:N Min prefix length for substring matching (1-20, default 6)

Other Options

Option Description
-w[N] Override prewarm size in KB (default: auto, implies -p)
-runs:N Number of benchmark runs (default 3)
-mamba Enable CUDA Mamba-2 neural predictor (requires CUDA GPU)
--max-extract-size:N Reject archives claiming more than N bytes (safety limit)
--enableLargePages Grant Lock Pages in Memory privilege (Windows, admin)

Examples

# Compress a file
cmsharp a input.txt

# Compress a directory
cmsharp a mydir/

# Compress multiple inputs to named archive
cmsharp a -o archive.cms file1.txt file2.txt dir/

# Add files to existing archive
cmsharp a -o archive.cms newfile.txt

# Zstd compression (fast)
cmsharp a -cZ input.txt

# Limited CM (4 models, faster)
cmsharp a -cL input.txt

# Best compression (extended models, high memory)
cmsharp a -cX -m8 input.txt

# Parallel CM compression
cmsharp a -p input.txt

# Encrypt with recovery record
cmsharp a -e -rr:1 input.txt

# Sign an archive
cmsharp a -sign:private.pem input.txt

# Split into 100MB volumes
cmsharp a -vol:100M largefile.bin

# Extract
cmsharp x archive.cms -o outdir/

# Extract with signature verification
cmsharp x archive.cms -o outdir/ -verify

# Test archive integrity
cmsharp t archive.cms

# Repair damaged archive
cmsharp r archive.cms

# Benchmark (5 runs, parallel)
cmsharp bench -runs:5 -p input.txt

# Piping
cmsharp a -                          # compress stdin -> stdout
cmsharp a - -o out.cms               # compress stdin -> file
cmsharp x input.cms -o -             # decompress -> stdout

Large Pages (Windows)

For ~8% faster compression, enable 2MB large pages:

cmsharp --enableLargePages   # Run once (triggers UAC), then log out/in

How It Works

flowchart TD
    A[Input Files] --> B[Profile Detection<br/>text / binary / image]
    B --> C[File Ordering<br/>group by type + extension]

    C --> D{Data Type?}

    D -->|Text| T1[Base64 Decode<br/>inline blocks → raw]
    T1 --> T2[JSON/CSV Columnar<br/>rows → columns]
    T2 --> T3[Dictionary<br/>words → short codes]

    D -->|Binary| B1[Preflate<br/>ZIP DEFLATE → raw]
    B1 --> B2[X86 BCJ Filter<br/>rel→abs addresses]

    D -->|Image| I1[JPEG DCT Extraction<br/>coefficients → stream]
    D -->|Image| I2[PNG Defilter<br/>IDAT → raw pixels]

    D -->|Optional| O1[Deduplication<br/>remove duplicate blocks]

    T3 --> E{Backend}
    B2 --> E
    I1 --> E
    I2 --> E
    O1 --> E

    E -->|Default| CM[Context Mixing<br/>7 models + neural mixer<br/>+ SSE + arithmetic coding]
    E -->|-cZ| ZS[Zstd<br/>LZ77 + Huffman/FSE]

    CM --> F[.cms Archive]
    ZS --> F

    F -.->|Optional| G1[AES-256-GCM Encryption]
    G1 -.-> G2[Reed-Solomon Recovery]
    G2 -.-> G3[ECDSA P-256 Signature]
    G3 -.-> G4[Multi-Volume Split]
Loading

The archive format (v2.0) uses .cms extension with CMSARCHIVE magic header, CRC32 per-file verification, CRC-based deduplication, solid-block compression, and metadata preservation (timestamps, permissions, symlinks, NTFS ACLs). Post-compression records (recovery, signature) are backward-compatible: older decompressors ignore them. See FORMAT.md for the full wire format specification.

Benchmarking

# AOT benchmark (3 runs on enwik8, builds AOT internally)
powershell -File bench_aot.ps1 -Runs 3 -Label TEST -HighPriority

# JIT benchmark
powershell -File bench.ps1 -Runs 3 -Label TEST

Credits

  • Mathieu Chartier - Original MCM compressor
  • Matt Mahoney - PAQ compression research and squash/stretch tables
  • Sami Runsas - NanoZip compressor (inspiration for adaptive model selection)
  • Eugene Shelwien - PPM-D implementation (mod_ppmd_v2)
  • Stephan Busch - Contributions to MCM
  • Christopher Mattern - Contributions to MCM
  • Dirk Steinke - Preflate DEFLATE reconstruction (Apache 2.0)
  • Thomas Witteborn - Reed-Solomon implementation (Witteborn.ReedSolomon, MIT license)
  • Oleg Stepanischev - ZstdSharp C# zstd port (BSD)
  • Keef Aragon - Konscious.Security.Cryptography Argon2id (MIT)

About

LLM-assisted C# port of Mathieu Chartier's MCM compressor

Resources

License

Stars

Watchers

Forks

Contributors