loxc

Train a compression model on your text. Ship it. Compress and decompress at hardware speed.

A trainable, frequency-optimized text codec in pure C99. Made for embedded systems, log pipelines, and any domain-specific text.

At a glance

Input:             "The quick brown fox jumps over the lazy dog"  (43 bytes)
Encoded:           29 bytes  (67% of original)
Decode:            < 0.1 ms
Compression ratio: ~60% on the bundled sample text corpus
Decode speed:      ~8x faster than encode
Decoder size:      ~5 KB compiled
Dependencies:      none (pure C99)

Language agnostic

loxc has no built-in language assumptions. The included demo module is trained on a public-domain text sample (Pride and Prejudice) for testing purposes only. For your own data:

Slovak, Czech, Polish, and other natural language corpora: train on your corpus
JSON, XML, and log lines: train on representative samples of your format
URLs and file paths: train on a representative set
Source code: train on files from your codebase

The compression algorithm works on bytes, not characters or words. Any text-like data with repeated patterns can compress well when the module is trained on a matching corpus.

Three lines to compress text

loxc_ctx_t *ctx = loxc_open("modules/loxc_demo.loxctab");
loxc_buffer_t out = loxc_compress_buffer(ctx, "Hello world!", 12, 0);
loxc_close(ctx);

That's it. out.data now holds compressed bytes.

See full examples -> | 5-minute tutorial ->

Why loxc?

Built for

Domain-specific text: JSON APIs, log lines, URL paths, localization files
Embedded systems: small decoder, no heavyweight runtime dependencies
Repeated payloads: train once on your corpus, compress millions of similar messages
Predictable latency: decode is table-driven, not entropy-decoder heavy

Not for

Archival compression, where zstd or brotli will give better ratios
Random-access string storage in databases, where FSST is a better fit
One-shot compression of unknown text, where gzip or zstd is simpler

How it works

TRAINING (offline, once per corpus)

your_corpus.txt  -->  loxc_train  -->  mytable.loxctab
    |
    +-- Counts byte frequencies
    +-- Extracts repeated phrases ("the", "and", "that", ...)
    +-- Greedy filter keeps only entries that reduce total output size
    `-- Picks best strategy:
        FLAT   (fixed-width)
        HIER4  (4x4 matrix with escapes)
        HIER8  (8x8 matrix with escapes)

RUNTIME (online, many times)

input text  -->  [encode via lookup tables]  -->  .loxc file
.loxc file  -->  [decode via lookup tables]  -->  output text

The hierarchical matrix idea

Top-frequency symbols live in a 6-bit grid in HIER8:

Position 0-55:  direct symbol (6 bits)
Position 56-63: ESCAPE -> read 6 more bits for the next grid

Frequent symbols:  [space] [e] [t] [o]   -> 6 bits each
Less frequent:     [q] [z] [x]           -> 12 bits each
Rare:                                      18 bits each

This is mathematically close to an adaptive (56,8)-Dense Code with auto-selected parameters per corpus.

Benchmarks

Measured with make bench-full on the current benchmark suite:

CPU: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
OS: Linux 6.6.87.2-microsoft-standard-WSL2
Compiler: cc 13.3.0
Iterations: 100 after warmup

Bundled sample-text module

File	Mode	Ratio	Encode	Decode
`trainings/demo_corpus.txt` (720.7 KiB)	`loxc-ext(demo)`	`60.8%`	`114.55 ms` / `6.1 MB/s`	`13.72 ms` / `51.3 MB/s`
`benchmarks/plain_sample_text.txt` (29.3 KiB)	`loxc-ext(demo)`	`62.2%`	`4.80 ms` / `6.0 MB/s`	`0.55 ms` / `51.8 MB/s`

Domain-tuned modules

Domain	Table size	Held-out file	Ratio	Decode throughput
sample-text	`1985 B`	`benchmarks/plain_sample_text.txt`	`62.2%`	`51.8 MB/s`
json	`1497 B`	`benchmarks/corpora/json_test.json`	`62.9%`	`77.7 MB/s`
csrc	`1447 B`	`benchmarks/corpora/csrc_test.c`	`45.1%`	`83.6 MB/s`

Same-run baseline context

File	Tool	Ratio	Decode throughput
`trainings/demo_corpus.txt`	`gzip:6`	`36.0%`	`73.1 MB/s`
`trainings/demo_corpus.txt`	`xz:6`	`29.3%`	`40.4 MB/s`
`trainings/demo_corpus.txt`	`loxc-ext(demo)`	`60.8%`	`51.3 MB/s`
`benchmarks/corpora/text_524288.txt`	`gzip:6`	`36.0%`	`62.4 MB/s`
`benchmarks/corpora/text_524288.txt`	`loxc-ext(demo)`	`60.7%`	`52.3 MB/s`

These numbers are measured on the same machine and reflect the current implementation. Baseline tools are invoked through their CLI, so small-file latency is dominated by process launch overhead; the larger corpus rows are the meaningful throughput comparison points. On the larger text slices, loxc decode throughput stays roughly flat with input size, while LZ-based codecs like gzip trend downward as their state-management overhead becomes more visible.

Full benchmark details ->

Quick start

Build

git clone https://github.com/Vanderhell/loxc
cd loxc && make

Use the CLI

./tools/loxc_cli compress \
    --table modules/loxc_demo.loxctab --embed \
    your_file.txt your_file.loxc

./tools/loxc_cli decompress your_file.loxc restored.txt

Use as a library

#include "loxc_simple.h"
#include <stdio.h>
#include <string.h>

int main(void) {
    loxc_ctx_t *ctx = loxc_open("modules/loxc_demo.loxctab");
    const char *text = "compress me";
    loxc_buffer_t out = loxc_compress_buffer(ctx, text, strlen(text), 0);

    printf("Original: %zu bytes, Compressed: %zu bytes\n",
           strlen(text), out.size);

    loxc_buffer_free(&out);
    loxc_close(ctx);
    return 0;
}

cc -Iinclude -Imodules myapp.c libloxc.a -o myapp && ./myapp

Train on your own data

./tools/loxc_train \
    --input your_data.txt \
    --output modules/loxc_mytable \
    --module-name mytable --module-id 50

Full tutorial -> | Cookbook ->

Examples

Working code in examples/:

#	File	Shows
1	`01_hello_world.c`	Smallest possible usage
2	`02_compress_file.c`	File operations with timing
3	`03_embedded_mode.c`	Self-contained `.loxc` files
4	`04_error_handling.c`	Error handling paths
5	`05_training_pipeline.c`	Train and use a custom module
6	`06_compare_modes.c`	External vs embedded size tradeoff
7	`07_streaming_chunks.c`	Current large-file workaround

Run them with make examples && ./examples/01_hello_world.

API

Simple API (recommended)

loxc_ctx_t *loxc_open(const char *table_path);
void        loxc_close(loxc_ctx_t *ctx);

int loxc_compress_file(loxc_ctx_t *ctx, const char *in_path,
                       const char *out_path, int embed_table);
int loxc_decompress_file(loxc_ctx_t *ctx, const char *in_path,
                         const char *out_path);

loxc_buffer_t loxc_compress_buffer(loxc_ctx_t *ctx,
                                   const void *data, size_t len,
                                   int embed_table);
loxc_buffer_t loxc_decompress_buffer(loxc_ctx_t *ctx,
                                     const void *data, size_t len);
void          loxc_buffer_free(loxc_buffer_t *buf);

const char   *loxc_strerror(int code);

Full API reference ->

Advanced API

For direct registry and buffer control, see docs/API.md#advanced-api.

Architecture

+-----------------------------------------------------------+
| Application                                               |
|  +-----------------------------------------------+        |
|  | loxc_simple.h  (recommended)                  |        |
|  | loxc.h         (low-level)                    |        |
|  +-----------------------------------------------+        |
+--------------------------+--------------------------------+
                           |
                           v
+-----------------------------------------------------------+
| libloxc.a                                                 |
|  +--------------+  +---------------+  +----------------+ |
|  | Strategy     |  | Hierarchical  |  | Stream Reader/ | |
|  | Selector     |  | Encoder/      |  | Writer         | |
|  |              |  | Decoder       |  |                | |
|  +--------------+  +---------------+  +----------------+ |
|  +--------------+  +---------------+                     |
|  | Dictionary   |  | Module        |                     |
|  | Filter       |  | Registry      |                     |
|  +--------------+  +---------------+                     |
+--------------------------+--------------------------------+
                           |
                           v
+-----------------------------------------------------------+
| Module tables (.loxctab files)                            |
| Hardcoded C modules or runtime-loaded portable tables     |
+-----------------------------------------------------------+

Full architecture ->

File structure

include/    public headers
src/        library implementation
tools/      loxc_train, loxc_cli, loxc_bench
tests/      unit tests
modules/    generated modules (git-ignored)
benchmarks/ benchmark inputs
trainings/  training data
examples/   runnable example programs
docs/       implementation documentation

Roadmap

v0.1.0 - Initial release
v0.2.0 - Benchmark suite + documentation overhaul
v0.2.4 - Release workflow + documentation fixes
v0.2.x - Huffman strategy (planned)
v0.3.0 - Multi-module support, streaming API
v1.0.0 - Production-stable release

Related work

Detailed comparison with Dense Codes, FSST, zstd dictionary mode, and Shared Brotli ->

Briefly, loxc is not a new compression principle. It is a practical recombination of:

Dense-code-like prefix structure
Learned per-corpus symbol tables
Trained dictionary deployment
External or embedded packaging

License

MIT - see LICENSE

Contributing

PRs welcome. See CONTRIBUTING.md.

Questions or bug reports: open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github		.github
benchmarks		benchmarks
docs		docs
examples		examples
include		include
modules		modules
src		src
tests		tests
tools		tools
trainings		trainings
.gitignore		.gitignore
AUTHORS		AUTHORS
BENCHMARKS.md		BENCHMARKS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

loxc

At a glance

Language agnostic

Three lines to compress text

Why loxc?

Built for

Not for

How it works

The hierarchical matrix idea

Benchmarks

Bundled sample-text module

Domain-tuned modules

Same-run baseline context

Quick start

Build

Use the CLI

Use as a library

Train on your own data

Examples

API

Simple API (recommended)

Advanced API

Architecture

File structure

Roadmap

Related work

License

Contributing

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

loxc

At a glance

Language agnostic

Three lines to compress text

Why loxc?

Built for

Not for

How it works

The hierarchical matrix idea

Benchmarks

Bundled sample-text module

Domain-tuned modules

Same-run baseline context

Quick start

Build

Use the CLI

Use as a library

Train on your own data

Examples

API

Simple API (recommended)

Advanced API

Architecture

File structure

Roadmap

Related work

License

Contributing

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages