Train a compression model on your text. Ship it. Compress and decompress at hardware speed.
A trainable, frequency-optimized text codec in pure C99. Made for embedded systems, log pipelines, and any domain-specific text.
Input: "The quick brown fox jumps over the lazy dog" (43 bytes)
Encoded: 29 bytes (67% of original)
Decode: < 0.1 ms
Compression ratio: ~60% on the bundled sample text corpus
Decode speed: ~8x faster than encode
Decoder size: ~5 KB compiled
Dependencies: none (pure C99)
loxc has no built-in language assumptions. The included demo module is
trained on a public-domain text sample (Pride and Prejudice) for testing
purposes only. For your own data:
- Slovak, Czech, Polish, and other natural language corpora: train on your corpus
- JSON, XML, and log lines: train on representative samples of your format
- URLs and file paths: train on a representative set
- Source code: train on files from your codebase
The compression algorithm works on bytes, not characters or words. Any text-like data with repeated patterns can compress well when the module is trained on a matching corpus.
loxc_ctx_t *ctx = loxc_open("modules/loxc_demo.loxctab");
loxc_buffer_t out = loxc_compress_buffer(ctx, "Hello world!", 12, 0);
loxc_close(ctx);That's it. out.data now holds compressed bytes.
See full examples -> | 5-minute tutorial ->
- Domain-specific text: JSON APIs, log lines, URL paths, localization files
- Embedded systems: small decoder, no heavyweight runtime dependencies
- Repeated payloads: train once on your corpus, compress millions of similar messages
- Predictable latency: decode is table-driven, not entropy-decoder heavy
- Archival compression, where
zstdorbrotliwill give better ratios - Random-access string storage in databases, where
FSSTis a better fit - One-shot compression of unknown text, where
gziporzstdis simpler
TRAINING (offline, once per corpus)
your_corpus.txt --> loxc_train --> mytable.loxctab
|
+-- Counts byte frequencies
+-- Extracts repeated phrases ("the", "and", "that", ...)
+-- Greedy filter keeps only entries that reduce total output size
`-- Picks best strategy:
FLAT (fixed-width)
HIER4 (4x4 matrix with escapes)
HIER8 (8x8 matrix with escapes)
RUNTIME (online, many times)
input text --> [encode via lookup tables] --> .loxc file
.loxc file --> [decode via lookup tables] --> output text
Top-frequency symbols live in a 6-bit grid in HIER8:
Position 0-55: direct symbol (6 bits)
Position 56-63: ESCAPE -> read 6 more bits for the next grid
Frequent symbols: [space] [e] [t] [o] -> 6 bits each
Less frequent: [q] [z] [x] -> 12 bits each
Rare: 18 bits each
This is mathematically close to an adaptive (56,8)-Dense Code with auto-selected parameters per corpus.
Measured with make bench-full on the current benchmark suite:
- CPU:
Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz - OS:
Linux 6.6.87.2-microsoft-standard-WSL2 - Compiler:
cc 13.3.0 - Iterations:
100after warmup
| File | Mode | Ratio | Encode | Decode |
|---|---|---|---|---|
trainings/demo_corpus.txt (720.7 KiB) |
loxc-ext(demo) |
60.8% |
114.55 ms / 6.1 MB/s |
13.72 ms / 51.3 MB/s |
benchmarks/plain_sample_text.txt (29.3 KiB) |
loxc-ext(demo) |
62.2% |
4.80 ms / 6.0 MB/s |
0.55 ms / 51.8 MB/s |
| Domain | Table size | Held-out file | Ratio | Decode throughput |
|---|---|---|---|---|
| sample-text | 1985 B |
benchmarks/plain_sample_text.txt |
62.2% |
51.8 MB/s |
| json | 1497 B |
benchmarks/corpora/json_test.json |
62.9% |
77.7 MB/s |
| csrc | 1447 B |
benchmarks/corpora/csrc_test.c |
45.1% |
83.6 MB/s |
| File | Tool | Ratio | Decode throughput |
|---|---|---|---|
trainings/demo_corpus.txt |
gzip:6 |
36.0% |
73.1 MB/s |
trainings/demo_corpus.txt |
xz:6 |
29.3% |
40.4 MB/s |
trainings/demo_corpus.txt |
loxc-ext(demo) |
60.8% |
51.3 MB/s |
benchmarks/corpora/text_524288.txt |
gzip:6 |
36.0% |
62.4 MB/s |
benchmarks/corpora/text_524288.txt |
loxc-ext(demo) |
60.7% |
52.3 MB/s |
These numbers are measured on the same machine and reflect the current
implementation. Baseline tools are invoked through their CLI, so small-file
latency is dominated by process launch overhead; the larger corpus rows are the
meaningful throughput comparison points. On the larger text slices, loxc
decode throughput stays roughly flat with input size, while LZ-based codecs
like gzip trend downward as their state-management overhead becomes more
visible.
git clone https://github.com/Vanderhell/loxc
cd loxc && make./tools/loxc_cli compress \
--table modules/loxc_demo.loxctab --embed \
your_file.txt your_file.loxc
./tools/loxc_cli decompress your_file.loxc restored.txt#include "loxc_simple.h"
#include <stdio.h>
#include <string.h>
int main(void) {
loxc_ctx_t *ctx = loxc_open("modules/loxc_demo.loxctab");
const char *text = "compress me";
loxc_buffer_t out = loxc_compress_buffer(ctx, text, strlen(text), 0);
printf("Original: %zu bytes, Compressed: %zu bytes\n",
strlen(text), out.size);
loxc_buffer_free(&out);
loxc_close(ctx);
return 0;
}cc -Iinclude -Imodules myapp.c libloxc.a -o myapp && ./myapp./tools/loxc_train \
--input your_data.txt \
--output modules/loxc_mytable \
--module-name mytable --module-id 50Full tutorial -> | Cookbook ->
Working code in examples/:
| # | File | Shows |
|---|---|---|
| 1 | 01_hello_world.c |
Smallest possible usage |
| 2 | 02_compress_file.c |
File operations with timing |
| 3 | 03_embedded_mode.c |
Self-contained .loxc files |
| 4 | 04_error_handling.c |
Error handling paths |
| 5 | 05_training_pipeline.c |
Train and use a custom module |
| 6 | 06_compare_modes.c |
External vs embedded size tradeoff |
| 7 | 07_streaming_chunks.c |
Current large-file workaround |
Run them with make examples && ./examples/01_hello_world.
loxc_ctx_t *loxc_open(const char *table_path);
void loxc_close(loxc_ctx_t *ctx);
int loxc_compress_file(loxc_ctx_t *ctx, const char *in_path,
const char *out_path, int embed_table);
int loxc_decompress_file(loxc_ctx_t *ctx, const char *in_path,
const char *out_path);
loxc_buffer_t loxc_compress_buffer(loxc_ctx_t *ctx,
const void *data, size_t len,
int embed_table);
loxc_buffer_t loxc_decompress_buffer(loxc_ctx_t *ctx,
const void *data, size_t len);
void loxc_buffer_free(loxc_buffer_t *buf);
const char *loxc_strerror(int code);For direct registry and buffer control, see docs/API.md#advanced-api.
+-----------------------------------------------------------+
| Application |
| +-----------------------------------------------+ |
| | loxc_simple.h (recommended) | |
| | loxc.h (low-level) | |
| +-----------------------------------------------+ |
+--------------------------+--------------------------------+
|
v
+-----------------------------------------------------------+
| libloxc.a |
| +--------------+ +---------------+ +----------------+ |
| | Strategy | | Hierarchical | | Stream Reader/ | |
| | Selector | | Encoder/ | | Writer | |
| | | | Decoder | | | |
| +--------------+ +---------------+ +----------------+ |
| +--------------+ +---------------+ |
| | Dictionary | | Module | |
| | Filter | | Registry | |
| +--------------+ +---------------+ |
+--------------------------+--------------------------------+
|
v
+-----------------------------------------------------------+
| Module tables (.loxctab files) |
| Hardcoded C modules or runtime-loaded portable tables |
+-----------------------------------------------------------+
include/ public headers
src/ library implementation
tools/ loxc_train, loxc_cli, loxc_bench
tests/ unit tests
modules/ generated modules (git-ignored)
benchmarks/ benchmark inputs
trainings/ training data
examples/ runnable example programs
docs/ implementation documentation
v0.1.0- Initial releasev0.2.0- Benchmark suite + documentation overhaulv0.2.4- Release workflow + documentation fixesv0.2.x- Huffman strategy (planned)v0.3.0- Multi-module support, streaming APIv1.0.0- Production-stable release
Detailed comparison with Dense Codes, FSST, zstd dictionary mode, and Shared Brotli ->
Briefly, loxc is not a new compression principle. It is a practical
recombination of:
- Dense-code-like prefix structure
- Learned per-corpus symbol tables
- Trained dictionary deployment
- External or embedded packaging
MIT - see LICENSE
PRs welcome. See CONTRIBUTING.md.
Questions or bug reports: open an issue.