datastore

Comparing OHLC storage formats in Java 25: CSV, JSON, Parquet, and Vortex — with GZIP and ZSTD compression variants.

Why

Financial time-series (OHLC: date, symbol, open, high, low, close, volume) is simple, repetitive, and write-once/read-many. Choosing the right format matters:

CSV — universal, human-readable, no schema, large
JSON — self-describing, ~2× larger than CSV, poor fit for columnar data
Parquet — columnar, typed, compresses well at scale
Vortex — columnar with built-in encoding (no separate compressor needed at scale)

The question: at what scale does each format win, and how much does compression help?

Data model

record OhlcRecord(LocalDate date, String symbol,
                  double open, double high, double low, double close,
                  long volume)

All stores write/read identical OhlcRecord values. Correctness is verified by CrossFormatConsistencyTest (1 000-record round-trip across all 10 formats).

Benchmark

Sizes in bytes vs plain CSV. Times in milliseconds (2 warmup runs + 10 measured iterations via System.nanoTime(); columns show avg / min / max). Scales: 252 ≈ 1 trading year; 2 520 ≈ 10 years; 10 k; 100 k.

252 records (1 year)

Format	Size (bytes)	vs CSV	Write ms (avg/min/max)	Read ms (avg/min/max)
CSV	12,708	1.0×	0 / 0 / 3	0 / 0 / 2
CSV+GZIP	4,451	0.4×	0 / 0 / 0	0 / 0 / 0
CSV+ZSTD	9,085	0.7×	1 / 0 / 2	0 / 0 / 0
JSON	27,537	2.2×	0 / 0 / 2	0 / 0 / 0
JSON+GZIP	5,519	0.4×	0 / 0 / 0	0 / 0 / 0
JSON+ZSTD	5,585	0.4×	0 / 0 / 0	0 / 0 / 1
Parquet	13,039	1.0×	5 / 4 / 8	3 / 2 / 4
Parquet+GZIP	6,981	0.5×	5 / 4 / 8	3 / 2 / 5
Parquet+ZSTD	7,091	0.6×	4 / 4 / 5	2 / 2 / 3
Vortex	18,916	1.5×	2 / 2 / 3	0 / 0 / 0

At small scale: CSV/JSON ops are sub-millisecond. Parquet pays per-call codec init (~4–5 ms).

2 520 records (10 years)

Format	Size (bytes)	vs CSV	Write ms (avg/min/max)	Read ms (avg/min/max)
CSV	127,181	1.0×	3 / 3 / 4	1 / 1 / 4
CSV+GZIP	43,595	0.3×	6 / 6 / 6	1 / 1 / 1
CSV+ZSTD	86,116	0.7×	5 / 5 / 6	1 / 1 / 1
JSON	275,822	2.2×	1 / 1 / 2	1 / 1 / 2
JSON+GZIP	53,637	0.4×	4 / 4 / 4	1 / 1 / 1
JSON+ZSTD	54,408	0.4×	1 / 1 / 1	1 / 1 / 1
Parquet	112,874	0.9×	5 / 5 / 6	3 / 3 / 3
Parquet+GZIP	43,755	0.3×	8 / 7 / 9	3 / 2 / 4
Parquet+ZSTD	46,089	0.4×	4 / 4 / 5	2 / 2 / 3
Vortex	43,276	0.3×	3 / 3 / 4	0 / 0 / 1

Vortex reaches CSV+GZIP size parity with faster reads and no explicit compression step.

10 000 records

Format	Size (bytes)	vs CSV	Write ms (avg/min/max)	Read ms (avg/min/max)
CSV	484,118	1.0×	13 / 13 / 17	4 / 4 / 4
CSV+GZIP	163,079	0.3×	23 / 23 / 24	5 / 5 / 5
CSV+ZSTD	322,420	0.7×	22 / 22 / 25	5 / 5 / 5
JSON	1,074,079	2.2×	3 / 3 / 6	4 / 4 / 4
JSON+GZIP	202,119	0.4×	17 / 17 / 18	4 / 4 / 5
JSON+ZSTD	209,980	0.4×	4 / 4 / 5	4 / 4 / 5
Parquet	384,085	0.8×	8 / 8 / 10	4 / 3 / 5
Parquet+GZIP	185,250	0.4×	18 / 18 / 19	4 / 4 / 4
Parquet+ZSTD	197,340	0.4×	8 / 8 / 9	4 / 4 / 4
Vortex	130,860	0.3×	6 / 6 / 8	0 / 0 / 1

Vortex smallest and fastest to read; Parquet+ZSTD best write/size tradeoff among compressed formats.

100 000 records

Format	Size (bytes)	vs CSV	Write ms (avg/min/max)	Read ms (avg/min/max)
CSV	4,638,435	1.0×	131 / 128 / 134	45 / 43 / 48
CSV+GZIP	1,291,129	0.3×	189 / 187 / 201	50 / 49 / 54
CSV+ZSTD	2,753,718	0.6×	228 / 222 / 242	51 / 48 / 56
JSON	10,538,396	2.3×	41 / 34 / 60	53 / 42 / 77
JSON+GZIP	1,572,003	0.3×	138 / 137 / 142	46 / 46 / 49
JSON+ZSTD	1,693,655	0.4×	45 / 45 / 48	43 / 43 / 45
Parquet	2,560,131	0.6×	36 / 35 / 39	17 / 17 / 18
Parquet+GZIP	1,266,857	0.3×	125 / 122 / 129	21 / 20 / 24
Parquet+ZSTD	1,417,150	0.3×	40 / 40 / 42	19 / 19 / 22
Vortex	1,453,420	0.3×	21 / 21 / 22	4 / 4 / 5

Parquet+GZIP is smallest. Vortex reads in 4 ms — 5× faster than any Parquet variant. JSON shows high variance (34–60 ms write, 42–77 ms read) from GC pressure at this scale.

Key findings

JSON is always the worst for size — 2.2–2.3× CSV before compression, never competitive.
Plain CSV beats plain Parquet only at tiny scale (<252 records); at 10 k records Parquet is 20% smaller uncompressed.
Vortex breaks even with compressed formats around 2 500 records — with zero explicit compression step.
Parquet+GZIP is smallest at 100 k (1.27 MB), but Vortex reads 5× faster (4 ms vs 19–21 ms).
ZSTD lags GZIP on size — OHLC data (smooth prices, repeated symbol) suits GZIP's sliding-window; ZSTD dictionary advantage not triggered here.
CSV+ZSTD write is slow — streaming ZSTD without dictionary shows high overhead at all scales; single-shot compression would help.
JSON variance grows with scale — GC pressure from string allocation causes 2× spread between min/max at 100 k records.

Running

# verify all formats produce identical data
./mvnw test -Dtest=CrossFormatConsistencyTest

# storage + timing benchmark (prints tables to stdout)
./mvnw test -Dtest=BenchmarkTest

Stack

Library	Version	Role
Jackson CSV / JSON	2.18.3	CSV + JSON I/O
Parquet + Hadoop	1.15.0 / 3.4.1	Parquet read/write
Apache Arrow	19.0.0	Vortex in-memory columnar
Vortex JNI	0.72.0	Vortex file format
zstd-jni	1.5.6-3	ZSTD compression
JUnit 5	5.11.4	Tests
Java	25	Runtime

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
.mvn/wrapper		.mvn/wrapper
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datastore

Why

Data model

Benchmark

252 records (1 year)

2 520 records (10 years)

10 000 records

100 000 records

Key findings

Running

Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

datastore

Why

Data model

Benchmark

252 records (1 year)

2 520 records (10 years)

10 000 records

100 000 records

Key findings

Running

Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages