Open a ZIP like a table. Still a ZIP, now queryable.
cozip glues a Parquet manifest onto an ordinary ZIP and drops a tiny fixed index at byte 0 that points to it. Fetch the index, fetch the manifest, query it locally, then range-request just the bytes you actually want. A 20 GB archive becomes a queryable dataset in two reads.
It works because nothing about the ZIP changes. unzip works. zipfile.ZipFile works. Your OS preview pane works. The manifest is just the first entry, and any conforming ZIP reader walks right past it.
import cozip
import pyarrow as pa
table = pa.table({
"path": ["local/tile_001.tif", "local/tile_002.tif", "local/tile_003.tif"],
"name": ["tile_001.tif", "tile_002.tif", "tile_003.tif"],
"split": ["train", "val", "train"],
"label": ["cloud", "water", "forest"],
})
cozip.write("dataset.zip", table)
manifest = cozip.read("https://example.com/dataset.zip")
train = manifest.filter(pa.compute.equal(manifest["split"], "train"))path says where each file lives on disk. name is how it shows up inside the archive. Everything else rides along into the manifest and becomes queryable on read. R and Julia have the same API, see their READMEs.
| Language | Install | Docs |
|---|---|---|
| Python | pip install cozip |
python/ |
| R | install.packages("cozip", repos = "https://asterisk-labs.r-universe.dev") |
r/ |
| Julia | Pkg.Registry.add("https://github.com/asterisk-labs/AsteriskRegistry"); Pkg.add("Cozip") |
julia/ |
Every binding wraps the same C core. A cozip written by R reads byte for byte identically in Julia, in Python, in C.
See SPEC.md. The format is short and stable. Any conforming reader handles any conforming writer.
MIT.