COVT - Cloud optmized archive

tar + Mapbox vector tiles (MVT) + Tar index = ❤️

Requirements

Supports any tar file and any contents (Currently focused on MVT)
Should be able to handle large 100GB+ tar files with millions of internal files
Index should be small and/or compressed to save space
Should be able to fetch ideally any file inside a archive with a minimal amount of requests (Ideally 2)

Usage

Create a cloud optimized vector tile files

npm i -g @covt/cli

covt create --output outputFile.covt inputFile.mbtiles

Tar files

TAR files contain a collection of files stored sequentially into the file. With every file containing a 512 byte header just before the file data is stored.

This makes it very easy to add new files to a archive as move files can just be appended to the end, however this makes random reads impossible, as every file header would have to be read until the specific file wanted would be found

TAR Index (.tar.index) is a JSON document containing the file location and size inside of a tar file. with this index a tar file can be randomly read.

/** Mapping of path -> index records */
type TarIndex = TarIndexFile[];

interface TarIndexFile [ 
    string, // Name of the file @see header.path
    number, //  Offset to the start of the data
    number // Number of bytes inside the file 
]

Future investigation

Zip files

ZIP store their metadata at the end of the file, and so the metadata can be read with a single range request for the last 1+MB of data. then individual files can be read directly from the ZIP.

See: https://github.com/tapalcatl/tapalcatl-2-spec

2021-04 comments The internal zip header is quite large with a 600,000 file test zip, the header was 55MB vs a 5MB gziped header using JSON

Combine tar with tar.index into a single tar

Having a single tar file greatly simplifies the distribution of the files, It would be quite simple to tar both the index (.tar.index) and data tar into another tar to combine the files into a single distribution

Use AWS S3's response-encoding to decompress internal gziped content on the fly
Change index structure to a binary format, there could be multiple indexes
Investigate a BTree index
Investigate if MPQ could be a better format
Store only the pointer to the header, as the file size is stored in the tar file header.
Store only the offset difference from the last index to save space

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github		.github
packages		packages
static		static
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.kodiak.toml		.kodiak.toml
.prettierrc.js		.prettierrc.js
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
TarFileBackground.png		TarFileBackground.png
TarFileIndex.png		TarFileIndex.png
lerna.json		lerna.json
package.json		package.json
tsconfig.base.json		tsconfig.base.json
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVT - Cloud optmized archive

Requirements

Usage

Tar files

Future investigation

About

Releases

Packages

Contributors 2

Languages

License

blacha/covt

Folders and files

Latest commit

History

Repository files navigation

COVT - Cloud optmized archive

Requirements

Usage

Tar files

Future investigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages