The ZDW archival format is for row-level, well-structured data (i.e., TSV with a schema file, as from a MySQL dump) that can be used in tandem with standard compression formats to yield highly efficient compression. It is best suited for optimizing storage footprint for archival storage, accessing large segments of the data, and outputting data row-by-row, as opposed to extracting only a few columns.
ZDW uses a combination of:
- A global, sorted dictionary of unique strings across all text columns for normalization
- Numeric and text values, as specified in an accompanying SQL-like schema file
- Variable byte-size values for integers as well as dictionary indexes
- Minimum value baseline per column for integers and dictionary indexes, to reduce the magnitude of the value needed on each row of that column
- Bit-flagging repeat column values across consecutive rows (similar approach to run-length encoding, but applied to the same column across rows)
It has evolved through multiple internal iterations (at version 10 as of September 2018 and version 11 as of September 2019). See ZDW v10 format for a diagram of the ZDW v10 format.
The ZDW compressor performs two passes over the uncompressed data. The first pass compiles, sorts and outputs a header, global string dictionary, per-column offset sizes and baseline values for numeric columns. The second pass converts and outputs the compressed row-oriented data. When a standard compression flag is applied, this output data is piped into the respective compression binary (e.g., 'zstd', 'xz', 'gz') for additional compression.
Multiple internal data blocks are supported in order to keep dictionary sizes manageable. These blocks are for memory management and are not currently intended to provide intra-file seek optimizations.
ZDW is designed to complement standard text or binary compression algorithms layered on top.
As ZDW is best suited for highly efficient compression of long-term archival data, the current recommended compression to apply on top of the ZDW file format is ZSTD. ZSTD incurs a one-time compute-intensive compression cost, then requires a low compute cost for uncompressing one or more times. Applying ZSTD compression is enabled by supplying the '-z' option to the compressor. Additional compressor command line options, such as invoking parallel ZSTD compression on large datasets, can be applied via an additional '--zargs' command line option to the compressor, e.g., "-z --zargs='-17 -T0'" for parallel ZSTD with -17 compression, or "-J --zargs='-7 --threads=2'" for parallel XZ compression at the -7 level.
Less efficient formats like XZ (lzma) or GZ (gzip) may alternatively be used. XZ may make files with certain data patterns smaller, while requiring more compute time for compression and uncompression. GZ may reduce compression compute time but will likely produce larger files relative to XZ or ZSTD.
Expected compression efficiency is substantially higher than alternative formats for well-structured data like ORC and Parquet.
Sample benchmark data, to illustrate expected efficiency compressing files with varying schema widths and properties, can be found in the test-files directory.
ZDW understands the data-type of each column. These types are based on SQL column types. The list of supported types are:
Data Type | Internal ID | Java/Scala Type | Spark Sql Type |
---|---|---|---|
VARCHAR | 0 | String | StringType |
TEXT | 1 | String | StringType |
DATETIME | 2 | java.util.Date in UTC timezone | TimestampType |
CHAR_2 | 3 | String | StringType |
CHAR | 6 | String | StringType |
TINY | 7 | Short | ShortType |
SHORT | 8 | Int | IntegerType |
LONG | 9 | Long | LongType |
LONG LONG | 10 | BigInt | StringType (no big-int in Spark SQL) |
DECIMAL | 11 | Double | DoubleType |
TINY SIGNED | 12 | Byte | ByteType |
SHORT SIGNED | 13 | Short | ShortType |
LONG SIGNED | 14 | Int | IntegerType |
LONG LONG SIGNED | 15 | Long | LongType |
TINYTEXT | 16 | String | StringType |
MEDIUMTEXT | 17 | String | StringType |
LONGTEXT | 18 | String | StringType |
For more details see ZDWColumn.
For access patterns, ZDW best caters to:
- Decompress an entire file's contents, either to disk, streamed to stdout, or unpacked into a row-level in-memory text buffer
- Access data row-by-row
- Access multiple columns in all rows
Currently this project is providing write support only via the cplusplus/convertDWfile binary.
A TSV file with a ".sql" extension must reside on disk alongside a ".desc.sql" file containing the schema of the ".sql" file.
Run "./convertDWfile infile.sql" to create a ZDW file of the specified input file.
Include a '-J' argument to compress ZDW files with XZ. Include '-v' to validate the created file with cmp (to confirm the uncompressed ZDW data is byte-for-byte identical to the source file).
Run without arguments to view usage and all supported ZDW file creation options.
For reading, this project provides the following interfaces.
cplusplus/unconvertDWfile binary
A C++ library API (see test_unconvert_api.cpp for example)
The first read layer provided works on a generic DataInputStream. It doesn't handle the filesystem or compression for you, but you can use any Java/Scala filesystem or compression tools on top of this that work with DataInputStream.
See format for more details.
<dependency>
<groupId>com.adobe.analytics.zdw</groupId>
<artifactId>zdw-format</artifactId>
<version>0.1.0</version>
</dependency>
The second read layer provided uses the Java filesystem implementation and Apache Commons tools to provide local and FTP server filesystem support as well as GZIP and XZ (LZMA) compression support.
<dependency>
<groupId>com.adobe.analytics.zdw</groupId>
<artifactId>zdw-file</artifactId>
<version>0.1.0</version>
</dependency>
See file for more details and here for example usage.
The third read layer is an alternative to the above implementation using Apache Commons. Instead it used Hadoop to provide the filesystem and compression support so you can use any filesystem or compression supported by your Hadoop environment.
<dependency>
<groupId>com.adobe.analytics.zdw</groupId>
<artifactId>zdw-hadoop</artifactId>
<version>0.1.0</version>
</dependency>
See hadoop for more details and here for example usage.
The last read layer is a Spark SQL FileFormat that builds on the Hadoop layer so can use any filesystem or compression supported by your Hadoop/Spark environment.
<dependency>
<groupId>com.adobe.analytics.zdw</groupId>
<artifactId>zdw-spark-sql</artifactId>
<version>0.1.0</version>
</dependency>