Skip to content

Generate and explore upper triangular and square matrices of 1-byte values

License

Notifications You must be signed in to change notification settings

alexpreynolds/byte-store

Repository files navigation

byte-store

Build Status

This utility generates and queries strictly upper triangular (SUT) and square matrices storing one-byte encoded correlation score values between -1.0 and +1.0. A so-called "byte-store" file is a string of bytes in SUT or square matrix form. Three encoding strategies are offered, a "full" strategy that encodes scores in 0.01 increments, a "mid-quarter-zero" strategy that encodes scores within the interval (-0.25, +0.25) as the +0.00 byte-equivalent, and a "custom" strategy, which allows specification of custom minimum and maximum bounds for encoding zero-score bytes.

Compressed square matrix byte-store files store a compressed form of encoded scores, along with metadata to perform indexed retrieval of rows. A compressed byte store can be a monolithic file or split into per-block files, each containing a specified number of rows.

Filesizes

We initially compare SUT and square matrix byte storage methods because of the significant filesize differences at large scales. SUT byte-store files take up n(n-1)/2 bytes for n elements, while a square matrix takes up n^2 bytes. Compressed square matrix files generally take considerably fewer bytes, depending on the nature of the score data, chosen encoding strategy and efficiency of the selected compression algorithm.

Score values

Score values are either simulated with the MT19937 RNG or sourced from a BED4 file (example), which contains a comma-delimited string of number values from which pairwise Pearson's r correlation scores are calculated and encoded:

chr1    1193700 1193850 20,8,10,31,...
chr1    1459000 1459150 17,10,9,42,...
chr1    2244600 2244750 149,72,71,87,...
chr1    3568000 3568150 70,34,17,70,...

Byte-store files are encoded with the "full", "mid-quarter-zero" or "custom" strategy. The default "full" strategy maps all scores from [-1.0, +1.0] within an unsigned char in 0.01 increments. The "mid-quarter-zero" strategy is similar, but any scores within (-0.25, +0.25) are directly mapped to the equivalent +0.00 unsigned char value. Tests of the "custom" strategy encode scores within (-0.50, +0.50) to the +0.00 byte-equivalent.

Getting started

To build byte-store, download the source and compile:

$ git clone https://github.com/alexpreynolds/byte-store.git
$ cd byte-store && make 

To date, compilation and score tests have passed under GNU gcc 4.6.3 and Clang 3.4 under Ubuntu 12.04 LTS Server (64-bit), GNU gcc 4.8.2 under RHEL6, and Clang 3.6 under Mac OS X Yosemite 10.10.4.

Testing

Some basic tests are included with the makefile to evaluate byte-store creation and query timing, as well as compression efficiency. BED files are included for basic tests. Refer to the test-pearsonr-sut-4 and subsequent targets for a detailed description of creation and querying of a variety of input sizes and encoding/compression parameters.

A more comprehensive test suite uses E. Rynes' master DHS signal dataset (~192 MB) to sample and encode inputs at levels between ~500 and 10k elements, which can generate 6 GB or more of intermediate data. Refer to the test-sample-performance target in the project makefile for more details. We discuss the results of these tests further below.

Inputs

The reservoir sampling utility sample is used to generate a sort-bed-sorted BED4 file containing some random number of elements from the source input. Here, each file is a uniformly-random sampled subset of the ~680 MB master DHS signal dataset at the following levels: 562, 1000, 1779, 3162, 5623, and 10000 elements. Because of the disparity in signal vectors in the subset files, we sample 3 trials at each level to get a more general sense of how this tool performs overall.

Creation timing

GNU time is used to measure the wallclock time required to create SUT ("strictly upper triangular") and square matrix byte-store files, raw and compressed. As described above, a byte-store file is just a string of bytes representing encoded floating point values between -1.00 and +1.00. For the purpose of this application, specifically, these scores are Pearson's r calculations. Scores are calculated and encoded from the score vectors in the input subsamples.

For visualization, we measure the ratio of elapsed time to number of processed elements, which yields a seconds-per-element rate or average cost for byte-store file creation.

Query timing

Queries of raw and compressed byte-store files are performed at each level, for each trial, to measure the retrieval time for all encoded, pairwise correlation scores. A query writes the two paired regions and their respective correlation score to BED7 (BED3 + BED3 + encoded score); since the operation is symmetric (and to more easily facilitate downstream BEDOPS set operations), the query result is written twice: once as A-B-score, and again as B-A-score.

For visualization, we measure the ratio of elapsed time to number of processed elements, which gives an average seconds-per-element rate or cost for querying a byte-store file.

Compression efficiency

To evaluate the potential longer-term archival of a SUT or square-matrix byte-store file, we compress it with open-source bzip2 and gzip compression methods and measure the efficiency ratio as defined by compressed filesize divided by raw filesize.

Results

Intermediate test files are written to a 3 TB LaCie d2 Quadra USB3/FW800 external hard drive (7200 RPM) attached to a Mac mini (2.3 GHz Intel Core i5, 16 GB RAM) over a FireWire 800 connection (local test results can be written to an alternative destination by overriding the SAMPLEDIR variable setting in the makefile).

Creation

An uncompressed, "raw" SUT byte store takes less average time per element to make than an uncompressed square-matrix store for all encoding strategies.

Compression of a square byte store with bzip2 (either as a monolithic file or a series of "split" per-block files) writes fewer bytes as the encoding strategy goes from "full" to "m.q.z." to "custom". This is because the "custom" strategy pushes more score values to zeroes, which increases redundancy and improves compression efficiency (with the tradeoff of lost information). The average per-element creation time for a compressed archive is lower than for the raw (uncompressed) store types, which suggests that much of the cost of store creation is in file I/O (under the listed test conditions).

     

Querying

Querying elements from an uncompressed, "raw" SUT byte store very quickly gets slower (more "expensive" per element) than lookups to a raw or compressed square-matrix store, as the number of elements increases.

Query times are virtually identical for all three encoding strategies:

     

Further, there is little practical difference between average retrieval times for single-file raw and bzip2-compressed square matrices. Relative query times rise for "split" compression, where the archive is broken into 512-row chunk files:

     

If we normalize by the filesizes of the respective encoding types, querying has a relatively similar per-byte cost:

     

Compression

For the "full" encoding strategy, compressing a raw SUT or square-matrix byte-store with gzip (default parameters) gives better results than bzip2 (default parameters). Byte-stores of typical encoded scores can be compressed down to roughly 76-80% of the original size. However, use of the "mid-quarter-zero" encoding strategy improves the compression efficiency to ~20%, with bzip2 offering more optimal results. The "custom" strategy more broadly encodes correlation values of (-0.50, +0.50) to 0, leading to a compression efficiency of ~4% when used with bzip2 routines.