add compression usage to README

davidssmith · May 15, 2017 · 23ccb4a · 23ccb4a
1 parent b134c2b
commit 23ccb4a
Showing 1 changed file with 123 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -19,9 +19,9 @@ mispronunciation *rah* (as in "raw" in some dialects) also makes sense.
 RA was designed to be portable, fast, and storage
 efficient. For scientific applications in particular, it can allow the simple
 storage of large arrays without a separate header file to store the
-dimensions and type metadata. 
+dimensions and type metadata.
 
-I believe the world doesn't need another hierarchical data container. We already have one of 
+I believe the world doesn't need another hierarchical data container. We already have one of
 those---it's called a filesystem. What is needed is a simple one-to-one mapping of data structures to disk files that preserves metadata and is fast and simple to read and write.
 
 In addition to int, uint, and float of arbitrary sizes, RA also supports
@@ -39,9 +39,9 @@ The file format is a simple concatenation of a header array and a data array. Th
 
 ### File Structure
 
-| offset (bytes) | object | type           | meaning 
+| offset (bytes) | object | type           | meaning
 |----------------|--------|----------------|---------
-|                |        |                | **HEADER**	
+|                |        |                | **HEADER**
 | 0              | magic  | UInt64         | magic number
 | 8              | flags  | UInt64         | endianness, future options
 | 16             | eltype | UInt64         | element type code
@@ -62,7 +62,7 @@ The file format is a simple concatenation of a header array and a data array. Th
 | 3    | floating point (IEEE-754 standard)
 | 4    | complex float (pairs of IEEE floats)
 
-The width of these types is defined separately in the `elbyte` field. For example, 
+The width of these types is defined separately in the `elbyte` field. For example,
 
 * a 32-bit unsigned integer would be `eltype = 2`, `elbyte = 4`;
 * a single-precision complex float (pairs of 32-bit floats) would be `eltype = 4`, `elbyte = 8`;
@@ -76,7 +76,7 @@ struct foo {
    uint32_t index;
    double v[8];
 }
-``` 
+```
 
 contains a 12-byte string, a 4-byte int, and 8 8-byte floats, so the total size is 80 bytes. It would be coded as `eltype = 0`, `elbyte = 80`.
 
@@ -90,7 +90,7 @@ The RA format is **column major**, so the first dimension will be the fastest va
 
 File Introspection
 ------------------
-To get a better handle on the format of an RA file, let's look inside one. If you are on a Unix system or have Cygwin installed on Windows, you can examine the contents of an RA file using command line tools.  For this section, we will use the `test.ra` file provided in the `examples/` subdirectory. 
+To get a better handle on the format of an RA file, let's look inside one. If you are on a Unix system or have Cygwin installed on Windows, you can examine the contents of an RA file using command line tools.  For this section, we will use the `test.ra` file provided in the `examples/` subdirectory.
 
 First, let's pretend you don't know the dimensionality of the array. Then
 
@@ -139,17 +139,131 @@ Pkg.add("RawArray")
 Usage
 -----
 
-To use RawArray, simply add the following line to your file:
+To use RawArray, simply add the following line to your Julia script:
 
 ```
 using RawArray
 ```
 
 Now you can call `raread` and `rawrite` for Julia objects of type `Array{T,N}`.
+See the test script `test/runtests.jl` for some examples of use.
+
+
+A simple example of reading and writing a float array looks like this:
+```
+julia> using RawArray
+
+julia> x = rand(8,8);
+
+julia> rawrite(x, "test.ra")
+
+julia> y = raread("test.ra")
+
+julia> x == y
+true
+```
 
 A test file called `test/runtests.jl` has been included, as well as a demo RA file in `examples/test.ra`.  You can test the code on your machine at the command line by running `julia runtests.jl`. If the tests pass, you'll get a message saying so.
 
-Notice the Julia version also contains a `raquery()` function that produces a YAML dump of the file header.
+Notice the Julia version also contains a `raquery()` function that produces a YAML dump of the file header for easier parsing in other codes.
+
+Integer Compression
+-----------
+
+If you are storing integers, RawArray has compression through variable length integer encoding built
+in, so you can store your array with lossless compression:
+```
+julia> using RawArray
+
+julia> n = rand(1:100, 8, 8);
+
+julia> rawrite(n, "ints.ra", compress=true)
+
+julia> m = raread("ints.ra")
+
+julia> m == n
+true
+```
+
+Float Compression
+-----------------
+
+You can use this compression on floats if you have limited precision data, because you can then convert to integer for storage without losing any true precision. For example, assume you have data on the [0,1] real interval with three decimal digits of true precision. Converting to integer for compressed storage would look something like this:
+```
+julia> x = rand(3,3)
+3×3 Array{Float64,2}:
+ 0.269812   0.116996  0.415197
+ 0.950308   0.583864  0.844317
+ 0.0306206  0.558326  0.610574
+
+julia> m = round(Int, x * 1000)
+3×3 Array{Int64,2}:
+ 270  117  415
+ 950  584  844
+  31  558  611
+
+julia> rawrite(m, "mydata.ra", compress=true)
+
+julia> n = raread("mydata.ra")
+3×3 Array{Int64,2}:
+ 270  117  415
+ 950  584  844
+  31  558  611
+
+julia> y = n * 0.001
+3×3 Array{Float64,2}:
+ 0.27   0.117  0.415
+ 0.95   0.584  0.844
+ 0.031  0.558  0.611
+```
+
+To see what the potential size savings are, let's write a large, image-sized float array both as the original float and as a compressed Int array with three digits of precision:
+```
+julia> x = rand(512,512);
+
+julia> rawrite(x,"x_float.ra")
+
+julia> m = round(Int, x * 1000);
+
+julia> rawrite(m, "x_int.ra", compress=true)
+
+julia> sf = stat("x_float.ra").size
+2097216
+
+julia> si = stat("x_int.ra").size
+507801
+
+julia> sf / si
+4.129995805443471
+```
+So an over 4x compression was achieved by this method that is very simple, fast, and internal to the RawArray package.
+
+External compression libraries, like 7zip, can then be used to further compress the compressed int RA file:
+```
+shell> 7z a x_int.7z x_int.ra
+
+7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
+p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,8 CPUs x64)
+
+Scanning the drive:
+1 file, 507801 bytes (496 KiB)
+
+Creating archive: x_int.7z
+
+Items to compress: 1
+
+
+Files read from disk: 1
+Archive size: 337078 bytes (330 KiB)
+Everything is Ok
+
+julia> siz = stat("x_int.7z").size
+337078
+
+julia> sf / siz
+6.221752828722136
+```
+So you can see that the external compression algorithms are complementary to the variable length integer compression. The final compressed size was 337 kB, which for 512 x 512 floats works out to *10.3 bits per float*. Or even smaller than the IEEE-754 half-precision float format that uses 16 bits per float.
 
 Getting Help
 ------------
@@ -163,4 +277,3 @@ David S. Smith [<david.smith@gmail.com>](mailto:david.smith@gmail.com)
 Disclaimer
 ----------
 This code comes with no warranty. Use at your own risk. If it breaks, let us know, and we'll try to help you fix it.
-