Huge rewrite of buffer reading in rzip.c. We use a wrapper instead of

accessing the buffer directly, thus allowing us to have window sizes larger than available ram. This is implemented through the use of a "sliding mmap" implementation. Sliding mmap uses two mmapped buffers, one large one as previously, and one page sized smaller one. When an attempt is made to read beyond the end of the large buffer, the small buffer is remapped to the file area that's being accessed. While this implementation is 100x slower than direct mmapping, it allows us to implement unlimited sized compression windows. Implement the -U option with unlimited sized windows. Rework the selection of compression windows. Instead of trying to guess how much ram the machine might be able to access, we try to safely buffer as much ram as we can, and then use that to determine the file buffer size. Do not choose an arbitrary upper window limit unless -w is specified. Rework the -M option to try to buffer the entire file, reducing the buffer size until we succeed. Align buffer sizes to page size. Clean up lots of unneeded variables. Fix lots of minor logic issues to do with window sizes accepted/passed to rzip and the compression backends. More error handling. Change -L to affect rzip compression level directly as well as backend compression level and use 9 by default now. More cleanups of information output. Use 3 point release numbering in case one minor version has many subversions. Numerous minor cleanups and tidying. Updated docs and manpages.
ckolivas · Nov 4, 2010 · 29b1666 · 29b1666
1 parent c106128
commit 29b1666
Show file tree

Hide file tree

Showing 12 changed files with 394 additions and 250 deletions.
diff --git a/ChangeLog b/ChangeLog
@@ -1,4 +1,38 @@
 lrzip ChangeLog
+NOVEMBER 2010, version 0.5.1 Con Kolivas
+* Fix Darwin build - Darwin doesn't support mremap so introduce a fake wrapper
+for it.
+* Fix the memopen routines, a wrongly implemented wrapper for Darwin equivalents
+was also using the faked versions on all builds.
+* Fix dodgy ordered includes.
+* Clean up excessive use of #ifdefs
+* Huge rewrite of buffer reading in rzip.c. We use a wrapper instead of
+accessing the buffer directly, thus allowing us to have window sizes larger than
+available ram. This is implemented through the use of a "sliding mmap"
+implementation. Sliding mmap uses two mmapped buffers, one large one as
+previously, and one page sized smaller one. When an attempt is made to read
+beyond the end of the large buffer, the small buffer is remapped to the file
+area that's being accessed. While this implementation is 100x slower than direct
+mmapping, it allows us to implement unlimited sized compression windows.
+* Implement the -U option with unlimited sized windows.
+* Rework the selection of compression windows. Instead of trying to guess how
+much ram the machine might be able to access, we try to safely buffer as much
+ram as we can, and then use that to determine the file buffer size. Do not
+choose an arbitrary upper window limit unless -w is specified.
+* Rework the -M option to try to buffer the entire file, reducing the buffer
+size until we succeed.
+* Align buffer sizes to page size.
+* Clean up lots of unneeded variables.
+* Fix lots of minor logic issues to do with window sizes accepted/passed to rzip
+and the compression backends.
+* More error handling.
+* Change -L to affect rzip compression level directly as well as backend
+compression level and use 9 by default now.
+* More cleanups of information output.
+* Use 3 point release numbering in case one minor version has many subversions.
+* Numerous minor cleanups and tidying.
+* Updated docs and manpages.
+
 NOVEMBER 2010, version 0.5 Con Kolivas
 * Changed offset encoding in rzip stage to use variable byte width offsets
 instead of 64 bits wide. Makes for better compression and slightly faster.

diff --git a/README b/README
@@ -1,4 +1,4 @@
-lrzip v0.5
+lrzip v0.5.1
 
 Long Range ZIP or Lzma RZIP
 
@@ -66,6 +66,17 @@ less ram and works on smaller ram machines.
 stdin/stdout work but in a very inefficient manner generating temporary files
 on disk so this method of using lrzip is not recommended.
 
+The unique feature of lrzip is that it tries to make the most of the available
+ram in your system at all times for maximum benefit. It does this by default,
+choosing the largest sized window possible without running out of memory. It
+also has a unique "sliding mmap" feature which makes it possible to even use
+a compression window larger than your ramsize, if the file is that large. It
+does this (with the -U option) by implementing one large mmap buffer as per
+normal, and a smaller moving buffer to track which part of the file is
+currently being examined, emulating a much larger single mmapped buffer.
+Unfortunately this mode is 100 times slower once lrzip begins examining the
+ram beyond the larger base window.
+
 See the file README.benchmarks in doc/ for performance examples and what kind
 of data lrzip is very good with.
 
@@ -91,10 +102,10 @@ Q. How do I make a static build?
 A. make static
 
 Q. I want the absolute maximum compression I can possibly get, what do I do?
-A. Try the command line options -Mz. This will use all available ram and ZPAQ
-compression. Expect serious swapping to occur if your file is larger than your
-ram. It may even fail to run if you do not have enough swap space allocated.
-Why? Well the more ram lrzip uses the better the compression it can achieve.
+A. Try the command line options -MUz. This will use all available ram and ZPAQ
+compression, and even use a compression window larger than you have ram.
+Expect serious swapping to occur if your file is larger than your ram and for
+it to take 1000 times longer. A more practical option is just -M.
 
 Q. Can I use your tool for even more compression than lzma offers?
 A. Yes, the rzip preparation of files makes them more compressible by every
@@ -111,11 +122,12 @@ used windows larger than 2GB.
 
 Q. How about 64bit?
 A. 64bit machines with their ability to address massive amounts of ram will
-excel with lrzip due to being able to use compresion windows limited only in
+excel with lrzip due to being able to use compression windows limited only in
 size by the amount of physical ram.
 
 Q. Other operating systems?
-A. Patches are welcome. Version 0.43+ should build on MacOSX 10.5+
+A. The code is POSIXy with GNU extensions. Patches are welcome. Version 0.43+
+should build on MacOSX 10.5+
 
 Q. Does it work on stdin/stdout?
 A. Yes it does. Compression from stdin works nicely.. However the other
@@ -146,7 +158,7 @@ to compress at all). If no compressible data is found, then the subsequent
 compression is not even attempted. This can save a lot of time during the
 compression phase when there is incompressible data. Theoretically it may be
 possible that data is compressible by the other backend (zpaq, lzma etc) and not
-at all by lzo, but in practice such data achieves only miniscule amounts of
+at all by lzo, but in practice such data achieves only minuscule amounts of
 compression which are not worth pursuing. Most of the time it is clear one way
 or the other that data is compressible or not. If you wish to disable this
 test and force it to try compressing it anyway, use -T 0.
@@ -156,8 +168,7 @@ generated file be decompressed on machines with less ram?
 A. Yes. Ram requirements for decompression go up only by the -L compression
 option with lzma and are never anywhere near as large as the compression
 requirements. However if you're on 64bit and you use a compression window
-greater than 2GB, it may NOT be possible to decompress it on 32bit machines.
-lrzip will warn you and fail if you try.
+greater than 2GB, it might not be possible to decompress it on 32bit machines.
 
 Q. I've changed the compression level with -L in combination with -l or -z and
 the file size doesn't vary?
@@ -212,28 +223,21 @@ good performing ones that will scale with memory and file size.
 Q. How do you use lrzip yourself?
 A. Two basic uses. I compress large files currently on my drive with the
 -l option since it is so quick to get a space saving, and when archiving
-data for permament storage I compress it with the default options.
+data for permanent storage I compress it with the default options.
 
 Q. I found a file that compressed better with plain lzma. How can that be?
 A. When the file is more than 5 times the size of the compression window
 you have available, the efficiency of rzip preparation drops off as a means
 of getting better compression. Eventually when the file is large enough,
 plain lzma compression will get better ratios. The lrzip compression will be
-a lot faster though. Currently I have no way around this problem without
-throwing more and more ram at the compression because trying to do this off
-disk (whether directly on the file or from swap) will mean the file is read
-a ridulous number of times over and over again. It presents an interesting
-problem for which there is no perfect solution but it certainly has us
-thinking hard about how to tackle it.
+a lot faster though. The only way around this is to use as much ram as
+possible with the -M option, and going beyond that with the -U option.
 
 Q. Can I use swapspace as ram for lrzip with a massive window?
-A. No. To make lrzip work completely from disk would make the data be read
-off disk an unrealistic number of times over again and again. For example, if
-you have 1GB of ram and a 2GB file to compress, it might read the file a
-billion times off disk. Most hard drives would fail in that time :) See the
-previous question. Update; I have been informed that people have successfully
-done this without destroying their hard drives and they've been _very_ patient,
-but it didn't take as long as I had predicted.
+A. It will indirectly do this with -M mode enabled. If you want the windows
+even larger, -U (unlimited) mode will make the compression window as big as
+the file itself no matter how big it is, but it will slow down 100 times
+during the compression phase once it has reached your full ram.
 
 Q. Why do you nice it to +19 by default? Can I speed up the compression by
 changing the nice value?
@@ -331,7 +335,7 @@ Ed Avis for various fixes. Thanks to Matt Mahoney for zpaq code. Thanks to
 Jukka Laurila for Darwin support. Thanks to George Makrydakis for lrztar.
 
 Con Kolivas <kernel@kolivas.org>
-Mon, 1 Nov 2010
+Mon, 4 Nov 2010
 
 Also documented by
 Peter Hyman <pete@peterhyman.com>

diff --git a/WHATS-NEW b/WHATS-NEW
@@ -1,3 +1,18 @@
+lrzip-0.5.1
+
+Fixed the build on Darwin.
+Rewrote the rzip compression phase to make it possible to use unlimited sized
+windows now, not limited by ram. Unfortunately it's 100 times slower in this
+mode but you can compress a file of any size as one big compression window with
+it using the new -U option.
+Changed the memory selection system to simply find the largest reasonable sized
+window and use that by default instead of guessing the window size.
+Setting -M now only affects the window size, trying to find the largest
+unreasonably sized window that will still work.
+The default compression level is now 9 and affects the rzip compression stage
+as well as the backend compression.
+Changed to 3 point releases in case we get more than 9 subversions ;)
+
 lrzip-0.50
 
 Rewrote the file format to be up to 5% more compact and slightly faster.

diff --git a/configure b/configure
@@ -1,6 +1,6 @@
 #! /bin/sh
 # Guess values for system-dependent variables and create Makefiles.
-# Generated by GNU Autoconf 2.67 for lrzip 0.5.
+# Generated by GNU Autoconf 2.67 for lrzip 0.5.1.
 #
 # Report bugs to <kernel@kolivas.org>.
 #
@@ -551,9 +551,9 @@ MAKEFLAGS=
 
 # Identity of this package.
 PACKAGE_NAME='lrzip'
-PACKAGE_TARNAME='lrzip-0.5'
-PACKAGE_VERSION='0.5'
-PACKAGE_STRING='lrzip 0.5'
+PACKAGE_TARNAME='lrzip-0.5.1'
+PACKAGE_VERSION='0.5.1'
+PACKAGE_STRING='lrzip 0.5.1'
 PACKAGE_BUGREPORT='kernel@kolivas.org'
 PACKAGE_URL=''
 
@@ -1221,7 +1221,7 @@ if test "$ac_init_help" = "long"; then
   # Omit some internal or obsolete options to make the list less imposing.
   # This message is too long to be a string in the A/UX 3.1 sh.
   cat <<_ACEOF
-\`configure' configures lrzip 0.5 to adapt to many kinds of systems.
+\`configure' configures lrzip 0.5.1 to adapt to many kinds of systems.
 
 Usage: $0 [OPTION]... [VAR=VALUE]...
 
@@ -1269,7 +1269,7 @@ Fine tuning of the installation directories:
   --infodir=DIR           info documentation [DATAROOTDIR/info]
   --localedir=DIR         locale-dependent data [DATAROOTDIR/locale]
   --mandir=DIR            man documentation [DATAROOTDIR/man]
-  --docdir=DIR            documentation root [DATAROOTDIR/doc/lrzip-0.5]
+  --docdir=DIR            documentation root [DATAROOTDIR/doc/lrzip-0.5.1]
   --htmldir=DIR           html documentation [DOCDIR]
   --dvidir=DIR            dvi documentation [DOCDIR]
   --pdfdir=DIR            pdf documentation [DOCDIR]
@@ -1286,7 +1286,7 @@ fi
 
 if test -n "$ac_init_help"; then
   case $ac_init_help in
-     short | recursive ) echo "Configuration of lrzip 0.5:";;
+     short | recursive ) echo "Configuration of lrzip 0.5.1:";;
    esac
   cat <<\_ACEOF
 
@@ -1375,7 +1375,7 @@ fi
 test -n "$ac_init_help" && exit $ac_status
 if $ac_init_version; then
   cat <<\_ACEOF
-lrzip configure 0.5
+lrzip configure 0.5.1
 generated by GNU Autoconf 2.67
 
 Copyright (C) 2010 Free Software Foundation, Inc.
@@ -2014,7 +2014,7 @@ cat >config.log <<_ACEOF
 This file contains any messages produced by compilers while
 running configure, to aid debugging if configure makes a mistake.
 
-It was created by lrzip $as_me 0.5, which was
+It was created by lrzip $as_me 0.5.1, which was
 generated by GNU Autoconf 2.67.  Invocation command line was
 
   $ $0 $@
@@ -5324,7 +5324,7 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1
 # report actual input values of CONFIG_FILES etc. instead of their
 # values after options handling.
 ac_log="
-This file was extended by lrzip $as_me 0.5, which was
+This file was extended by lrzip $as_me 0.5.1, which was
 generated by GNU Autoconf 2.67.  Invocation command line was
 
   CONFIG_FILES    = $CONFIG_FILES
@@ -5386,7 +5386,7 @@ _ACEOF
 cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1
 ac_cs_config="`$as_echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`"
 ac_cs_version="\\
-lrzip config.status 0.5
+lrzip config.status 0.5.1
 configured by $0, generated by GNU Autoconf 2.67,
   with options \\"\$ac_cs_config\\"
 

diff --git a/configure.ac b/configure.ac
@@ -1,5 +1,5 @@
 dnl Process this file with autoconf to produce a configure script.
-AC_INIT([lrzip],[0.5],[kernel@kolivas.org],[lrzip-0.5])
+AC_INIT([lrzip],[0.5.1],[kernel@kolivas.org],[lrzip-0.5.1])
 AC_CONFIG_HEADER(config.h)
 # see what our system is!
 AC_CANONICAL_HOST

diff --git a/doc/README.benchmarks b/doc/README.benchmarks
@@ -89,12 +89,14 @@ gzip		2772899756	 25.8		7m52.667s	4m8.661s
 bzip2		2704781700	 25.2		20m34.269s	7m51.362s
 xz		2272322208	 21.2		58m26.829s	4m46.154s
 7z		2242897134	 20.9		29m28.152s	6m35.952s
-lrzip		1361276826	 12.7		27m45.874s	9m20.046
-lrzip(lzo)	1837206675	 17.1		4m48.167s	8m28.842s
+lrzip*		1354237684	 12.6		29m13.402s	6m55.441s
+lrzip(lzo)*	1828073980	 17.0		3m34.816s	5m06.266s
 lrzip(zpaq)	1341008779	 12.5		4h11m14s
 lrzip(zpaq)M	1270134391	 11.8		4h30m14
 lrzip(zpaq)MW	1066902006	  9.9
 
+(The benchmarks with * were done with version 0.5)
+
 At this end of the spectrum things really start to heat up. The compression
 advantage is massive, with the lzo backend even giving much better results
 than 7z, and over a ridiculously short time. Note that it's not much longer
@@ -117,4 +119,4 @@ Or, to make things easier, just use the default settings all the time and be
 happy as lzma gives good results. :D
 
 Con Kolivas
-Sat, 19 Dec 2009
+Tue, 2nd Nov 2010