Skip to content

Commit

Permalink
Huge rewrite of buffer reading in rzip.c. We use a wrapper instead of
Browse files Browse the repository at this point in the history
accessing the buffer directly, thus allowing us to have window sizes larger than
available ram. This is implemented through the use of a "sliding mmap"
implementation. Sliding mmap uses two mmapped buffers, one large one as
previously, and one page sized smaller one. When an attempt is made to read
beyond the end of the large buffer, the small buffer is remapped to the file
area that's being accessed. While this implementation is 100x slower than direct
mmapping, it allows us to implement unlimited sized compression windows.
Implement the -U option with unlimited sized windows.
Rework the selection of compression windows. Instead of trying to guess how
much ram the machine might be able to access, we try to safely buffer as much
ram as we can, and then use that to determine the file buffer size. Do not
choose an arbitrary upper window limit unless -w is specified.
Rework the -M option to try to buffer the entire file, reducing the buffer
size until we succeed.
Align buffer sizes to page size.
Clean up lots of unneeded variables.
Fix lots of minor logic issues to do with window sizes accepted/passed to rzip
and the compression backends.
More error handling.
Change -L to affect rzip compression level directly as well as backend
compression level and use 9 by default now.
More cleanups of information output.
Use 3 point release numbering in case one minor version has many subversions.
Numerous minor cleanups and tidying.
Updated docs and manpages.
  • Loading branch information
ckolivas committed Nov 4, 2010
1 parent c106128 commit 29b1666
Show file tree
Hide file tree
Showing 12 changed files with 394 additions and 250 deletions.
34 changes: 34 additions & 0 deletions ChangeLog
Original file line number Diff line number Diff line change
@@ -1,4 +1,38 @@
lrzip ChangeLog
NOVEMBER 2010, version 0.5.1 Con Kolivas
* Fix Darwin build - Darwin doesn't support mremap so introduce a fake wrapper
for it.
* Fix the memopen routines, a wrongly implemented wrapper for Darwin equivalents
was also using the faked versions on all builds.
* Fix dodgy ordered includes.
* Clean up excessive use of #ifdefs
* Huge rewrite of buffer reading in rzip.c. We use a wrapper instead of
accessing the buffer directly, thus allowing us to have window sizes larger than
available ram. This is implemented through the use of a "sliding mmap"
implementation. Sliding mmap uses two mmapped buffers, one large one as
previously, and one page sized smaller one. When an attempt is made to read
beyond the end of the large buffer, the small buffer is remapped to the file
area that's being accessed. While this implementation is 100x slower than direct
mmapping, it allows us to implement unlimited sized compression windows.
* Implement the -U option with unlimited sized windows.
* Rework the selection of compression windows. Instead of trying to guess how
much ram the machine might be able to access, we try to safely buffer as much
ram as we can, and then use that to determine the file buffer size. Do not
choose an arbitrary upper window limit unless -w is specified.
* Rework the -M option to try to buffer the entire file, reducing the buffer
size until we succeed.
* Align buffer sizes to page size.
* Clean up lots of unneeded variables.
* Fix lots of minor logic issues to do with window sizes accepted/passed to rzip
and the compression backends.
* More error handling.
* Change -L to affect rzip compression level directly as well as backend
compression level and use 9 by default now.
* More cleanups of information output.
* Use 3 point release numbering in case one minor version has many subversions.
* Numerous minor cleanups and tidying.
* Updated docs and manpages.

NOVEMBER 2010, version 0.5 Con Kolivas
* Changed offset encoding in rzip stage to use variable byte width offsets
instead of 64 bits wide. Makes for better compression and slightly faster.
Expand Down
54 changes: 29 additions & 25 deletions README
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
lrzip v0.5
lrzip v0.5.1

Long Range ZIP or Lzma RZIP

Expand Down Expand Up @@ -66,6 +66,17 @@ less ram and works on smaller ram machines.
stdin/stdout work but in a very inefficient manner generating temporary files
on disk so this method of using lrzip is not recommended.

The unique feature of lrzip is that it tries to make the most of the available
ram in your system at all times for maximum benefit. It does this by default,
choosing the largest sized window possible without running out of memory. It
also has a unique "sliding mmap" feature which makes it possible to even use
a compression window larger than your ramsize, if the file is that large. It
does this (with the -U option) by implementing one large mmap buffer as per
normal, and a smaller moving buffer to track which part of the file is
currently being examined, emulating a much larger single mmapped buffer.
Unfortunately this mode is 100 times slower once lrzip begins examining the
ram beyond the larger base window.

See the file README.benchmarks in doc/ for performance examples and what kind
of data lrzip is very good with.

Expand All @@ -91,10 +102,10 @@ Q. How do I make a static build?
A. make static

Q. I want the absolute maximum compression I can possibly get, what do I do?
A. Try the command line options -Mz. This will use all available ram and ZPAQ
compression. Expect serious swapping to occur if your file is larger than your
ram. It may even fail to run if you do not have enough swap space allocated.
Why? Well the more ram lrzip uses the better the compression it can achieve.
A. Try the command line options -MUz. This will use all available ram and ZPAQ
compression, and even use a compression window larger than you have ram.
Expect serious swapping to occur if your file is larger than your ram and for
it to take 1000 times longer. A more practical option is just -M.

Q. Can I use your tool for even more compression than lzma offers?
A. Yes, the rzip preparation of files makes them more compressible by every
Expand All @@ -111,11 +122,12 @@ used windows larger than 2GB.

Q. How about 64bit?
A. 64bit machines with their ability to address massive amounts of ram will
excel with lrzip due to being able to use compresion windows limited only in
excel with lrzip due to being able to use compression windows limited only in
size by the amount of physical ram.

Q. Other operating systems?
A. Patches are welcome. Version 0.43+ should build on MacOSX 10.5+
A. The code is POSIXy with GNU extensions. Patches are welcome. Version 0.43+
should build on MacOSX 10.5+

Q. Does it work on stdin/stdout?
A. Yes it does. Compression from stdin works nicely.. However the other
Expand Down Expand Up @@ -146,7 +158,7 @@ to compress at all). If no compressible data is found, then the subsequent
compression is not even attempted. This can save a lot of time during the
compression phase when there is incompressible data. Theoretically it may be
possible that data is compressible by the other backend (zpaq, lzma etc) and not
at all by lzo, but in practice such data achieves only miniscule amounts of
at all by lzo, but in practice such data achieves only minuscule amounts of
compression which are not worth pursuing. Most of the time it is clear one way
or the other that data is compressible or not. If you wish to disable this
test and force it to try compressing it anyway, use -T 0.
Expand All @@ -156,8 +168,7 @@ generated file be decompressed on machines with less ram?
A. Yes. Ram requirements for decompression go up only by the -L compression
option with lzma and are never anywhere near as large as the compression
requirements. However if you're on 64bit and you use a compression window
greater than 2GB, it may NOT be possible to decompress it on 32bit machines.
lrzip will warn you and fail if you try.
greater than 2GB, it might not be possible to decompress it on 32bit machines.

Q. I've changed the compression level with -L in combination with -l or -z and
the file size doesn't vary?
Expand Down Expand Up @@ -212,28 +223,21 @@ good performing ones that will scale with memory and file size.
Q. How do you use lrzip yourself?
A. Two basic uses. I compress large files currently on my drive with the
-l option since it is so quick to get a space saving, and when archiving
data for permament storage I compress it with the default options.
data for permanent storage I compress it with the default options.

Q. I found a file that compressed better with plain lzma. How can that be?
A. When the file is more than 5 times the size of the compression window
you have available, the efficiency of rzip preparation drops off as a means
of getting better compression. Eventually when the file is large enough,
plain lzma compression will get better ratios. The lrzip compression will be
a lot faster though. Currently I have no way around this problem without
throwing more and more ram at the compression because trying to do this off
disk (whether directly on the file or from swap) will mean the file is read
a ridulous number of times over and over again. It presents an interesting
problem for which there is no perfect solution but it certainly has us
thinking hard about how to tackle it.
a lot faster though. The only way around this is to use as much ram as
possible with the -M option, and going beyond that with the -U option.

Q. Can I use swapspace as ram for lrzip with a massive window?
A. No. To make lrzip work completely from disk would make the data be read
off disk an unrealistic number of times over again and again. For example, if
you have 1GB of ram and a 2GB file to compress, it might read the file a
billion times off disk. Most hard drives would fail in that time :) See the
previous question. Update; I have been informed that people have successfully
done this without destroying their hard drives and they've been _very_ patient,
but it didn't take as long as I had predicted.
A. It will indirectly do this with -M mode enabled. If you want the windows
even larger, -U (unlimited) mode will make the compression window as big as
the file itself no matter how big it is, but it will slow down 100 times
during the compression phase once it has reached your full ram.

Q. Why do you nice it to +19 by default? Can I speed up the compression by
changing the nice value?
Expand Down Expand Up @@ -331,7 +335,7 @@ Ed Avis for various fixes. Thanks to Matt Mahoney for zpaq code. Thanks to
Jukka Laurila for Darwin support. Thanks to George Makrydakis for lrztar.

Con Kolivas <kernel@kolivas.org>
Mon, 1 Nov 2010
Mon, 4 Nov 2010

Also documented by
Peter Hyman <pete@peterhyman.com>
Expand Down
15 changes: 15 additions & 0 deletions WHATS-NEW
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
lrzip-0.5.1

Fixed the build on Darwin.
Rewrote the rzip compression phase to make it possible to use unlimited sized
windows now, not limited by ram. Unfortunately it's 100 times slower in this
mode but you can compress a file of any size as one big compression window with
it using the new -U option.
Changed the memory selection system to simply find the largest reasonable sized
window and use that by default instead of guessing the window size.
Setting -M now only affects the window size, trying to find the largest
unreasonably sized window that will still work.
The default compression level is now 9 and affects the rzip compression stage
as well as the backend compression.
Changed to 3 point releases in case we get more than 9 subversions ;)

lrzip-0.50

Rewrote the file format to be up to 5% more compact and slightly faster.
Expand Down
22 changes: 11 additions & 11 deletions configure
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#! /bin/sh
# Guess values for system-dependent variables and create Makefiles.
# Generated by GNU Autoconf 2.67 for lrzip 0.5.
# Generated by GNU Autoconf 2.67 for lrzip 0.5.1.
#
# Report bugs to <kernel@kolivas.org>.
#
Expand Down Expand Up @@ -551,9 +551,9 @@ MAKEFLAGS=

# Identity of this package.
PACKAGE_NAME='lrzip'
PACKAGE_TARNAME='lrzip-0.5'
PACKAGE_VERSION='0.5'
PACKAGE_STRING='lrzip 0.5'
PACKAGE_TARNAME='lrzip-0.5.1'
PACKAGE_VERSION='0.5.1'
PACKAGE_STRING='lrzip 0.5.1'
PACKAGE_BUGREPORT='kernel@kolivas.org'
PACKAGE_URL=''

Expand Down Expand Up @@ -1221,7 +1221,7 @@ if test "$ac_init_help" = "long"; then
# Omit some internal or obsolete options to make the list less imposing.
# This message is too long to be a string in the A/UX 3.1 sh.
cat <<_ACEOF
\`configure' configures lrzip 0.5 to adapt to many kinds of systems.
\`configure' configures lrzip 0.5.1 to adapt to many kinds of systems.
Usage: $0 [OPTION]... [VAR=VALUE]...
Expand Down Expand Up @@ -1269,7 +1269,7 @@ Fine tuning of the installation directories:
--infodir=DIR info documentation [DATAROOTDIR/info]
--localedir=DIR locale-dependent data [DATAROOTDIR/locale]
--mandir=DIR man documentation [DATAROOTDIR/man]
--docdir=DIR documentation root [DATAROOTDIR/doc/lrzip-0.5]
--docdir=DIR documentation root [DATAROOTDIR/doc/lrzip-0.5.1]
--htmldir=DIR html documentation [DOCDIR]
--dvidir=DIR dvi documentation [DOCDIR]
--pdfdir=DIR pdf documentation [DOCDIR]
Expand All @@ -1286,7 +1286,7 @@ fi

if test -n "$ac_init_help"; then
case $ac_init_help in
short | recursive ) echo "Configuration of lrzip 0.5:";;
short | recursive ) echo "Configuration of lrzip 0.5.1:";;
esac
cat <<\_ACEOF
Expand Down Expand Up @@ -1375,7 +1375,7 @@ fi
test -n "$ac_init_help" && exit $ac_status
if $ac_init_version; then
cat <<\_ACEOF
lrzip configure 0.5
lrzip configure 0.5.1
generated by GNU Autoconf 2.67
Copyright (C) 2010 Free Software Foundation, Inc.
Expand Down Expand Up @@ -2014,7 +2014,7 @@ cat >config.log <<_ACEOF
This file contains any messages produced by compilers while
running configure, to aid debugging if configure makes a mistake.
It was created by lrzip $as_me 0.5, which was
It was created by lrzip $as_me 0.5.1, which was
generated by GNU Autoconf 2.67. Invocation command line was
$ $0 $@
Expand Down Expand Up @@ -5324,7 +5324,7 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1
# report actual input values of CONFIG_FILES etc. instead of their
# values after options handling.
ac_log="
This file was extended by lrzip $as_me 0.5, which was
This file was extended by lrzip $as_me 0.5.1, which was
generated by GNU Autoconf 2.67. Invocation command line was
CONFIG_FILES = $CONFIG_FILES
Expand Down Expand Up @@ -5386,7 +5386,7 @@ _ACEOF
cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1
ac_cs_config="`$as_echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`"
ac_cs_version="\\
lrzip config.status 0.5
lrzip config.status 0.5.1
configured by $0, generated by GNU Autoconf 2.67,
with options \\"\$ac_cs_config\\"
Expand Down
2 changes: 1 addition & 1 deletion configure.ac
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
dnl Process this file with autoconf to produce a configure script.
AC_INIT([lrzip],[0.5],[kernel@kolivas.org],[lrzip-0.5])
AC_INIT([lrzip],[0.5.1],[kernel@kolivas.org],[lrzip-0.5.1])
AC_CONFIG_HEADER(config.h)
# see what our system is!
AC_CANONICAL_HOST
Expand Down
8 changes: 5 additions & 3 deletions doc/README.benchmarks
Original file line number Diff line number Diff line change
Expand Up @@ -89,12 +89,14 @@ gzip 2772899756 25.8 7m52.667s 4m8.661s
bzip2 2704781700 25.2 20m34.269s 7m51.362s
xz 2272322208 21.2 58m26.829s 4m46.154s
7z 2242897134 20.9 29m28.152s 6m35.952s
lrzip 1361276826 12.7 27m45.874s 9m20.046
lrzip(lzo) 1837206675 17.1 4m48.167s 8m28.842s
lrzip* 1354237684 12.6 29m13.402s 6m55.441s
lrzip(lzo)* 1828073980 17.0 3m34.816s 5m06.266s
lrzip(zpaq) 1341008779 12.5 4h11m14s
lrzip(zpaq)M 1270134391 11.8 4h30m14
lrzip(zpaq)MW 1066902006 9.9

(The benchmarks with * were done with version 0.5)

At this end of the spectrum things really start to heat up. The compression
advantage is massive, with the lzo backend even giving much better results
than 7z, and over a ridiculously short time. Note that it's not much longer
Expand All @@ -117,4 +119,4 @@ Or, to make things easier, just use the default settings all the time and be
happy as lzma gives good results. :D

Con Kolivas
Sat, 19 Dec 2009
Tue, 2nd Nov 2010
Loading

0 comments on commit 29b1666

Please sign in to comment.