Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Apr 10, 2007

  1. compute a CRC32 for each object as stored in a pack

    The most important optimization for performance when repacking is the
    ability to reuse data from a previous pack as is and bypass any delta
    or even SHA1 computation by simply copying the raw data from one pack
    to another directly.
    
    The problem with  this is that any data corruption within a copied object
    would go unnoticed and the new (repacked) pack would be self-consistent
    with its own checksum despite containing a corrupted object.  This is a
    real issue that already happened at least once in the past.
    
    In some attempt to prevent this, we validate the copied data by inflating
    it and making sure no error is signaled by zlib.  But this is still not
    perfect as a significant portion of a pack content is made of object
    headers and references to delta base objects which are not deflated and
    therefore not validated when repacking actually making the pack data reuse
    still not as safe as it could be.
    
    Of course a full SHA1 validation could be performed, but that implies
    full data inflating and delta replaying which is extremely costly, which
    cost the data reuse optimization was designed to avoid in the first place.
    
    So the best solution to this is simply to store a CRC32 of the raw pack
    data for each object in the pack index.  This way any object in a pack can
    be validated before being copied as is in another pack, including header
    and any other non deflated data.
    
    Why CRC32 instead of a faster checksum like Adler32?  Quoting Wikipedia:
    
       Jonathan Stone discovered in 2001 that Adler-32 has a weakness for very
       short messages. He wrote "Briefly, the problem is that, for very short
       packets, Adler32 is guaranteed to give poor coverage of the available
       bits. Don't take my word for it, ask Mark Adler. :-)" The problem is
       that sum A does not wrap for short messages. The maximum value of A for
       a 128-byte message is 32640, which is below the value 65521 used by the
       modulo operation. An extended explanation can be found in RFC 3309,
       which mandates the use of CRC32 instead of Adler-32 for SCTP, the
       Stream Control Transmission Protocol.
    
    In the context of a GIT pack, we have lots of small objects, especially
    deltas, which are likely to be quite small and in a size range for which
    Adler32 is dimed not to be sufficient.  Another advantage of CRC32 is the
    possibility for recovery from certain types of small corruptions like
    single bit errors which are the most probable type of corruptions.
    
    OK what this patch does is to compute the CRC32 of each object written to
    a pack within pack-objects.  It is not written to the index yet and it is
    obviously not validated when reusing pack data yet either.
    
    Signed-off-by: Nicolas Pitre <nico@cam.org>
    Signed-off-by: Junio C Hamano <junkio@cox.net>
    authored April 09, 2007 Junio C Hamano committed April 10, 2007

Aug 10, 2005

  1. sirainen

    [PATCH] -Werror fixes

    GCC's format __attribute__ is good for checking errors, especially
    with -Wformat=2 parameter. This fixes most of the reported problems
    against 2005-08-09 snapshot.
    authored August 09, 2005 Junio C Hamano committed August 09, 2005

Jun 28, 2005

  1. csum-file: add "sha1fd()" to create a SHA1 csum file from an existing…

    … file descriptor
    
    We'll use this soon to write pack-files to stdout.
    authored June 28, 2005

Jun 27, 2005

  1. csum-file interface updates: return resulting SHA1

    Also, make the writing of the SHA1 as a end-header be conditional: not
    every user will necessarily want to write the SHA1 to the file itself,
    even though current users do (but we migh end up using the same helper
    functions for the object files themselves, that don't do this).
    
    This also makes the packed index file contain the SHA1 of the packed
    data file at the end (just before its own SHA1).  That way you can
    validate the pairing of the two if you want to.
    authored June 26, 2005
  2. git-pack-objects: write the pack files with a SHA1 csum

    We want to be able to check their integrity later, and putting the
    sha1-sum of the contents at the end is a good thing.  The writing
    routines are generic, so we could try to re-use them for the index file,
    instead of having the same logic duplicated.
    
    Update unpack-objects to know about the extra 20 bytes at the end
    of the index.
    authored June 26, 2005
Something went wrong with that request. Please try again.