# Can git compress 'similar' (binary) files in repositories?

## Preparations

In [1]:
py348 = "https://www.python.org/ftp/python/3.4.8/Python-3.4.8.tgz"
py348rc1 = "https://www.python.org/ftp/python/3.4.8/Python-3.4.8rc1.tgz"

In [2]:
!wget $py348

--2018-05-31 14:54:21--  https://www.python.org/ftp/python/3.4.8/Python-3.4.8.tgz
Resolving www.python.org... 151.101.112.223, 2a04:4e42:1b::223
Connecting to www.python.org|151.101.112.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19663810 (19M) [application/octet-stream]
Saving to: ‘Python-3.4.8.tgz’


2018-05-31 14:54:23 (13.1 MB/s) - ‘Python-3.4.8.tgz’ saved [19663810/19663810]



In [3]:
!wget $py348rc1

--2018-05-31 14:54:23--  https://www.python.org/ftp/python/3.4.8/Python-3.4.8rc1.tgz
Resolving www.python.org... 151.101.112.223, 2a04:4e42:1b::223
Connecting to www.python.org|151.101.112.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19694450 (19M) [application/octet-stream]
Saving to: ‘Python-3.4.8rc1.tgz’


2018-05-31 14:54:24 (15.5 MB/s) - ‘Python-3.4.8rc1.tgz’ saved [19694450/19694450]



In [4]:
# strip off URL path
import os.path
py348 = os.path.split(py348)[1]
py348rc1 = os.path.split(py348rc1)[1]

# use letters A and B for convenience, and strip off file extension (.tgz)
A = py348[:-4]
B = py348rc1[:-4]


# Experiment 1: Adding gzipped tar files to git

In [5]:
Atgz = A + '.tgz'
!rm -rf .git # clean start
!git init .
!git add $Atgz
!git commit -m "file A"
!du -hs .git

Initialized empty Git repository in /Users/fangohr/git/git-better-to-add-gzipped-tar-file-or-tar-file/.git/
[master (root-commit) 05dcfb2] file A
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 Python-3.4.8.tgz
 19M	.git


Can we optimise disk usage further? (No:)

In [6]:
# Can we optimise git's storage?
!git gc --aggressive
!du -hs .git

Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), done.
Total 3 (delta 0), reused 0 (delta 0)
 19M	.git


In [7]:
Btgz = B + '.tgz'
!git add $Btgz
!git commit -m "file B"
!du -hs .git

[master df10613] file B
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 Python-3.4.8rc1.tgz
 38M	.git


In [8]:
!ls -lh Python*tgz

-rw-r--r--@ 1 fangohr  staff    19M  5 Feb 00:53 Python-3.4.8.tgz
-rw-r--r--@ 1 fangohr  staff    19M 23 Jan 13:51 Python-3.4.8rc1.tgz


In [9]:
# Can we optimise git's storage?
!git gc --aggressive
!du -hs .git

Counting objects: 6, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (5/5), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 0), reused 3 (delta 0)
 38M	.git


Conclusion 1: 
- Adding compressed files to git -> git cannot compress further
- Adding two similar (but already compressed files) -> git cannot exploit similarity 

# Experiment 2: Adding (uncompressed) tar files to git

In [10]:
!rm -f *.tar            # tidy up
!gunzip $Atgz $Btgz
!ls -lh Python*tar

-rw-r--r--  1 fangohr  staff    72M  5 Feb 00:53 Python-3.4.8.tar
-rw-r--r--  1 fangohr  staff    72M 23 Jan 13:51 Python-3.4.8rc1.tar


In [11]:
Atar = A + ".tar"
!rm -rf .git # clean start
!git init .
!git add $Atar
!git commit -m "file A"
!du -hs .git

Initialized empty Git repository in /Users/fangohr/git/git-better-to-add-gzipped-tar-file-or-tar-file/.git/
[master (root-commit) 7ef509e] file A
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 Python-3.4.8.tar
 22M	.git


In [12]:
# Can we optimise git's storage?
!git gc --aggressive
!du -hs .git

Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), done.
Total 3 (delta 0), reused 0 (delta 0)
 19M	.git


Yes, now git has achieved (approx) the same compression as gzip could do on the tarball.

In [13]:
Btar = B + ".tar"
!git add $Btar
!git commit -m "file B"
!du -hs .git

[master 1a65e62] file B
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 Python-3.4.8rc1.tar
 41M	.git


In [14]:
!git log --stat

[33mcommit 1a65e6228b2e5cc0e34d3640cfd3e36c9086c8d6[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m)[m
Author: Hans Fangohr <hans.fangohr@xfel.eu>
Date:   Thu May 31 14:54:35 2018 +0200

    file B

 Python-3.4.8rc1.tar | Bin [31m0[m -> [32m75673600[m bytes
 1 file changed, 0 insertions(+), 0 deletions(-)

[33mcommit 7ef509e1c5cfeb1c73922d64ec719d7d423801cb[m
Author: Hans Fangohr <hans.fangohr@xfel.eu>
Date:   Thu May 31 14:54:31 2018 +0200

    file A

 Python-3.4.8.tar | Bin [31m0[m -> [32m75673600[m bytes
 1 file changed, 0 insertions(+), 0 deletions(-)


In [15]:
# Can we optimise git's storage?
!git gc --aggressive
!du -hs .git

Counting objects: 6, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (5/5), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 1), reused 2 (delta 0)
 19M	.git


Conclusion 2: Adding uncompressed files to git:

- each file is compressed (although not efficiently as original gzip: 22MB instead of 19MB)
- no apparent benefit in exploiting similarity of the two files
- *until* 'git gc' is run; then very good compression of the two 'similar' files A and B.

# Conclusion

- Adding the two similar gzipped tar files to a git repository results in a repository size (i.e. size of ``.git`` of 38MB). 

- Adding the files as tar files, results in a repository size of 44MB. Once ``git gc`` is executed, the repository size shrinks to 19MB.

So based on this, we should add tar files, not zipped archives.

## Acknowledgements

Thanks to Martin Teichmann who taught me about ``git gc`` and pointed to this study: https://gist.github.com/matthewmccullough/2695758

