This notebook _roughly_ follows [git-book chapter 10.4 - Git Internals - Packfiles](https://git-scm.com/book/en/v2/Git-Internals-Packfiles).

# Before we git started, let's setup your environment


## NOTE: Run the [git-references notebook](../2_git-references/git-references.ipynb) before running this notebook

In [None]:
from pprint import pprint

!sh clean.sh
!sh setup.sh

# Git Internals - Packfiles

![a green trash compactor](http://hwestequipment.com/wp-content/uploads/2018/07/4-Reasons-Why-You-Need-an-Industrial-Trash-Compactor-on-Your-Companys-Grounds.png)

## Packfiles

Git is pretty good at compacting files - after all of our work, our repo is not even 1 killobyte

However, that may not always be the case. Let's add a large file to our repo to demonstrate

In [None]:
!curl https://raw.githubusercontent.com/mojombo/grit/master/lib/grit/repo.rb > repo.rb
!git checkout master
!git add repo.rb
!git commit -m 'added repo.rb'

Let's grab the SHA-1 of the file

In [None]:
reporb_commit_sha_1 = !git cat-file -p master^{tree} | awk '$4 == "repo.rb" { print $3 }'
reporb_commit_sha_1 = reporb_commit_sha_1[0]
pprint(reporb_commit_sha_1)

And use it to find the size of the file

In [None]:
!git cat-file -s $reporb_commit_sha_1

Now, let's modify this file and see what changes

In [None]:
!echo '# testing' >> repo.rb
!git commit -am 'modified repo.rb a bit'

And grab the SHA-1 once again

In [None]:
mod_reporb_commit_sha_1 = !git cat-file -p master^{tree} | awk '$4 == "repo.rb" { print $3 }'
mod_reporb_commit_sha_1 = mod_reporb_commit_sha_1[0]
pprint(mod_reporb_commit_sha_1)

Woah - did you catch that?
The blob is a _completely_ different blob, even though we tacked on just a tiny bit of text!

Let's check the size of the new blob

In [None]:
!git cat-file -s $mod_reporb_commit_sha_1

So, every time we modify a file, even a large file, git will create a new blob... that's not great. There must be some way for git to store the similar bits of a file seperate from where they differ...

Thankfully, git has another trick up its sleeve

The format we've been using to store data far is what we call "loose" object format

The more compact form of storage in git is called a "packfile" - a binary file containing the common parts of several blobs

Normally, this "packing" is done automatically during fetch and merge, but we can manually trigger this function by calling ```git gc```

First let's check our .git/objects directory size before and after we run ```git gc```

In [None]:
!find .git/objects -type f
!git gc
!find .git/objects -type f

As you can see, we have noticable change in the number of files in our objects directory - and a few new faces
* .git/objects/pack*/pack-*\.idx
* .git/objects/pack*/pack-*\.pack
* .git/objects/info/packs

The objects that remail are the blobs which were not pointed to any commit - like the "what is up, doc?" and "test content" test blobs 

Since the aforementioned blobs were not added to a commit, they are considered to be dangling and are not picked up in the new packfile

The new .pack file contains the contens of all the objects which were removed from your filesystem.

The .index file contains offsets into that packfile so we can quickly see a specific object

All in all ```git gc``` reduced the size of our objects from roughly 15K to a cool 7K

Git accomplishes this feat by looking for files which are named and sized similarly, and stores the deltas from one version of the file to the next

We can look into the packfile using ```git verify-pack``` to see what git packed-up

In [None]:
pack_idx_file = !find .git/objects -type f | grep '.idx'
pack_idx_file = pack_idx_file[0]
pprint(pack_idx_file)
!git verify-pack -v $pack_idx_file

Here, we can see the SHA-1 of the original repo.rb file

In [None]:
pprint(reporb_commit_sha_1)

Is referencing the modified repo.rb blob file

In [None]:
pprint(mod_reporb_commit_sha_1)

Given that the third column in the table represents the size of the blob, we can see that the modified repo.rb blob file takes up the original 22K, but that the original repo.rb blob file now only take ~9 bytes.

Git purposefully keeps the newer file intact while reducing on the size of older files, as you will be more likly to access and update this version of the file 

This can be repacked at any time! Even though git will occasionally repack your database, you can expirement and see if runing ```git gc``` increases the preformance of your repo!