This notebook _roughly_ follows [git-book chapter 10.2 - Git Internals - Git Objects](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects).

# Before we git started, let's setup your environment


TODO: Read from [dotfile](https://github.com/theskumar/python-dotenv)

In [None]:
from pprint import pprint

!sh clean.sh

%env USERNAME="<git config user.name>"
%env USEREMAIL="<git config user.email>"

# Git Internals - Git Objects

![A Complex-Looking Object](https://talkingwithimage.files.wordpress.com/2014/10/f5ac465df56759f1fecb01e677ceeb34.jpg)

## Git Objects

### [Fact #1 - Git is a Content Addressable File System][factlink]
This means that git stores and retrieves data using a unique key.

[factlink]: https://notebooks.azure.com/dalinwilliams/projects/git-good/html/Git%20Facts

#### Great... what does that mean

Let's go ahead and create a new repo

In [None]:
!git init

And use ```git hash-object``` to add content to our repo

In [None]:
!echo 'test content' | git hash-object -w --stdin

Just like that, we have content! By using ```hash-content``` with the ```-w``` flag, we've written ```'test content'``` to our repo. ```--stdin``` allows us to read from standard-in - the terminal - rahter than from a file. 

In [None]:
!find .git/objects -type f

The value returned by ```hash-content``` is the key to retrieve the data - the SHA-1 checksum of the content + a **header** (remember **header** for later).

We can retrieve this data from our repo using ```git cat-file``` and the  checksum.

In [None]:
!git cat-file -p d670460b4b4aece5915caf5c68d12f560a9fe3e4

We can even show how git implements version control - albeit on a micro scale

Let's create a file, and add it to our repo

In [None]:
!echo 'version 1' > test.txt
!git hash-object -w test.txt

Let's modify that file to create a 'v2' of that file

In [None]:
!echo 'version 2' > test.txt
!git hash-object -w test.txt

And make sure these new objects have been created in our .git/objects dir

In [None]:
!find .git/objects -type f

Now, we can delete our local copy of test.txt, and use git to retrieve any version we want

In [None]:
!git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30 > test.txt
!cat test.txt

In [None]:
!git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a > test.txt
!cat test.txt

Now we can fetch data using the SHA-1 of a file - yay! However, since we are not storing the file by name, we now need to memorize all of the SHA-1s for every file and every version of that file, and remember it's name...

![focus, focus, focus](https://www.theamericanconservative.com/wp-content/uploads/2013/09/student-studying.jpg)

Never fear! Git has a method of handling this complexity. For the time being, remember that this one-file-one-version mapping is called a ```blob``` - git's lowest representation of data. Git can even tell you that the object that you are looking at is a ```blob```

In [None]:
!git cat-file -t 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a

## Tree Objects


Tree objects solve the problem of memorizing file-version-SHA-1 pairs. 

Tree objects also allow us to store a group of files together.

All of the content in a git repo is stored as tree and blob objects, with trees representing directory entries, and blobs representing file contents

This is roughly simular to how UNIX systems operate, with trees representing UNIX directory entries, adn blobs representing inodes/file contents

A tree object can contain one or more entries, with each entry representing the SHA-1 of a blob or subtree with its associated mode, name, and filetype

<Neat - what do you mean by trees?!>

![A simple representation of the git data model](https://git-scm.com/book/en/v2/images/data-model-1.png)

Alright, let's create our own tree. Typically, this would be done by taking the state of your index or staging area and writing a series of tree objects, so let's stage some files

This can be done by creating an index, which will function as our staging area.

We can use ```git update-index``` to the version of test.txt that we've already added to .git/objects. We'll need to use the ```--add``` flag to push the file to the staging area, and combine it with the ```--cacheinfo``` flag because the file we're adding is not in  our diretory. 

Finally, we'll need to provide the SHA-1, the filename, and the mode of the file. Since test.txt is a normal file (100755 - executable file or 120000 - symbolic link), we provide the code 100644.

These modes are taken from UNIX modes. The aforementioned three-modes are the only valid modes for files i.e. blobs in git - although other modes are available for submodules and directories

In [None]:
!git update-index --add --cacheinfo 100644 \
  83baae61804e65cc73a7201a7252750c76066a30 test.txt

We can use ```git ls-files --stage``` to list the files staged in .git/index

In [None]:
!git ls-files --stage

Since we've added the file to the staging area, we can go ahead and use ```git write-tree``` without the ```-w``` flag. This will create a tree object from the state of the index - if that tree does not exist

In [None]:
!git write-tree
!git cat-file -p d8329fc1cc938780ffdd9f94e0d364e0ea74f579

We can now verify that the returned SHA-1 is referencing a tree git object

In [None]:
!git cat-file -t d8329fc1cc938780ffdd9f94e0d364e0ea74f579

Alright, let's create a new tree with a second version of test.txt + a new file

In [None]:
!echo 'new file' > new.txt
!git update-index --add --cacheinfo 100644 \
  1f7a7a472abf3dd9643fd615f6da379c4acb3e3a test.txt
!git update-index --add new.txt

Awesome! Our staging area should have the new version of test.txt and a new file - new.txt. 

In [None]:
!git ls-files --stage

Let's go ahead and write that tree

In [None]:
!git write-tree
!git cat-file -p 0155eb4229851634a0f03eb265b69f5a2d56f341

Take note - this tree has both file entries AND the SHA-1 is the test.txt containing "version 2" (```1f7a7a```). For fun, let's the first tree as a subdirectory into this tree. We can read trees into our staging area by calling ```git read-tree```

In this case, we'll read an existing tree into our staging area as a subtree using the ```prefix``` flag

In [None]:
!git read-tree --prefix=bak d8329fc1cc938780ffdd9f94e0d364e0ea74f579
!git write-tree
!git cat-file -p 3c4e9cd789d88d8d89c1073707c3585e41b0e614

So, what just happened?

If we were create a new directory from this new tree, it would have new.txt and test.txt ("version 2") in the root directory, and a folder named bak which would contain test.txt ("version 1").

![Structure of git data after bak folder is added](https://git-scm.com/book/en/v2/images/data-model-2.png)

## Commit Objects

Alright so we have three trees which represent different snapshots of our project.

Wait - we still have must remember the SHA-1s of these trees to recall these snapshots! We also do not have any information on who saved the snapshots, when or why they were saved.

This is what commits will store for us

In order to create a commit object, we need to call ```git commit-tree``` and specify a single tree SHA-1 and which commit objects (if any) directly proceed it. Let's start with the first tree we wrote

In [None]:
!git config user.name $USERNAME
!git config --replace-all user.email $USEREMAIL
first_commit_sha_1 = !echo 'first commit' | git commit-tree d8329f 
first_commit_sha_1 = first_commit_sha_1[0]
pprint(first_commit_sha_1)

We're storing the commit SHA-1 in commit_sha_1 as your SHA-1 is influenced by the config values of user.name and user.email.

We can now fetch the new commit object using ```git cat-file```

In [None]:
!git cat-file -p $first_commit_sha_1

The format you see here is simple - the top level commit; the parent commmits (the command above should not have any parents); the author information (name, email, and a timestamp) a blank line, and the commit message

Now, let's write two other commit objects referencing the commit that came directly before it.

In [None]:
second_commit_sha_1 = !echo 'second commit' | git commit-tree 0155eb -p $first_commit_sha_1
second_commit_sha_1 = second_commit_sha_1[0]
pprint(second_commit_sha_1)
third_commit_sha_1 = !echo 'third commit' | git commit-tree 3c4e9c -p $second_commit_sha_1
third_commit_sha_1 = third_commit_sha_1[0]
pprint(third_commit_sha_1)

What we have are three new commits which each point to one of the three trees we created. We can use ```git log``` to see the all-too-familiar chain of commits and parent commits - given we provide the last commmit SHA-1

In [None]:
!git log --stat $third_commit_sha_1

First, give yourselves a round of applauze - we managed to build up git history without using any of the front-end commands

What we've done here is do what git does when we run ```git add``` and ```git commit``` - that is we:

1. stored blobs for the files that have changed

2. update the index (add files to stage)

3. write-out the trees

4. write-out the commit objects that reference the top-level trees and commits that came immediately before them

Let's take a look at all of our shiny new git objects

In [None]:
!find .git/objects -type f

If we were to map the above pointers, we would have an object-graph simular to the following:

![All the reachable objects in your git directory](https://git-scm.com/book/en/v2/images/data-model-3.png)

### Remember:
The commit SHA-1s will be different, however the tree and blob SHA-1s should be the same

## Object Storage

Eariler we mentioned that there is a header stored with every object we commit to your git object database. Let's take a closer look at how this header is calculated, and how it influences the computation of the objects SHA-1

Using ruby, let's set-up sample data to be commit into our repo

```ruby
irb
>> content = "what is up, doc?"
=> "what is up, doc?"
```

After recieving content, git will generate the header. This header will contain the git identified the object type (blob in this case) and a space, followed by the size of in bytes of the content, and a final null byte

```ruby
>> header = "blob #{content.length}\0"
=> "blob 16\u0000"
```

Git will concatenate the header and the original content. The results are then used to calculate the SHA-1 of the content

```ruby
>> store = header + content
=> "blob 16\u0000what is up, doc?"
>> require 'digest/sha1'
=> true
>> sha1 = Digest::SHA1.hexdigest(store)
=> "bd9dbf5aae1a3862dd1526723246b20206e5fc37"
```

We have implemented the above logic in 'sha-1-example.rb'. The SHA-1 will be the same SHA-1 that will be computed by ```git hash-object```!

In [None]:
!ruby sha-1-example.rb

In [None]:
!echo -n "what is up, doc?" | git hash-object --stdin

Git then compreses the new content with zlib.

```ruby
>> require 'zlib'
=> true
>> zlib_content = Zlib::Deflate.deflate(store)
=> "x\x9CK\xCA\xC9OR04c(\xCFH,Q\xC8,V(-\xD0QH\xC9O\xB6\a\x00_\x1C\a\x9D"
```

Now, we need to write this deflated zlib content to an object on disk. We will set subdirectory path to the first two characters of the SHA-1 value, and the remainding 38 characters will be the file name in that directory

```ruby
>> path = '.git/objects/' + sha1[0,2] + '/' + sha1[2,38]
=> ".git/objects/bd/9dbf5aae1a3862dd1526723246b20206e5fc37"
>> require 'fileutils'
=> true
>> FileUtils.mkdir_p(File.dirname(path))
=> ".git/objects/bd"
>> File.open(path, 'w') { |f| f.write zlib_content }
=> 32
```

We have implemented the above steps in blob-example.rb. We can run the ruby script, and check to see if we've created a valid git blob object

In [None]:
!ruby blob-example.rb
!git cat-file -p bd9dbf5aae1a3862dd1526723246b20206e5fc37

Done and done 🎉

All git objects are stored the same way, just with differnt types. The header would be 'tree' for tree objects, 'commit' for commit objects, etc.