Skip to content

BragdonD/rgit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Let's learn Git by building it (Part 1)

I intend to write this whole project in Rust to learn the language along the way.

How does Git handle each files and each directories?

In Git, each file and folder is considerated as a git object. The git object are stored in the .git/objects folder. There are 2 main types of git objects: blobs and trees.

Blobs

A blob is a git object that contains the content of a file.

Trees

A tree is a git object that contains the direct content of a directory.

OID (Object IDentifier)

You might be wondering but how do I know which git object is representing which file or directory? Well, git use a special technique to create a unique identifier for each git object. This unique identifier is called OID (Object IDentifier).

The OID is created by hashing the content of the git object. The hashing algorithm used is SHA-1. The SHA-1 algorithm will generate a 40 characters long hexadecimal string. This string is the OID of the git object.

Let's take an example. Let's say we have a file named foo.txt with the following content:

Hello World!

The OID of this file will be generated by using:

SHA-1("blob 12\0Hello World!")

So what is happening here?

  • First, we are using the type of the git object: blob or tree.
  • Then, we are adding a space. (This is just a convention)
  • Then, we are adding the length of the content of the file: 12.
  • Then, we are adding a null character \0. This is used to separate the header from the content.
  • Finally, we are adding the content of the file: Hello World!\n.

The SHA-1 algorithm will generate the following hash:

c57eff55ebc0c54973903af5f72bac72762cf4f4

So now we have a unique identifier for our file. Let's try to do that with git.

mkdir git-objects
cd git-objects
git init
echo "Hello World!" > foo.txt
git add foo.txt

Now, go inside the .git/objects folder. Right now, you should be wondering where is the file name with the c57eff55ebc0c54973903af5f72bac72762cf4f4 oid. Well, git is smart, having too many files in the same directory can make the system slow. To prevent this, git is using a special technique to store the git objects.

Git is using the first 2 characters of the OID as a directory name and the last 38 characters as the file name. So in our case, the file will be stored in the 14 directory with the name c57eff55ebc0c54973903af5f72bac72762cf4f4.

Let's try to see the content of the file.

cat .git/objects/c5/7eff55ebc0c54973903af5f72bac72762cf4f4

Ok all of this is fun but we only have the name of the file. By the way, you cannot reversed the SHA-1 algorithm. So how do we get the content of the file?

Git objects content

Each git object has for content the exact content of the file or directory. If you tried previously to see the content of the file, you should have seen something unreadable. It is completly intended.

Each file has for goal to be store on a database to be restore later. But a database cannot store GB of data in a single object and having multiple objects would break the OID concept we just explain.

To save place, git is compressing with ZLIB the content of the file and get a binary output. This binary output is the content of the git object.

If you want to see the content of the file, you need to decompress the content of the git object. To do that, you can use the git cat-file command.

git cat-file -p c57eff55ebc0c54973903af5f72bac72762cf4f4

You should see the content of the file.

Git objects header

Each git object has a header. The header is used to store the type of the git object and the length of the content of the git object.

To see the header of a git object, you can use the git cat-file command.

git cat-file -t c57eff55ebc0c54973903af5f72bac72762cf4f4

You should see the type of the git object: blob.

git cat-file -s c57eff55ebc0c54973903af5f72bac72762cf4f4

You should see the size of the content of the git object: 12.

Side note about SHA-1

The SHA-1 algorithm is not considered as secure anymore. It is possible to create 2 different content that will generate the same SHA-1 hash. This is called a collision. This is why Git is moving to SHA-256 in the newest version (2.x<)

About

A rust implementation of git

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages