I intend to write this whole project in Rust to learn the language along the way.
In Git, each file and folder is considerated as a git object. The git object are stored in the .git/objects folder. There are 2 main types of git objects: blobs and trees.
A blob is a git object that contains the content of a file.
A tree is a git object that contains the direct content of a directory.
You might be wondering but how do I know which git object is representing which file or directory? Well, git use a special technique to create a unique identifier for each git object. This unique identifier is called OID (Object IDentifier).
The OID is created by hashing the content of the git object. The hashing algorithm used is SHA-1. The SHA-1 algorithm will generate a 40 characters long hexadecimal string. This string is the OID of the git object.
Let's take an example. Let's say we have a file named foo.txt
with the following content:
Hello World!
The OID of this file will be generated by using:
SHA-1("blob 12\0Hello World!")
So what is happening here?
- First, we are using the type of the git object:
blob
ortree
. - Then, we are adding a space. (This is just a convention)
- Then, we are adding the length of the content of the file:
12
. - Then, we are adding a null character
\0
. This is used to separate the header from the content. - Finally, we are adding the content of the file:
Hello World!\n
.
The SHA-1 algorithm will generate the following hash:
c57eff55ebc0c54973903af5f72bac72762cf4f4
So now we have a unique identifier for our file. Let's try to do that with git.
mkdir git-objects
cd git-objects
git init
echo "Hello World!" > foo.txt
git add foo.txt
Now, go inside the .git/objects
folder. Right now, you should be wondering where is the file name with the c57eff55ebc0c54973903af5f72bac72762cf4f4
oid. Well, git is smart, having too many files in the same directory can make the system slow. To prevent this, git is using a special technique to store the git objects.
Git is using the first 2 characters of the OID as a directory name and the last 38 characters as the file name. So in our case, the file will be stored in the 14
directory with the name c57eff55ebc0c54973903af5f72bac72762cf4f4
.
Let's try to see the content of the file.
cat .git/objects/c5/7eff55ebc0c54973903af5f72bac72762cf4f4
Ok all of this is fun but we only have the name of the file. By the way, you cannot reversed the SHA-1 algorithm. So how do we get the content of the file?
Each git object has for content the exact content of the file or directory. If you tried previously to see the content of the file, you should have seen something unreadable. It is completly intended.
Each file has for goal to be store on a database to be restore later. But a database cannot store GB of data in a single object and having multiple objects would break the OID concept we just explain.
To save place, git is compressing with ZLIB the content of the file and get a binary output. This binary output is the content of the git object.
If you want to see the content of the file, you need to decompress the content of the git object. To do that, you can use the git cat-file
command.
git cat-file -p c57eff55ebc0c54973903af5f72bac72762cf4f4
You should see the content of the file.
Each git object has a header. The header is used to store the type of the git object and the length of the content of the git object.
To see the header of a git object, you can use the git cat-file
command.
git cat-file -t c57eff55ebc0c54973903af5f72bac72762cf4f4
You should see the type of the git object: blob
.
git cat-file -s c57eff55ebc0c54973903af5f72bac72762cf4f4
You should see the size of the content of the git object: 12
.
The SHA-1 algorithm is not considered as secure anymore. It is possible to create 2 different content that will generate the same SHA-1 hash. This is called a collision. This is why Git is moving to SHA-256 in the newest version (2.x<)