Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration to Git LFS inflates repository multiple times #3374

Closed
mloskot opened this issue Nov 12, 2018 · 4 comments
Closed

Migration to Git LFS inflates repository multiple times #3374

mloskot opened this issue Nov 12, 2018 · 4 comments

Comments

@mloskot
Copy link
Contributor

mloskot commented Nov 12, 2018

(I originally sent this to Git mailing list, thread Migration to Git LFS inflates repository multiple times,
but I hope asking Git LFS team directly is fine too.
)

TL;TR: Is this normal a repository migrated to Git LFS inflates multiple times
and how to deal with it?

I'm migrating a big SVN repository to Git.
In SVN, a collection of third-party SDKs is maintained along with codebase.
Many of the third-party libraries come in binary form.
So, I'm migrating binary files of those to Git LFS.

I'm following the Git LFS tutorial, section Migrating existing repository data to LFS.

First, I run initial translation of the SVN reoi into Git..
The new repository is a Git bare repository.
There are 5 branches and 10+ tags in the proj.git repo.

It is quite large:

proj.git (BARE:master) $ du -sh
19G

Next, I performed the following sequence of steps to optimise it
and migrate to Git LFS:

  1. Optimise the repo
proj.git (BARE:master) $ git gc
Enumerating objects: 1432599, done.
Counting objects: 100% (1432599/1432599), done.
Delta compression using up to 48 threads
Compressing objects: 100% (864524/864524), done.
Writing objects: 100% (1432599/1432599), done.
Total 1432599 (delta 541698), reused 1405922 (delta 525738)
Removing duplicate objects: 100% (256/256), done.
Checking connectivity: 1432599, done.

proj.git (BARE:master) $ du -sh
11G
  1. List the file types taking up the most space in the repo
proj.git (BARE:master) $ git lfs migrate info --everything
migrate: Sorting commits: ..., done
migrate: Examining commits: 100% (29412/29412), done
*.lib   27 GB       3524/3524 files(s)  100%
*.pdb   5.6 GB      1412/1412 files(s)  100%
*.cpp   4.8 GB  131848/131854 files(s)  100%
*.exe   2.3 GB        798/798 files(s)  100%
*.dll   2.0 GB      1000/1000 files(s)  100% 
  1. Migrate the repo to Git LFS
proj.git (BARE:master) $ git lfs migrate import --include="*.exe,*.dll,*.lib,*.pdb,*.zip" --everything
  1. Check size of the repo after migration to Git LFS
proj.git (BARE:master) $ du -sh
47G
  1. Cleaning up the .git directory after migration to Git LFS
proj.git (BARE:master) $ git reflog expire --expire-unreachable=now --all

and

proj.git (BARE:master) $ git gc --prune=now --aggressive
Enumerating objects: 1462310, done.
Counting objects: 100% (1462310/1462310), done.
Delta compression using up to 48 threads
Compressing objects: 100% (1422322/1422322), done.
Writing objects: 100% (1462310/1462310), done.
Total 1462310 (delta 577640), reused 845097 (delta 0)
Removing duplicate objects: 100% (256/256), done.
Checking connectivity: 1462310, done.
  1. Check final disk size of the repo
proj.git (BARE:master) $ du -sh
39G
  1. List the file types taking up the most space in the repository after migration to Git LFS
proj.git (BARE:master) $ git lfs migrate info --everything
migrate: Sorting commits: ..., done
migrate: Examining commits: 100% (29412/29412), done
*.cpp   4.8 GB  131848/131854 files(s)  100%
*.png   1.1 GB  696499/696499 files(s)  100%
*.h     828 MB    86386/86471 files(s)  100%
*.csv   820 MB        939/939 files(s)  100%
*.html  686 MB    34126/34126 files(s)  100%

Now, I'm looking for anaswers to the following questions:

  1. Is the procedure presented above correct to migrate (SVN ->) Git -> Git LFS?

  2. Given the initial translation to Git generated 19 GB repo (optimised to 11 GB) is this normal Git LFS migration inflates the repository to 47 GB (optimised ot 39 GB)?

  3. Why the inflation happens? Is this a function of number of branches? How to understand the jump from 11 GB to 39 GB?

  4. How to optimise the repository to cut the size down further?

My next step is to somehow push the fat pig into GitHub, Bitbucket or Azure DevOps ;-)

I've used Git for a few years, but I'm pretty newbie regarding low-level or administration tasks, so I might have made basic errors.
I'll be thankful for any feedback.

@bturner
Copy link

bturner commented Nov 13, 2018

As Ævar suggested on the Git mailing list, the most likely culprit here is the lack of delta compression on the LFS objects after they're extracted. Each large object is still individually compressed, so they don't use the total space git lfs migrate info --everything showed, but without the delta compression to capitalize on common bytes between files, they still end up much larger.

Rather than a simple du -sh, can you separate out Git objects and Git LFS objects in the resulting repository and calculate their combined sizes independently? Git's objects will have SHA-1 IDs and LFS objects will use SHA-256 IDs instead. (I'm not sufficiently familiar with the repository internals for how LFS objects are stored locally, but I'd wager they're in separate directories in your bare repository).

Have you tried doing a git clone --mirror proj.git check.git and then comparing how large the mirror is? I believe that wouldn't have any large objects in it, and would be much more indicative of the Git part of the repository size.

@mloskot
Copy link
Contributor Author

mloskot commented Nov 13, 2018

@bturner Thanks for the hints.

Can you separate out Git objects and Git LFS objects in the resulting repository
and calculate their combined sizes independently?
Git's objects will have SHA-1 IDs and LFS objects will use SHA-256 IDs instead.

I am aware of the different SHA-s, but I'm still quite lacking of the Git fu.
So, I have no idea how I can calculate totals of objects in the two scopes separately.

Does git-sizer count objects managed by Git LFS? For this moment, I'll assume it does NOT. Then, I have written this git lfs ls-file helper git_lfs_calculate_size_by_type.py that counts just Git LFS objects.

Here is the results for proj.git (bare) for which du -sh reports 38 GB:

proj.git (BARE:master) $ git-sizer
Processing blobs: 1107392
Processing trees: 178226
Processing commits: 29412
Matching commits to trees: 29412
Processing annotated tags: 0
Processing references: 24
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Blobs                      |           |                                |
|   * Total size               |  12.8 GiB | *                              |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Trees                      |           |                                |
|   * Maximum entries      [1] |  1.96 k   | *                              |
| * Blobs                      |           |                                |
|   * Maximum size         [2] |   113 MiB | ***********                    |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [3] |  13.3 k   | ******                         |
| * Maximum path depth     [4] |    18     | *                              |
| * Maximum path length    [5] |   232 B   | **                             |
| * Number of files        [6] |   910 k   | ******************             |
| * Total size of files    [7] |  3.37 GiB | ***                            |
proj.git (BARE:master) $ python git_lfs_calculate_size_by_type.py
Git LFS objects summary:
.lib:   count: 1111     size: 8764.66 MB
.dll:   count: 749      size: 1427.98 MB
.pdb:   count: 612      size: 2814.09 MB
.exe:   count: 786      size: 2005.72 MB
.zip:   count: 24       size: 1153.65 MB
Total:  count: 3282     size: 16166.11 MB

Then, would 12.8 GiB + 16166.11 GiB be the grand total of the resulting repository?

Have you tried doing a git clone --mirror proj.git check.git and then comparing how large the mirror is?
I believe that wouldn't have any large objects in it, and would be much more indicative of the Git part of the repository size.

I've just tried that and here is what I got using

  • du - the mirror is small!
check.git (BARE:master) $ du -sh
2.4G    .
  • git-sizer
check.git (BARE:master) $ git-sizer
Processing blobs: 1107392
Processing trees: 178226
Processing commits: 29412
Matching commits to trees: 29412
Processing annotated tags: 0
Processing references: 24
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Blobs                      |           |                                |
|   * Total size               |  12.8 GiB | *                              |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Trees                      |           |                                |
|   * Maximum entries      [1] |  1.96 k   | *                              |
| * Blobs                      |           |                                |
|   * Maximum size         [2] |   113 MiB | ***********                    |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [3] |  13.3 k   | ******                         |
| * Maximum path depth     [4] |    18     | *                              |
| * Maximum path length    [5] |   232 B   | **                             |
| * Number of files        [6] |   910 k   | ******************             |
| * Total size of files    [7] |  3.37 GiB | ***                            |
  • Totals based on git lfs ls-files --size
proj.git (BARE:master) $ python git_lfs_calculate_size_by_type.py
Git LFS objects summary:
.lib:   count: 1111     size: 8764.66 MB
.dll:   count: 749      size: 1427.98 MB
.pdb:   count: 612      size: 2814.09 MB
.exe:   count: 786      size: 2005.72 MB
.zip:   count: 24       size: 1153.65 MB
Total:  count: 3282     size: 16166.11 MB

Interestingly, get-sizer results above show the same sizes for both, proj.git and its cloned mirror.

The git count-objects seems consistent as well:

proj.git (BARE:master) $ git count-objects -v -H
count: 0
size: 0 bytes
in-pack: 1315030
packs: 1
size-pack: 2.39 GiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

and

check.git (BARE:master) $ git count-objects -v -H
count: 0
size: 0 bytes
in-pack: 1315030
packs: 1
size-pack: 2.39 GiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

Hmm, does that mean the whole set of content is compressed into single pack file?

That would explain this, wouldn't it?

check.git (BARE:master) $ du -sh
2.4G    .

@bturner
Copy link

bturner commented Nov 13, 2018

git-sizer is looking at the sizes of Git objects themselves, sans compression. The "Total size" for blobs is the combined size of every file that has ever existed in the repository. The "Total size of files" should be how large the repository would be when checked out minus any space used by LFS files.

Looking at your git_lfs_calculate_size_by_type.py, that's going to compute the size of the LFS files that would be checked out--it's not computing the size the compressed file data is using on disk.

The two together imply the work tree of a repository that cloned your repo.git with a checkout would weigh in somewhere around ~20GB, 3.37 GB for Git-tracked objects plus 16.2GB for LFS-tracked objects.

I walked through your steps locally (with a trivial, throwaway repository; it's not the data contents I was interested in). Based on having done so, in your repo.git run these two commands:

  • du -sh objects - This is the size of Git objects, packed/compressed
  • du -sh lfs - This is the size of the LFS objects, compressed but not packed (because LFS doesn't work that way)

My expectation is that your du -sh objects will weigh in somewhere in the 2-4GB range, with du -sh lfs showing 30GB+. That would confirm that everything is basically "as expected": Migrating your objects to LFS has made the data left in "normal" Git objects quite a lot smaller, and the disk usage is because your initial conversion has every version of every large object in the entire repository available locally and, without delta compression, they take up significantly more space.

@mloskot
Copy link
Contributor Author

mloskot commented Nov 13, 2018

Right, I made a mistake, the total is not Git blobs + Git LFS (12.8 GiB + 16166.11 GiB) but Git checkout + Git LFS as you corrected.

My expectation is that your du -sh objects will weigh in somewhere in the 2-4GB range, with du -sh lfs showing 30GB+.

AFAICT, it is how you expect:

proj.git (BARE:master) $ du -sh objects && du -sh lfs
2.5G    objects
36G     lfs

By the way, for the git clone --mirror proj.git check.git, it is also as expected:

check.git (BARE:master) $ du -sh objects && du -sh lfs
2.5G    objects
1.9M    lfs

That would confirm that everything is basically "as expected" (...)

Thank you very much for walking me through and helping to understand the issues.
I initially completely missed the role of delta compression for binary files.

(Closing the issue as it's been answered for me. Thanks a lot!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants