Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lighten up repo #12

Closed
Robinlovelace opened this issue Mar 3, 2022 · 14 comments
Closed

Lighten up repo #12

Robinlovelace opened this issue Mar 3, 2022 · 14 comments

Comments

@Robinlovelace
Copy link
Contributor

The zip file alone is now 110 MB πŸ˜΅β€πŸ’«

Good news: we can purge the giant files with https://rtyley.github.io/bfg-repo-cleaner/

I can give this a go... You may need to reclone the (much lighter) repo after the job is done as it 'rewrites history' so heads-up @Nowosad, @anitagraser and @michaeldorman.

@Robinlovelace
Copy link
Contributor Author

Preliminary tests below. Going to reclone following the exact instruction...

java -jar ~/programs/bfg-1.14.0.jar --strip-blobs-bigger-than 1M .

Using repo : /mnt/57982e2a-2874-4246-a6fe-115c199bc6bd/orgs/geocompr/pytest/./.git

Scanning packfile for large blobs: 417
Scanning packfile for large blobs completed in 401 ms.
Found 3 blob ids for large blobs - biggest=34328852 smallest=1546282
Total size (unpacked)=47623904
Found 33 objects to protect
Found 9 commit-pointing refs : HEAD, refs/heads/main, refs/heads/motivations, ...

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit dd90d990 (protected by 'HEAD') - contains 2 dirty files : 
        - data/landsat.tif (11.2 MB)
        - data/nz_elev.tif (1.5 MB)

WARNING: The dirty content above may be removed from other commits, but as
the *protected* commits still use it, it will STILL exist in your repository.

Details of protected dirty content have been recorded here :

/mnt/57982e2a-2874-4246-a6fe-115c199bc6bd/orgs/geocompr/pytest/..bfg-report/2022-03-03/10-21-18/protected-dirt/

If you *really* want this content gone, make a manual commit that removes it,
and then run the BFG on a fresh copy of your repo.
       

Cleaning
--------

Found 173 commits
Cleaning commits:       100% (173/173)
Cleaning commits completed in 300 ms.

Updating 7 Refs
---------------

        Ref                               Before     After   
        -----------------------------------------------------
        refs/heads/main                 | dd90d990 | f0e2a140
        refs/heads/motivations          | 2683cfa8 | bb1fa3fd
        refs/heads/quartotest           | 1a6fd3fd | 69bce924
        refs/remotes/origin/gh-pages    | 1f60a2bb | 31e52c47
        refs/remotes/origin/main        | dd90d990 | f0e2a140
        refs/remotes/origin/motivations | 2683cfa8 | bb1fa3fd
        refs/remotes/origin/quartotest  | 1a6fd3fd | 69bce924

Updating references:    100% (7/7)
...Ref update completed in 19 ms.

Commit Tree-Dirt History
------------------------

        Earliest                                              Latest
        |                                                          |
        ...DDDDDDDDDDDDDDDDDDDDDmmmmmmmmmmmmmmmmmmmmmmmmDDDDDDDDDDDm

        D = dirty commits (file tree fixed)
        m = modified commits (commit message or parents changed)
        . = clean commits (no changes to file tree)

                                Before     After   
        -------------------------------------------
        First modified commit | 46378c9b | 7b098c52
        Last dirty commit     | 0f9cadc4 | 2f219096

Deleted files
-------------

        Filename      Git id            
        --------------------------------
        landsat.tif | 9c25b46e (11.2 MB)
        nz_elev.tif | 4c3ff0e2 (1.5 MB) 
        state.shp   | 60cb67a2 (32.7 MB)


In total, 227 object ids were changed. Full details are logged here:

        /mnt/57982e2a-2874-4246-a6fe-115c199bc6bd/orgs/geocompr/pytest/..bfg-report/2022-03-03/10-21-18

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive

@Robinlovelace
Copy link
Contributor Author

Just ran this:

git clone --mirror https://github.com/geocompr/py
java -jar ~/programs/bfg-1.14.0.jar --strip-blobs-bigger-than 1M py
cd py
cd py.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive

@Robinlovelace
Copy link
Contributor Author

Heads-up you will need to reclone this after I press git push.

@Robinlovelace
Copy link
Contributor Author

Not sure if it's worked but have just pushed...

@Robinlovelace
Copy link
Contributor Author

Report:

git reflog expire --expire=now --all && git gc --prune=now --aggressive
Cloning into bare repository 'py.git'...
remote: Enumerating objects: 1142, done.
remote: Counting objects: 100% (1142/1142), done.
remote: Compressing objects: 100% (605/605), done.
remote: Total 1142 (delta 625), reused 972 (delta 457), pack-reused 0
Receiving objects: 100% (1142/1142), 125.42 MiB | 6.14 MiB/s, done.
Resolving deltas: 100% (625/625), done.

Using repo : /tmp/py/.git


This repo has been processed by The BFG before! Will prune repo before proceeding - to avoid unnecessary cleaning work on unused objects...
Completed prune of old objects - will now proceed with the main job!

Scanning packfile for large blobs: 1966
Scanning packfile for large blobs completed in 25 ms.
Found 3 blob ids for large blobs - biggest=101731257 smallest=1546282
Total size (unpacked)=115026309
Found 33 objects to protect
Found 5 commit-pointing refs : HEAD, refs/heads/main, refs/remotes/origin/HEAD, ...

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit a97a8c42 (protected by 'HEAD') - contains 3 dirty files : 
	- data/air.2x2.250.mon.anom.comb.nc (97.0 MB)
	- data/landsat.tif (11.2 MB)
	- data/nz_elev.tif (1.5 MB)

WARNING: The dirty content above may be removed from other commits, but as
the *protected* commits still use it, it will STILL exist in your repository.

Details of protected dirty content have been recorded here :

/tmp/py.bfg-report/2022-03-03/10-34-47/protected-dirt/

If you *really* want this content gone, make a manual commit that removes it,
and then run the BFG on a fresh copy of your repo.
       

Cleaning
--------

Found 173 commits
Cleaning commits:       100% (173/173)
Cleaning commits completed in 115 ms.

BFG aborting: No refs to update - no dirty commits found??

cd: no such file or directory: py.git
Enumerating objects: 983, done.
Counting objects: 100% (983/983), done.
Delta compression using up to 6 threads
Compressing objects: 100% (903/903), done.
Writing objects: 100% (983/983), done.
Total 983 (delta 557), reused 403 (delta 0), pack-reused 0

@Robinlovelace
Copy link
Contributor Author

Tried and failed. Any ideas guys?

@anitagraser
Copy link
Collaborator

My only idea would be deleting and re-creating the repo, potentially at the cost of losing the history.

@Robinlovelace
Copy link
Contributor Author

Going to try again with this:

git filter-branch -f --tree-filter 'rm -f /path/to/file' HEAD --all

If that fails yeah now's a good time to restart.

@Robinlovelace
Copy link
Contributor Author

fd --help | rg size    -S, --size <size>...                  
            Limit results based on the size of files using the format <+-><NUM><UNIT>.
               '+': file size must be greater than or equal to this
               '-': file size must be less than or equal to this
               'NUM':  The numeric size (e.g. 500)

@Robinlovelace
Copy link
Contributor Author

fd --size +1MB
data/air.2x2.250.mon.anom.comb.nc
data/landsat.tif
data/nz_elev.tif

@Robinlovelace
Copy link
Contributor Author

No trying with git filter-repo...

https://superuser.com/questions/1563034/how-do-you-install-git-filter-repo

@Robinlovelace
Copy link
Contributor Author

Big datasets now can be found here: https://github.com/geocompr/py/releases/tag/0.1

@Robinlovelace
Copy link
Contributor Author

Done, finally, what. a. mission!

gh repo clone geocompr/py               
Cloning into 'py'...
remote: Enumerating objects: 1047, done.
remote: Counting objects: 100% (1047/1047), done.
remote: Compressing objects: 100% (415/415), done.
remote: Total 1047 (delta 554), reused 1045 (delta 552), pack-reused 0
Receiving objects: 100% (1047/1047), 4.23 MiB | 7.42 MiB/s, done.
Resolving deltas: 100% (554/554), done.

@Robinlovelace
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants