Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete configlet binaries from history? #300

Closed
ee7 opened this issue Mar 10, 2019 · 8 comments
Closed

Delete configlet binaries from history? #300

ee7 opened this issue Mar 10, 2019 · 8 comments

Comments

@ee7
Copy link
Member

ee7 commented Mar 10, 2019

Issue

Try running git clone https://github.com/exercism/ocaml.git

Expected behavior

It finishes instantly.

Actual behavior

It takes a long time.

Diagnosis

The repo download is about 24 MiB, which is roughly 100x larger than it should be.

Here is a useful shell script that finds large commits in git history.

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --key=2 -r \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest \
| head -60

The output:

4f22d18ccb2c  3.4MiB bin/configlet-linux-amd64
18a995b36c29  3.4MiB bin/configlet-linux-amd64
e1dbe65dbb5b  3.4MiB bin/configlet-linux-amd64
45ec32d5ea4d  3.4MiB bin/configlet-linux-amd64
c398d23cd8fc  3.4MiB bin/configlet-linux-amd64
a4caae4f1ea3  3.3MiB bin/configlet-darwin-amd64
73f74a035fcc  3.3MiB bin/configlet-darwin-amd64
bafa72ee3e45  3.3MiB bin/configlet-darwin-amd64
1142a8c72410  3.3MiB bin/configlet-darwin-amd64
eb00d5d49d58  3.3MiB bin/configlet-darwin-amd64
24734c89d647  3.2MiB bin/configlet-windows-amd64.exe
6f3696e86b4d  3.2MiB bin/configlet-windows-amd64.exe
85e31801d978  3.2MiB bin/configlet-windows-amd64.exe
39fc7872f061  3.2MiB bin/configlet-windows-amd64.exe
04798ab3a125  3.2MiB bin/configlet-windows-amd64.exe
46fb7a1307a9  2.9MiB bin/configlet-windows-amd64.exe
ae5af0eb03b9  2.9MiB bin/configlet-linux-amd64
acbc3d12d1b5  2.9MiB bin/configlet-linux-amd64
81de536d325d  2.8MiB bin/configlet-darwin-amd64
3fc496954741  2.8MiB bin/configlet-darwin-amd64
c4f3977347d3  2.8MiB bin/configlet-linux-386
4d89e9747dfd  2.8MiB bin/configlet-linux-386
4f48e868e549  2.8MiB bin/configlet-linux-amd64
aef8742a53d0  2.8MiB bin/configlet-darwin-386
90865bc0e308  2.8MiB bin/configlet-windows-amd64.exe
d1123cb2d559  2.8MiB bin/configlet-windows-amd64.exe
6a9c5d110456  2.8MiB bin/configlet-linux-386
2ff44037950d  2.8MiB bin/configlet-linux-386
182ff4839524  2.8MiB bin/configlet-darwin-386
abde89f2fee5  2.8MiB bin/configlet-darwin-amd64
ce17f9698690  2.8MiB bin/configlet-linux-386
514ec09424c7  2.7MiB bin/configlet-darwin-386
08d77af3075c  2.7MiB bin/configlet-darwin-386
0cb316c56a21  2.7MiB bin/configlet-darwin-386
8900361e1956  2.7MiB bin/configlet-windows-386.exe
37ca9c724611  2.7MiB bin/configlet-windows-386.exe
64f20caf40c4  2.6MiB bin/configlet-windows-386.exe
4e535d6aa1c1  2.6MiB bin/configlet-windows-386.exe
eed4a12153db  2.6MiB bin/configlet-windows-386.exe
cbe164082082  2.4MiB bin/configlet-linux-386
b9d4d1d67fcd  2.4MiB bin/configlet-linux-386
79ab39e067a3  2.3MiB bin/configlet-windows-386.exe
db0b25da2985  2.3MiB bin/configlet-darwin-386
37b57e730031  2.3MiB bin/configlet-darwin-386
9f2726f349d4  2.3MiB bin/configlet-windows-386.exe
793b5cb5bce5  2.3MiB bin/configlet-windows-386.exe
ebc005e716b4  2.2MiB bin/configlet-linux-386
d552a15ecbe7  2.2MiB bin/configlet-darwin-386
ded745ed1998   21KiB tools/test-generator/test/beer-song.json
dcb56118d11f   15KiB tools/test-generator/test/beer-song.json

It looks like #11 was not successful. The configlet binaries were deleted in e615131, but they remain in the repo history.

These commits are still in master:
8dd7d33
e662023
296040f
2f99459
e71f60c
ffff920
16d5a30
89e9f96

Solution

As mentioned in #11, you would need to rewrite the history. I don't know whether it's worth fixing now.

I didn't see any related discussion, so I'm just checking whether it's a known issue.

Other tracks

As far as I can tell, no other Exercism tracks still have the configlet binaries in the master history.

If you're curious, here are the other tracks with large files:

Go

3cf0bd07391e  2.9MiB error-handling/error-handling.test
97b101ea9701  131KiB img/icon.png
a1652b1e8e95  121KiB img/icon.png
3a0983ca681a   49KiB img/mars3.png
...

Javascript

d3ca11d12cd0  1.2MiB node_modules/traceur/bin/traceur.js
f8abbf08d10f 1014KiB node_modules/jasmine-node/node_modules/requirejs/bin/r.js
ebbabd161870  895KiB node_modules/traceur/bin/traceur.js
2369f99a9edc  851KiB node_modules/jasmine-node/node_modules/jasmine-reporters/ext/js.jar
8da1c0738170  681KiB node_modules/jasmine-node/node_modules/jasmine-reporters/ext/env.rhino.1.2.js
7c70f81ef9bd  433KiB node_modules/traceur/node_modules/rsvp/dist/test/browserify.js
...

PHP

53f53f560fc2  2.6MiB assignments/php/phpunit.phar
...

Swift

cf571ed6df43  571KiB docs/img/tests.png
2ed3a543f492  497KiB docs/img/tests-fail.png
75d274d2464c  443KiB docs/img/tests-pass.png
9c5539d88d18  310KiB docs/img/tests-fail.png
b71ab2edd773  283KiB docs/img/tests-pass.png
1918f9944298  238KiB docs/img/tests.png
...
@sshine
Copy link
Contributor

sshine commented Jun 18, 2019

Thanks for this analysis and for making the action easier.

I agree with your sentiment and think we should do this.

I'm investigating the implications.

@NobbZ
Copy link
Member

NobbZ commented Jun 18, 2019

The problem is, that the website actually keeps a not of the SHA to determine which version of an exercise has been solved by the student.

Rewriting history will change all SHAs since the first removed/altered commit.

I'm pretty sure it will affect the website if done blindly, @iHiD or @kytrinyx should give a go first.

@iHiD
Copy link
Member

iHiD commented Jun 18, 2019

Yeah - we can't rewrite history. Basically. It'll break everything.

@sshine
Copy link
Contributor

sshine commented Jun 19, 2019

To elaborate on at least one thing that I know will break: The exercism CLI uses git commit hashes to download and display the proper version of the test suite. So when a test suite is updated on a language track, the git commit hash on the master branch for that point in time is used as a key.

I suppose the lesson here is that we shouldn't use externally provided IDs that have another primary purpose as keys in our data structures. :-P

@iHiD: If we were to eventually want to prune our git histories, are there other places that depend on git commit hashes? The test suite version is one issue that can, for a small track like Ocaml, easier be addressed than on so many other tracks.

@iHiD
Copy link
Member

iHiD commented Jun 19, 2019

It's not really using it as a key. It literally checks out the git repo with the sha and the displays the contents of it to the user. So if the sha isn't there, the checkout will fail and nothing will be displayed to the user.

@sshine sshine closed this as completed Jun 19, 2019
@iHiD
Copy link
Member

iHiD commented Jun 19, 2019

@sshine So the thing we could do here, would be to map between the old shas and the new shas during the rebase (maybe matching on commit strings if they're uniq?) and then update the database to change all old shas to new shas. And push the rebased version over the top?

It's a lot of hassle, but it's a workable solution I think, if we ever actually need to do this.

@sshine
Copy link
Contributor

sshine commented Jun 21, 2019

I went and sorted all track repos by size

for d in all-tracks; do [ -d $d ] && du -sh $d; done | sort -nr

and inspected the 12 largest repos using @ee7's snippet, yielding

  • cpp (29M), c (15M) are huge, but has no huge commits, just a lot of activity / some large test files
  • ocaml (26M), coffeescript (21M) are huge because of configlet commits
  • kotlin (12M), java (12M) are large, has some ~56K grade-wrapper.jar commits
  • javascript (12M), ecmascript (11M) large, mainly due to NPM noise (package-lock.json commits at ~200-400K and other things)
  • swift (12M) is large, has project.pbxproj commits at ~100K each
  • csharp (9.8M) is large, has Exercises.sln commits at ~50-100K each
  • haskell (8.7M) is large, but has no huge commits, just a lot of activity

I don't know how to estimate the win of purging stale history from tracks other than ocaml and coffeescript. And even though I or @ee7 could create a branch with history rewritten and create a map from old hashes to new hashes, it'd probably be in @iHiD's hands only to determine where in the website's and CLI's codebase this change has effect.

On that account, I think @iHiD would have to call the shots, and I fully understand if this is a low priority of his resources.

@sshine
Copy link
Contributor

sshine commented Jun 21, 2019

I suppose the fix that avoids dependency on GitHub commit hashes is to use a timestamp instead and then at any given point when one needs to find e.g. the test suite appropriate for a given point in time (this is the only use of git commit hashes that I know of), convert this timestamp to a commit hash via the GitHub API or similar.

Given such a change, rewriting history would be less of an issue.

But this, I suppose, does not change the priority for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants