Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
broothie committed Jul 19, 2022
1 parent fbcfecf commit df2128c
Showing 0 changed files with 0 additions and 0 deletions.

1 comment on commit df2128c

@broothie
Copy link
Owner Author

@broothie broothie commented on df2128c Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is kindof a quine!

Look at that commit message. Mouseover it. Click it. It's the same as the short SHA of the commit itself! Crazy.

How it went down

The initial inspiration came from the quine tweet.

How it was built

You can probably guess what the strategy was here: brute force guessing the short SHA. This project has a dependency on how GitHub works as well, since it leverages GitHub's commit SHA auto-linking. I did a quick test to figure out what the minimum number of characters required to trigger short SHA auto-linking is: it appears to be 7.

Armed with this knowledge, I wrote a Ruby script, trashed it, re-wrote it in Go, and iterated on that. As much as I wanted to play around with Ruby's new-ish async stuff, I'm much more comfortable with Go's concurrency patterns. Plus, my guess is the little bump in speed would come in handy in the long run.

The final version of the program does the following:

  • spawn n workers
  • each worker gets a unique path on the filesystem
  • each worker initializes a repo at their path
  • then, in a loop:
    • generate a 7-character hexadecimal string
    • make an empty commit with the hex string as the commit message
    • git rev-parse --short=7
      • if it's a match:
        • celebrate! 🎉
        • also, be sure to print it (and write it to a file to be safe)
        • then bail
      • if not:
        • git reset --hard HEAD~
    • occasionally, remove the repo and re-git init

How it was run

I didn't want to run the script on my own machine because I didn't want to go through the hassle of keeping it plugged in and/or figuring out how to keep the script running and/or confirming that it was actually running all the time. Plus, I figured I could could get a fancier rig from the cloud ☁️.

Up on GCP, I went through 4 instances while trying to figure out the right setup. IIRC the first few were either too slow in disk ops, or had too little CPU. Running trials with 50 to 200 workers would either blow out the CPU and have all the goroutines fighting over time, or cause the single-iteration time to be too high e.g. on the order of 10s of seconds.

I eventually settled on GCP's lowest tier compute-optimized instance, c2-standard-4, which was heavily based on the fact that it seems like you can only attach local disks to compute- or memory-optimized (or GPU) instances. By this point I was operating under the assumption that I was wasting precious time talking to the network attached disks these instances have by default, so a local SSD seemed necessary.

On this box, I was able to run 300 workers, hitting almost exactly 80% CPU usage, and iterating roughly every 750ms. With 7 hex digits, we have a 1 in 16^7 = 268,435,456 chance of guessing the short SHA. So:

> 16**7 * 0.750 / 60 / 60 / 24 / 300
7.76722962962963

i.e. I should be expecting a result within ~7¾ days.

How did it actually shake out?

$ stat nohup.out
...
Modify: 2022-07-19 19:29:40.561706164 +0000
Change: 2022-07-19 19:29:40.561706164 +0000
 Birth: 2022-07-16 09:03:44.409662694 +0000

3.434675925925926 days! Pretty good!

Things I learned and open questions

git reset --hard doesn't completely get rid of your changes

This became clear after two things:

  • I started seeing messages about git automatically running git gc
  • I checked the reflog and, well, there're at least some remnants of the reset commits in there

I briefly had a version of the program which would run git gc after every reset, but that didn't seem to help upon checking the reflog, so I ended up adding an occasional nuke and re-init of the repo.

But then, how can you truly get rid of your changes without rm-rf-ing? (I'm sure there's a way, I just haven't gotten around to Googling it).

Did I really need 300 workers?

I made a lot of assumptions during this project, and this is kindof one of them. I did do a bit of tuning when starting up the program and seeing the average iteration time of each worker. It seemed like iteration time scaled more with number of workers than with (what I imagine is) disk usage.

That would be the other variable here right? With more workers comes more disk operations, and I would guess the bottleneck would then come from workers waiting on disk IO.

I ended up limiting it to 300 workers, in part because any more than about 330 workers and I'd start seeing "too many files open" errors. I guessed there was a way to increase this limit, but I didn't spend the time on it, and plus, my CPU usage was already at a comfortable 80% with 300 workers.

How much are empty commits interacting with the filesystem?

I know git works by performing lots of arcane operations on the .git directory. Empty commits seemed like the right way to go to limit disk operations, but I wonder how much time I was really saving by doing this? Versus commiting a single markdown file with the guessed SHA or something.

What is the true meaning of df2128c?

D? F? 21? 28?? C??!?!1 What is the significance of these numbers and letters? We may never know ¯_(ツ)_/¯

Please sign in to comment.