# GitLeaks Codec Performance

Some folks are having issues with the GitLeaks recursive decode feature[^1] and GitLab disabled the feature by default because of it[^2]. The isses didn't specifically call out repos that they saw performance problems with so I'll pick some that I expect have decently large histories and fair amounts of encoding to see if I can recreate the issues and maybe make some performance improvements to the code.

I'll run this notebook in a Fedora 43 toolbx container and have the notebook setup the enviornment to keep things consistent.

[^1]: Upstream & GitLab issues:
- https://github.com/gitleaks/gitleaks/issues/2019
- https://gitlab.com/gitlab-org/gitlab/-/issues/587467#note_3037857436

[^2]: GitLab MR disabling the flag:
- https://gitlab.com/gitlab-org/security-products/analyzers/secrets/-/merge_requests/437


In [21]:
import os

repos_dir = "testdata/repos"
test_repos = [
    "github.com/openssl/openssl.git",
    "github.com/microsoft/typescript.git",
    "github.com/leaktk/fake-leaks.git",
]

commands = [
    "hyperfine",
    "git",
    "go",
    "make",
]

gitleaks_bin = "testdata/gitleaks/gitleaks"
gitleaks_repo_dir = os.path.dirname(gitleaks_bin)

In [26]:
import os
import shutil

print("# installing deps")
for cmd in commands:
    if not shutil.which(cmd):
        print("installing " + cmd)
        # I'm running this in a toolbx container locally so sudo shouldn't prompt
        os.system("sudo dnf install -y " + cmd)
    else:
        print(cmd + " already installed")

print("\n# fetching testdata")
for repo_id in test_repos:
    repo_local_path = os.path.join(repos_dir, repo_id)
                                   
    if not os.path.exists(repo_local_path):
        assert os.system(f"git clone --mirror https://{repo_id} {repo_local_path} 2>&1") == 0
    else:
        print(repo_id + "already cloned")

if not os.path.isfile(gitleaks_bin):
    print("building gitleaks")
    if os.path.isdir(gitleaks_repo_dir):
        shutil.rmtree(gitleaks_repo_dir)

    assert os.system("git clone --depth 1 --branch v8.30.0 https://github.com/gitleaks/gitleaks.git " + gitleaks_repo_dir) == 0
    assert os.system(f"cd {gitleaks_repo_dir} && go build -o gitleaks") == 0
else:
    print("gitleaks v8.30.0 already exists")
    


# installing deps
hyperfine already installed
git already installed
go already installed
make already installed

# fetching testdata
github.com/openssl/openssl.gitalready cloned
github.com/microsoft/typescript.gitalready cloned
github.com/leaktk/fake-leaks.gitalready cloned
gitleaks v8.30.0 already exists


## Initial Benchmarking

I'm going to run hyperfine to see if the performance drops enough in these repos to see if they'll be good test candidates.

In [40]:
import shlex

cmd = shlex.join([
    "hyperfine",
    # "--show-output", # for debugging
    "--warmup", "3",
    "--parameter-list", "repo", ",".join(os.path.join(repos_dir, repo_id) for repo_id in test_repos),
    "--parameter-list", "max_decode_depth", "0,5",
    gitleaks_bin + " --exit-code=0 --no-banner --no-color --max-decode-depth={max_decode_depth} git {repo}",
])

print("# running tests before tweaks")
print("+", cmd)
assert os.system(cmd) == 0

# running tests before tweaks
+ hyperfine --warmup 3 --parameter-list repo testdata/repos/github.com/openssl/openssl.git,testdata/repos/github.com/microsoft/typescript.git,testdata/repos/github.com/leaktk/fake-leaks.git --parameter-list max_decode_depth 0,5 'testdata/gitleaks/gitleaks --exit-code=0 --no-banner --no-color --max-decode-depth={max_decode_depth} git {repo}'
Benchmark 1: testdata/gitleaks/gitleaks --exit-code=0 --no-banner --no-color --max-decode-depth=0 git testdata/repos/github.com/openssl/openssl.git
  Time (mean ± σ):     388.438 s ±  1.868 s    [User: 4130.044 s, System: 11.800 s]
  Range (min … max):   386.638 s … 391.560 s    10 runs
 
Benchmark 2: testdata/gitleaks/gitleaks --exit-code=0 --no-banner --no-color --max-decode-depth=0 git testdata/repos/github.com/microsoft/typescript.git
  Time (mean ± σ):     1359.206 s ±  0.928 s    [User: 11392.456 s, System: 48.334 s]
  Range (min … max):   1358.202 s … 1361.247 s    10 runs
 
Benchmark 3: testdata/gitleaks/gitleak

### Results

TODO

## Profiling Details

Okay, now that I have the repos that I want to use to tune against, I'll need to get some pprof data to see where the slow bits are.