Skip to content
GitHub no longer supports this web browser. Learn more about the browsers we support.
Analyze how a Git repo grows over time
Python Shell
Branch: master
Clone or download
erikbern Merge pull request #63 from MFreidank/master
Missing branch: fall back to default branch
Latest commit ed24f0a Jul 18, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
container Dockerizing Proof-of-Concept Dec 7, 2016
git_of_theseus Missing branch: fall back to default branch Jul 18, 2018
pics normalization Jan 6, 2018
.travis.yml just install scipy via setup.py May 13, 2018
Dockerfile using setup.py in Dockerfile Feb 24, 2018
LICENSE add apache license Dec 11, 2016
README.md normalization Jan 6, 2018
docker-compose.yml Dockerizing Proof-of-Concept Dec 7, 2016
setup.py version 0.2.0 May 24, 2018

README.md

travis badge pypi badge

Some scripts to analyze Git repos. Produces cool looking graphs like this (running it on git itself):

git

Installing

Run pip install git-of-theseus

Running

First, you need to run git-of-theseus-analyze <path to repo> (see git-of-theseus-analyze --help for a bunch of config). This will analyze a repository and might take quite some time.

After that, you can generate plots! Here are some ways you can do that:

  1. Run git-of-theseus-stack-plot cohorts.json which will write to stack_plot.png
  2. Run git-of-theseus-survival-plot survival.json which will write to survival_plot.png (run it with --help for some options)

If you want to plot multiple repositories, have to run git-of-theseus-analyze separately for each project and store the data in separate directories using the --outdir flag. Then you can run git-of-theseus-survival-plot <foo/survival.json> <bar/survival.json> (optionally with the --exp-fit flag to fit an exponential decay)

Help

AttributeError: Unknown property labels – upgrade matplotlib if you are seeing this. pip install matplotlib --upgrade

Some pics

Survival of a line of code in a set of interesting repos:

git

This curve is produced by the git-of-theseus-survival-plot script and shows the percentage of lines in a commit that are still present after x years. It aggregates it over all commits, no matter what point in time they were made. So for x=0 it includes all commits, whereas for x>0 not all commits are counted (because we would have to look into the future for some of them). The survival curves are estimated using Kaplan-Meier.

You can also add an exponential fit:

git

Linux – stack plot:

git

This curve is produced by the git-of-theseus-stack-plot script and shows the total number of lines in a repo broken down into cohorts by the year the code was added.

Node – stack plot:

git

Rails – stack plot:

git

Plotting other stuff

git-of-theseus-analyze will write exts.json, cohorts.json and authors.json. You can run git-of-theseus-stack-plot authors.json to plot author statistics as well, or git-of-theseus-stack-plot exts.json to plot file extension statistics. For author statistics, you might want to create a .mailmap file to deduplicate authors. For instance, here's the author statistics for Kubernetes:

git

You can also normalize it to 100%. Here's author statistics for Git:

git

Other stuff

Markovtsev Vadim implemented a very similar analysis that claims to be 20%-6x faster than Git of Theseus. It's named Hercules and there's a great blog post about all the complexity going into the analysis of Git history.

You can’t perform that action at this time.