TLDR; I show how you can visualize the knowledge about source code by mining a version control system.

# Introduction
In software development, it's all about the knowledge &ndash; both technical and the business domain. But we software developers transfer only a small part of this knowledge into code. So code alone isn't sufficient to get a glimpse of the greater picture and the interrelations of all the different concepts. There will be always developers that know more about some concept as laid down in the source code. It's important to make sure that this knowledge is distributed over more than one head. 

It's possible to estimate this knowledge distribution by analyzing the version control system. We can use active changes in the code as proxy for "someone know what he does", because otherwise, he wouldn't be able to contribute code. To find spots where the knowledge about the code could be improved, we can identify areas in the code that are possibly known by only one developer. This gives you a hint where you should start some pair programming of a invest in redocumentation.

We can approximate the knowledge distribution by counting the number of additions per file that each developer contributed to a software system.

I'll show you step by step how you can do this by using Python and [Pandas](http://pandas.pydata.org/).


Attribution: The work is heavily inspired by Adam Tornhill's book ["Your Code as a Crime Scene"](https://pragprog.com/book/atcrime/your-code-as-a-crime-scene), who did a similar analysis called "knowledge map".

# Import history
For this analysis, you need a log from your Git repository. In this example, we analyze a fork of the Spring PetClinic project. To avoid some noise, we add some paramaters (<tt>--no-merges</tt> and <tt>--no-renames</tt>

```bash
git log --no-merges --no-renames --numstat --pretty=format:"%x09%x09%x09%aN"
```

We read the log output into a Pandas' <tt>DataFrame</tt> by using the method describe in this [blog post](https://www.feststelltaste.de/reading-a-git-repos-commit-history-with-pandas-efficiently/) slightly modified (because we need less data):

In [1]:
import git
from io import StringIO
import pandas as pd

# connect to repo
git_bin = git.Repo("../../buschmais-spring-petclinic/").git

# execute log command
git_log = git_bin.execute('git log --no-merges --no-renames --numstat --pretty=format:"%x09%x09%x09%aN" -- *.java')

# read in the log
git_log = pd.read_csv(StringIO(git_log), sep="\x09", header=None, names=['additions', 'deletions', 'path','author'])

# convert to DataFrame
commit_data = git_log[['additions', 'deletions', 'path']].join(git_log[['author']].fillna(method='ffill')).dropna()
commit_data.head()

Unnamed: 0,additions,deletions,path,author
1,4.0,5.0,src/test/java/org/springframework/samples/petc...,Antoine Rey
2,25.0,7.0,src/test/java/org/springframework/samples/petc...,Antoine Rey
3,21.0,9.0,src/test/java/org/springframework/samples/petc...,Antoine Rey
4,23.0,3.0,src/test/java/org/springframework/samples/petc...,Antoine Rey
5,10.0,6.0,src/test/java/org/springframework/samples/petc...,Antoine Rey


# Getting data that matters

In this example, we are only interested in Java source code files that still exist in the software project

We can retrieve the existing Java source code files by using Git's <tt>ls-files</tt> combined with a filter for the Java source code file extension. The command will return a plain text string that we split by the line ending to get a list of files. Because we want to combine this information with the other above, we put it into a <tt>DataFrame</tt> with the column name <tt>path</tt>.

In [2]:
java_files = pd.DataFrame(git_bin.execute("git ls-files -- *.java ").split("\n"), columns=['path'])
java_files.head()

Unnamed: 0,path
0,src/main/java/org/springframework/samples/petc...
1,src/main/java/org/springframework/samples/petc...
2,src/main/java/org/springframework/samples/petc...
3,src/main/java/org/springframework/samples/petc...
4,src/main/java/org/springframework/samples/petc...


The next step is to combined the <tt>commit_data</tt> with the <tt>java_files</tt> information by using Pandas' <tt>merge</tt> function. By default, <tt>merge</tt> will 
- combine the data by the columns with the same name in each <tt>DataFrame</tt> 
- only leave those entries that have the same value (using an "inner join"). 

In plain English, <tt>merge</tt> will only leave the still existing Java source code files in the <tt>DataFrame</tt>. This is exactly what we need.

In [3]:
contributions = pd.merge(commit_data, java_files)
contributions.head()

Unnamed: 0,additions,deletions,path,author
0,4.0,5.0,src/test/java/org/springframework/samples/petc...,Antoine Rey
1,53.0,0.0,src/test/java/org/springframework/samples/petc...,Colin But
2,25.0,7.0,src/test/java/org/springframework/samples/petc...,Antoine Rey
3,167.0,0.0,src/test/java/org/springframework/samples/petc...,Colin But
4,21.0,9.0,src/test/java/org/springframework/samples/petc...,Antoine Rey


We can now convert some columns to their correct data types. The columns <tt>additions</tt> and <tt>deletions</tt> columns are representing the added or deleted lines of code respectively. We have to convert them accordingly.

In [4]:
contributions['additions'] = pd.to_numeric(contributions['additions'])
contributions['deletions'] = pd.to_numeric(contributions['deletions'])
contributions.head()

Unnamed: 0,additions,deletions,path,author
0,4.0,5.0,src/test/java/org/springframework/samples/petc...,Antoine Rey
1,53.0,0.0,src/test/java/org/springframework/samples/petc...,Colin But
2,25.0,7.0,src/test/java/org/springframework/samples/petc...,Antoine Rey
3,167.0,0.0,src/test/java/org/springframework/samples/petc...,Colin But
4,21.0,9.0,src/test/java/org/springframework/samples/petc...,Antoine Rey


We want to estimate the knowledge about code as the proportion of additions to the whole source code file. This means we need to calculate the relative amount of added lines for each developer. To be able to do this, we have to know the sum of all additions for a file.

Additionally, we calculate is for deletions, to easily get the number of lines for each code.

We have to normalize the <tt>additions</tt> column to be able to calculate the relative proportion that each author contributed to the source code. We use an additional <tt>DataFrame</tt> to do that (I think there is a more elegant way to do this).

In [5]:
additions_sum = contributions.groupby('path').sum()[['additions', 'deletions']].reset_index()
additions_sum.head()

Unnamed: 0,path,additions,deletions
0,src/main/java/org/springframework/samples/petc...,111.0,0.0
1,src/main/java/org/springframework/samples/petc...,70.0,23.0
2,src/main/java/org/springframework/samples/petc...,67.0,19.0
3,src/main/java/org/springframework/samples/petc...,290.0,137.0
4,src/main/java/org/springframework/samples/petc...,79.0,23.0


We also want to have an indicator about the quantity of the knowledge. This can be easily achieve if we calculate the lines of code for each files, which is a simple substraction of the deletions form the additions:

In [6]:
additions_sum['length'] = additions_sum['additions'] - additions_sum['deletions']
additions_sum.head()

Unnamed: 0,path,additions,deletions,length
0,src/main/java/org/springframework/samples/petc...,111.0,0.0,111.0
1,src/main/java/org/springframework/samples/petc...,70.0,23.0,47.0
2,src/main/java/org/springframework/samples/petc...,67.0,19.0,48.0
3,src/main/java/org/springframework/samples/petc...,290.0,137.0,153.0
4,src/main/java/org/springframework/samples/petc...,79.0,23.0,56.0


# Identify knowledge hotspots

If we are here, we also retrieve the lines of code for the existing files by simply counting the lines for each file. We do this by a simple function tha reads in the whole file and couting the lines. It's not elegant, but it works pretty good.

We need that information for visualizing "knowledge islands" later on.

And combine it analog as above.

In [7]:
contributions_norm = pd.merge(
    contributions, 
    additions_sum, 
    left_on='path', 
    right_on='path', 
    suffixes=['', '_sum'])
contributions_norm.head()

Unnamed: 0,additions,deletions,path,author,additions_sum,deletions_sum,length
0,4.0,5.0,src/test/java/org/springframework/samples/petc...,Antoine Rey,57.0,5.0,52.0
1,53.0,0.0,src/test/java/org/springframework/samples/petc...,Colin But,57.0,5.0,52.0
2,25.0,7.0,src/test/java/org/springframework/samples/petc...,Antoine Rey,192.0,7.0,185.0
3,167.0,0.0,src/test/java/org/springframework/samples/petc...,Colin But,192.0,7.0,185.0
4,21.0,9.0,src/test/java/org/springframework/samples/petc...,Antoine Rey,134.0,9.0,125.0


In [8]:
grouped_commits = contributions_norm.groupby(
    ['path']).sum()
grouped_commits.head()

Unnamed: 0_level_0,additions,deletions,additions_sum,deletions_sum,length
path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
src/main/java/org/springframework/samples/petclinic/PetclinicInitializer.java,111.0,0.0,111.0,0.0,111.0
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,70.0,23.0,560.0,184.0,376.0
src/main/java/org/springframework/samples/petclinic/model/NamedEntity.java,67.0,19.0,268.0,76.0,192.0
src/main/java/org/springframework/samples/petclinic/model/Owner.java,290.0,137.0,1740.0,822.0,918.0
src/main/java/org/springframework/samples/petclinic/model/Person.java,79.0,23.0,316.0,92.0,224.0


In [9]:
grouped_commits = contributions_norm.groupby(
    ['path', 'author']).agg(
    {'additions' : 'sum',
     'additions_sum' : 'first',
     'length' : 'first'})
grouped_commits

Unnamed: 0_level_0,Unnamed: 1_level_0,additions,additions_sum,length
path,author,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
src/main/java/org/springframework/samples/petclinic/PetclinicInitializer.java,Antoine Rey,111.0,111.0,111.0
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Antoine Rey,3.0,70.0,47.0
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Faisal Hameed,1.0,70.0,47.0
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Gordon Dickens,14.0,70.0,47.0
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Michael Isvy,51.0,70.0,47.0
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,boly38,1.0,70.0,47.0
src/main/java/org/springframework/samples/petclinic/model/NamedEntity.java,Antoine Rey,3.0,67.0,48.0
src/main/java/org/springframework/samples/petclinic/model/NamedEntity.java,Gordon Dickens,15.0,67.0,48.0
src/main/java/org/springframework/samples/petclinic/model/NamedEntity.java,Michael Isvy,49.0,67.0,48.0
src/main/java/org/springframework/samples/petclinic/model/Owner.java,Antoine Rey,14.0,290.0,153.0


In [10]:
grouped_commits['ownership'] = grouped_commits['additions'] / grouped_commits['additions_sum']
grouped_commits.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,additions,additions_sum,length,ownership
path,author,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
src/main/java/org/springframework/samples/petclinic/PetclinicInitializer.java,Antoine Rey,111.0,111.0,111.0,1.0
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Antoine Rey,3.0,70.0,47.0,0.042857
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Faisal Hameed,1.0,70.0,47.0,0.014286
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Gordon Dickens,14.0,70.0,47.0,0.2
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Michael Isvy,51.0,70.0,47.0,0.728571


In [11]:
ownership_hotspots = grouped_commits.reset_index().groupby(['author']).mean().sort_values(by='ownership', ascending=False)
ownership_hotspots.head(5)

Unnamed: 0_level_0,additions,additions_sum,length,ownership
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Colin But,85.0,110.666667,100.833333,0.786699
Michael Isvy,83.270833,121.979167,67.1875,0.749534
Costin Leau,24.5,48.0,29.5,0.732955
Gordon Dickens,34.243243,136.594595,73.702703,0.216802
Antoine Rey,15.755556,124.622222,72.422222,0.140097


In [12]:
ownerships = grouped_commits.reset_index().groupby(['path']).max()
ownerships.head(5)

Unnamed: 0_level_0,author,additions,additions_sum,length,ownership
path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
src/main/java/org/springframework/samples/petclinic/PetclinicInitializer.java,Antoine Rey,111.0,111.0,111.0,1.0
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,boly38,51.0,70.0,47.0,0.728571
src/main/java/org/springframework/samples/petclinic/model/NamedEntity.java,Michael Isvy,49.0,67.0,48.0,0.731343
src/main/java/org/springframework/samples/petclinic/model/Owner.java,Michael Isvy,164.0,290.0,153.0,0.565517
src/main/java/org/springframework/samples/petclinic/model/Person.java,Michael Isvy,59.0,79.0,56.0,0.746835


In [13]:
plot_data = ownerships.reset_index()
plot_data['responsible']  = plot_data['author']
plot_data.loc[plot_data['ownership'] < 0.7, 'responsible']  = "None"
plot_data.head()

Unnamed: 0,path,author,additions,additions_sum,length,ownership,responsible
0,src/main/java/org/springframework/samples/petc...,Antoine Rey,111.0,111.0,111.0,1.0,Antoine Rey
1,src/main/java/org/springframework/samples/petc...,boly38,51.0,70.0,47.0,0.728571,boly38
2,src/main/java/org/springframework/samples/petc...,Michael Isvy,49.0,67.0,48.0,0.731343,Michael Isvy
3,src/main/java/org/springframework/samples/petc...,Michael Isvy,164.0,290.0,153.0,0.565517,
4,src/main/java/org/springframework/samples/petc...,Michael Isvy,59.0,79.0,56.0,0.746835,Michael Isvy


In [14]:
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors

authors = plot_data['author'].unique()

rgb_colors = [
                matplotlib.colors.rgb2hex(x) 
                for x in cm.RdYlGn_r(
                    np.linspace(0,1,len(authors)))
                ]

colors = plot_data[['author']].drop_duplicates()
colors['color'] = rgb_colors
colors

Unnamed: 0,author,color
0,Antoine Rey,#006837
1,boly38,#39a758
2,Michael Isvy,#9dd569
9,Tomas Repel,#e3f399
42,Tejas Metha,#fee999
44,Rossen Stoyanchev,#fca55d
45,Costin Leau,#e34933
51,Colin But,#a50026


In [15]:
colored_plot_data = pd.merge(plot_data, colors, left_on='responsible', right_on='author', how='left', suffixes=['', '_color'])
colored_plot_data.loc[colored_plot_data['responsible'] == 'None', 'color'] = "white"
colored_plot_data.head()

Unnamed: 0,path,author,additions,additions_sum,length,ownership,responsible,author_color,color
0,src/main/java/org/springframework/samples/petc...,Antoine Rey,111.0,111.0,111.0,1.0,Antoine Rey,Antoine Rey,#006837
1,src/main/java/org/springframework/samples/petc...,boly38,51.0,70.0,47.0,0.728571,boly38,boly38,#39a758
2,src/main/java/org/springframework/samples/petc...,Michael Isvy,49.0,67.0,48.0,0.731343,Michael Isvy,Michael Isvy,#9dd569
3,src/main/java/org/springframework/samples/petc...,Michael Isvy,164.0,290.0,153.0,0.565517,,,white
4,src/main/java/org/springframework/samples/petc...,Michael Isvy,59.0,79.0,56.0,0.746835,Michael Isvy,Michael Isvy,#9dd569


# Visualizing
Export DataFrame into d3's flare format

In [16]:
import os
import json

def create_flare_json(data, json_file):
    
    json_data = {}
    json_data['name'] = 'flare'
    json_data['children'] = []
    
    for row in data.iterrows():
        series = row[1]
        path, filename = os.path.split(series['path'])

        last_children = None
        children = json_data['children']

        for path_part in path.split("/"):
            entry = None

            for child in children:
                if "name" in child and child["name"] == path_part:
                    entry = child
            if not entry:
                entry = {}
                children.append(entry)

            entry['name'] = path_part
            if not 'children' in entry: 
                entry['children'] = []

            children = entry['children']
            last_children = children

        last_children.append({
            'name' : filename + " [" + series['responsible'] + ", " + "{:6.2f}".format(series['ownership']) + "]",
            'weight' : series['ownership'],
            'size' :  series['length'],
            'author_color' : series['color']})

    with open (json_file, mode='w', encoding='utf-8') as json_file:
        json_file.write(json.dumps(json_data, indent=3))
        
create_flare_json(colored_plot_data, "vis/flare.json")

In [17]:
import IPython
url = "vis/knowledge_islands.html"
iframe = '<iframe src=' + url + ' scrolling="No" width="800" height="800" style=border:none;"></iframe>'
IPython.display.HTML(iframe)