# Introduction
In software development, it's all about the knowledge &ndash; both technical and the business domain. We software developers transfer only a minimal part of this knowledge into code. So code alone isn't sufficient to get a glimpse of the greater picture and the interrelations of all the different concepts. There will be always developers that know more about some concept as laid down in the source code. It's important to make sure that this knowledge is distributed over more than one developer. For source code, this means that more than one developers changes the code. Multiple contributors to one source file will level the different perspectives on the abstract concepts / knowledge represented as code.

To identify areas in the code that are possibly known by only one developer (and where you should do some pair programming of a invest in redocumentation) you can visualize the knowledge about source code by mining a version control system. We can approximate the knowledge distribution by counting the number of additions per file that each developer contributed to a software system.

I'll show you step by step how you can do this by using Python and [Pandas](http://pandas.pydata.org/).


Attribution: The work is heavily inspired by Adam Tornhill's book ["Your Code as a Crime Scene"](https://pragprog.com/book/atcrime/your-code-as-a-crime-scene), who did a similar analysis called "Knowledge map".

# Import history
For this analysis, you need a log from your Git repository. To avoid some noise, we add some paramaters (<tt>--no-merges</tt> and <tt>--no-renames</tt>

```bash
git log --no-merges --no-renames --numstat --pretty=format:"%x09%x09%x09%aN"
```

We read the log output into a Pandas' <tt>DataFrame</tt> by using the method describe in this [blog post](https://www.feststelltaste.de/reading-a-git-repos-commit-history-with-pandas-efficiently/) slightly modified (because we need less data):

In [20]:
import pandas as pd
import git
from io import StringIO

# connect to repo
git_bin = git.Repo("../../buschmais-spring-petclinic/").git

# execute log command
git_log = git_bin.execute('git log --no-merges --no-renames --numstat --pretty=format:"%x09%x09%x09%aN"')

# read in the log
git_log = pd.read_csv(
    StringIO(git_log), 
    sep="\t", 
    header=None,
    usecols=[0,2,3],
    names=['additions', 'path','author']
)

# convert to DataFrame
commit_data = git_log[['additions', 'path']]\
    .join(git_log[['author']]\
    .fillna(method='ffill'))\
    .dropna()
    
commit_data.head()

Unnamed: 0,additions,path,author
1,1,docs/README.md,Markus Harrer
3,76,docs/README.md,Markus Harrer
4,290,docs/assets/css/style.scss,Markus Harrer
5,-,docs/documentation/images/class-diagram.png,Markus Harrer
6,1224,docs/documentation/index.html,Markus Harrer


# Getting data that matters

This gives use the information which author did add how many lines of code to which file. Because we are only interested in Java source code as well as only the files that still exist in the software project, we filter out the others. We can retrieve a list of still existing Java source code files by using Git's <tt>ls-files</tt> combined with a filter. Because we want to combine this information with the other above, we put it into a <tt>DataFrame</tt> as well.

In [21]:
java_files = pd.DataFrame(git_bin.execute("git ls-files -- *.java ").split("\n"), columns=['path'])
java_files.head()

Unnamed: 0,path
0,src/main/java/org/springframework/samples/petc...
1,src/main/java/org/springframework/samples/petc...
2,src/main/java/org/springframework/samples/petc...
3,src/main/java/org/springframework/samples/petc...
4,src/main/java/org/springframework/samples/petc...


If we are here, we also retrieve the lines of code for the existing files by simply counting the lines for each file. We do this by a simple function tha reads in the whole file and couting the lines. It's not elegant, but it works pretty good.

We need that information for visualizing "knowledge islands" later on.

In [22]:
def count_lines(file_path):
    
    abs_path = git_bin.working_dir + "/" + file_path
    with open(abs_path, 'r', encoding='utf-8') as file:
        return len(file.readlines())

java_files['length'] = java_files['path'].apply(count_lines)
java_files.head()

Unnamed: 0,path,length
0,src/main/java/org/springframework/samples/petc...,111
1,src/main/java/org/springframework/samples/petc...,47
2,src/main/java/org/springframework/samples/petc...,48
3,src/main/java/org/springframework/samples/petc...,153
4,src/main/java/org/springframework/samples/petc...,56


# Refine the data
The next step is to combined the <tt>commit_data</tt> with the <tt>java_files</tt> information by using Pandas' <tt>merge</tt> function. It does the magic all by it's own. By default, <tt>merge</tt> will combine the data by the columns with the same name in each <tt>DataFrame</tt> and only leave those entries that have the same value. In plain English, <tt>merge</tt> will only leave the still existing Java source code files in the <tt>DataFrame</tt>

In [23]:
contributions = pd.merge(commit_data, java_files)
contributions.head()

Unnamed: 0,additions,path,author,length
0,4,src/test/java/org/springframework/samples/petc...,Antoine Rey,52
1,53,src/test/java/org/springframework/samples/petc...,Colin But,52
2,25,src/test/java/org/springframework/samples/petc...,Antoine Rey,185
3,167,src/test/java/org/springframework/samples/petc...,Colin But,185
4,21,src/test/java/org/springframework/samples/petc...,Antoine Rey,125


The columns <tt>additions</tt> is representing the added lines of code. We have to convert the data type accordingly.

In [24]:
contributions['additions'] = pd.to_numeric(contributions['additions'])
contributions.head()

Unnamed: 0,additions,path,author,length
0,4,src/test/java/org/springframework/samples/petc...,Antoine Rey,52
1,53,src/test/java/org/springframework/samples/petc...,Colin But,52
2,25,src/test/java/org/springframework/samples/petc...,Antoine Rey,185
3,167,src/test/java/org/springframework/samples/petc...,Colin But,185
4,21,src/test/java/org/springframework/samples/petc...,Antoine Rey,125


# Identify knowledge hotspots

We have to normalize the <tt>additions</tt> column to be able to calculate the relative proportion that each author contributed to the source code. We use an additional <tt>DataFrame</tt> to do that (I think there is a more elegant way to do this).

In [25]:
additions_sum = contributions.groupby('path').sum()[['additions']].reset_index()
additions_sum.head()

Unnamed: 0,path,additions
0,src/main/java/org/springframework/samples/petc...,111
1,src/main/java/org/springframework/samples/petc...,70
2,src/main/java/org/springframework/samples/petc...,67
3,src/main/java/org/springframework/samples/petc...,290
4,src/main/java/org/springframework/samples/petc...,79


And combine it analog as above.

In [26]:
contributions_norm = pd.merge(
    contributions, 
    additions_sum, 
    left_on='path', 
    right_on='path', 
    suffixes=['', '_sum'])
contributions_norm.head()

Unnamed: 0,additions,path,author,length,additions_sum
0,4,src/test/java/org/springframework/samples/petc...,Antoine Rey,52,57
1,53,src/test/java/org/springframework/samples/petc...,Colin But,52,57
2,25,src/test/java/org/springframework/samples/petc...,Antoine Rey,185,192
3,167,src/test/java/org/springframework/samples/petc...,Colin But,185,192
4,21,src/test/java/org/springframework/samples/petc...,Antoine Rey,125,134


In [27]:
grouped_commits = contributions_norm.groupby(
    ['path', 'author']).agg(
    {'additions' : 'sum',
     'additions_sum' : 'first',
     'length' : 'first'})
grouped_commits.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,additions,additions_sum,length
path,author,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
src/main/java/org/springframework/samples/petclinic/PetclinicInitializer.java,Antoine Rey,111,111,111
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Antoine Rey,3,70,47
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Faisal Hameed,1,70,47
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Gordon Dickens,14,70,47
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Michael Isvy,51,70,47


In [28]:
grouped_commits['ownership'] = grouped_commits['additions'] / grouped_commits['additions_sum']
grouped_commits.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,additions,additions_sum,length,ownership
path,author,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
src/main/java/org/springframework/samples/petclinic/PetclinicInitializer.java,Antoine Rey,111,111,111,1.0
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Antoine Rey,3,70,47,0.042857
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Faisal Hameed,1,70,47,0.014286
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Gordon Dickens,14,70,47,0.2
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,Michael Isvy,51,70,47,0.728571


In [29]:
ownership_hotspots = grouped_commits.reset_index().groupby(['author']).mean().sort_values(by='ownership', ascending=False)
ownership_hotspots.head(5)

Unnamed: 0_level_0,additions,additions_sum,length,ownership
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Colin But,85.0,110.666667,100.833333,0.786699
Michael Isvy,83.270833,121.979167,67.1875,0.749534
Costin Leau,24.5,48.0,29.5,0.732955
Gordon Dickens,34.243243,136.594595,73.702703,0.216802
Antoine Rey,15.755556,124.622222,72.422222,0.140097


In [30]:
ownerships = grouped_commits.reset_index().groupby(['path']).max()
ownerships.head(5)

Unnamed: 0_level_0,author,additions,additions_sum,length,ownership
path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
src/main/java/org/springframework/samples/petclinic/PetclinicInitializer.java,Antoine Rey,111,111,111,1.0
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java,boly38,51,70,47,0.728571
src/main/java/org/springframework/samples/petclinic/model/NamedEntity.java,Michael Isvy,49,67,48,0.731343
src/main/java/org/springframework/samples/petclinic/model/Owner.java,Michael Isvy,164,290,153,0.565517
src/main/java/org/springframework/samples/petclinic/model/Person.java,Michael Isvy,59,79,56,0.746835


In [31]:
plot_data = ownerships.reset_index()
plot_data['responsible']  = plot_data['author']
plot_data.loc[plot_data['ownership'] < 0.7, 'responsible']  = "None"
plot_data.head()

Unnamed: 0,path,author,additions,additions_sum,length,ownership,responsible
0,src/main/java/org/springframework/samples/petc...,Antoine Rey,111,111,111,1.0,Antoine Rey
1,src/main/java/org/springframework/samples/petc...,boly38,51,70,47,0.728571,boly38
2,src/main/java/org/springframework/samples/petc...,Michael Isvy,49,67,48,0.731343,Michael Isvy
3,src/main/java/org/springframework/samples/petc...,Michael Isvy,164,290,153,0.565517,
4,src/main/java/org/springframework/samples/petc...,Michael Isvy,59,79,56,0.746835,Michael Isvy


In [32]:
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors

authors = plot_data['author'].unique()

rgb_colors = [
                matplotlib.colors.rgb2hex(x) 
                for x in cm.RdYlGn_r(
                    np.linspace(0,1,len(authors)))
                ]

colors = plot_data[['author']].drop_duplicates()
colors['color'] = rgb_colors
colors

Unnamed: 0,author,color
0,Antoine Rey,#006837
1,boly38,#39a758
2,Michael Isvy,#9dd569
9,Tomas Repel,#e3f399
42,Tejas Metha,#fee999
44,Rossen Stoyanchev,#fca55d
45,Costin Leau,#e34933
51,Colin But,#a50026


In [33]:
colored_plot_data = pd.merge(plot_data, colors, left_on='responsible', right_on='author', how='left', suffixes=['', '_color'])
colored_plot_data.loc[colored_plot_data['responsible'] == 'None', 'color'] = "white"
colored_plot_data.head()

Unnamed: 0,path,author,additions,additions_sum,length,ownership,responsible,author_color,color
0,src/main/java/org/springframework/samples/petc...,Antoine Rey,111,111,111,1.0,Antoine Rey,Antoine Rey,#006837
1,src/main/java/org/springframework/samples/petc...,boly38,51,70,47,0.728571,boly38,boly38,#39a758
2,src/main/java/org/springframework/samples/petc...,Michael Isvy,49,67,48,0.731343,Michael Isvy,Michael Isvy,#9dd569
3,src/main/java/org/springframework/samples/petc...,Michael Isvy,164,290,153,0.565517,,,white
4,src/main/java/org/springframework/samples/petc...,Michael Isvy,59,79,56,0.746835,Michael Isvy,Michael Isvy,#9dd569


# Visualizing
Export DataFrame into d3's flare format

In [34]:
import os
import json

def create_flare_json(data, json_file):
    
    json_data = {}
    json_data['name'] = 'flare'
    json_data['children'] = []
    
    for row in data.iterrows():
        series = row[1]
        path, filename = os.path.split(series['path'])

        last_children = None
        children = json_data['children']

        for path_part in path.split("/"):
            entry = None

            for child in children:
                if "name" in child and child["name"] == path_part:
                    entry = child
            if not entry:
                entry = {}
                children.append(entry)

            entry['name'] = path_part
            if not 'children' in entry: 
                entry['children'] = []

            children = entry['children']
            last_children = children

        last_children.append({
            'name' : filename + " [" + series['responsible'] + ", " + "{:6.2f}".format(series['ownership']) + "]",
            'weight' : series['ownership'],
            'size' :  series['length'],
            'author_color' : series['color']})

    with open (json_file, mode='w', encoding='utf-8') as json_file:
        json_file.write(json.dumps(json_data, indent=3))
        
create_flare_json(colored_plot_data, "vis/flare.json")

In [35]:
import IPython
url = "vis/knowledge_island.html"
iframe = '<iframe src=' + url + ' scrolling="No" width="800" height="800" style=border:none;"></iframe>'
IPython.display.HTML(iframe)