# Estimating the Knowledge Distribution within a Modular System

## Exercise

_Level: Hard_

### Background

In software systems, multiple developers work on multiple parts of the system. These persons change all code differently within specific parts.

### Your Task

In this exercise, you should find out, _which author does know which percentage of the system's modules?_

### The Dataset

The dataset in `../datasets/git_log_numstat_dropover.csv.gz` contains the following information:

* `additions`: The number of added lines per file per commit
* `deletions`: The number of deleted lines per file per commit
* `file`: The name of the file that was changed
* `sha`: The unique key of the commit
* `timestamp`: The time of the commit
* `author`: The name of the author who did the change

Here you can see the first 10 entries of the dataset:

<small>
<code>
additions,deletions,file,sha,timestamp,author
191.0,0.0,backend/pom-2016-07-16_04-40-56-752.xml,8c686954,2016-07-22 17:43:38,Michael
1.0,1.0,backend/src/test/java/at/dropover/scheduling/interactor/SetFinalDateTest.java,97c6ef96,2016-07-16 09:51:15,Markus
55.0,0.0,backend/src/test/java/at/dropover/scheduling/interactor/SetFinalDateTest.java,432113a2,2016-07-15 21:17:07,Chris
19.0,3.0,backend/src/main/webapp/app/widgets/gallery/js/galleryController.js,3f7cf92c,2016-07-16 09:07:31,Markus
24.0,11.0,backend/src/main/webapp/app/widgets/gallery/js/galleryController.js,bf2b00ba,2014-10-26 05:52:48,Michael
294.0,0.0,backend/src/main/webapp/app/widgets/gallery/js/galleryController.js,62f4013b,2014-10-11 22:24:46,Michael
1.0,1.0,backend/src/main/webapp/app/widgets/gallery/views/galleryView.html,3f7cf92c,2016-07-16 09:07:31,Markus
5.0,5.0,backend/src/main/webapp/app/widgets/gallery/views/galleryView.html,bf2b00ba,2014-10-26 05:52:48,Michael
75.0,0.0,backend/src/main/webapp/app/widgets/gallery/views/galleryView.html,62f4013b,2014-10-11 22:24:46,Michael
</code>
</small>

### Further Information

The system under investigation has several peculiarities:

* It was developed by only three people: Chris, Markus and Michael. Each of these developers worked more or less on one or more modules.
* The system was structured along business modules with functionality like comment, creator, scheduling and so on. You can find this information at the 7th place in filepath.
* The relevant source code for the backend was written in Java. These files use `.java` as file extensions.
* The interesting files of the system are the ones that begin with `backend/src/main/java/`.
* There are also irrelevant files for this analysis in the backend named `package-info.java`.

## Idea

Working assumption: The number of commits from an author within a certain module corresponds to the existing knowledge about that module.

## Data Loading
Retrieve Git log information from a Git repository

## Data Cleaning
Just keep the code that is of interest.

## Analysis
First, find out _which author has how much overall "knowledge"?_

Extract the information about a the business modules.

## Interpretation
List all the existing knowledge ratios per modules and authors

## Visualization
Plot the result for each module in a bar chart

In [None]:
knowledge_per_module.unstack()['ratio'].index

Index(['comment', 'creator', 'files', 'framework', 'mail', 'scheduling',
       'site', 'todo'],
      dtype='object', name='module')