# Fundamentals of Software Systems (FSS)
**Software Evolution – Part 02 Assignment**

## Submission Guidelines

To correctly complete this assignment you must:

* Carry out the assignment in a team of 2 to 4 students.
* Carry out the assignment with your team only. You are allowed to discuss solutions with other teams, but each team should come up its own personal solution. A strict plagiarism policy is going to be applied to all the artifacts submitted for evaluation.
* As your submission, upload the filled Jupyter Notebook (including outputs) together with the d3 visualization web pages (i.e. upload everything you downloaded including the filled Jupyter Notebook plus your `output.json`)
* The files must be uploaded to OLAT as a single ZIP (`.zip`) file by 2024-12-02 18:00.


## Group Members
Carlos Kirchdorfer, 19-720-002, carlos.kirchdorfer@uzh.ch

Hyeongseok Kim, 23-741-903, hyeongseok.kim@uzh.ch, 

Flavian Roland Thür, 16-562-274, flavianroland.thuer@uzh.ch


## Task Context

In this assigment we will be analyzing the _[Nautilus trader](https://nautilustrader.io/)_ project. The git repository is available here: https://github.com/nautechsystems/nautilus_trader 

All following tasks should be done with the subset of commits from tag `v1.165.0` to tag `v1.206.0`.

## Task 1: Author contributions

In the following, please consider only Rust, Python, and Cython (pyx and pxd) files.

The first task is to get an overview of the author ownership of the Nautilus Trader project. In particular, we want to understand who are the main authors in the system between the two considered tags and what is the amount of their contributions both in absolute terms and in percentages. We also want to investigate if the same patterns apply on the various subsystems.

To this end, you should:
* extract all the contributions between the two tags. If an author committed a file 3 times, then the number of contributions of that author on that file is 3





In [1]:
import pydriller
import tqdm
from collections import defaultdict
from datetime import datetime
import pytz

In [2]:
repo = pydriller.Repository("./../nautilus_trader", 
    from_tag="v1.165.0", 
    to_tag="v1.206.0", order="date-order")

In [3]:
contributions = defaultdict(lambda: defaultdict(int))  # {author: {file: count}}

allowed_extensions = {".rs", ".py", ".pyx"}

progress = tqdm.tqdm(unit="commit", desc="Processing commits")

# Traverse commits in the repository
for commit in repo.traverse_commits():
    author = commit.author.name
    for f in commit.modified_files:
        file_path = str(f.new_path)
        # Check if file extension is allowed
        if file_path and any(file_path.endswith(ext) for ext in allowed_extensions):
            # Count contributions (commits) for each author and file
            contributions[author][file_path] += 1
    progress.update(1)

progress.close()

author_file_contributions = defaultdict(lambda: defaultdict(int))

for author, files in contributions.items():
    for file, count in files.items():
        author_file_contributions[author][file] = count

Processing commits: 3859commit [08:34,  7.51commit/s]


In [4]:
print("Author to File Contributions:")
for author, files in contributions.items():
    print(f"Author: {author}")
    for file, count in files.items():
        print(f"  File: {file}, Contributions: {count}")

Author to File Contributions:
Author: Chris Sellers
  File: nautilus_trader\examples\strategies\ema_cross.py, Contributions: 16
  File: nautilus_trader\examples\strategies\ema_cross_bracket.py, Contributions: 10
  File: nautilus_trader\examples\strategies\ema_cross_bracket_algo.py, Contributions: 10
  File: nautilus_trader\examples\strategies\ema_cross_cython.pyx, Contributions: 13
  File: nautilus_trader\examples\strategies\ema_cross_stop_entry.py, Contributions: 12
  File: nautilus_trader\examples\strategies\ema_cross_trailing_stop.py, Contributions: 12
  File: nautilus_trader\examples\strategies\ema_cross_twap.py, Contributions: 10
  File: nautilus_trader\examples\strategies\volatility_market_maker.py, Contributions: 29
  File: nautilus_core\persistence\src\arrow\delta.rs, Contributions: 18
  File: nautilus_core\core\src\cvec.rs, Contributions: 2
  File: nautilus_core\model\src\enums.rs, Contributions: 54
  File: nautilus_trader\adapters\betfair\sockets.py, Contributions: 9
  File: 

In [5]:
def get_author_contribution_sum(auth_file_contribution):
    auth_contribution_sum = defaultdict()

    for author, files in auth_file_contribution.items():
        auth_contribution_sum[author] = sum(files.values())  # Sum up all contributions for this author

    return auth_contribution_sum
        


author_aggregated_sum = get_author_contribution_sum(author_file_contributions)
for auth, count in sorted(author_aggregated_sum.items(), key=lambda x: x[1], reverse=True):
    print(auth, count)

Chris Sellers 13565
Filip Macek 1298
Ishan Bhanuka 510
Pushkar Mishra 387
Brad 276
David Blom 221
rsmb7z 169
faysou 150
Benjamin Singleton 84
limx0 82
Miller Moore 63
DevRoss 21
graceyangfan 16
Ayush 13
Javaid 13
Ayush Singh Bhandari 10
sunlei 10
Nisayo 7
r3k4mn14r 5
Dia Kharrat 4
ghill2 4
Myles 4
Anurag Roy 3
Evgenii Prusov 3
Dimitar Petrov 2
Ben Singleton 2
freddiehill 2
Sunlei 2
imemo88 2
fred monroe 1
Troubladore 1
dodofarm 1


* sort the authors by total number of contributions, define a threshold and from now on consider only the authors above the threshold

In [6]:
min_contribution_threshold = 50

author_contribution_over_threshold = {
    author: count for author, count in author_aggregated_sum.items()
    if count > min_contribution_threshold
}

print("\nAuthor Contributions Above Threshold:")
for author, count in sorted(author_contribution_over_threshold.items(), key=lambda x: x[1], reverse=True):
    print(f"{author}: {count}")


filtered_author_file_contributions = {
    author: files for author, files in author_file_contributions.items()
    if author in author_contribution_over_threshold
}

# Print the filtered contributions
print("\nFiltered Author-File Contributions (Only Authors Above Threshold):")
for author, files in filtered_author_file_contributions.items():
    print(f"Author: {author}")
    for file, count in files.items():
        print(f"  File: {file}, Contributions: {count}")



Author Contributions Above Threshold:
Chris Sellers: 13565
Filip Macek: 1298
Ishan Bhanuka: 510
Pushkar Mishra: 387
Brad: 276
David Blom: 221
rsmb7z: 169
faysou: 150
Benjamin Singleton: 84
limx0: 82
Miller Moore: 63

Filtered Author-File Contributions (Only Authors Above Threshold):
Author: Chris Sellers
  File: nautilus_trader\examples\strategies\ema_cross.py, Contributions: 16
  File: nautilus_trader\examples\strategies\ema_cross_bracket.py, Contributions: 10
  File: nautilus_trader\examples\strategies\ema_cross_bracket_algo.py, Contributions: 10
  File: nautilus_trader\examples\strategies\ema_cross_cython.pyx, Contributions: 13
  File: nautilus_trader\examples\strategies\ema_cross_stop_entry.py, Contributions: 12
  File: nautilus_trader\examples\strategies\ema_cross_trailing_stop.py, Contributions: 12
  File: nautilus_trader\examples\strategies\ema_cross_twap.py, Contributions: 10
  File: nautilus_trader\examples\strategies\volatility_market_maker.py, Contributions: 29
  File: naut

* consider the following subsystems: nautilus_core and all the directories inside nautilus_trader (e.g., accounting, adapters, analysis, ...)

In [7]:
filtered_author_file_contributions_with_files = {
    author: {file: count for file, count in files.items() if file.startswith(('nautilus_trader', 'nautilus_core'))}
    for author, files in filtered_author_file_contributions.items()
}

# Print the filtered contributions
print("\nFiltered Author-File Contributions (Files Starting with 'nautilus_trader' or 'nautilus_core'):")
for author, files in filtered_author_file_contributions_with_files.items():
    if files:  # Only print authors with relevant files
        print(f"Author: {author}")
        for file, count in files.items():
            print(f"  File: {file}, Contributions: {count}")


Filtered Author-File Contributions (Files Starting with 'nautilus_trader' or 'nautilus_core'):
Author: Chris Sellers
  File: nautilus_trader\examples\strategies\ema_cross.py, Contributions: 16
  File: nautilus_trader\examples\strategies\ema_cross_bracket.py, Contributions: 10
  File: nautilus_trader\examples\strategies\ema_cross_bracket_algo.py, Contributions: 10
  File: nautilus_trader\examples\strategies\ema_cross_cython.pyx, Contributions: 13
  File: nautilus_trader\examples\strategies\ema_cross_stop_entry.py, Contributions: 12
  File: nautilus_trader\examples\strategies\ema_cross_trailing_stop.py, Contributions: 12
  File: nautilus_trader\examples\strategies\ema_cross_twap.py, Contributions: 10
  File: nautilus_trader\examples\strategies\volatility_market_maker.py, Contributions: 29
  File: nautilus_core\persistence\src\arrow\delta.rs, Contributions: 18
  File: nautilus_core\core\src\cvec.rs, Contributions: 2
  File: nautilus_core\model\src\enums.rs, Contributions: 54
  File: naut

* for the considered authors and the considered subsystems, create a matrix (for example a pandas dataframe) where the columns are the authors, the rows the subsystems, and the value of a cell is the number of contribution of that author on that subsystem

In [8]:
import pandas as pd

In [9]:
def create_datafram_from_contrib_matix(contrib_matrix):

    module_author_contributions = defaultdict(lambda: defaultdict(int))

    # Process the filtered data
    for author, files in contrib_matrix.items():
        for file, count in files.items():
            if file.startswith("nautilus_core"):
                module = "nautilus_core"
            elif file.startswith("nautilus_trader"):
                # Extract subfolder after 'nautilus_trader/'
                parts = file.split("/")
                if len(parts) > 2:  # Ensure there is a subfolder
                    module = f"nautilus_trader/{parts[1]}"
                else:
                    module = "nautilus_trader"
            else:
                continue  # Skip files not matching the criteria
            
            # Add contributions to the module for the author
            module_author_contributions[module][author] += count

    # Convert to a pandas DataFrame
    dataframe = pd.DataFrame(module_author_contributions).fillna(0).T

    return dataframe



df = create_datafram_from_contrib_matix(filtered_author_file_contributions_with_files)
print(df.head())




                 Chris Sellers  Ishan Bhanuka  Filip Macek  rsmb7z   Brad  \
nautilus_trader         4630.0           29.0        192.0    88.0  152.0   
nautilus_core           6513.0          464.0        976.0    12.0    0.0   

                 David Blom  Benjamin Singleton  limx0  faysou  Miller Moore  \
nautilus_trader       176.0                61.0   30.0    66.0          63.0   
nautilus_core           8.0                 0.0    0.0    29.0           0.0   

                 Pushkar Mishra  
nautilus_trader             0.0  
nautilus_core             384.0  


* now comment on the results: is the main author predominant in terms of contributions? how are the contributions distributed among the authors? are the subsystems similar in terms of distribution? To answer these questions compute an additional column that measure the percentage of contributions of the main author with respect to the total.

In [10]:
def calc_main_auth_and_add_percentage_to_df(df):

    # Calculate the total contributions for each author across all modules
    total_contributions_by_author = df.sum(axis=0)

    # Identify the main author (author with the highest contributions)
    main_author = total_contributions_by_author.idxmax()
    main_author_contributions = total_contributions_by_author[main_author]

    # Calculate the total contributions per module
    total_contributions_per_module = df.sum(axis=1)

    # Add a column for the percentage of contributions by the main author
    df['Main_Author_Percentage'] = (df[main_author] / total_contributions_per_module) * 100

    # Display the updated DataFrame
    print("\nUpdated DataFrame with Main Author Percentage:")
    print(df)

    return df, main_author, main_author_contributions

df, main_author, main_author_contributions = calc_main_auth_and_add_percentage_to_df(df)


Updated DataFrame with Main Author Percentage:
                 Chris Sellers  Ishan Bhanuka  Filip Macek  rsmb7z   Brad  \
nautilus_trader         4630.0           29.0        192.0    88.0  152.0   
nautilus_core           6513.0          464.0        976.0    12.0    0.0   

                 David Blom  Benjamin Singleton  limx0  faysou  Miller Moore  \
nautilus_trader       176.0                61.0   30.0    66.0          63.0   
nautilus_core           8.0                 0.0    0.0    29.0           0.0   

                 Pushkar Mishra  Main_Author_Percentage  
nautilus_trader             0.0               84.381265  
nautilus_core             384.0               77.665156  


In [11]:
print("\nComments on the Results:")
print(f"1. Main Author: {main_author} with total contributions: {main_author_contributions}")
print("2. Contribution Percentages by Main Author:")
print(df['Main_Author_Percentage'])


Comments on the Results:
1. Main Author: Chris Sellers with total contributions: 11143.0
2. Contribution Percentages by Main Author:
nautilus_trader    84.381265
nautilus_core      77.665156
Name: Main_Author_Percentage, dtype: float64


In [12]:
modules_with_main_author_dominance = (df['Main_Author_Percentage'] > 50).sum()


print(f"Total modules: {len(df)}")
print(f"Modules where main author has more than 50% contribution {modules_with_main_author_dominance}")

Total modules: 2
Modules where main author has more than 50% contribution 2


Yes the main author (Chris Sellers) is predominant in terms of contributions. He is in most of the modules (25 out of 26) the main author (more than 50% contribution)

* redo all the previous steps but instead of counting the number of contributions, for every file in every commit, count the number of lines added. Produce a matrix equivalent to the previous one, using the lines added and comment on the results. Are the results the same if we look at the lines added instead of the number of contributions? For which subsystems are the results different?

In [None]:
# Dictionary to store lines added per file per author
author_file_lines_added = defaultdict(lambda: defaultdict(int))  # {author: {file: lines_added}}

allowed_extensions = {".rs", ".py", ".pyx"}

progress = tqdm.tqdm(unit="commit", desc="Processing commits")

# Traverse commits in the repository
for commit in repo.traverse_commits():
    author = commit.author.name
    for f in commit.modified_files:
        file_path = str(f.new_path)
        # Check if file extension is allowed
        if file_path and any(file_path.endswith(ext) for ext in allowed_extensions):
            # Count lines added per file for each author
            author_file_lines_added[author][file_path] += f.added_lines or 0
    progress.update(1)

progress.close()

# Convert to a structured dictionary
author_file_lines_matrix = defaultdict(lambda: defaultdict(int))

for author, files in author_file_lines_added.items():
    for file, lines_added in files.items():
        author_file_lines_matrix[author][file] = lines_added

# Optionally, print the matrix for verification
print("\nAuthor-File Lines Added Matrix:")
for author, files in author_file_lines_matrix.items():
    print(f"Author: {author}")
    for file, lines in files.items():
        print(f"  File: {file}, Lines Added: {lines}")


Processing commits: 3859commit [11:47,  5.45commit/s]


Author-File Lines Added Matrix:
Author: Chris Sellers
  File: nautilus_trader\examples\strategies\ema_cross.py, Lines Added: 30
  File: nautilus_trader\examples\strategies\ema_cross_bracket.py, Lines Added: 30
  File: nautilus_trader\examples\strategies\ema_cross_bracket_algo.py, Lines Added: 45
  File: nautilus_trader\examples\strategies\ema_cross_cython.pyx, Lines Added: 23
  File: nautilus_trader\examples\strategies\ema_cross_stop_entry.py, Lines Added: 41
  File: nautilus_trader\examples\strategies\ema_cross_trailing_stop.py, Lines Added: 46
  File: nautilus_trader\examples\strategies\ema_cross_twap.py, Lines Added: 20
  File: nautilus_trader\examples\strategies\volatility_market_maker.py, Lines Added: 124
  File: nautilus_core\persistence\src\arrow\delta.rs, Lines Added: 286
  File: nautilus_core\core\src\cvec.rs, Lines Added: 3
  File: nautilus_core\model\src\enums.rs, Lines Added: 913
  File: nautilus_trader\adapters\betfair\sockets.py, Lines Added: 45
  File: nautilus_trader\c




In [14]:
author_line_added_aggregated_sum = get_author_contribution_sum(author_file_lines_matrix)
for auth, count in sorted(author_line_added_aggregated_sum.items(), key=lambda x: x[1], reverse=True):
    print(auth, count)

Chris Sellers 209618
Filip Macek 59453
Pushkar Mishra 18264
David Blom 12315
Ishan Bhanuka 10322
Miller Moore 8818
rsmb7z 7863
Benjamin Singleton 6673
faysou 6491
Brad 6100
DevRoss 1051
Ayush 857
Nisayo 828
Ayush Singh Bhandari 799
graceyangfan 551
limx0 435
Javaid 200
r3k4mn14r 138
Myles 134
Anurag Roy 69
ghill2 60
Dia Kharrat 36
freddiehill 31
sunlei 24
imemo88 23
Troubladore 17
Dimitar Petrov 16
Ben Singleton 15
Evgenii Prusov 10
dodofarm 2
Sunlei 1
fred monroe 1


In [15]:
min_added_line_contribution_threshold = 1000

author_contribution_line_added_over_threshold = {
    author: count for author, count in author_line_added_aggregated_sum.items()
    if count > min_added_line_contribution_threshold
}

print("\nAuthor Contributions Above Threshold:")
for author, count in sorted(author_contribution_line_added_over_threshold.items(), key=lambda x: x[1], reverse=True):
    print(f"{author}: {count}")


filtered_author_file_line_added_contributions = {
    author: files for author, files in author_file_lines_matrix.items()
    if author in author_contribution_line_added_over_threshold
}

# Print the filtered contributions
print("\nFiltered Author-File Contributions (Only Authors Above Threshold):")
for author, files in filtered_author_file_line_added_contributions.items():
    print(f"Author: {author}")
    for file, count in files.items():
        print(f"  File: {file}, Contributions: {count}")



Author Contributions Above Threshold:
Chris Sellers: 209618
Filip Macek: 59453
Pushkar Mishra: 18264
David Blom: 12315
Ishan Bhanuka: 10322
Miller Moore: 8818
rsmb7z: 7863
Benjamin Singleton: 6673
faysou: 6491
Brad: 6100
DevRoss: 1051

Filtered Author-File Contributions (Only Authors Above Threshold):
Author: Chris Sellers
  File: nautilus_trader\examples\strategies\ema_cross.py, Contributions: 30
  File: nautilus_trader\examples\strategies\ema_cross_bracket.py, Contributions: 30
  File: nautilus_trader\examples\strategies\ema_cross_bracket_algo.py, Contributions: 45
  File: nautilus_trader\examples\strategies\ema_cross_cython.pyx, Contributions: 23
  File: nautilus_trader\examples\strategies\ema_cross_stop_entry.py, Contributions: 41
  File: nautilus_trader\examples\strategies\ema_cross_trailing_stop.py, Contributions: 46
  File: nautilus_trader\examples\strategies\ema_cross_twap.py, Contributions: 20
  File: nautilus_trader\examples\strategies\volatility_market_maker.py, Contributio

In [16]:
filtered_author_file_contributions_with_files_added_lines = {
    author: {file: count for file, count in files.items() if file.startswith(('nautilus_trader', 'nautilus_core'))}
    for author, files in filtered_author_file_line_added_contributions.items()
}

# Print the filtered contributions
print("\nFiltered Author-File Contributions (Files Starting with 'nautilus_trader' or 'nautilus_core'):")
for author, files in filtered_author_file_contributions_with_files_added_lines.items():
    if files:  # Only print authors with relevant files
        print(f"Author: {author}")
        for file, count in files.items():
            print(f"  File: {file}, Contributions: {count}")


Filtered Author-File Contributions (Files Starting with 'nautilus_trader' or 'nautilus_core'):
Author: Chris Sellers
  File: nautilus_trader\examples\strategies\ema_cross.py, Contributions: 30
  File: nautilus_trader\examples\strategies\ema_cross_bracket.py, Contributions: 30
  File: nautilus_trader\examples\strategies\ema_cross_bracket_algo.py, Contributions: 45
  File: nautilus_trader\examples\strategies\ema_cross_cython.pyx, Contributions: 23
  File: nautilus_trader\examples\strategies\ema_cross_stop_entry.py, Contributions: 41
  File: nautilus_trader\examples\strategies\ema_cross_trailing_stop.py, Contributions: 46
  File: nautilus_trader\examples\strategies\ema_cross_twap.py, Contributions: 20
  File: nautilus_trader\examples\strategies\volatility_market_maker.py, Contributions: 124
  File: nautilus_core\persistence\src\arrow\delta.rs, Contributions: 286
  File: nautilus_core\core\src\cvec.rs, Contributions: 3
  File: nautilus_core\model\src\enums.rs, Contributions: 913
  File: n

In [17]:
df2 = create_datafram_from_contrib_matix(filtered_author_file_contributions_with_files_added_lines)
print(df2.head())

                 Chris Sellers  Ishan Bhanuka  Filip Macek  rsmb7z    Brad  \
nautilus_trader        64248.0          295.0       7919.0  4674.0  3590.0   
nautilus_core         112980.0         9938.0      45489.0   561.0     0.0   

                 David Blom  Benjamin Singleton  DevRoss  faysou  \
nautilus_trader      8624.0              4382.0    564.0  2726.0   
nautilus_core          59.0                 0.0      0.0  1288.0   

                 Miller Moore  Pushkar Mishra  
nautilus_trader        8818.0             0.0  
nautilus_core             0.0         18184.0  


In [18]:
df2, main_author2, main_author_contributions2 = calc_main_auth_and_add_percentage_to_df(df2)


Updated DataFrame with Main Author Percentage:
                 Chris Sellers  Ishan Bhanuka  Filip Macek  rsmb7z    Brad  \
nautilus_trader        64248.0          295.0       7919.0  4674.0  3590.0   
nautilus_core         112980.0         9938.0      45489.0   561.0     0.0   

                 David Blom  Benjamin Singleton  DevRoss  faysou  \
nautilus_trader      8624.0              4382.0    564.0  2726.0   
nautilus_core          59.0                 0.0      0.0  1288.0   

                 Miller Moore  Pushkar Mishra  Main_Author_Percentage  
nautilus_trader        8818.0             0.0               60.702948  
nautilus_core             0.0         18184.0               59.936657  


In [19]:
print("\nComments on the Results:")
print(f"1. Main Author: {main_author2} with total added line of code: {main_author_contributions2}")
print("2. Contribution Percentages by Main Author:")
print(df2['Main_Author_Percentage'])


Comments on the Results:
1. Main Author: Chris Sellers with total added line of code: 177228.0
2. Contribution Percentages by Main Author:
nautilus_trader    60.702948
nautilus_core      59.936657
Name: Main_Author_Percentage, dtype: float64


In [20]:
modules_with_main_author_dominance2 = (df2['Main_Author_Percentage'] > 50).sum()


print(f"Total modules: {len(df2)}")
print(f"Modules where main author has more than 50% contribution {modules_with_main_author_dominance2}")

Total modules: 2
Modules where main author has more than 50% contribution 2


The main author is still the main contributor. However, in certain modules he did not add more than 50% LOC.

## Task 2: Knowledge loss

We now want to analyze the knowledge loss when the main contributor of the analyzed project would leave. For this we will use the circle packaging layout introduced in the "Code as a Crime Scene" book. This assignment includes the necessary `knowledge_loss.html` file as well as the `d3` folder for all dependencies. Your task is to create the `output.json` file according to the specification below. This file can then be visualized with the files provided.

For showing the visualization, once you have the output as `output.json` you should

* make sure to have the `knowledge_loss.html` file in the same folder
* start a local HTTP server in the same folder (e.g. with python `python3 -m http.server`) to serve the html file (necessary for d3 to work)
* open the served `knowledge_loss.html` and look at the visualization

Based on the visualization, comment on how is the project in terms of project loss and what could happen if the main contributor would leave.


### Output Format for Visualization

* `root` is always the root of the tree
* `size` should be the total number of lines of contribution
* `weight` can be set to the same as `size`
* `ownership` should be set to the percentage of contributions from the main author (e.g. 0.98 for 98% if contributions coming from the main author)

```
{
  "name": "root",
  "children": [
    {
      "name": "test",
      "children": [
        {
          "name": "benchmarking",
          "children": [
            {
              "author_color": "red",
              "size": "4005",
              "name": "t6726-patmat-analysis.scala",
              "weight": 1.0,
              "ownership": 0.9,
              "children": []
            },
            {
              "author_color": "red",
              "size": "55",
              "name": "TreeSetIterator.scala",
              "weight": 0.88,
              "ownership": 0.9,
              "children": []
            }
          ]
        }
      ]
    }
  ]
}
```

### JSON Export

For exporting the data to JSON you can use the following snippet:

```
import json

with open("output.json", "w") as file:
    json.dump(tree, file, indent=4)
```

In [37]:
json_df = pd.DataFrame(author_file_lines_matrix).stack().reset_index()
json_df.columns = ['filename', 'author', 'contributions']
json_df = json_df[['author', 'filename', 'contributions']]
json_df.head()


Unnamed: 0,author,filename,contributions
0,Chris Sellers,nautilus_trader\examples\strategies\ema_cross.py,30.0
1,rsmb7z,nautilus_trader\examples\strategies\ema_cross.py,11.0
2,Chris Sellers,nautilus_trader\examples\strategies\ema_cross_...,30.0
3,Chris Sellers,nautilus_trader\examples\strategies\ema_cross_...,45.0
4,Chris Sellers,nautilus_trader\examples\strategies\ema_cross_...,23.0


In [93]:
grouped_df = json_df.groupby('filename', as_index=False)['contributions'].sum()
main_df = json_df[json_df['author'] == 'Chris Sellers']
merged_df = grouped_df.merge(main_df, on='filename', how='left', suffixes=('_total', '_main'))
merged_df.fillna({'contributions_main': 0}, inplace=True)
merged_df['main_percentage'] = (
    (merged_df['contributions_main'] / merged_df['contributions_total']) * 100
)
result_df = merged_df[['filename', 'contributions_total', 'contributions_main', 'main_percentage']]
result_df = result_df.fillna({'main_percentage': 0})
result_df.head()

Unnamed: 0,filename,contributions_total,contributions_main,main_percentage
0,build.py,121.0,121.0,100.0
1,docs\_pygments\monokai.py,17.0,17.0,100.0
2,docs\api_reference\conf.py,81.0,30.0,37.037037
3,docs\conf.py,14.0,13.0,92.857143
4,examples\backtest\betfair_backtest_orderbook_i...,14.0,9.0,64.285714


In [85]:
def make_tree(file, row, node):
    tree= file.split('\\')
    if len(tree) == 1:
        leaf = {'name': None, 'children': []}
        leaf['name'] = tree[0]
        if row['main_percentage'] >= 50:
            leaf['author_color'] = 'red'
        else:
            leaf['author_color'] = 'grey'
        leaf['size'] = row['contributions_total']
        leaf['weight'] = row['contributions_total']
        leaf['ownership'] = row['main_percentage']
        node['children'].append(leaf)
    else:
        matching_child = next((child for child in node['children'] if child['name'] == tree[0]), None)
        if matching_child:
            matching_child = make_tree('\\'.join(tree[1:]), row, matching_child)
        else:
            leaf = {'name': None, 'children': []}
            leaf['name'] = tree[0]
            node['children'].append(make_tree('\\'.join(tree[1:]), row, leaf))
    return node

In [94]:
tree = {
  "name": "root",
  "children":[]
}

for index, row in result_df.iterrows():
    tree = make_tree(row['filename'], row, tree)

In [87]:
import json

with open("output.json", "w") as file:
    json.dump(tree, file, indent=4)

<img src="task_2.png" width="400" /> <br>
- Based on the visualization, comment on how is the project in terms of project loss and what could happen if the main contributor would leave. <br>

If the main contributor leaves the organization, this project can face severe risk. In the module `nautilus_core`, Chris Sellers participated in most of the dependencies, especially in the core modules like model, common, and adapters.<br>
Also in the module `nautilus_trader`, he participated less in adapter sub-module but still in over half of sub-modules are dominated by his effort.<br>

Since his work is not limited in few modules but all over the system, if the main developer leave the project, system can face huge project loss.

## Task 3: Code Churn Analysis

The third and last task is to analyze the code churn of the _Nautilus Trader_ project. For this analysis we look at the code churn, meaning the daily change in the total number of lines of the project.

Visualize the code churn over time bucketing the data by day. Remember that you'll need to consider also the days when there are no commits.

Look at the churn trend over time, identify one outlier of your choice, and for it:

* investigate if it was caused by a single or multiple commits (since you are bucketing the data by day)
* find the involved commit(s) and look at the commit message(s)
* find the involved files, and for each file look at the number of lines added and/or deleted as well as the modification type (addition, deletion, modification, renaming)

Based on the above, discuss the potential reasons for the outlier and if it should be a reason for concern.

In [26]:
# your solution here