# Fundamentals of Software Systems (FSS)
**Software Evolution – Part 02 Assignment**

## Submission Guidelines

To correctly complete this assignment you must:

* Carry out the assignment in a team of 2 to 4 students.
* Carry out the assignment with your team only. You are allowed to discuss solutions with other teams, but each team should come up its own personal solution. A strict plagiarism policy is going to be applied to all the artifacts submitted for evaluation.
* As your submission, upload the filled Jupyter Notebook (including outputs) together with the d3 visualization web pages (i.e. upload everything you downloaded including the filled Jupyter Notebook plus your `output.json`)
* The files must be uploaded to OLAT as a single ZIP (`.zip`) file by 2023-12-04 18:00.


## Group Members
* Firstname, Lastname, Immatrikulation Number
* **TO BE FILLED**

## Task Context

In this assigment we will be analyzing the _elasticsearch_ project. All following tasks should be done with the subset of commits from tag `v1.6.0` to tag `v2.0.0`.

In [2]:
from pydriller import Repository, Git
import os
import matplotlib.pyplot as plt
import numpy as np

In [3]:
# Clone code from the repo and save it for code portability -> via normal git clone
url = "https://github.com/elastic/elasticsearch"
repo_path = os.path.join(os.getcwd(), 'elasticsearch')
clone = f"git clone {url} \"{repo_path}\""

os.system(clone) # Cloning

128

In [4]:
# checkout the tag 3.6.0
os.chdir(repo_path)
os.system("git checkout v2.0.0")
# back to the "home" folder
os.chdir("..")

In [5]:
from_tag = "v1.6.0"
to_tag = "v2.0.0"

In [6]:
gr = Git(repo_path)
# get all commits between the two tags
from_commit_date = gr.get_commit_from_tag(from_tag).committer_date
to_commit_date = gr.get_commit_from_tag(to_tag).committer_date

## Task 1: Author analysis

In the following, please consider only `java` files.

The first task is to get an overview of the author ownership of the _elasticsearch_ project. In particular, we want to understand who are the main authors in the system between the two considered tags, the authors distribution among files and the files distribution among authors. To this aim, perform the following:

* create a dictionary (or a list of tuples) with the pairs author => number of modified files
* create a dictionary (or a list of tuples) with the pairs file => number of authors who modified the file
* visualize the distribution of authors among files: the visualization should have on the x axis the number of authors per file (from 1 to max), and on the y axis the number of files with the given number of authors (so for example the first bar represent the number of files with single author)
* visualize the distribution of files among authors: the visualization should have on the x axis the number of files per author (from 1 to max), and on the y axis the number of authors that own the given number of files (so for example the first bar represent the minor contributors, i.e., the number of authors who own 1 file)

Comment the two distribution visualizations.



Now, let's look at the following 3 packages in more details:

1. `src/main/java/org/elasticsearch/common`
2. `src/main/java/org/elasticsearch/rest`
3. `src/main/java/org/elasticsearch/cluster`

Create a function that, given the path of a package and a modification type (see class Modification below), returns a dictionary of authors => number, where the number counts the total lines added or removed or added+removed or added-removed (depending on the given Modification parameter), for the given package. To compute the value at the package level, you should aggregate the data per file.

Using the function defined above, visualize the author contributions (lines added + lines removed). The visualization should have the author on the x axis, and the total lines on the y axis. Sort the visualization in decreasing amount of contributions, i.e., the main author should be the first.

Compare the visualization for the 3 packages and comment.

In [8]:
authors_nr_modifications = {}
files_nr_authors = {}

for commit in Repository(repo_path, since=from_commit_date, to=to_commit_date).traverse_commits():
    for file in commit.modified_files:
        filename = file.filename
        if not filename.endswith(".java"):
            continue
        author = commit.author.name
        if filename not in files_nr_authors:
            files_nr_authors[filename] = set()
        files_nr_authors[filename].add(author)
        if author not in authors_nr_modifications:
            authors_nr_modifications[author] = set()
        authors_nr_modifications[author].add(filename)

for author in authors_nr_modifications:
    authors_nr_modifications[author] = len(authors_nr_modifications[author])

for filename in files_nr_authors:
    files_nr_authors[filename] = len(files_nr_authors[filename])

In [7]:
# Function to create a histogram for the distribution of authors among files
def plot_authors_distribution(authors_files):
    author_counts = list(authors_files.values())
    max_authors = max(author_counts)

    plt.hist(author_counts, bins=np.arange(1, max_authors + 2) - 0.5, edgecolor='black', alpha=0.7)
    plt.xlabel('Number of Authors per File')
    plt.ylabel('Number of Files')
    plt.title('Distribution of Authors Among Files')
    plt.show()

In [8]:
# Function to create a histogram for the distribution of files among authors
def plot_files_distribution(files_authors):
    files_counts = list(files_authors.values())
    max_files = max(files_counts)

    plt.hist(files_counts, bins=np.arange(1, max_files + 2) - 0.5, edgecolor='black', alpha=0.7)
    plt.xlabel('Number of Files per Author')
    plt.ylabel('Number of Authors')
    plt.title('Distribution of Files Among Authors')
    plt.show()

In [9]:
plot_authors_distribution(authors_nr_modifications)
plot_files_distribution(files_nr_authors)

NameError: name 'authors_nr_modifications' is not defined

In [7]:
from enum import Enum 

class Modification(Enum):
    ADDED = "Lines added"
    REMOVED = "Lines removed"
    TOTAL = "Lines added + lines removed"
    DIFF = "Lines added - lines removed"

In [11]:
def calc_authors_nr_modifications(package_path, modification):
    authors_nr_modifications = {}
    for commit in Repository(repo_path, since=from_commit_date, to=to_commit_date).traverse_commits():
        for file in commit.modified_files:
            filename = file.filename
            if not filename.endswith(".java"):
                continue
            if file.new_path is None:
                continue    
            if not file.new_path.startswith(package_path):
                continue
            author = commit.author.name
            if author not in authors_nr_modifications:
                authors_nr_modifications[author] = 0
            if modification == Modification.ADDED:
                authors_nr_modifications[author] += file.added
            elif modification == Modification.REMOVED:
                authors_nr_modifications[author] += file.removed
            elif modification == Modification.TOTAL:
                authors_nr_modifications[author] += file.added + file.removed
            elif modification == Modification.DIFF:
                authors_nr_modifications[author] += file.added - file.removed
    return authors_nr_modifications

In [12]:
package_paths = [
    "src\\main\\java\\org\\elasticsearch\\common",
    "src\\main\\java\\org\\elasticsearch\\rest",
    "src\\main\\java\\org\\elasticsearch\\cluster",
]

authors_nr_modifications = {}

for package_path in package_paths:
    authors_nr_modifications[package_path] = {}
    for modification in Modification:
        authors_nr_modifications[package_path][modification] = calc_authors_nr_modifications(package_path, modification)


: 

## Task 2: Knowledge loss

We now want to analyze the knowledge loss when the main contributor of the analyzed project would leave. For this we will use the circle packaging layout introduced in the "Code as a Crime Scene" book. This assignment includes the necessary `knowledge_loss.html` file as well as the `d3` folder for all dependencies. Your task is to create the `output.json` file according to the specification below. This file can then be visualized with the files provided.

For showing the visualization, once you have the output as `output.json` you should

* make sure to have the `knowledge_loss.html` file in the same folder
* start a local HTTP server in the same folder (e.g. with python `python3 -m http.server`) to serve the html file (necessary for d3 to work)
* open the served `knowledge_loss.html` and look at the visualization

Based on the visualization, comment on how is the project in terms of project loss and what could happen if the main contributor would leave.


### Output Format for Visualization

* `root` is always the root of the tree
* `size` should be the total number of lines of contribution
* `weight` can be set to the same as `size`
* `ownership` should be set to the percentage of contributions from the main author (e.g. 0.98 for 98% if contributions coming from the main author)

```
{
  "name": "root",
  "children": [
    {
      "name": "test",
      "children": [
        {
          "name": "benchmarking",
          "children": [
            {
              "author_color": "red",
              "size": "4005",
              "name": "t6726-patmat-analysis.scala",
              "weight": 1.0,
              "ownership": 0.9,
              "children": []
            },
            {
              "author_color": "red",
              "size": "55",
              "name": "TreeSetIterator.scala",
              "weight": 0.88,
              "ownership": 0.9,
              "children": []
            }
          ]
        }
      ]
    }
  ]
}
```

### JSON Export

For exporting the data to JSON you can use the following snippet:

```
import json

with open("output.json", "w") as file:
    json.dump(tree, file, indent=4)
```

## Task 3: Code Churn Analysis

The third and last task is to analyze the code churn of the _elasticsearch_ project. For this analysis we look at the code churn, meaning the daily change in the total number of lines of the project.

Visualize the code churn over time bucketing the data by day. Remember that you'll need to consider also the days when there are no commits.

Look at the churn trend over time, identify one outlier, and for it:

* investigate if it was caused by a single or multiple commits (since you are bucketing the data by day)
* find the hash of the involved commit(s)
* find the involved files, and for each file look at the number of lines added and/or deleted as well as the modification type (addition, deletion, modification, renaming)
* look at the commit messages

Based on the above, discuss the potential reasons for the outlier and if it should be a reason for concern.