# Fundamentals of Software Systems (FSS)
**Software Evolution – Part 02 Assignment**

## Submission Guidelines

To correctly complete this assignment you must:

* Carry out the assignment in a team of 2 to 4 students.
* Carry out the assignment with your team only. You are allowed to discuss solutions with other teams, but each team should come up its own personal solution. A strict plagiarism policy is going to be applied to all the artifacts submitted for evaluation.
* As your submission, upload the filled Jupyter Notebook (including outputs) together with the d3 visualization web pages (i.e. upload everything you downloaded including the filled Jupyter Notebook plus your `output.json`)
* The files must be uploaded to OLAT as a single ZIP (`.zip`) file by 2023-12-04 18:00.


## Group Members
* Bauer Adam, 		20-744-694
* Brazerol Alessio, 	18-924-084
* Luley Paul, 		21-741-491

## Task Context

In this assigment we will be analyzing the _elasticsearch_ project. All following tasks should be done with the subset of commits from tag `v1.6.0` to tag `v2.0.0`.

##  Preprocessing - Execute before each subtask, but the file dosen't have to be analyzes as whole for each subtask to work, all subtasks are implemented independently 

In [None]:
%pip install plotly
%pip install tdqm
%pip install nbformat

In [3]:
import os
from pydriller import Repository, Git
from tqdm import tqdm
from collections import defaultdict, Counter
import plotly.express as px
import subprocess
import platform
import json
import pandas as pd
from datetime import datetime
import numpy as np
import math
import plotly.graph_objs as go
from plotly.subplots import make_subplots

In [4]:
# clone the project -> easier setup (will not clone if folder already exists or is not empty)
url = "https://github.com/elastic/elasticsearch"
repo_path = os.path.join(os.getcwd(), 'elasticsearch')
clone = f"git clone {url} {repo_path}" 

os.system(clone)

128

In [5]:
# define Repository
gr = Git(repo_path)
from_tag = "v1.6.0"
to_tag = "v2.0.0"
since_time = gr.get_commit_from_tag(from_tag).committer_date
to_time = gr.get_commit_from_tag(to_tag).committer_date

repo = Repository(repo_path, since=since_time, to=to_time)
print(f"Analyzing repo from {from_tag} to {to_tag}, which are commits from {since_time} to {to_time}")


Analyzing repo from v1.6.0 to v2.0.0, which are commits from 2015-06-09 13:35:08+00:00 to 2015-10-21 23:01:03+02:00


In [18]:
from enum import Enum 

class Modification(Enum):
    ADDED = "Lines added"
    REMOVED = "Lines removed"
    TOTAL = "Lines added + lines removed"
    DIFF = "Lines added - lines removed"

## Task 1: Author analysis



In the following, please consider only `java` files.

The first task is to get an overview of the author ownership of the _elasticsearch_ project. In particular, we want to understand who are the main authors in the system between the two considered tags, the authors distribution among files and the files distribution among authors. To this aim, perform the following:

* create a dictionary (or a list of tuples) with the pairs author => number of modified files
* create a dictionary (or a list of tuples) with the pairs file => number of authors who modified the file
* visualize the distribution of authors among files: the visualization should have on the x axis the number of authors per file (from 1 to max), and on the y axis the number of files with the given number of authors (so for example the first bar represent the number of files with single author)
* visualize the distribution of files among authors: the visualization should have on the x axis the number of files per author (from 1 to max), and on the y axis the number of authors that own the given number of files (so for example the first bar represent the minor contributors, i.e., the number of authors who own 1 file)

Comment the two distribution visualizations.



Now, let's look at the following 3 packages in more details:

1. `src/main/java/org/elasticsearch/common`
2. `src/main/java/org/elasticsearch/rest`
3. `src/main/java/org/elasticsearch/cluster`

Create a function that, given the path of a package and a modification type (see class Modification below), returns a dictionary of authors => number, where the number counts the total lines added or removed or added+removed or added-removed (depending on the given Modification parameter), for the given package. To compute the value at the package level, you should aggregate the data per file.

Using the function defined above, visualize the author contributions (lines added + lines removed). The visualization should have the author on the x axis, and the total lines on the y axis. Sort the visualization in decreasing amount of contributions, i.e., the main author should be the first.

Compare the visualization for the 3 packages and comment.

### Part 1

In [13]:
def create_dictionary(repo):
    authors_dict = defaultdict(set)
    files_dict = defaultdict(set)

    commits = [commit for commit in repo.traverse_commits()]

    for commit in tqdm(commits, desc="Analyzing commits", unit="commit"):
        for file in commit.modified_files:
            filename = file.filename
            if not filename.endswith(".java"):
                continue

            authors_dict[commit.author.name].add(filename) #since dict is a set, we don't have to care about duplicities
            files_dict[filename].add(commit.author.name) #since dict is a set, we don't have to care about duplicities
           
    # perform counting on both dictionaries to get the numbers
    authors_dict = {author: len(files) for author, files in authors_dict.items()}
    files_dict = {file: len(authors) for file, authors in files_dict.items()}
    return authors_dict, files_dict

In [14]:
# visualuse them
authors_dict, files_dict = create_dictionary(repo)

Analyzing commits: 100%|██████████| 1746/1746 [01:23<00:00, 20.97commit/s]


In [15]:
def plot_interactive_bar_chart(dictionary, title, ylabel, xlabel, top_n=10):
    # Counting the occurrences in the dictionary
    counts = Counter(dictionary.values())

    # Sorting the counts dictionary by values in descending order
    sorted_counts = dict(sorted(counts.items(), key=lambda item: item[0]))

    # Getting the top n most common elements after sorting

    # Preparing data for plotting
    values = list(sorted_counts.keys())
    frequencies = list(sorted_counts.values())

    # Creating an interactive bar chart
    fig = px.bar(x=values, y=frequencies, labels={'x': xlabel, 'y': ylabel}, title=title)
    fig.update_layout(xaxis={'type':'category'})
    fig.show()



plot_interactive_bar_chart(files_dict, "How many files are used by how many authors (so for example the first bar represent the number of files with single author)", "Number of authors", "Number of files")
plot_interactive_bar_chart(authors_dict, "How many authors have interacted with how many files (so for example the first bar represent the minor contributors, i.e., the number of authors who own 1 file)", "Number of authors", "Number of files")

As we can see there are plenty of authors who are authors of only one files, this could speak for the fact, that elasticsearch is an open source where eveybody can contribute and do modifications, however there are still some files, where multiple developers have contributed, this can be God classes or for example package manager. The largest number of contributors have one file with 12 authors.

In the second graphics, we can see, that there are again rather authors who own lower amount of files, however there are two core developers. One owns 1235 files and the second one mostly 800 files.

### Part 2

In [16]:
def get_modified_value(file, modification_type):
    if(modification_type not in Modification):
        raise ValueError("Invalid modification type")
    
    if(modification_type == Modification.ADDED):
        return file.added_lines
    elif(modification_type == Modification.REMOVED):
        return file.deleted_lines
    elif(modification_type == Modification.TOTAL):
        return file.added_lines + file.deleted_lines
    elif(modification_type == Modification.DIFF):
        return file.added_lines - file.deleted_lines

def get_authors_numbers(package_paths, modification_types):
    if(not isinstance(package_paths, list)):
        raise ValueError("Package paths must be a list")
    if(not isinstance(modification_types, list)):
        raise ValueError("Modification types must be a list")
    
    # restrict repo to the previous defined tags
    repo = Repository(repo_path, since=since_time, to=to_time)
    commits = [commit for commit in repo.traverse_commits()]

    authors_dict = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: 0)))
    for commit in tqdm(commits, desc="Analyzing commits", unit="commit"):
        for file in commit.modified_files:
            filename = file.new_path if file.new_path else file.old_path

            for package_path in package_paths:
                if filename.startswith(package_path):
                    for modification_type in modification_types:
                        authors_dict[package_path][modification_type][commit.author.name] += get_modified_value(file, modification_type)

    return authors_dict   

In [19]:
# Be aware, they have moved from core into server folder in 2018
packages = ["core\\src\\main\\java\\org\\elasticsearch\\common", "core\\src\\main\\java\\org\\elasticsearch\\rest", "core\\src\\main\\java\\org\\elasticsearch\\cluster"]
modifications = [Modification.ADDED, Modification.REMOVED, Modification.TOTAL, Modification.DIFF]
analysis_packages = get_authors_numbers(packages, modifications)

Analyzing commits: 100%|██████████| 1746/1746 [01:32<00:00, 18.87commit/s]


In [20]:
import plotly.express as px
def graph_interactive(data, title):
    # Sort the dictionary by values (number of lines)
    sorted_data = dict(sorted(data.items(), key=lambda item: item[1]))

    # Splitting the sorted dictionary into authors and lines
    authors = list(sorted_data.keys())
    lines = list(sorted_data.values())

    # Creating an interactive bar chart using Plotly
    fig = px.bar(x=authors, y=lines)
    fig.update_layout(
        title=title,
        xaxis_title="Authors",
        yaxis_title="Number of Lines",
        xaxis={'categoryorder':'total descending'}
    )
    fig.show()


for pack in packages:
    graph_interactive(analysis_packages[pack][Modification.TOTAL], "Package: "+pack)


As visible in the graphic, Simon Willnauer is the main contributor in all of these packages. Accorsing to controbutions the package 'cluster' has much more contributions than the other 2 packages.

## Task 2: Knowledge loss

We now want to analyze the knowledge loss when the main contributor of the analyzed project would leave. For this we will use the circle packaging layout introduced in the "Code as a Crime Scene" book. This assignment includes the necessary `knowledge_loss.html` file as well as the `d3` folder for all dependencies. Your task is to create the `output.json` file according to the specification below. This file can then be visualized with the files provided.

For showing the visualization, once you have the output as `output.json` you should

* make sure to have the `knowledge_loss.html` file in the same folder
* start a local HTTP server in the same folder (e.g. with python `python3 -m http.server`) to serve the html file (necessary for d3 to work)
* open the served `knowledge_loss.html` and look at the visualization

Based on the visualization, comment on how is the project in terms of project loss and what could happen if the main contributor would leave.


### Output Format for Visualization

* `root` is always the root of the tree
* `size` should be the total number of lines of contribution
* `weight` can be set to the same as `size`
* `ownership` should be set to the percentage of contributions from the main author (e.g. 0.98 for 98% if contributions coming from the main author)

```
{
  "name": "root",
  "children": [
    {
      "name": "test",
      "children": [
        {
          "name": "benchmarking",
          "children": [
            {
              "author_color": "red",
              "size": "4005",
              "name": "t6726-patmat-analysis.scala",
              "weight": 1.0,
              "ownership": 0.9,
              "children": []
            },
            {
              "author_color": "red",
              "size": "55",
              "name": "TreeSetIterator.scala",
              "weight": 0.88,
              "ownership": 0.9,
              "children": []
            }
          ]
        }
      ]
    }
  ]
}

{
  "name": "root",
  "children": [
    {
      "name": ".settings",
      "children": [
        {
              "name": "org.eclipse.core.resources.prefs",
              "val1": 0,
              "val2": 6,
              "children": []
        },
        {
              "name": "org.eclipse.jdt.core.prefs",
              "val1": 0,
              "val2": 18,
              "children": []
        },
      
          ]
        },
        {
      "name": "core",
      "children": [
          {
                "name": "license.txt",
                "val1": 0,
                "val2": 6,
                "children": []
          },
          {
                "name": "lsrc",
                "children": [
                  {
                        "name": "core.prefs",
                        "val1": 0,
                        "val2": 18,
                        "children": []
                  },
                ]
          },
      
          ]
        }
      ]
    }
  ]
}
```

### JSON Export

For exporting the data to JSON you can use the following snippet:

```
import json

with open("output.json", "w") as file:
    json.dump(tree, file, indent=4)
```

In [12]:
# Determine the main contributor 0> look at github stats:) Shay Banon (kimchy)
# UNIX was not tested!!!!
import shlex
def count_lines_total_and_by_authors(file_path, author_names):
    if(os.getcwd() != repo_path):
        raise ValueError("You must be in the repository folder to run this function")
    
    os_type = platform.system()
    # Create a grep pattern that matches any of the authors
    

    # needs system path 
    # Count total lines in the file
    if os_type == 'Windows':
        total_lines_cmd = f"powershell -Command \"(Get-Content {file_path} | Measure-Object).Count\""
    else:
        total_lines_cmd = f"wc -l < {file_path}"

    # print()
    total_lines_result = subprocess.run(total_lines_cmd, shell=True, capture_output=True, text=True)
    total_lines = int(total_lines_result.stdout.strip())

    # remove repo path from file path  (fix the slash that is missint at the end of repopath, otherwise reative)
    # FIX

   
    relative_file_path = file_path.replace((repo_path+"\\"), '')
    # Count lines by authors

    # needs relative path
    if os_type == 'Windows':

        patterns = ', '.join([f"'author {author}'" for author in author_names])
        author_lines_cmd = f"(git blame --line-porcelain {relative_file_path} | Select-String -Pattern {patterns}).Count"
        author_lines_cmd = f"powershell -Command \"{author_lines_cmd}\""
    else:
        grep_pattern = '|'.join([f'^author {shlex.quote(author)}' for author in author_names])
        # Unix-like systems: using grep with extended regex
        author_lines_cmd = f"git blame --line-porcelain {relative_file_path} | grep -E '{grep_pattern}' | wc -l"

    # print(author_lines_cmd)
   
    author_lines_result = subprocess.run(author_lines_cmd, shell=True, capture_output=True, text=True)
    author_lines = int(author_lines_result.stdout.strip())
    

    return total_lines, author_lines

In [None]:
# Test function
# fp = "c:\\school\\schweiz_UNI\\fss\\fss-se1\\se2\\elasticsearch\\core\\src\\main\\java\\org\\elasticsearch\\Build.java"
# count_lines_total_and_by_authors(fp, ["kimchy", "Shay Banon"])

In [20]:

gr = Git(repo_path)
commit = gr.get_commit_from_tag(to_tag)
gr.checkout(commit.hash)
files = defaultdict(lambda: (0 ,0))
i = 10
# change into repo path
os.chdir(repo_path)
for file in tqdm(gr.files(), desc="Analyzing files", unit="file"):
    # # lowercase first letter of filepath -> windows fix the file is returned with C not c as drive
    file = file[0].lower() + file[1:]
    total, kimchy = count_lines_total_and_by_authors(file, ['kimchy', 'Shay Banon'])
    # print(file)
    # print(total, kimchy)
    file_relative = file.replace((repo_path+"\\"), '')
    files[file_relative] = (total, kimchy)

    # i -= 1
    # if(i == 0):
    #     break
    
os.chdir("..")

Analyzing files: 100%|██████████| 6059/6059 [1:58:06<00:00,  1.17s/file]   


In [26]:
def build_json_structure(data):
    def add_to_tree(base, parts, values):
        for part in parts[:-1]:
            found = next((item for item in base if item['name'] == part), None)
            if not found:
                new_node = {"name": part, "children": []}
                base.append(new_node)
                base = new_node['children']
            else:
                base = found['children']
        
        # Add the file as a child to the last directory
        # values[0] -> total
        # values[1] -> kimchy
        to_append = {
                    "name": parts[-1], 
                    "author_color": values[0] != 0 and (values[1]/values[0] > 0.5 and "red" or "green") or "green",
                    "size": values[0], 
                    "weight": values[0],
                    "ownership":  values[0] != 0 and values[1]/values[0] or 0
                    }
        base.append(to_append)

    tree = {"name": "root", "children": []}
    
    # Sort the paths for easier processing
    sorted_data = sorted(data.items())

    for path, values in sorted_data:
        parts = path.split('\\')
        add_to_tree(tree['children'], parts, values)

    return tree


In [27]:
import os

# Check if the JSON file already exists
if os.path.exists("output.json"):
    overwrite = input("The JSON file already exists. Do you want to overwrite it? (y/n): ")
    if overwrite.lower() != "y":
        print("JSON file not overwritten. Exiting...")
        exit()

# Generate JSON
json_structure = build_json_structure(files)
json_output = json.dumps(json_structure, indent=2)

# Save JSON to file
with open("output.json", "w") as file:
    file.write(json_output)

print("Now run your local python server by `python3 -m http.server` and open the knowledge_loss.html file in your browser")
print("http://localhost:8000/knowledge_loss.html")

Now run your local python server and open the knowledge_loss.html file in your browser
http://localhost:8000/knowledge_loss.html


## Task 3: Code Churn Analysis



The third and last task is to analyze the code churn of the _elasticsearch_ project. For this analysis we look at the code churn, meaning the daily change in the total number of lines of the project.

Visualize the code churn over time bucketing the data by day. Remember that you'll need to consider also the days when there are no commits.

Look at the churn trend over time, identify one outlier, and for it:

* investigate if it was caused by a single or multiple commits (since you are bucketing the data by day)
* find the hash of the involved commit(s)
* find the involved files, and for each file look at the number of lines added and/or deleted as well as the modification type (addition, deletion, modification, renaming)
* look at the commit messages

Based on the above, discuss the potential reasons for the outlier and if it should be a reason for concern.

### Generating the Timegraph

In [4]:
all_commits = [commit for commit in repo.traverse_commits()]

In [17]:
progress = tqdm(total=len(all_commits), desc="Analyzing commits", unit="commit")

analysis = {
    'addition' : defaultdict(lambda: 0),
    'deletion' : defaultdict(lambda: 0),
}
for commit in repo.traverse_commits():
    date = commit.committer_date.strftime("%Y-%m-%d") #use commiter_date othwesise there are also dates which are not in the data
    analysis['addition'][date] += commit.insertions
    analysis['deletion'][date] += commit.deletions

    
    progress.update(1)

progress.close()


Analyzing commits:   0%|          | 0/1746 [00:00<?, ?commit/s]

Analyzing commits: 100%|██████████| 1746/1746 [04:23<00:00,  6.62commit/s]


In [18]:
# exlapolate the datatime series

def exlaporate(serie, data_from, data_to):
    # create a new dict with all the dates
    # fill the gaps with 0
    # add the data from the old dict
    # return the new dict
    new_serie = {}
    for date in pd.date_range(data_from.date(), data_to.date()):
        new_serie[date.strftime("%Y-%m-%d")] = 0

    for date, value in serie.items():
        new_serie[date] = value
    
    return new_serie

In [19]:
analysis_e = {}
analysis_e['addition'] = exlaporate(analysis['addition'], since_time, to_time)
analysis_e['deletion'] = exlaporate(analysis['deletion'], since_time, to_time)


In [20]:
format=  '%Y-%m-%d'
for set  in analysis_e:
    prev_key = since_time.strftime(format)
    for key, value in analysis_e[set].items():
        if(datetime.strptime(key, format) < datetime.strptime(prev_key, format)):
            assert False, "The dates are not sorted"
        prev_key = key
assert len(analysis_e['addition']) == len(analysis_e['deletion']), "The length of the series is not the same"

In [63]:
keys = list(analysis_e['addition'].keys())
df = {}
for i in keys:
    df[i] = [analysis_e['addition'][i], analysis_e['deletion'][i]]
df = pd.DataFrame(df).transpose()
df.columns = ['addition', 'deletion']

In [24]:
# Assuming 'df' is your DataFrame and it has 'addition' and 'deletion' columns
def create_graph(df, important_days=[]):
    datatimeseries = pd.to_datetime(df.index)

    # Create subplots: one main plot and one subplot for an overview
    fig = make_subplots(rows=1, cols=1)

    # Add 'addition' trace
    fig.add_trace(
        go.Scatter(x=datatimeseries, y=df['addition'], mode='lines', name='Addition', line=dict(color='green')),
    )

    # Add 'deletion' trace
    fig.add_trace(
        go.Scatter(x=datatimeseries, y=df['deletion'], mode='lines', name='Deletion', line=dict(color='red')),
    )

    # Update layout
    fig.update_layout(
        height=600, width=800, title_text="Time Series Analysis (blue lines are the interesting days that are further analyzed)",
        xaxis=dict(
            rangeselector=dict(
                buttons=list([
                    dict(count=1, label="1m", step="month", stepmode="backward"),
                    dict(count=6, label="6m", step="month", stepmode="backward"),
                    dict(step="all")
                ])
            ),
            rangeslider=dict(
                visible=True
            ),
            type="date",
            tickformat="%Y-%m-%d"
        ),
        xaxis_title="Time",
        yaxis_title="Values"
    )

    # Highlight important days with vertical lines
    for day in important_days:
        if day in datatimeseries:
            fig.add_shape(
                go.layout.Shape(
                    type="line",
                    x0=day,
                    x1=day,
                    y0=min(df['deletion'].min(), df['addition'].min()),
                    y1=max(df['deletion'].max(), df['addition'].max()),
                    line=dict(color='blue', width=2, dash="dash"),
                )
            )

    # Show the plot
    fig.show()

In [61]:
# Find the dates automatically -> less tideous (count together addition and deletions and find which days are more than 100% increase to previous days)
def generate_outliners_df(analysis_e, method='mean'):
    df_analysis = pd.DataFrame(analysis_e)
    df_analysis['total'] = df_analysis['addition'] + df_analysis['deletion']

    if(method == 'mean'):
        mean_a = df_analysis['addition'].mean()*4
        mean_d = df_analysis['deletion'].mean()*4
        df_outliners = df_analysis[(df_analysis['addition'] > mean_a) | (df_analysis['deletion'] > mean_d)]
        return df_outliners
    
    elif(method == 'rolling'):
        df_analysis['rolling_total'] = df_analysis['total'].rolling(window=6).sum()
        df_analysis['diff'] = df_analysis['total'].diff()

        # divide diff by the prebious total
        df_analysis['diff_relative'] = df_analysis['diff'] / df_analysis['rolling_total'].shift(1)

        df_analysis.replace([np.inf, -np.inf], np.nan, inplace=True)
        df_outliners = df_analysis[df_analysis['diff_relative'] > 0.5]

        return df_outliners

df_outliners = generate_outliners_df(analysis_e=analysis_e, method ='mean')
outliner_days = df_outliners.index
df_outliners

Unnamed: 0,addition,deletion,total
2015-06-13,236,8104,8340
2015-06-17,12470,2531,15001
2015-06-23,24314,1133,25447
2015-06-24,20299,8823,29122
2015-06-29,5544,4574,10118
2015-07-21,21873,3054,24927
2015-08-03,58452,22592,81044
2015-08-06,12803,5662,18465


In [62]:
# display the previous graph with filtered days displayed as important
create_graph(df, important_days=df_outliners.index)

### Deep dive2hotspots

In [75]:
# Create 2 arrays -> one for files, one for commits for important days
commit_data = []        
files_in_important_days = []     


# Iterate over each day and commit
for day in tqdm(outliner_days, 'Analyzing commits', unit='day'):
    commit_count = 0

    for commit in all_commits:
        if commit.committer_date.strftime("%Y-%m-%d") == day:
            modified_files = 0
            modified_lines = 0
            for file in commit.modified_files:
                files_in_important_days.append({
                    'Date': commit.committer_date.strftime("%Y-%m-%d"), 
                    'Commit_Hash': commit.hash, 'File': file.filename, 
                    'Additions': file.added_lines, 'Deletions': file.deleted_lines, 
                    'Total': file.added_lines + file.deleted_lines,
                    'Modification_Type': file.change_type.name,
                    'nloc': file.nloc,
                    })
                
                modified_files += 1
                modified_lines += file.added_lines + file.deleted_lines

            commit_data.append({'Date': day, 'Commit_Hash': commit.hash, 'Modified_Files': modified_files, 'Modified_Lines': modified_lines})
            commit_count += 1

# Convert the list to a DataFrame
commit_df = pd.DataFrame(commit_data)
files_in_important_days_df = pd.DataFrame(files_in_important_days)

Analyzing commits: 100%|██████████| 8/8 [00:43<00:00,  5.39s/day]


In [None]:
# Function to investigate the commits of a specific day
summary = commit_df.groupby('Date').agg(
    Total_Commits=pd.NamedAgg(column='Commit_Hash', aggfunc='count'),
    Total_Modified_Files=pd.NamedAgg(column='Modified_Files', aggfunc='sum'),
    Total_Modified_Lines=pd.NamedAgg(column='Modified_Lines', aggfunc='sum')
)
# Display the summary
print(summary)

# select subframe for each day, sort by largest modified lines/commit and select so many that 75% of the total changes are covered
# it can only help to understand the changes, but not to find the most important commits
def get_most_important_commits_per_day(df, day):
    df = commit_df[commit_df['Date'] == day]
    df_day = df[df['Date'] == day]
    df_day = df_day.sort_values(by=['Modified_Lines'], ascending=False)
    df_day['cumsum'] = df_day['Modified_Lines'].cumsum()
    df_day['cumsum_percentage'] = df_day['cumsum'] / df_day['Modified_Lines'].sum()
    df_day = df_day[df_day['cumsum_percentage'] <= 0.75]
    return df_day

get_most_important_commits_per_day(commit_df, '2015-08-03')

In [68]:
# create graph with the data get each file each day and create small graphics for each day's top files accross all commits
# try to explain 75% of all lines
def get_most_important_files_per_day(dataframe, day):
    df = dataframe[dataframe['Date'] == day]
    df = df.sort_values(by=['Total'], ascending=False)
    df['cumsum'] = df['Total'].cumsum()
    df['cumsum_percentage'] = df['cumsum'] / df['Total'].sum()

    #Uncomment followinf if you want to see the files that cover 75% of the changes
    df = df[df['cumsum_percentage'] <= 0.75]
    # # if there are more than 30 files select first 30
    # if(len(df) > 30):
    #     df = df.head(30)
    
    return df

In [69]:
def plot_file_changes(unique_dates, files_in_important_days_df):
    # Get unique dates from the dataframe

    # Create subplots, one for each date
    rows = math.ceil(len(unique_dates) / 2)
    fig = make_subplots(rows=rows, cols=2,  subplot_titles=unique_dates)

    # Iterate over each date and plot the bar graph
    for i, date in enumerate(unique_dates):
        row = (i // 2) + 1
        col = (i % 2) + 1


        # Filter data for the current date
        df_date = get_most_important_files_per_day(files_in_important_days_df, date)

        # Add the bar graph to the subplot
        fig.add_trace(go.Bar(
            x=df_date['File'],
            y=df_date['Additions'],
            name='Additions',
            marker_color='green'
        ), row=row, col=col)

        fig.add_trace(go.Bar(
            x=df_date['File'],
            y=df_date['Deletions'],
            name='Deletions',
            marker_color='red'
        ), row=row, col=col)

    # Update layout
    fig.update_layout(height=400 *rows, width=1200, showlegend=True, title_text="File Changes per Day")
    fig.update_xaxes(title_text="File Name", row=len(unique_dates), col=1)
    fig.update_yaxes(title_text="Count", col=1)

    # Show the figure
    fig.show()

plot_file_changes(outliner_days, files_in_important_days_df)

In [54]:
# create dataframe with sum of totals deletions and modifications according to files and according to the deletions(show how much modifications are not explained by files)
files_in_important_days_df

summary_commits = files_in_important_days_df.groupby('Date').agg(
    Total=pd.NamedAgg(column='Total', aggfunc='sum'),
    Additions=pd.NamedAgg(column='Additions', aggfunc='sum'),
    Deletions=pd.NamedAgg(column='Deletions', aggfunc='sum')
)

# merge summary_commits and df_outliners bu index
all_changes = df_outliners.merge(summary_commits, left_index=True, right_index=True)
all_changes

# Show an overvire first 3 colums are based on the commits insertion/aditions summary
# the last 3 columns are based on the files insertion/aditions summary
# the different cannot be explained by file analysis -> maybe movements?

Unnamed: 0,addition,deletion,total,Total,Additions,Deletions
2015-06-13,236,8104,8340,8340,236,8104
2015-06-17,12470,2531,15001,7797,7702,95
2015-06-23,24314,1133,25447,14102,12989,1113
2015-06-24,20299,8823,29122,4898,2302,2596
2015-06-29,5544,4574,10118,5211,3418,1793
2015-07-21,21873,3054,24927,11644,10192,1452
2015-08-03,58452,22592,81044,6267,3238,3029
2015-08-06,12803,5662,18465,9061,6310,2751


In [94]:
modifications_df = files_in_important_days_df.groupby(['Date','Modification_Type']).agg(
    Total=pd.NamedAgg(column='Total', aggfunc='sum'),
    Additions=pd.NamedAgg(column='Additions', aggfunc='sum'),
    Deletions=pd.NamedAgg(column='Deletions', aggfunc='sum'),
    sum_NLOC=pd.NamedAgg(column='nloc', aggfunc='sum')
)
modifications_df.reset_index(inplace=True)

# Converting 'Date' to datetime
# modifications_df['Date'] = pd.to_datetime(modifications_df['Date'])

# Creating a bar chart with Plotly
fig = px.bar(modifications_df, x='Date', y='sum_NLOC', color='Modification_Type', title='Modification Types Over Time per NLOC',
             category_orders={"Date": modifications_df['Date'].sort_values().unique()},
             barmode='group',
             log_y=True
             )


fig.show()

as we can see there are a lot of renames on the 03 august, which also makes the analysis harder, since these modifications are then projected to each file which is changed by 2 files, therefore we have also a lot of small files to explain 75% of the changes 