# Preprocessing

In [94]:
%pip install pydriller --upgrade
%pip install ujson --upgrade
%pip install numpy --upgrade
%pip install scipy --upgrade

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [95]:
from pydriller import Repository, Git
import numpy as np
from scipy.sparse import csc_matrix, vstack
from datetime import timedelta, datetime
import os
import ujson
from collections import deque

In [96]:
# Clone code from the repo and save it for code portability -> via normal git clone
url = "https://github.com/apache/kafka"
repo_path = os.path.join(os.getcwd(), 'kafka')
clone = f"git clone {url} {repo_path}" 

os.system(clone) # Cloning

128

In [97]:
# checkout the tag 3.6.0
os.chdir(repo_path)
os.system("git checkout 3.6.0")
# back to the "home" folder
os.chdir("..")

# Exercise 3

In [98]:
# Configuration

# Timeframes to look at
timeframes = [168, 72, 48, 24]

temporal_coupling_save_file = "./ex_3_temporal_coupling.json"
logical_coupling_save_file = "./ex_3_logical_coupling.json"

repo_path = "./kafka"

In [99]:
# Get all files currently in the repository
files = [os.path.relpath(entity, repo_path) for entity in Git(repo_path).files()]

number_of_files = len(files)
commits_since = datetime(2023, 9, 1, 0, 0, 0)

commits = [c for c in Repository(repo_path).traverse_commits()]

timeframes.sort(reverse=True)

In [None]:
commit_window = deque()
temporal_update_vectors = [{'v': [], 'w': []} for _ in timeframes]
logical_update_vectors = []

for commit in commits:
    modifications = [file.new_path for file in commit.modified_files]
    mask = np.isin(files, modifications)
    indices = np.where(mask)[0]
    c1_modifications = csc_matrix((np.ones_like(indices), (np.zeros_like(indices), indices)), shape=(1, len(files)))
    c1_time = commit.committer_date
    c1 = {"time": c1_time, "modifications": c1_modifications}
    inside_timeframe = [False for _ in timeframes]
    i = 0
    logical_update_vectors.append(c1_modifications)
    while i < len(commit_window):
        c2 = commit_window[i]
        delta = c1_time - c2["time"]
        for idx, timeframe in enumerate(timeframes):
            if inside_timeframe[idx] or delta <= timedelta(hours=timeframe):
                temporal_update_vectors[idx]['v'].append(c1_modifications)
                temporal_update_vectors[idx]['w'].append(c2["modifications"])
                inside_timeframe[idx] = True
        if inside_timeframe[0]:
            i += 1
        else:
            commit_window.popleft()
    commit_window.append(c1)              

In [None]:
# For each timeframe compute a temporal coupling matrix
temporal_matrices = [vstack(update_matrix['v']).transpose() @ vstack(update_matrix['w']) for update_matrix in temporal_update_vectors]

In [None]:
# Compute the logical coupling matrix and joint commits vector
logical_coupling_matrix = vstack(logical_update_vectors).transpose() @ vstack(logical_update_vectors)

joint_commits_vector = logical_coupling_matrix.sum(axis=0).A1 - logical_coupling_matrix.diagonal()

In [None]:
# Convert the temporal coupling matrix to the required format
temporal_coupling = {}
for idx, matrix in enumerate(temporal_matrices):
    for row, col in zip(*matrix.nonzero()):
        if row == col:
            continue
        val = int(matrix[row, col])
        key = f"{row}_{col}"
        if key not in temporal_coupling:
            temporal_coupling[key] = {
                "file_pair": [
                    files[row],
                    files[col]
                ],
                "coupled_commits": [
                    {
                        "time_window": timeframe,
                        "commit_count": 0
                    } for timeframe in reversed(timeframes)
                ]
            }
        temporal_coupling[key]["coupled_commits"][len(timeframes) - 1 - idx]["commit_count"] = val

In [None]:
# Convert the logical coupling matrix and vector to the required format
logical_coupling = []
for row, col in zip(*logical_coupling_matrix.nonzero()):
    # We just need to look at the upper triangle because the matrix is symmetric
    if row >= col:
        continue
    val = int(logical_coupling_matrix[row, col])
    file_name_1 = files[row]
    file_name_2 = files[col]
    
    logical_coupling.append({
        "file_pair": [file_name_1, file_name_2],
        "logical_coupling": {
            "Joint": val,
            file_name_1: int(joint_commits_vector[row]) - val,
            file_name_2: int(joint_commits_vector[col]) - val
        }
    })

In [None]:
# Saving temporal couplings
with open(temporal_coupling_save_file, "w") as f:
    ujson.dump(list(temporal_coupling.values()), f, indent=4)

In [None]:
# Saving logical couplings
with open(logical_coupling_save_file, "w") as f:
    ujson.dump(logical_coupling, f, indent=4)

# Analysis

## 1. KafkaApis.scala and ReplicaManager.scala
**KafkaApis.scala**:        core\src\main\scala\kafka\server\KafkaApis.scala
**ReplicaManager.scala**:   core\src\main\scala\kafka\server\ReplicaManager.scala

The file *KafkaApis.scala* and the file *ReplicaManager.scala* have a high temporal coupling, with 103 commits in a 24hour timeframe, 188 commits in a 48hour timeframe, 239 commits in a 72hour timeframe and 519 commits in a 168 timeframe. There also exists a high logical coupling with 147 logically coupled commits. These commits include various keywords like 'add(ed)', 'fix', 'refactor(ing)', 'feature', 'bug', etc., without one being significantly more frequent than the rest (TODO: ADD IS MORE). When we look at a sample of the commits in the issue tracking system we can also see various types of issues, ranging from *Bug* to *Improvement*. This is possible because most of the commits include an issue tracking id at the beginning of the commit message. For example commit [45c8195fa14c766b200c720f316836dbb84e9d8b](https://github.com/apache/kafka/commit/45c8195fa14c766b200c720f316836dbb84e9d8b) is a sub-task of the issue [KAFKA-3259](https://issues.apache.org/jira/browse/KAFKA-3259) which is of type *Improvement*. On the other hand the commit [660c0c0aa33ced5307ee70bfdb78ebde4b978d73](https://github.com/apache/kafka/commit/660c0c0aa33ced5307ee70bfdb78ebde4b978d73) is of type *Bug*. We also took a closer look at the files themselves. In *KafkaApis.scala*, the class `KafkaApis` has a member variable that is an instance of `ReplicaManager`, which is defined in *ReplicaManager.scala*. Changing the method `appendRecords` in *ReplicaManager.scala* on line 611 - 639 in the commit [56dcb837a2f1c1d8c016cfccf8268a910bb77a36](https://github.com/apache/kafka/commit/56dcb837a2f1c1d8c016cfccf8268a910bb77a36#diff-78812e247ffeae6f8c49b1b22506434701b1e1bafe7f92ef8f8708059e292bf0) required propagating the changes up to or down from *KafkaApi.scala*, demonstrating an architectural dependency between the two files. 



## 2. KafkaApis.scala and KafkaConsumer.java
**KafkaApis.scala**:        core\\src\\main\\scala\\kafka\\server\\KafkaApis.scala
**KafkaConsumer.java**:     clients\\src\\main\\java\\org\\apache\\kafka\\clients\\consumer\\KafkaConsumer.java

The file *KafkaApis.scala* and the file *KafkaConsumer.java* have a relatively high temporal coupling of 84 commits in a 24hour timeframe, 137 commits in a 48hour timeframe, 183 commits in a 72hour timeframe and 390 commits in a 168hour timeframe. In contrast, there seems to be just a slight logical coupling between the two files. There are just 24 logically coupled commits, where *KafakaConsumer.java* has 4644 commits with other files and *KafkaApis.scala* has 10481 commits with other files. The reason behind the high temporal coupling proved difficult to ascertain from the temporally coupled commits alone. Temporally coupled files may not be coupled in an obvious way, so understanding their influence requires a deep understanding of the system. Additionally, two files can be temporally linked even if they do not heavily influence each other, because coupled commits were created purely by chance. This is especially true if the files are modified frequently. We can not exclude that the files have any effect on each other, but that both files have very large commits counts with other files and are not strongly logically coupled does not speak in favor of it. On the other hand, the naming of the files could imply that the *KafkaConsumer* consumes the *KafkaApi* share an architectural dependency. For example the logically coupled commit [69645f1fe5103adb00de6fa43152e7df989f3aea](https://github.com/apache/kafka/commit/69645f1fe5103adb00de6fa43152e7df989f3aea#diff-4537fba2845d55d73736763b6f555e04fb21cd56f183320eda45c92a3a52f11d) would support this statement. There seams to be a connection between the change in the *KafkaConsumer.java* on the line 2314 - 2338 and the *KafakApis.scala* on the line 1693, where an optional join group request reason field is added in both files. 

## 3. Filename1 and Filename2 (TODO)
**Filename1**:  
**Filename2**: 