# Legacy System Excavation

This notebook offers an archaeology-inspired analysis of a software repository.

The main idea is to reveal the different *epochs* that make up a software system by visualizing its historical layers.
To do this, directories and files are color-coded based on their creation date (specifically, the date of their first Git commit).
This allows you to spot the foundational structures of the codebase and trace how the system evolved over time, across all source code elements.

## 1. Setup and Configuration

First, we import the necessary Python libraries and configure the path to the repository we want to analyze. 

**Important:** You must update the `REPO_PATH` variable to the absolute path of the Git repository you want to analyze, and set the `FILE_EXT` variable to match the file extension used by the programming language in that repository.

In [1]:
# --- Dependencies ---
import os
import subprocess
import plotly.graph_objects as go
from datetime import datetime
import sys
from collections import defaultdict

# --- Configuration ---
REPO_PATH = "../../../dropover-at/"
FILE_EXT = ".java"

absolute_repo_path = os.path.abspath(REPO_PATH)

print(f"Repository to analyze: {absolute_repo_path}")

Repository to analyze: /mnt/c/dev/repos/dropover-at


## 2. Helper Functions

These functions are the building blocks of our analysis. They handle the low-level tasks of interacting with Git and the file system.

In [2]:
def get_file_creation_time(filepath):
    """Gets the creation timestamp of a file from its first Git commit."""
    try:
        cmd = ['git', 'log', '--diff-filter=A', '--follow', '-1', '--format=%ct', os.path.basename(filepath)]
        result = subprocess.run(
            cmd, capture_output=True, text=True, check=True, cwd=os.path.dirname(filepath))
        stdout = result.stdout.strip()
        return int(stdout) if stdout else int(os.path.getmtime(filepath))
    except (subprocess.CalledProcessError, FileNotFoundError, ValueError):
        return int(os.path.getmtime(filepath))

def get_directory_creation_time(file_creation_times, dir_path):
    oldest_time = float('inf')
    found_file = False
    for path, time in file_creation_times.items():
        if path.startswith(dir_path):
            found_file = True
            if time < oldest_time:
                oldest_time = time
    return oldest_time if found_file else 0

def count_lines(filepath):
    try:
        with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
            return len(f.readlines())
    except Exception:
        return 0

## 3. Data Analysis

### 3.1. File Discovery
In this step, we walk through the entire repository to find every source code file. For each file, we retrieve its creation timestamp from Git and count its lines.

*Hint: This will take a loooooong time for huge repositories!* 

In [None]:
file_creation_times = {}
file_line_counts = {}
dir_paths = set()

for dirpath, _, filenames in os.walk(absolute_repo_path):
    if '.git' in dirpath:
        continue
    dir_paths.add(dirpath)
    for filename in filenames:
        if filename.endswith(FILE_EXT):
            filepath = os.path.join(dirpath, filename)
            file_creation_times[filepath] = get_file_creation_time(filepath)
            file_line_counts[filepath] = count_lines(filepath)
            
print(f"Found {len(file_creation_times)} java files in {len(dir_paths)} directories.")

### 3.2. Directory Sizes Calculation

To ensure the treemap renders correctly, we must pre-calculate the total size (line count) of each directory. We do this by summing the line counts of all files contained within each directory and its subdirectories.

In [None]:
dir_sizes = defaultdict(int)
for file_path, line_count in file_line_counts.items():
    parent = os.path.dirname(file_path)
    while parent.startswith(absolute_repo_path):
        dir_sizes[parent] += line_count
        if parent == absolute_repo_path:
            break
        parent = os.path.dirname(parent)
dir_sizes[absolute_repo_path] = sum(file_line_counts.values())

print(f"Calculated sizes for {len(dir_sizes)} directories.")

### 3.3. Tree Construction

Now we build the hierarchical structure of the treemap. We create a list of unique IDs, labels, and parent-child relationships for every directory and file that will be displayed.

In [None]:
ids = []
labels = []
parents = []
values = []

ids.append(absolute_repo_path)
labels.append(os.path.basename(absolute_repo_path))
parents.append("")
values.append(dir_sizes[absolute_repo_path])

for path in sorted(list(dir_paths)):
    if path == absolute_repo_path:
        continue
    parent_path = os.path.dirname(path)
    ids.append(path)
    labels.append(os.path.basename(path))
    parents.append(parent_path)
    values.append(dir_sizes.get(path, 0))

for path, line_count in file_line_counts.items():
    parent_path = os.path.dirname(path)
    ids.append(path)
    labels.append(os.path.basename(path))
    parents.append(parent_path)
    values.append(line_count)

### 3.4. Color and Data Calculation

In the final data processing step, we iterate through the structure we just built. We calculate the correct color and hover-text for each item, using its creation date.

In [None]:
colors = []
customdata = []

for unique_id in ids:
    is_file = unique_id in file_creation_times
    is_dir = os.path.isdir(unique_id) and not is_file

    if is_file:
        timestamp = file_creation_times[unique_id]
        date_str = datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d') if timestamp > 0 else 'N/A'
        customdata.append(f'Created on: {date_str}')
        colors.append(timestamp)
    elif is_dir:
        timestamp = get_directory_creation_time(file_creation_times, unique_id)
        if timestamp != float('inf') and timestamp > 0:
            date_str = datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d')
            customdata.append(f'First commit in dir: {date_str}')
            colors.append(timestamp)
        else:
            customdata.append('Directory')
            colors.append(-1)
    else: # Root node
        customdata.append('Root Directory')
        colors.append(-1)

# fix values for dirs without information
min_positive = min(x for x in colors if x > 0)
colors = [min_positive if x == -1 else x for x in colors]

## 4. Visualization

Finally, we use Plotly to generate the interactive treemap. The size of each box represents the number of lines of code. The color represents the creation date for both files and directories.

In [None]:
soil_colorscale = [
    (0.0, '#1b0f07'),   # Almost black with a brown tint
    (0.33, '#9c5e32'),  # Rich warm brown (more saturated than before)
    (0.66, '#d39b63'),  # Lighter and warmer mid-brown
    (1.0, '#fdf5e6')    # Very pale sand (close to ivory)
]

fig = go.Figure(go.Treemap(
    ids=ids,
    labels=labels,
    parents=parents,
    values=values,
    customdata=customdata,
    hovertemplate='<b>%{label}</b><br>Lines: %{value}<br>%{customdata}<extra></extra>',
    marker_colors=colors,
    marker_colorscale=soil_colorscale,
    root_color="grey"
))

fig.update_layout(
    margin=dict(t=10, l=10, r=10, b=10),
    width=1920/2,
    height=1080/2
)

fig.show()

output_html_file = "legacy_system_excavation.html"
fig.write_html(output_html_file)
print(f"Treemap saved as interactive HTML: {output_html_file}")