# Dataset Complexity Analysis

This notebook explores the Random Forest complexity metrics contained in ``forest_report.json``. It replicates the logic used by ``sort_datasets_by_complexity.py`` and enriches it with tabular views and visualisations to better understand how each dataset compares.

## Load the dataset report

The JSON report is generated by ``dataset_forest_report.py``. Each entry contains metadata about the dataset alongside summary statistics for the optimised Random Forest model.

In [None]:
from __future__ import annotations

from pathlib import Path
from typing import Any, Mapping
import json

import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')
REPORT_PATH = Path('forest_report.json')
REPORT_PATH.resolve()

In [None]:
with REPORT_PATH.open('r', encoding='utf-8') as handle:
    report_data: list[dict[str, Any]] = json.load(handle)

len(report_data)

## Build a summary table

We derive a single ``forest_dimension`` metric by multiplying the number of estimators by the average number of nodes per tree. This is the same approximation used by the CLI.

In [None]:
def compute_forest_dimension(statistics: Mapping[str, Any] | None) -> float:
    if not isinstance(statistics, Mapping):
        return 0.0
    n_estimators = statistics.get('n_estimators')
    avg_nodes = statistics.get('avg_nodes')
    try:
        return float(n_estimators) * float(avg_nodes)
    except (TypeError, ValueError):
        return 0.0

def extract_metadata(entry: Mapping[str, Any]) -> dict[str, Any]:
    metadata = entry.get('metadata')
    statistics = entry.get('forest_statistics')

    dataset = str(entry.get('dataset', '')) or '<unknown>'
    series_length = metadata.get('series_length') if isinstance(metadata, Mapping) else None
    train_size = metadata.get('train_size') if isinstance(metadata, Mapping) else None
    test_size = metadata.get('test_size') if isinstance(metadata, Mapping) else None

    return {
        'dataset': dataset,
        'forest_dimension': compute_forest_dimension(statistics if isinstance(statistics, Mapping) else None),
        'series_length': int(series_length) if series_length is not None else None,
        'train_size': int(train_size) if train_size is not None else None,
        'test_size': int(test_size) if test_size is not None else None,
        'avg_depth': float(statistics.get('avg_depth')) if isinstance(statistics, Mapping) and statistics.get('avg_depth') is not None else None,
        'avg_leaves': float(statistics.get('avg_leaves')) if isinstance(statistics, Mapping) and statistics.get('avg_leaves') is not None else None,
        'avg_nodes': float(statistics.get('avg_nodes')) if isinstance(statistics, Mapping) and statistics.get('avg_nodes') is not None else None,
        'n_estimators': int(statistics.get('n_estimators')) if isinstance(statistics, Mapping) and statistics.get('n_estimators') is not None else None,
    }

summary_rows = [extract_metadata(entry) for entry in report_data]
summary_df = pd.DataFrame(summary_rows)
summary_df = summary_df.sort_values(['forest_dimension', 'series_length', 'dataset'], ascending=[False, False, True]).reset_index(drop=True)
summary_df.head()

## Full dataset ranking

The full table mirrors the CLI output but includes additional metadata for reference.

In [None]:
styled_summary = summary_df.style.format({
    'forest_dimension': '{:,.2f}'.format,
    'avg_depth': '{:.2f}'.format,
    'avg_leaves': '{:,.2f}'.format,
    'avg_nodes': '{:,.2f}'.format,
})
styled_summary

## Top datasets by forest complexity

A horizontal bar chart provides a compact overview of the datasets with the largest forests.

In [None]:
top_n = 20
top_complex = summary_df.head(top_n)
fig, ax = plt.subplots(figsize=(10, 8))
ax.barh(top_complex['dataset'], top_complex['forest_dimension'], color='#1f77b4')
ax.set_xlabel('Forest dimension (n_estimators × avg_nodes)')
ax.set_ylabel('Dataset')
ax.set_title(f'Top {top_n} datasets by forest complexity')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

## Complexity vs. series length

The scatter plot below highlights whether longer time series also require larger forests.

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(summary_df['series_length'], summary_df['forest_dimension'], s=60, alpha=0.7)
ax.set_xlabel('Series length')
ax.set_ylabel('Forest dimension')
ax.set_title('Forest complexity vs. series length')
plt.tight_layout()
plt.show()

## Distribution of forest complexity

Finally, a histogram shows the spread of the forest dimension metric across all datasets.

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(summary_df['forest_dimension'], bins=20, color='#ff7f0e', edgecolor='black', alpha=0.8)
ax.set_xlabel('Forest dimension')
ax.set_ylabel('Number of datasets')
ax.set_title('Distribution of forest complexity')
plt.tight_layout()
plt.show()